About Newspaper Navigator

A project by Benjamin Charles Germain Lee as part of the 2020 Innovator in Residence Program at the Library of Congress.

About

Q: What is Newspaper Navigator?

A: Newspaper Navigator is a project being carried out by Benjamin (Ben) Charles Germain Lee as part of the Innovator in Residence Program at the Library of Congress. The goals of Newspaper Navigator are 1) to extract visual content from 16+ million pages of digitized historic American newspapers in Chronicling America and 2) re-imagine how you, members of the American public, can search the visual content using machine learning techniques. The first phase (extracting the visual content) was completed in April, 2020, and resulted in the full Newspaper Navigator dataset (more info below). The second phase of Newspaper Navigator consists of the application that you are currently using! The Library of Congress believes that the newspapers in Chronicling America are in the public domain or have no known copyright restrictions. Further rights and reproductions information about the collection can be found on the Chronicling America website. Essentially, newspapers published more than 95 years ago are in the public domain in their entirety. Any newspapers published less than 95 years ago could have copyrighted third party materials such as images (advertisements and comics especially!). If you plan to use images from Newspaper Navigator that are less than 95 years old, it is your responsibility to make an independent legal assessment of the item and secure any necessary permissions. When viewing an image, you can select the "learn about this newspaper" button for further information on the newspaper publisher.

Q: What is Chronicling America?

A: Chronicling America is a database of 16+ million digitized historic newspapers, all of which are in the public domain. It is a product of the National Digital Newspaper Program, itself a partnership of the Library of Congress and the National Endowment for the Humanities. You can learn more about Chronicling America here. In addition, you can find visualizations of Chronicling America here.

Q: What can I do with this Newspaper Navigator application?

A: This application enables you to search and explore historic newspaper photographs. You can search by keyword over the photos' captions (extracted from the OCR of each newspaper page), as well as search by visual similarity. The visual similarity search capability retrieves relevant photos by empowering you to train a machine learning algorithm by selecting photos that you are interested in. The application contains 1.56 million photos from the Newspaper Navigator dataset, consisting of all extracted photos in the dataset published between 1900 and 1963 with confidence scores above 90%. Advertisements are considered a separate category in the Newspaper Navigator dataset, but you may still find some advertisements using this application (the visual content recognition model was not 100% perfect at classifying visual content type, and there are also lots of advertisements that include photographs!).

Q: Where can I learn more about the Newspaper Navigator dataset?

A: You can find the full Newspaper Navigator dataset here, along with instructions documenting how to query the dataset using HTTPS and S3 requests, as well as how to download pre-packaged subsets of the dataset. In addition, a whitepaper describing the dataset and its construction can be found here. To learn more about the ways in which machine learning affects discoverability of photos in the dataset, see the data archaeology here. Lastly, we hosted a data jam on May 7, 2020. The full archived recording can be found here. The recording includes a presentation on the dataset, a tutorial for how to use it, and a show-and-tell of projects created by participants.

Q: How does the visual search capability provided by Newspaper Navigator work?

A: The visual search capability behind Newspaper Navigator is powered by a machine learning algorithm that learns from the selections that you make and returns images that are visually similar to your selections. In the language of machine learning, your positive and negative selections are training data for the algorithm. The Newspaper Navigator visual search function is powered by image embeddings. An image embedding is a low-dimensional representation of an image, often a list of a few hundred or a few thousand numbers, that captures much of the image’s semantic content. Image embeddings are typically generated by feeding an image into a pre-trained neural image classification model (i.e., a model that takes in an image and outputs a label of "dog" or "cat") and extracting a representation of the image from one of the model’s hidden layers. The Newspaper Navigator dataset contains ResNet-18 image embeddings for all of the photos (ResNet refers to a specific image classification model architecture). The system trains a machine learner on the fly using the embeddings for your positive and negative examples (as well as some additional randomly-drawn negative examples). The machine learner then predicts over all 1.56 million photos, and the results are sorted according to prediction score. Because the ResNet-18 image embeddings are low-dimensional (512 dimensions to be precise), training and predicting over all 1.56 million photos takes only seconds.

Q: Machine learning has a fraught history of perpetuating marginalization. Have you studied or audited your system from the perspective of algorithmic bias?

A: This question is incredibly important. If you navigate to this link (also linked to on the main menu), you will find a paper discussing the ways in which Newspaper Navigator mediates the discoverability of newspaper photos, with a particular focus on algorithmic bias. In this writing, which I call a "data archaeology," I trace the journeys of four newspaper photos of W.E.B. Du Bois as they travel through the Chronicling America and Newspaper Navigator pipelines as a case study. If you would like to discuss this data archaeology or have any questions regarding machine learning and how it perpetuates marginalization, please feel free to email me at [email protected] or contact me on Twitter (my email account will be de-activated in the coming months).

Q: The visual search capability with the AI navigators doesn't seem to be learning the topic or concept that I'm interested in. What should I do?

A: I'd recommend experimenting with different tunings by clicking different training examples. If that doesn't seem to work, you might be encountering a limitation of the machine learning algorithm itself. While image embeddings capture an amazing amount of information, they do not always encode relationships between visually similar images!

Q: Why do the search results contain duplicate examples of a photo?

A: Photos were often printed in many different newspaper issues and titles, so you have found a photo that went "viral"!

Q: Some of the photo captions have typos. Others are incomplete or contain extraneous text. Is there an explanation for this?

A: The transcription for each Chronicling America page is generated using an automated optical character recognition (OCR) algorithm. Because the resulting transcription is machine-generated, its quality is dependent on a number of factors including scan quality and contrast, typeface, and ink. The photo captions included in this app are part of the Newspaper Navigator dataset, and each caption was identified and extracted by a visual content recognition model (this process is described in more detail here). Because the captions were extracted in an automated fashion, the captions are sometimes clipped or contain neighboring text on the page.

Q: When I filter by state/territory, I don't see any search results. Why is this the case?

A: At the time of producing the Newspaper Navigator dataset, Chronicling America did not contain any pages from New Hampshire, Rhode Island, or Wyoming. Try searching by a different state or over all photos!

Q: When I filter by state/territory, some of the search results are photographs from newspapers published in other states. Why is this the case?

A: Great question! The state/territory facet filters by a newspaper's geographic coverage, rather than its publication location (geographic coverage is more robust because the publication location refers only to where the first issue of a newspaper was published). A newspaper title's geographic coverage may encompass multiple states & territories, so even if a title seems as though it is from another state, it was still distributed in the state/territory that you selected.

Q: How can I save my progress & share my collection?

A: On each page, there is a "save" button. If you click it, it will copy a URL to your clipboard, which you can then save or share with friends by pasting it (CTRL-v). This link will restore your progress. You can also share the link with friends!

Q: The URL for saving my progress is quite long. Can I shorten it?

A: Yes! There are a number of URL shorteners available online that will drastically compress the URL.

Q: How can I save the photos in my collection?

A: If you click the "download metadata" button on the "My Collection" page, the application will generate a speadsheet (.csv) with metadata for all of the photos in your collection, as well as links for downloading the photos themselves from the Newspaper Navigator dataset and from Chronicling America with IIIF.

Q: The website doesn't look right on my screen. Do you have any suggestions?

A: I would recommend visiting this site on a desktop or laptop with a browser that is relatively up-to-date. The mobile version of the site should function, but the interactions are much more fluid on a larger screen.

Q: Where can I find the code for Newspaper Navigator, and can I re-use any of it?

A: All code for Newspaper Navigator is available at the Newspaper Navigator GitHub repository. This includes all code for the search app, as well as for the creation of the Newspaper Navigator dataset. All code is open source and in the public domain for unrestricted re-use! We encourage you to re-use this code for your own creative computing and digital humanities projects.

Q: How was this application built?

A: I built this application using Python, Flask, HTML, CSS, and vanilla Javascript. I utilized scikit-learn for the back-end machine learning components. All photos are served using IIIF image URLs that are mapped to Chronicling America. The application is containerized using Docker. More info on the app can be found at the Newspaper Navigator GitHub repository.

Q: I'd like to make a similar application for my own photo dataset using your code. How feasible is this?

A: This is definitely possible! The application is fully containerized using Docker, so setting up a development version of the application should be as easy as cloning the repo and running "docker-compose --up". The bulk of the work will relate to pre-processing your own images and metadata (i.e., creating image embeddings, formatting the metadata properly, etc.). If you do end up making a similar application with the Newspaper Navigator code, we'd love to hear about it! Email the team at [email protected] or tweet @LC_Labs with the hashtag #NewspaperNavigator.

Q: Can I share what I've found or created using Newspaper Navigator with you?

A: Definitely, we'd love to hear about what you find and make using Newspaper Navigator! Email the team at [email protected] or tweet @LC_Labs with the hashtag #NewspaperNavigator.

Q: Who are you?

A: My name is Benjamin Charles Germain Lee (Ben for short!), and I am a 2020 Innovator in Residence at the Library of Congress, as well as a third-year Ph.D. student at the Paul G. Allen School for Computer Science & Engineering at the University of Washington. My research focus is at the intersection of artificial intelligence and human-computer interaction, with application to cultural heritage and the digital humanities. A special thank-you to my advisor, Professor Daniel Weld, whose input, advice, and guidance with Newspaper Navigator has been invaluable!

Q: How can I contact you?

A: For press inquiries, collaborations, questions, and feedback, please feel free to email me at [email protected] or contact me on Twitter (my email account will be de-activated in the coming months).

Q: How can I stay updated on Newspaper Navigator and LC Labs's other experiments?

A: You can follow the Library of Congress Labs on Twitter or sign up for their newsletter. You can also follow me on Twitter.