Links to web resources frequently break, and linked content can change at unpredictable rates. These dynamics of the Web are detrimental when references to web resources provide evidence or supporting information. In this paper, we highlight the significance of reference rot, provide an overview of existing techniques and their characteristics to address it, and introduce our Robust Links approach, including its web service and underlying API. Robustifying links offers a proactive, uniform, and machine-actionable way to combat reference rot. In addition, we discuss our reasoning and approach aimed at keeping the approach functional for the long term. To showcase our approach, we have robustified all links in this article.
For traditional library collections, archivists can select a representative sample from a collection and display it in a featured physical or digital library space. Web archive collections may consist of thousands of archived pages, or mementos. How should an archivist display this sample to drive visitors to their collection? Search engines and social media platforms often represent web pages as cards consisting of text snippets, titles, and images. Web storytelling is a popular method for grouping these cards in order to summarize a topic. Unfortunately, social media platforms are not archive-aware and fail to consistently create a good experience for mementos. They also allow no UI alterations for their cards. Thus, we created MementoEmbed to generate cards for individual mementos and Raintale for creating entire stories that archivists can export to a variety of formats.
Tools such as Google News and Flipboard exist to convey daily news, but what about the past? In this paper, we describe how to combine several existing tools with web archive holdings to perform news analysis and visualization of the "biggest story" for a given date. StoryGraph clusters news articles together to identify a common news story. Hypercane leverages ArchiveNow to store URLs produced by StoryGraph in web archives. Hypercane analyzes these URLs to identify the most common terms, entities, and highest quality images for social media storytelling. Raintale then uses the output of these tools to produce a visualization of the news story for a given day. We name this process SHARI (StoryGraph Hypercane ArchiveNow Raintale Integration).
Used by a variety of researchers, web archive collections have become invaluable sources of evidence. If a researcher is presented with a web archive collection that they did not create, how do they know what is inside so that they can use it for their own research? Search engine results and social media links are represented as surrogates, small easily digestible summaries of the underlying page. Search engines and social media have a different focus, and hence produce different surrogates than web archives. Search engine surrogates help a user answer the question "Will this link meet my information need?" Social media surrogates help a user decide "Should I click on this?" Our use case is subtly different. We hypothesize that groups of surrogates together are useful for summarizing a collection. We want to help users answer the question of "What does the underlying collection contain?" But which surrogate should we use? With Mechanical Turk participants, we evaluate six different surrogate types against each other. We find that the type of surrogate does not influence the time to complete the task we presented the participants. Of particular interest are social cards, surrogates typically found on social media, and browser thumbnails, screen captures of web pages rendered in a browser. At p=0.0569, and p=0.0770, respectively, we find that social cards and social cards paired side-by-side with browser thumbnails probably provide better collection understanding than the surrogates currently used by the popular Archive-It web archiving platform. We measure user interactions with each surrogate and find that users interact with social cards less than other types. The results of this study have implications for our web archive summarization work, live web curation platforms, social media, and more.
Web archives, a key area of digital preservation, meet the needs of journalists, social scientists, historians, and government organizations. The use cases for these groups often require that they guide the archiving process themselves, selecting their own original resources, or seeds, and creating their own web archive collections. We focus on the collections within Archive-It, a subscription service started by the Internet Archive in 2005 for the purpose of allowing organizations to create their own collections of archived web pages, or mementos. Understanding these collections could be done via their user-supplied metadata or via text analysis, but the metadata is applied inconsistently between collections and some Archive-It collections consist of hundreds of thousands of seeds, making it costly in terms of time to download each memento. Our work proposes using structural metadata as an additional way to understand these collections. We explore structural features currently existing in these collections that can unveil curation and crawling behaviors. We adapt the concept of the collection growth curve for understanding Archive-It collection curation and crawling behavior. We also introduce several seed features and come to an understanding of the diversity of resources that make up a collection. Finally, we use the descriptions of each collection to identify four semantic categories of Archive-It collections. Using the identified structural features, we reviewed the results of runs with 20 classifiers and are able to predict the semantic category of a collection using a Random Forest classifier with a weighted average F1 score of 0.720, thus bridging the structural to the descriptive. Our method is useful because it saves the researcher time and bandwidth. Identifying collections by their semantic category allows further downstream processing to be tailored to these categories.
Web archive collections are created with a particular purpose in mind. A curator selects seeds, or original resources, which are then captured by an archiving system and stored as archived web pages, or mementos. The systems that build web archive collections are often configured to revisit the same original resource multiple times. This is incredibly useful for understanding an unfolding news story or the evolution of an organization. Unfortunately, over time, some of these original resources can go off-topic and no longer suit the purpose for which the collection was originally created. They can go off-topic due to web site redesigns, changes in domain ownership, financial issues, hacking, technical problems, or because their content has moved on from the original topic. Even though they are off-topic, the archiving system will still capture them, thus it becomes imperative to anyone performing research on these collections to identify these off-topic mementos. Hence, we present the Off-Topic Memento Toolkit, which allows users to detect off-topic mementos within web archive collections. The mementos identified by this toolkit can then be separately removed from a collection or merely excluded from downstream analysis. The following similarity measures are available: byte count, word count, cosine similarity, Jaccard distance, Sörensen-Dice distance, Simhash using raw text content, Simhash using term frequency, and Latent Semantic Indexing via the gensim library. We document the implementation of each of these similarity measures. We possess a gold standard dataset generated by manual analysis, which contains both off-topic and on-topic mementos. Using this gold standard dataset, we establish a default threshold corresponding to the best F1 score for each measure. We also provide an overview of potential future directions that the toolkit may take.
In this paper, we explore the use of Memento with the Internet Archive as a means of avoiding spoilers in fan wikis. We conduct two experiments: one to determine the probability of encountering a spoiler when using Memento with the Internet Archive for a given wiki page, and a second to determine which date prior to an episode to choose when trying to avoid spoilers for that specific episode.
A reader who visits a web at large resource by following a URI reference in an article, some time after its publication, is led to believe that the resource’s content is representative of what the author originally referenced. However, due to the dynamic nature of the web, that may very well not be the case. We reuse a dataset from a previous study in which several authors of this paper were involved, and investigate to what extent the textual content of web at large resources referenced in a vast collection of Science, Technology, and Medicine (STM) articles published between 1997 and 2012 has remained stable since the publication of the referencing article.
We quantify the extent to which references to papers in scholarly literature use persistent HTTP URIs that leverage the Digital Object Identifier infrastructure. We find a significant number of references that do not, speculate why authors would use brittle URIs when persistent ones are available, and propose an approach to alleviate the problem.
In the course of conducting a study with almost 700,000 web pages, we encountered issues acquiring mementos and extracting text from them. The acquisition of memento content via HTTP is expected to be a relatively painless exercise, but we have found cases to the contrary. For the benefit of others acquiring mementos across many web archives, we document those experiences here.
Enterprising readers might browse the wiki in a web archive so as to view the page prior to a specific episode date and thereby avoid spoilers. We find that when accessing fan wiki pages in the Internet Archive there is as much as a 66% chance of encountering a spoiler.
Enterprising readers might browse the wiki in a web archive so as to view the page prior to a specific episode date and thereby avoid spoilers. We quantify how the current heuristic used for choosing an archived web page based on a date is inadequate for avoiding spoilers, analyzing data collected from fan wikis and the Internet Archive. We find that when accessing fan wiki pages in the Internet Archive there is as much as a 66% chance of encountering a spoiler.
We have implemented the Memento MediaWiki Extension Version 2.0, which brings the Memento Protocol to MediaWiki, used by Wikipedia and the Wikimedia Foundation. Test results show that the extension has a negligible impact on performance.