Individual web archive collections can contain thousands of documents. If a researcher wants to use one of these collections, which one best meets their information need? How does the researcher differentiate them? deally, a user would be able to glance at a visualization and gain understanding of the collection, but existing visualizations require a lot of cognitive load and training even to convey one aspect of a collection. Social media storytelling provides us with an approach. We want to use this proven technique because readers already understand how to view these visualizations. The Dark and Stormy Archives (DSA) Project explores how to summarize web archive collections through these visualizations. We make our DSA Toolkit freely available to others so they can explore web archive collections through storytelling.
On Friday, Twitter suspended Donald Trump's account due to concerns that his current and future tweets might continue to foment violence in the United States. Hayes Brown from MSNBC and Marshall Cohen from CNN echoed a concern I had when developing MementoEmbed: what happens to the embed if the source material or a service managing the embed goes away?
On July 25, more than 1,300 registrants around the world opened their laptops and started attending SIGIR 2020. Via Zoom and the streaming capabilities of the conference web portal, we were able to watch speakers, raise hands, ask questions, and chat with attendees. SIGIR 2020 was in Xi'an, China, but few registrants could attend in person due to the travel restrictions imposed to curb the COVID-19 pandemic. The SIGIR 2020 program committee successfully converted their in-person conference to an online variant. Here I summarize the conference.
Links on the web break all the time. We frequently experience the infamous “404 – Page not found” message, also known as “a broken link” or “link rot.” Sometimes we follow a link and discover that the linked page has significantly changed and its content no longer represents what was originally referenced, a scenario known as “content drift.” Both link rot and content drift are forms of “reference rot”, a significant detriment to our web experience. In the realm of scholarly communication where we increasingly reference web resources such as blog posts, source code, videos, social media posts, datasets, etc. in our manuscripts, we recognize that we are losing our scholarly record to reference rot.
In Part 1, we introduced Hypercane, a tool for automatically sampling mementos from web archive collections. Web archive collections consist of thousands of documents, and humans need tools to intelligently select mementos for a given purpose. Hypercane's goal is to supply us with a list of memento URI-Ms derived from the input we provide. In Part 2, I highlighted how Hypercane's synthesize action converts its input into other formats like JSON for Raintale stories, WARCs for Archives Unleashed Toolkit, or boilerplate-free files for Gensim. This post focuses on the primitive advanced actions that make up Hypercane's sampling algorithms. We can mix and match different primitives to arrive at the sample that best meets our needs.
In Part 1 of this series of blog posts, I introduced Hypercane, a tool for automatically sampling mementos from web archive collections. If a human wishes to create a sample of documents from a web archive collection, they are confronted with thousands of documents from which to choose. Most collections contain insufficient metadata for making decisions. Hypercane's focus is to supply us with a list of memento URI-Ms derived from the input we provide. One of the uses for this sampling is summarization. The previous blog post in this series focused on its high level sample and report actions and how they can be used for storytelling. This post focuses on how to generate output for other tools via Hypercane's synthesize action.
Yasmin AlNoamany experimented with summarizing a web collection by choosing a small number of exemplars and then visualizing them with social media storytelling. This is in contrast to approaches that try to account for all members of the collection. When I took over the Dark and Stormy Archives project from her in 2017, the goal was to improve upon her excellent work. Her existing code relied heavily upon the Storify platform to render its stories. Storify was discontinued in May 2018. We discovered that other platforms rendered mementos poorly, so we developed MementoEmbed to render individual surrogates and later Raintale to render whole stories. We discovered that cards are probably the best surrogate for stories. We now publish stores to the DSA-Puddles web site on a regular basis. Up to this point, we have relied upon sources such as Nwala's StoryGraph or human selection to generate the list of mementos rendered in our stories. The document selection is key to the entire process. What tool can we rely on to automate the selection of mementos for these stories and other purposes? Hypercane.
My research focuses on summarizing existing web archive collections through social media storytelling. For this effort, we developed Raintale to tell the stories produced by a selection of mementos. Collections exist at various web archives, like Archive-It and the UK Web Archive. As shown by Klein et al., we can build collections of mementos by conducting focused crawling of web archives. Raintale works well for these cases involving existing mementos, but what if we want to make a story about live web resources, like current events from the news?
Students, professors, industry experts, and others came to Beijing to attend the 28th ACM International Conference on Information and Knowledge Management (CIKM). This was the first time CIKM had accepted a long paper from the Old Dominion University Web Science and Digital Libraries Research Group (WS-DL) and I was happy to represent us at this prestigious conference.
On October 28, 2019, web archiving experts met with librarians and archivists at the George Washington University in Washington, DC. As part of the Continuing Education to Advance Web Archiving (CEDWARC) effort, we covered several different modules related to tools and technologies for web archives. The event consisted of morning overview presentations and afternoon lab portions. Here I will provide an overview of the topics we covered.
One of the most challenging problems to solve while conducting user studies is recruiting participants. Amazon's Mechanical Turk (MT) solves this problem by providing a marketplace where participants can earn money by completing studies for researchers. This blog post summarizes the lessons I have learned from other studies that have successfully employed MT. I have found parts of this information scattered throughout different bodies of knowledge, but not gathered in one place; thus, I hope it is a useful starting place for future researchers.
Raintale is the latest entry in the Dark and Stormy Archives project. Our goal is to provide research studies and tools for combining web archives and social media storytelling. Raintale provides the storytelling capability. It has been designed to visualize a small number of mementos selected from an immense web archive collection, allowing a user to summarize and visualize the whole collection or a specific aspect of it.
Since 2013, I have been a principal contributor to the Memento MediaWiki Extension. We recently released version 2.2.0 to support MediaWiki versions of 1.31.1 and greater. During the extension's development, I have detailed some of its concepts on this blog, I have presented it at WikiConference USA 2014, and I have even helped the W3C adopt it. It became the cornerstone of my Master's Thesis, where I showed how the Memento MediaWiki Extension could help people avoid spoilers on fan wikis. Why do Memento and MediaWiki belong together?
On Tuesday, we released our latest pre-print "Social Cards Probably Provide Better Understanding of Web Archive Collections". My work builds on AlNoamany's work of using social media storytelling to provide a visualization that summarizes web archive collections. In previous blog posts I discussed different storytelling services. A key component of their capability to convey understanding is the surrogate, a small visualization of a web page that provides a summary of that page, like the surrogate within the Twitter Tweet example shown below. However, there are many types of surrogates. We want to use a group of surrogates together as a story to provide a summary of a web archive collection. Which type of surrogate works best for helping users understand the underlying collection?
Google+ will be shut down on April 2, 2019. In this blog post I cover how much of Google+ is archived and how to archive its pages.
With the death of Storify, I've been examining alternatives for summarizing web archive collections. Key to these summaries are surrogates. I have discovered that there exist services that provide users with embeds. These embeds allow an author to insert a surrogate into the HTML of their blog post or other web page. These containing pages often use the surrogate to further illustrate some concept from the surrounding content. Unfortunately, not all services generate good surrogates for mementos. After some reading, I came to the conclusion that we can fill in the gap with our own embeddable surrogate service: MementoEmbed.
There are two US government websites in danger, the National Guideline Clearinghouse (https://www.guideline.gov) and the National Quality Measures Clearinghouse (https://qualitymeasures.ahrq.gov). Both store medical guidelines. Both will "not be available after July 16, 2018". Seeing at these two sites will be shut down on July 16, 2018, how well are they archived?
At iPres 2018, I will be presenting "The Many Shapes of Archive-It", a paper that focuses on some structural features inherent in Archive-It collections. The paper is now available as a preprint on arXiv. As part of the data gathering for "The Many Shapes of Archive-It", and also as part of the development the Off-Topic Memento Toolkit, I had to write code that extracts metadata and seeds from public Archive-It collections. This capability will be useful to several aspects of our storytelling and summarization work, so I used the knowledge gained from those projects and produced a standalone Python library named Archive-It Utilities (AIU).
Inspired by AlNoamany's work from "Detecting off-topic pages within TimeMaps in Web archives" I am pleased to announce an alpha release of the Off-Topic Memento Toolkit (OTMT). The results of testing with this software will be presented at iPres 2018 and those results are now available as a preprint.
On June 3, 2018, PhD students arrived in Fort Worth, Texas to attend the Joint Conference on Digital Libraries Doctoral Consortium. This is a pre-conference event associated with the ACM and IEEE-CS Joint Conference on Digital Libraries. This event gives PhD students a forum in which to discuss their dissertation work with others in the field. The Doctoral Consortium was well attended, not only by the presenting PhD students, their advisors/supervisors, and organizers, but also by those who were genuinely interested in emerging work.
Web resources can be represented in a variety of ways. In this blog post I go over work that has been done to create surrogates, or representations of web resources, for use on social media, search engine results, and more.
The Storify platform will be discontinued in May 2018. Here I outline some options for those trying to preserve their work before it disappears.
We engaged in discussions about a very important topic: the preservation of online news content. Brewster Kahle is well known in digital preservation and especially web archiving circles. I tried to cover elements of all presentations while live tweeting during the event, and wish I could go into more detail here, but, as usual I will only cover a subset.
The crowds descended upon Arlington, Virginia for the 80th annual meeting of the Association for Information Science and Technology. I attended this meeting to learn more about ASIS&T, including its special interest groups. Also attending with me was former ODU Computer Science student and current Los Alamos National Laboratory librarian Valentina Neblitt-Jones. Here I cover the event.
This post is a re-examination of the landscape since AlNoamany's dissertation to see if there are tools other than Storify that the Dark and Stormy Archives project can use. It covers the tools living in the spaces of content curation, storytelling, and social media.
I was fortunate enough to have the opportunity to present Yasmin AlNoamany's work at Web Science 2017. Dr. Nelson offers an excellent class on Web Science, but it has been years since I had taken it and I still was uncertain about the current state of the art. Web Science 2017 took place in Troy, a small city in upstate New York that is home to Rensselaer Polytechnic Institute (RPI). The RPI team had organized an excellent conference focused on a variety of Web Science topics, including cyber bullying, taxonomies, social media, and ethics.
Though scholars write articles and papers, they also post a lot of content on the web. Datasets, blog posts (like this one), presentations, and more are posted by scholars as part of scholarly communications. What if we could aggregate the content by scholar, instead of by web site?
Given a scholar's identity on a portal, how can we crawl the scholarly portal to ensure that we capture all of their content? In this post, I evaluate a number of scholarly portals to find their boundaries, the URI patterns that allow us to capture the content of a user.
In this post, I examine different trusted timestamping methods. I start with some of the more traditional methods before discussing OriginStamp, a solution by Gipp, Meuschke, and Gernandt that uses the Bitcoin blockchain for timestamping.
As we celebrate the 20th anniversary of the Internet Archive, I realize that using Memento and the Wayback Machine has become second nature when solving certain problems, not only in my research, but also in my life. Those who have read my Master's Thesis, Avoiding Spoilers on Mediawiki Fan Sites Using Memento, know that I am a fan of many fictional television shows and movies. URIs are discussed in these fictional worlds, and sometimes the people making the fiction actually register these URIs, seen in the example below, creating an additional vector for fans to find information on their favorite characters and worlds.
We are pleased to report that the W3C has embraced Memento for versioning its specifications and its wiki. Completing this effort required collaboration between the W3C and the Los Alamos National Laboratory (LANL) Research Library Prototyping Team. Here we inform others of the brief history of this effort and provide an overview of the technical aspects of the work done to make Memento at the W3C.
In a previous post, we discussed a way to use the existing Memento protocol combined with link headers to access unaltered (raw) archived web content. Interest in unaltered content has grown as more use cases arise for web archives. Ilya Kremer and David Rosenthal had previously suggested that a new dimension of content negotiation would be necessary to allow clients to access unaltered content. That idea was not originally pursued, because it would have required the standardization of new HTTP headers. At the time, none of us were aware of the standard Prefer header from RFC7240. Prefer can solve this problem in an intuitive way much like their original suggestion of content negotiation.
On June 16, 2016, the Library of Congress hosted a one day Symposium entitled Saving the Web: The Ethics and Challenges of Preserving What's on the Internet.
I was fortunate to present a poster at the 25th International World Wide Web Conference, held from April 11, 2016 - April 15, 2016. Though my primary mission was to represent both the WS-DL and the LANL Prototyping Group, I gained a better appreciation for the state of the art of the World Wide Web. The conference was held in Montréal, Canada at the Palais des congrés de Montéal.
Recently, we conducted an experiment using mementos for almost 700,000 web pages from more than 20 web archives. These web pages spanned much of the life of the web (1997-2012). Much has been written about acquiring and extracting text from live web pages, but we believe that this is an unparalleled attempted to acquire and extract text from mementos themselves.