With the death of Storify, I've been examining alternatives for summarizing web archive collections. Key to these summaries are surrogates. I have discovered that there exist services that provide users with embeds. These embeds allow an author to insert a surrogate into the HTML of their blog post or other web page. These containing pages often use the surrogate to further illustrate some concept from the surrounding content. Unfortunately, not all services generate good surrogates for mementos. After some reading, I came to the conclusion that we can fill in the gap with our own embeddable surrogate service: MementoEmbed.
There are two US government websites in danger, the National Guideline Clearinghouse (https://www.guideline.gov) and the National Quality Measures Clearinghouse (https://qualitymeasures.ahrq.gov). Both store medical guidelines. Both will "not be available after July 16, 2018". Seeing at these two sites will be shut down on July 16, 2018, how well are they archived?
At iPres 2018, I will be presenting "The Many Shapes of Archive-It", a paper that focuses on some structural features inherent in Archive-It collections. The paper is now available as a preprint on arXiv. As part of the data gathering for "The Many Shapes of Archive-It", and also as part of the development the Off-Topic Memento Toolkit, I had to write code that extracts metadata and seeds from public Archive-It collections. This capability will be useful to several aspects of our storytelling and summarization work, so I used the knowledge gained from those projects and produced a standalone Python library named Archive-It Utilities (AIU).
Inspired by AlNoamany's work from "Detecting off-topic pages within TimeMaps in Web archives" I am pleased to announce an alpha release of the Off-Topic Memento Toolkit (OTMT). The results of testing with this software will be presented at iPres 2018 and those results are now available as a preprint.
On June 3, 2018, PhD students arrived in Fort Worth, Texas to attend the Joint Conference on Digital Libraries Doctoral Consortium. This is a pre-conference event associated with the ACM and IEEE-CS Joint Conference on Digital Libraries. This event gives PhD students a forum in which to discuss their dissertation work with others in the field. The Doctoral Consortium was well attended, not only by the presenting PhD students, their advisors/supervisors, and organizers, but also by those who were genuinely interested in emerging work.
Web resources can be represented in a variety of ways. In this blog post I go over work that has been done to create surrogates, or representations of web resources, for use on social media, search engine results, and more.
The Storify platform will be discontinued in May 2018. Here I outline some options for those trying to preserve their work before it disappears.
We engaged in discussions about a very important topic: the preservation of online news content. Brewster Kahle is well known in digital preservation and especially web archiving circles. I tried to cover elements of all presentations while live tweeting during the event, and wish I could go into more detail here, but, as usual I will only cover a subset.
The crowds descended upon Arlington, Virginia for the 80th annual meeting of the Association for Information Science and Technology. I attended this meeting to learn more about ASIS&T, including its special interest groups. Also attending with me was former ODU Computer Science student and current Los Alamos National Laboratory librarian Valentina Neblitt-Jones. Here I cover the event.
This post is a re-examination of the landscape since AlNoamany's dissertation to see if there are tools other than Storify that the Dark and Stormy Archives project can use. It covers the tools living in the spaces of content curation, storytelling, and social media.
I was fortunate enough to have the opportunity to present Yasmin AlNoamany's work at Web Science 2017. Dr. Nelson offers an excellent class on Web Science, but it has been years since I had taken it and I still was uncertain about the current state of the art. Web Science 2017 took place in Troy, a small city in upstate New York that is home to Rensselaer Polytechnic Institute (RPI). The RPI team had organized an excellent conference focused on a variety of Web Science topics, including cyber bullying, taxonomies, social media, and ethics.
Though scholars write articles and papers, they also post a lot of content on the web. Datasets, blog posts (like this one), presentations, and more are posted by scholars as part of scholarly communications. What if we could aggregate the content by scholar, instead of by web site?
Given a scholar's identity on a portal, how can we crawl the scholarly portal to ensure that we capture all of their content? In this post, I evaluate a number of scholarly portals to find their boundaries, the URI patterns that allow us to capture the content of a user.
In this post, I examine different trusted timestamping methods. I start with some of the more traditional methods before discussing OriginStamp, a solution by Gipp, Meuschke, and Gernandt that uses the Bitcoin blockchain for timestamping.
As we celebrate the 20th anniversary of the Internet Archive, I realize that using Memento and the Wayback Machine has become second nature when solving certain problems, not only in my research, but also in my life. Those who have read my Master's Thesis, Avoiding Spoilers on Mediawiki Fan Sites Using Memento, know that I am a fan of many fictional television shows and movies. URIs are discussed in these fictional worlds, and sometimes the people making the fiction actually register these URIs, seen in the example below, creating an additional vector for fans to find information on their favorite characters and worlds.
We are pleased to report that the W3C has embraced Memento for versioning its specifications and its wiki. Completing this effort required collaboration between the W3C and the Los Alamos National Laboratory (LANL) Research Library Prototyping Team. Here we inform others of the brief history of this effort and provide an overview of the technical aspects of the work done to make Memento at the W3C.
In a previous post, we discussed a way to use the existing Memento protocol combined with link headers to access unaltered (raw) archived web content. Interest in unaltered content has grown as more use cases arise for web archives. Ilya Kremer and David Rosenthal had previously suggested that a new dimension of content negotiation would be necessary to allow clients to access unaltered content. That idea was not originally pursued, because it would have required the standardization of new HTTP headers. At the time, none of us were aware of the standard Prefer header from RFC7240. Prefer can solve this problem in an intuitive way much like their original suggestion of content negotiation.
On June 16, 2016, the Library of Congress hosted a one day Symposium entitled Saving the Web: The Ethics and Challenges of Preserving What's on the Internet.
I was fortunate to present a poster at the 25th International World Wide Web Conference, held from April 11, 2016 - April 15, 2016. Though my primary mission was to represent both the WS-DL and the LANL Prototyping Group, I gained a better appreciation for the state of the art of the World Wide Web. The conference was held in Montréal, Canada at the Palais des congrés de Montéal.
Recently, we conducted an experiment using mementos for almost 700,000 web pages from more than 20 web archives. These web pages spanned much of the life of the web (1997-2012). Much has been written about acquiring and extracting text from live web pages, but we believe that this is an unparalleled attempted to acquire and extract text from mementos themselves.