Authors:
Published:
A protester throwing cookies at the parliament.
Here are some things that caught our ear this fine Thursday at the International Internet Preservation Consortium web archiving conference:
-
Tom Storrar at the UK Government Web Archive reports on a user research project: ~20 in person interviews and ~130 WAMMI surveys resulting in 5 character vignettes. “WAMMI” replaces “WASAPI” as our favorite acronym.
-
How do we integrate user research into day-to-day development? We’ll be chewing more on that one.
-
Jefferson Bailey shares the Internet Archive’s learnings ups and downs with Archive-It Research Services. Projects from the last year include .GOV (100TB of .gov data in a Hadoop cluster donated by Altiscale), the L3S Alexandria Project, and something we didn’t catch with Ian Milligan at Archives.ca.
-
You too can learn archive research with Vinay Goel’s Archive Research Services Workshop.
-
PLUS Jefferson threw in some amazing stuff we still haven’t quite figured out involving iPython Notebooks with connections to big data sets.
-
What the WAT? We hear a lot about WATs this year. Common Crawl has a good explainer.
-
Ditte Laursen sets out to answer a big research question: “What does Danish web look like?” What is the shape of .dk? Eld Zierau reports that in a comparison of the Royal Danish Library’s .dk collection with the Internet Archive’s collection of Danish-language sites, only something like 10% were in both.
-
Hugo Huurdeman asks an important question: what exactly _is _a website? Is it a host, a domain, or a set of pages that share the same CSS? To visualize change in whatever that is, he uses ssdeep, a fuzzy hashing mechanism for page comparison.
-
Let’s just pause to say how inspiring this all is. It’s at about this point in the day that we started totally rethinking a project we’ve been working on for months.
-
Justin Littman shares the Social Feed Manager, his happenin’ stack to harvest tweets and such.
-
We learned that TWARC is either twerking for WARCs or a Twitter-harvesting Python package – we’re not entirely sure. Either way it’s our new new favorite acronym. Sorry, WAMMI.
-
Nick Ruest and Ian Milligan give a very cool talk about sifting through hashtagged content on Twitter. Did you know that researchers only have 7-9 days to grab tweets under a hashtag before Twitter only makes the full stream available for a fee? (We did not know that.)
-
We were also impressed by Canada’s huge amount of political social media engagement. Even though Canada isn’t a huge country,[Ian’s words not ours] 55,000 Tweets were generated in one day with the #elxn42 tag.
-
Fernando Melo of Arquivo.pt pointed out that the struggle is real with live-web leaks in his research comparing OpenWayback and pywb. Fernando says in his tests OpenWayback was faster but pywb has higher-quality playbacks (more successes, fewer leaks). Both tools are expected to improve soon. We say it’s time for something like arewefastyet.com to make this a proper competition.
-
Nicola Bingham is self-deprecating about the British Library’s extensive QA efforts: “This talk title isn’t quite right because it implies that we have Quality Assurance Practices in the Post Legal Deposit Environment.” They use the Web Curator Tool QA Module, but are having to go beyond that for domain-scale archiving.
-
We’re also curious about this paper: Current Quality Assurance Practices in Web Archiving.
-
Todd Stoffer demos NC State’s QA tool. A clever blend of tools like Google Forms, Trello, and IFTTT to let student employees provide archive feedback during downtime. Here are Todd’s [snazzy HTML/JS] slides.
TL;DR: lots of exciting things happening in the archiving world. Also exciting: the Icelandic political landscape. On the way to dinner, the team happened upon a relatively small protest right outside of the parliament. There was pot clanging, oil barrel banging, and an interesting use of an active smoke alarm machine as a noise maker. We were also handed “red cards” to wave at the government.
Now we’re off to look for the northern lights!