The first wave of digital evangelism has passed. With the dot-com bubble burst and with the help of John Seely Brown and Paul Duguid, among others, we are much less attracted to the pitch that all problems will be solved by the application of massive amounts of data.
Around the same time, Google proved that there is an extremely usable middle ground between cataloged, curated information sets and hopelessly disjoint stacks of data. Users increasingly choose the convenience of Google, and more recently Wikipedia, Flickr, and YouTube, over the authoritative thoroughness of library-mediated research.
Librarians cringe at amateur cataloging. It’s like home dentistry. Google’s black-box PageRank reflects the “uniquely democratic nature of the web” by choosing relevant information based on proxies for trust, reputation, and authoritativeness (not expert assessments of those qualities). Flickr and YouTube use “Web 2.0” social tagging techniques to roughly categorize content. Even non-librarians can appreciate the pitfalls of letting just anyone add meta-data – they’ll get it wrong.
But talking to librarians, I’ve started to appreciate whole other levels of control over the process and content of cataloging. IANAL — I am not a Librarian. The following discussion is for entertainment purposes only.
First of all, current implementations of tagging are “flat” – there is no meta-meta-data about what type of label a tag represents. Is “Mona Lisa” the title, author, location, genre, art movement? Is it the band whose album cover this painting appears on? All I know based on Flickr is that it’s been tagged “Mona Lisa”. Librarians have invented cataloging standards, e.g. MARC, VRA, and Dublin Core, which specify and standardize the types of metadata.
Second, and fairly obviously, tags are susceptible to spelling errors and multiple listings for the same category (“Italian”, “italian”, “italy”, “Italians”… what do I search for?). This is a case where more information is less informative. With each variation in tag formulation, the content in that category is further fractured. Librarians have implemented controlled vocabularies to handle this issue.
Third, even if you know what attribute you’re supposed to label, and have a controlled list of values to pick from, ambiguity may persist in how to summarize multiple or conflicting facts about the object. Things get especially tricky when trying to decide between cataloging the object or the representation of the object — a sort of existential problem for digital objects, where there is no “original”. Librarians have entire rulebooks to sort out these issues.
The bottom line is, you need a degree in Library Science to do this right. Where does that leave the vast amount of digital information that is piling up? Early projections are speculative, but according to this study, the amount of information in the world doubled in roughly three years. It is a serious and valid question whether it will be humanly possible to catalog even the fraction of information that we find worth keeping.
This limitation, combined with the competition of amateur, democratic, rough-and-ready categorization, means that professional cataloging must adapt or die. The good news is that the information technology community is finally ready to hear what the librarians have been saying. The brute force, flat data approach doesn’t scale.
One intriguing approach to finding answers is Amazon’s Mechanical Turk. Just like the hoax robot it’s named after, this system looks like artificial intelligence, but uses human intelligence at its core. As with the recently retired Google Answers, the problem is one of scale. Paying enough experts to spend enough time to answer the questions is just too expensive. As you scale up, paying less for less-expert answers, sooner or later you’re paying next to nothing for amateur opinions: the original problematic situation.
Eventually, robots might catalog for us. (Librarians shudder.) What we now know is just how far away that is – bot catalogers will need much better AI than currently exists. But in order for this project to even be possible, we have to make our data bot-readable. That means implementing some of the cataloging technologies invented and refined by librarians over the centuries.
We need to standardize meta-data format and content. Digital resources need not only meta-data but also meta-meta-data describing the standards they conform with. Catalog and search solutions need to read this information and pass it on when communicating with other systems.
The Semantic Web specifications put forth by the W3C and the XML Metadata Interchange standard are steps in that direction. We owe it to ourselves to give the bots a fair shot at this task. Using open standards is the first step.