Categories
tech

The Virtues and Limits of Cataloging

The first wave of digital evangelism has passed. With the dot-com bubble burst and with the help of John Seely Brown and Paul Duguid, among others, we are much less attracted to the pitch that all problems will be solved by the application of massive amounts of data.

Around the same time, Google proved that there is an extremely usable middle ground between cataloged, curated information sets and hopelessly disjoint stacks of data. Users increasingly choose the convenience of Google, and more recently Wikipedia, Flickr, and YouTube, over the authoritative thoroughness of library-mediated research.

Librarians cringe at amateur cataloging. It’s like home dentistry. Google’s black-box PageRank reflects the “uniquely democratic nature of the web” by choosing relevant information based on proxies for trust, reputation, and authoritativeness (not expert assessments of those qualities). Flickr and YouTube use “Web 2.0” social tagging techniques to roughly categorize content. Even non-librarians can appreciate the pitfalls of letting just anyone add meta-data – they’ll get it wrong.

But talking to librarians, I’ve started to appreciate whole other levels of control over the process and content of cataloging. IANAL — I am not a Librarian. The following discussion is for entertainment purposes only.

First of all, current implementations of tagging are “flat” – there is no meta-meta-data about what type of label a tag represents. Is “Mona Lisa” the title, author, location, genre, art movement? Is it the band whose album cover this painting appears on? All I know based on Flickr is that it’s been tagged “Mona Lisa”. Librarians have invented cataloging standards, e.g. MARC, VRA, and Dublin Core, which specify and standardize the types of metadata.

Second, and fairly obviously, tags are susceptible to spelling errors and multiple listings for the same category (“Italian”, “italian”, “italy”, “Italians”… what do I search for?). This is a case where more information is less informative. With each variation in tag formulation, the content in that category is further fractured. Librarians have implemented controlled vocabularies to handle this issue.

Third, even if you know what attribute you’re supposed to label, and have a controlled list of values to pick from, ambiguity may persist in how to summarize multiple or conflicting facts about the object. Things get especially tricky when trying to decide between cataloging the object or the representation of the object — a sort of existential problem for digital objects, where there is no “original”. Librarians have entire rulebooks to sort out these issues.

The bottom line is, you need a degree in Library Science to do this right. Where does that leave the vast amount of digital information that is piling up? Early projections are speculative, but according to this study, the amount of information in the world doubled in roughly three years. It is a serious and valid question whether it will be humanly possible to catalog even the fraction of information that we find worth keeping.

This limitation, combined with the competition of amateur, democratic, rough-and-ready categorization, means that professional cataloging must adapt or die. The good news is that the information technology community is finally ready to hear what the librarians have been saying. The brute force, flat data approach doesn’t scale.

One intriguing approach to finding answers is Amazon’s Mechanical Turk. Just like the hoax robot it’s named after, this system looks like artificial intelligence, but uses human intelligence at its core. As with the recently retired Google Answers, the problem is one of scale. Paying enough experts to spend enough time to answer the questions is just too expensive. As you scale up, paying less for less-expert answers, sooner or later you’re paying next to nothing for amateur opinions: the original problematic situation.

Eventually, robots might catalog for us. (Librarians shudder.) What we now know is just how far away that is – bot catalogers will need much better AI than currently exists. But in order for this project to even be possible, we have to make our data bot-readable. That means implementing some of the cataloging technologies invented and refined by librarians over the centuries.

We need to standardize meta-data format and content. Digital resources need not only meta-data but also meta-meta-data describing the standards they conform with. Catalog and search solutions need to read this information and pass it on when communicating with other systems.

The Semantic Web specifications put forth by the W3C and the XML Metadata Interchange standard are steps in that direction. We owe it to ourselves to give the bots a fair shot at this task. Using open standards is the first step.

13 replies on “The Virtues and Limits of Cataloging”

@ David:

Good points. Wikipedia’s disambiguation has 3 unattractive features: it’s tangential to, rather than a gateway for, the ambiguous pages; it’s indexed and searched separately (note that this would not be solved by a change in the first feature); most saliently, it’s equally subject to bad edits and nonstandard decision-making.

Since Wikipedia is usually not a primary source, you probably shouldn’t cite it in that context. On the other hand, it makes perfect sense when you need to refer to a common understanding or debate over an issue to point to the collaborative consensus that a Wikipedia entry represents. I’ve already cited it many times in this blog.

Very interesting. Very true thesis. The web cyber info tool could be easier
to use, more precise and even validated as to info source and authenticity.
Who will catalog the internet and by what/whose standards? And how to do
this and yet maintain the free flow participatory nature of the web that has
vastly improved all of our access to info and thought?

The library profession is coming at this in a slightly oblique way, by
introducing the notion of Library 2.0, keeping the professional
classification of information as sort of a bulwark between we who know and
have standards and you who don’t, but whom we serve, and inviting a general
response and engagement with it, such as making library on-line catalogs and
websites more like Amazon.com. The public (!) is invited to react to the
entries in the catalog, write reviews of the materials therein. Also, like
Amazon, some public libraries now contact the public (we call you “patrons”)
to let them know when books of possible interest are available (“if you
liked “Heartbreaking Work” may we recommend “What is the What”?

Search Library 2.0. You’ll find some interesting articles.

Nice post, Eli. I’m glad to see that we LIBRARIANS/ARCHIVISTS are being heard by at least one information technologist! Thanks for listening and helping us work through this scary amount of info. Look forward to more postings.

An IT Guy Looks at Cataloging Standards

Here’s a thoughtful post by Eli Jacobowitz, Manager of Digital Technology in Academic Computing at Sarah Lawrence College. It’s titled The Virtues and Limits of Cataloging. He looks at traditional library cataloging within the context of the ascent o…

Leave a Reply to IraCancel reply