Koha Library System

The SLC library currently uses proprietary catalog software. It’s expensive, we can’t add features we want, and it won’t interoperate with our other systems like web servers, image databases, and our learning management system (which is a whole other problem in itself). So everyone was pleased when the opportunity arose to consider a different solution: Koha. It’s an open source integrated library system.

The bad news is that it’s still an immature product and lacks some features we would need, like a reserves module. The good news is that some of the developers close to the project have started a service company, LibLime, which will develop features and customizations and add them to the software. Rather than paying a software license fee to the proprietary vendor, who has little incentive to implement our feature requests, we could directly pay the developers to build the software we want.

LibLime’s approach is to treat customizations as preferences — switches that can be flipped to give different functionality from the same build of the software. This prevents forking and versioning issues, which were my key concerns with mission-critical open source software. The developers themselves take an integrative approach; they seem very interested in developing an extensive feature list in response to what librarians need and dealing with any conflicts at the preference level.

Often with proprietary software, one preference is forced on all users because that is less work for the developers. To the contrary, the paid-development/open-source model means that the developers get paid for exactly how much work they do, so they can afford to do things the hard way if that’s what users want.

Down the road, I’m concerned with making sure that the systems we implement are standards-compliant and talk to each other. The possibility of tying together a catalog/search solution like Koha with a web platform like Plone, another open source software, really raises the prospect of free and easy information flow around campus. The open source model means that these tools keep getting better and more available; what starts in the library and expands to the campus continues to spread across the entire internet.


Blog Archiving Survey

Jessamyn West posts about a survey put out by UNC-Chapel Hill’s School of Info & Library Science. They’re gathering data about bloggers’ habits and perceptions, with an eye to preserving blog content permanently.

It’s an interesting question — many blogs certainly fall into the category of journalism, and would be as useful as newspaper archives for historians and researchers. On the other hand, many bloggers post casually and treat the medium as ephemeral. Twitter takes that approach to the extreme.

In any case, I’m curious to see what the UNC folks make of the results.


Laura Quilter Copyright Talk

Our Library Copyright group was addressed by librarian, copyright attorney, and researcher Laura Quilter today. She did a great job of giving the group a background on copyright law and how it affects college libraries. She also answered our questions about how we currently try to avoid copyright infringement. I think it went well — participants, please add your comments! Thanks to Sha Fagan and Julie Auster for making Laura’s visit possible.

Here is Laura’s powerpoint presentation, so I won’t summarize the whole talk, but I did come away with some interesting points which I don’t think were explicit in her notes:

  • Because copyright is privately enforced (there are no copypolice), copyright holders have an incentive to be as aggressive as possible. Therefore their vehemence doesn’t reflect, and shouldn’t be used to measure, the merits of their claims.
  • Copyright is a quasi-property right. Property rights have powerful imagery and we should beware carrying all the implications of property across an analogy to “intellectual property”.
  • Restricting access to on-campus users probably keeps us safe within “educational purposes”, which are specially protected under copyright law.
  • Forwarding an email, if you don’t get permission first, is a prima facie copyright infringement. We engage in many practices that could get us in trouble. We should not concern ourselves with eliminating all behavior that could conceivably lead to a lawsuit, but with reducing the amount and severity of risk.
  • Educational institutions making good-faith efforts to respect copyright can have infringement damages waived and just be enjoined from the infringing use, if such is found. Considering the privileged status of education in copyright law, together we are in a very strong legal position.
  • Every effort we make to comply is a “plus point” to be considered with the other circumstances of possible infringement. The important thing is that we make a good-faith effort to respect copyright.
  • Thinking through our policy is one of the best ways to bolster our case, should we get in trouble.

See also: Copyright Resources


Social Search

Found a neat summary of social search by Arnaud Fischer at Web technology has gone through a few distinct phases. First (early-mid 1990s) was just digitizing and hyperlinking information, making its interconnectedness literal. Second, Google (1998) revolutionized search; you no longer need to know where information is in order to get it. But, as I’ve previously posted, there are benefits to cataloging information rather than just sifting through an undifferentiated mess. It seems that any algorithm that is less complex than an intelligent agent is, in addition to being less effective at finding good results, susceptible to manipulation.

Throughout the past decade, a search engine’s most critical success factors – relevance, comprehensiveness, performance, freshness, and ease of use – have remained fairly stable. Relevance is more subjective than ever and must take into consideration the holistic search experience one user at a time. Inferring each user’s intent from a mere 2.1 search terms remains at the core of the relevance challenge.

Social search addresses relevance head-on. After “on-the-page” and “off-the-page” criteria, web connectivity and link authority, relevance is now increasingly augmented by implicit and explicit user behaviors, social networks and communities.

Attempts to literally harness human judgement to do the work of a cataloging engine (see the Mechanical Turk) don’t scale to internet proportions. What we need is a way to collect social information without imposing a burden on users. Some sites have succeeded in providing a platform where users freely contribute linked content , e.g. WikiPedia, and some have further gotten their users to add cataloging information — e.g. YouTube‘s tags. Visualizations like tag clouds make these implementations easier to use, but no deeper. And they still require intentional effort, and therefore goodwill and trust — two of the least scalable human resources.

I fundamentally agree with Fischer’s conclusion that using self-organizing groups to generate consensus is a much better way to measure relevance. The big question is how to balance the needs of social data collection with the freedom of association that it depends on. The public, machine-readable nature of most web forums amplifies any chilling effect into a snowstorm of suppression. Further, when a web service becomes popular, there is a strong temptation to monetize that popularity with ads, sponsored content, selling user information to marketers, etc. That background probably skews the opinions expressed in that forum, and by extension, the extractable metadata.

But even more fundamentally, there is something different about web discourse that makes participants espouse opinions they normally wouldn’t. Try researching software on a users’ forum. Many seemingly sincere opinions are based not on experience but are sort of meta-opinions as to the consensus answer. In fact I bet most posters really are sincere — I have caught myself doing this. This reification is what actually creates consensus, and having posted on the accepted side of an issue recursively increases a poster’s reputation. There is no check on this tendency because we lack the normal social status indicators online. I would bet that posters regress toward the mean opinion much faster than in offline discourse. Any social scientists out there want to test this out?

Speaking of which, there are already many empirically supported social biases which affect our judgements and opinions. Are we ready for Search by Stereotype? The tyranny of the majority and the tragedy of the commons are as dangerous to the freedom of (digital) association as government suppression or market influences.

The web as it currently stands, the non-semantic web, is largely non-judgemental in its handling of information. Divergent opinions have equal opportunity to be tagged, linked, and accepted. This non-function is actually a feature. Before we trust online consensus to generate relevance measurements for our social search engines, we need to understand and try to compensate for its biases.


Learning 2.0

Wired runs a story about Learning 2.0, a self-directed, web-based learning tool. It was designed for the Public Library of Charlotte and Mecklenburg County to help their librarians get more web-savvy.

Using tools is the best way to learn them, and this project teaches about web 2.0 technologies by using those technologies. It gives you a nice list of tasks like starting a blog, using Technorati, and playing with YouTube. It’s not just for librarians — this looks like a great way to get familiar with new web tech for anyone who feels a little left behind by some of this new stuff.


Senate Reconsiders Open Access to Research

Wired covers the reconsideration of a stalled Senate bill to require publicly funded research to be made freely available to the public. (See my previous post.)

Sen. John Cornyn (R-Texas) has pledged this year to resurrect the Federal Research Public Access Act (S.2695), which would require federally funded research to become publicly available online within six months of being published.

“When it’s the taxpayers that are underwriting projects in the federal government, they deserve to access the very things they’re paying for,” said Cornyn spokesman Brian Walsh. “This research is funded by American taxpayers and conducted by researchers funded by public institutions. But it’s not widely available.”

As it stands, researchers are “encouraged” but not required to post their results in open-access journals. So they generally don’t. With limited budgets, libraries can’t afford to give the public access to all of the research it funds. This is kind of a no-brainer for everyone except the journal publishers, but libraries particularly should be interested in passing this law.

Library advocacy group SPARC is part of the lobbying effort. They link to this Public Access to Research petition. I signed.


Data Explosion Continues

Wired runs this AP piece about the latest estimate of the amount of digital data on Earth: we’re up to 161 exabytes! (161 billion gigabytes.) The last best estimate was 5 exabytes in 2003, so data has increased 32-fold in the last 4 years. Pretty impressive–and impossible to catalog.

Of course, that curve should level off somewhat as more fields complete their transition to digital. We can be fairly confident that some next-generation storage technology (e.g. holographic storage) will prevent us from running out of room in the physical world for all this data. But just as we have found that you can’t just throw computers at an knowledge problem, you also can’t just throw storage at a data problem.

Not every piece of data is suitable to be posted on the web where google can crawl and index it. Not every search needs an algorithmic best guess based on keywords, incoming links, tags, or user profiling. How we search and catalog our 161 exabytes of data is the next big question… and getting bigger.


Free Culture

I have been talking about copyright issues from the perspective of taking Lawrence Lessig’s Free Culture for granted. Not all of my readers have that perspective. So I’d like to take a moment to get us on the same page.

Lessig’s 2003 book was a watershed moment in the copyright discussion. He articulates a sharp distinction between publishing’s need for some copyright protections, on the one hand, and the value of the once-unregulated copying that has been lost with modern copyright’s expansion to regulate digital media.

We used to be able to rely on the relative narrowness of the copying process — only people with printing presses could copy books — to set a reasonable scope on copyright. Most uses of books (reading, excerpting, repurposing, etc.) were therefore not regulated by copyright, striking a good balance between public benefit from promoting publishers and public benefit from unregulated use of the work (including its free use in the public domain after, originally in 1790, 14 years).


Fair Use Act of 2007

Wired covers a new bill in the House that would limit DMCA restrictions on copying: “Librarians would be allowed to bypass DRM technology to update or preserve their collections.”

It would also cap statutory damages for copyright infringement. Looks like a good thing, if not a full reversal of modern copyright law’s excesses. (Recommended reading on that topic: Free Culture by Lawrence Lessig.)

“I think the main thing (the current bill) does for that individual is that he’ll be able to move around within his home network the material that he has lawfully acquired.” — Rep. Rick Boucher

“Limiting the availability of statutory damages in this way is a huge step in the right direction…. It would give innovators much-needed breathing room.” — EFF’s Derek Slater


Copyright Policy

Library and Academic Computing staff met today to start a discussion of copyright policy. When we looked at library practices in our DAM working group, it quickly became clear that A.) copyright practices vary widely among the different areas of the library (books, reserves, music and slides are all handled differently) and B.) fear of infringing copyright is a driving force in decision-making.

In order to bring our varying practices into accord, we need a library-wide policy on copyright. And in order to write that policy, we need to learn much more about copyright — we’ll try to reduce the fear of infringement by reducing the uncertainty involved. Unfortunately this area of law is relatively unsettled, and certainty may only be found in the courtroom. But if/when it comes to that, it will help us and other libraries and colleges to have a robust established policy. The opposite, a history of compliance with publishing industry demands, will only work against us. Either way, having an established policy is a crucial first step.

For the next steps, we’ll be bringing in expert speakers on copyright and colleges and libraries. Each of the group participants will also do some research about existing copyright practices in their field to bring to the next meeting. In the longer term, we’ll need to educate other SLC community members about these issues and work towards a campus-wide policy.

I’m very happy just to start the discussion, and quite optimistic that we can come up with a consensus that works with everyone’s interests. If anyone wants to contribute research material, please post a link in the comments of my previous post, Copyright Resources.