Search Innovations

Read/Write Web posts:

There are an abundance of new search engines (100+ at last count ) – each pioneering some innovation in search technology. Here is a list of the top 17 innovations that, in our opinion, will prove disruptive in the future. These innovations are classified into four types: Query Pre-processing; Information Sources; Algorithm Improvement; Results Visualization and Post-processing.


Refining Google

Via Digg, I found an interesting article on Google’s attempts to prevent people from “gaming” its search results. Google’s PageRank algorithm, while secret, is known to consider the number and quality of incoming links to a site in its rankings. Therefore PageRank has working models of reputation, trust, etc.

In the article, Carsten Cumbrowski talks a lot of jargon and the writing becomes elliptical and dense at times, but the information he presents, and links to, comprises a very good background on issues with PageRank. He analyzes the NOFOLLOW attribute, an attempt to reduce the credence given to paid or otherwise less meaningful links. He also covers improvements to PageRank’s trust model:

It is like with people. You do not trust anybody you just have met. How quickly you trust somebody is less a time factor, but has to do with what the person is doing or is not doing and how much it does match what the person says about himself, his intentions and his plans.

Therefore the age of a site is a poor proxy for trustworthiness, and PageRank’s naive reliance on it was faulty. As I’ve posted before, an extreme amount of time and effort goes into reverse-engineering search algorithms, along a whole spectrum from benign “search engine optimization” to malicious exploitation of flaws. It’s an arms race in which the complexity of the system is determined as much by competitive pressure from its exploiters as by the desire for more useful search results.

Remember that the next time you rely on a search algorithm — or build a web service that relies on one.


Koha Library System

The SLC library currently uses proprietary catalog software. It’s expensive, we can’t add features we want, and it won’t interoperate with our other systems like web servers, image databases, and our learning management system (which is a whole other problem in itself). So everyone was pleased when the opportunity arose to consider a different solution: Koha. It’s an open source integrated library system.

The bad news is that it’s still an immature product and lacks some features we would need, like a reserves module. The good news is that some of the developers close to the project have started a service company, LibLime, which will develop features and customizations and add them to the software. Rather than paying a software license fee to the proprietary vendor, who has little incentive to implement our feature requests, we could directly pay the developers to build the software we want.

LibLime’s approach is to treat customizations as preferences — switches that can be flipped to give different functionality from the same build of the software. This prevents forking and versioning issues, which were my key concerns with mission-critical open source software. The developers themselves take an integrative approach; they seem very interested in developing an extensive feature list in response to what librarians need and dealing with any conflicts at the preference level.

Often with proprietary software, one preference is forced on all users because that is less work for the developers. To the contrary, the paid-development/open-source model means that the developers get paid for exactly how much work they do, so they can afford to do things the hard way if that’s what users want.

Down the road, I’m concerned with making sure that the systems we implement are standards-compliant and talk to each other. The possibility of tying together a catalog/search solution like Koha with a web platform like Plone, another open source software, really raises the prospect of free and easy information flow around campus. The open source model means that these tools keep getting better and more available; what starts in the library and expands to the campus continues to spread across the entire internet.


"Spock" People Search

TechCrunch previews Spock, a people-oriented search service. Like Google, they are indexing the entire web, but with some built-in data structure assumptions. If Google tried to catalog pages rather than just associate them, it would need a metadata standard. It would make no sense to have, e.g., a “first name” field attached to a page about Linux, Limburger cheese, or the limbic system. Since Spock knows it’s dealing with people, e.g., John Linell, it knows what blank fields to create and try to fill.

Spock auto-creates tags for individuals based on the information they find. Prominent tags for Bill Clinton, for example, include “former U.S. President, “Great Leader,” “Womanizer,” “Left Handed,” “Democrat,” and “Saxophonist,” among others. Spock also auto detects other relevant meta data about the individual – age, location and sex.

This specialization should allow Spock to give higher-quality results about people than a generalized search engine. Of course, it’s only as savvy about people as its designers know how to make it, so this approach would not scale to every specialized type of data. But since we care a lot about people — Spock claims 30% of web searches are for people — this is probably a useful, if limited, approach.

And for those of you who are relying on privacy through obscurity, there are already several indexing tools that will bring potential stalkers right to your door. If you have a little bit of personal info on several sites, that data could be automatically aggregated to build a full set of personal info — be careful!

Spock is pre-release and for now, you need an invitation to try it.


Social Search

Found a neat summary of social search by Arnaud Fischer at Web technology has gone through a few distinct phases. First (early-mid 1990s) was just digitizing and hyperlinking information, making its interconnectedness literal. Second, Google (1998) revolutionized search; you no longer need to know where information is in order to get it. But, as I’ve previously posted, there are benefits to cataloging information rather than just sifting through an undifferentiated mess. It seems that any algorithm that is less complex than an intelligent agent is, in addition to being less effective at finding good results, susceptible to manipulation.

Throughout the past decade, a search engine’s most critical success factors – relevance, comprehensiveness, performance, freshness, and ease of use – have remained fairly stable. Relevance is more subjective than ever and must take into consideration the holistic search experience one user at a time. Inferring each user’s intent from a mere 2.1 search terms remains at the core of the relevance challenge.

Social search addresses relevance head-on. After “on-the-page” and “off-the-page” criteria, web connectivity and link authority, relevance is now increasingly augmented by implicit and explicit user behaviors, social networks and communities.

Attempts to literally harness human judgement to do the work of a cataloging engine (see the Mechanical Turk) don’t scale to internet proportions. What we need is a way to collect social information without imposing a burden on users. Some sites have succeeded in providing a platform where users freely contribute linked content , e.g. WikiPedia, and some have further gotten their users to add cataloging information — e.g. YouTube‘s tags. Visualizations like tag clouds make these implementations easier to use, but no deeper. And they still require intentional effort, and therefore goodwill and trust — two of the least scalable human resources.

I fundamentally agree with Fischer’s conclusion that using self-organizing groups to generate consensus is a much better way to measure relevance. The big question is how to balance the needs of social data collection with the freedom of association that it depends on. The public, machine-readable nature of most web forums amplifies any chilling effect into a snowstorm of suppression. Further, when a web service becomes popular, there is a strong temptation to monetize that popularity with ads, sponsored content, selling user information to marketers, etc. That background probably skews the opinions expressed in that forum, and by extension, the extractable metadata.

But even more fundamentally, there is something different about web discourse that makes participants espouse opinions they normally wouldn’t. Try researching software on a users’ forum. Many seemingly sincere opinions are based not on experience but are sort of meta-opinions as to the consensus answer. In fact I bet most posters really are sincere — I have caught myself doing this. This reification is what actually creates consensus, and having posted on the accepted side of an issue recursively increases a poster’s reputation. There is no check on this tendency because we lack the normal social status indicators online. I would bet that posters regress toward the mean opinion much faster than in offline discourse. Any social scientists out there want to test this out?

Speaking of which, there are already many empirically supported social biases which affect our judgements and opinions. Are we ready for Search by Stereotype? The tyranny of the majority and the tragedy of the commons are as dangerous to the freedom of (digital) association as government suppression or market influences.

The web as it currently stands, the non-semantic web, is largely non-judgemental in its handling of information. Divergent opinions have equal opportunity to be tagged, linked, and accepted. This non-function is actually a feature. Before we trust online consensus to generate relevance measurements for our social search engines, we need to understand and try to compensate for its biases.


Merit-Based Search Results?

This Slashdot post by Bennet Haselton proposes a new direction for search algorithms. The current best, Google’s PageRank, is a trade secret. Although they claim that their “methods make human tampering with our results extremely difficult,” in fact Google is in a continual arms race with people manipulating the system — “spamdexing”, i.e. achieving high-ranked search results for reasons other than user satisfaction.

Haselton’s suggestion looks like a good start — an open-source algorithm that uses samples of the user base rather than aggregating all users’ “votes” (clicks). This would certainly render current spamdexing schemes obsolete. Of course, as statisticians will tell you, getting a properly randomized, representative sample is a problem in itself. Usually, in order to factor out influences you don’t want to measure, you need to gather some demographic data from participants — raising privacy concerns.

And what is “merit” anyway? Is popular reaction its best measure? How can an algorithm distinguish sincere offerings from click greed?


The Virtues and Limits of Cataloging

The first wave of digital evangelism has passed. With the dot-com bubble burst and with the help of John Seely Brown and Paul Duguid, among others, we are much less attracted to the pitch that all problems will be solved by the application of massive amounts of data.

Around the same time, Google proved that there is an extremely usable middle ground between cataloged, curated information sets and hopelessly disjoint stacks of data. Users increasingly choose the convenience of Google, and more recently Wikipedia, Flickr, and YouTube, over the authoritative thoroughness of library-mediated research.

Librarians cringe at amateur cataloging. It’s like home dentistry. Google’s black-box PageRank reflects the “uniquely democratic nature of the web” by choosing relevant information based on proxies for trust, reputation, and authoritativeness (not expert assessments of those qualities). Flickr and YouTube use “Web 2.0” social tagging techniques to roughly categorize content. Even non-librarians can appreciate the pitfalls of letting just anyone add meta-data – they’ll get it wrong.

But talking to librarians, I’ve started to appreciate whole other levels of control over the process and content of cataloging. IANAL — I am not a Librarian. The following discussion is for entertainment purposes only.