This Slashdot post by Bennet Haselton proposes a new direction for search algorithms. The current best, Google’s PageRank, is a trade secret. Although they claim that their “methods make human tampering with our results extremely difficult,” in fact Google is in a continual arms race with people manipulating the system — “spamdexing”, i.e. achieving high-ranked search results for reasons other than user satisfaction.
Haselton’s suggestion looks like a good start — an open-source algorithm that uses samples of the user base rather than aggregating all users’ “votes” (clicks). This would certainly render current spamdexing schemes obsolete. Of course, as statisticians will tell you, getting a properly randomized, representative sample is a problem in itself. Usually, in order to factor out influences you don’t want to measure, you need to gather some demographic data from participants — raising privacy concerns.
And what is “merit” anyway? Is popular reaction its best measure? How can an algorithm distinguish sincere offerings from click greed?
It may be impossible to create a system that can’t be “gamed”. Fellow Slashdotter “attonitus” mentions Arrow’s Theorem, which places logical limits on any ranking system. (Although I bet a clever designer could find a way to display circular rankings…)
On the other hand, we don’t need perfect search — as I argue in another post, we just(?) need to meet professional cataloging standards. And in the short term, any improvement would be welcome. Check out Nutch, an open-source search tool still in early development.
It’s an empirical question whether a closed or open model produces a better search algorithm. The analogous debate rages in security; there is not yet a clear winner. We do know that full disclosure of security vulnerabilities tends to produce better fixes in the long term, but in the short term amounts to advertising “come get me”. (See Bruce Schneier’s take.) Because Google doesn’t disclose its flaws, from the moment someone finds a loophole until the moment it’s fixed, users are unwittingly exploited. Full disclosure of PageRank flaws is a step Google could take today.