Last October Google changed the ranking algorithm. "The great ranking shakeup" which, among other things, brought to an absurd lawsuit against Google.
In July I had written an article analyzing PageRank pollution, and how Google should be doing something about it. To its credit, Google did indeed reduce the effect of Googlebombing somewhat, and what had such a profound impact on the ranking wasn't the link topology factor that I had analyzed, but the link text factor, that I completely failed to account for.
Today I would speculate that link topology analysis has a very marginal role in determining the final page score. Additionally, the actual content of a page that is a link target is of importance only as a cross check with the link text.
Pretty obvious if you think that a global search engine's main problem is weeding spam out of the results to get better relevance. Google's current relevance ranking should be called "popularity ranking", as Krishna Bharat once told me "if 1 million people say that that's IBM, for us that's IBM".
Objective relevance, if it even exists, isn't really needed on a global scale, and it implies delving into page content to check against the query, which in turn exposes you to spam. On a global scale something that is popular on average is good enough. Google doesn't even do word stemming (just in case something spelled slightly differently is more relevant), there typically are so many results anyway.
In other words, you can visualize link topology analysis as a vote for the target page, a link means "this page is authoritative" or more generally "this page deserves more weight than the other 2.999.999.999 I didn't link to", whereas link text analysis can be visualized as a vote on what the page is about, like "this page talks about foo". With 3 billion pages it all averages out and (unintuitively) works. This kills spam because it doesn't really matter what's on a page, what matters is that people also link to the page "describing it" close enough to what it contains.
Incidentally, this anti-spam optimization of the current PageRank is why I think that the Google Search Appliance (which appears to be based on the same ranking algorithms) isn't very well suited for small to medium scale search applications. But I should disclose that my company developes search technologies, so I'm biased.
Google has also been relying excessively on dmoz directory listings, over which it doesn't have very much control at all.
Anyway this is the state of the art, and Teoma and Overture are catching up to challenge Google. Google must be working on something new, some earth-shaking, lawsuit-collecting ranking improvement, but in what direction?
Google's acquisition of Pyra and Google's content targeted ads appearing on Blogger weblogs mean that Google is entering the content arena.
There are many ways to deal with content, machine learning technologies being one. Like the bayesian spam filters, machine learning works on collections of words statistically, and is usually language independent. Yet Google kick started their content targeting with an english-only solution, witness the note on the adwords page:
Content-Targeted Advertising is only available for AdWords campaigns with English as a target language and any of the following target country selections: All Countries, the US, UK, Australia, or Canada.
So switch from mild speculation to wild speculation, my guess is that Google is experimenting with english language technology via content targeted advertizing on Blogger weblogs.
Later on english language processing will be good enough (and fast enough), to be applied to the english language documents on the web, and Google will start powering search results with it.
The search engine I worked on supports searching for arbitrary metadata associated with a page, including language information. This is an extremely powerful information retrieval technique. For example imagine searching for vehicle and finding relevant pages on cars, bikes, boats, etc. or searching for the name of the digital camera you want to buy and sorting by good, neutral or bad review.
Notice that this is not semantic web trickery. Sergey Brin is notoriously skeptical about the semantic web:
Look, putting angle brackets around things is not a technology, by itself. I'd rather make progress by having computers understand what humans write, than by forcing humans to write in ways computers can understand.
Yes. It will be interesting. (Oh, and Google will be upgrading the web again, no semantic web needed).
© 2002-2003 by Duncan Wilcox