Discussion of Sections 1 -4 of Kleinberg
For the week of Feb. 4, 2007, we are discussing Authoritative Sources in a Hyperlinked Environment, with today’s posting focused on sections 1 through 4. This paper really has some meat on it and I found myself having to reread sections and even dig out my old Linear Algebra text.
As is typical of most academic papers, section 1 introduces us to the problem that is being discussed, in this case how to determine the authoritative sites on the web. Specifically, in the prehistoric web days, search engines would bring you back lots of hits, but seldom did they bring you back the most important hits, at least not high in the rankings (we are relatively spoiled these days in this regard.) As Brin and Page alluded to, doing a query on “microsoft” on these old engines may not even have microsoft.com as the first result, even though most users would tell you that it should be. Klienberg asserts that the big thing missing from search engines in these days is the use of link analysis to determine the authoritative sites on the web so that they may be ranked higher. Unlike closed corpus search engines that suffer from the scarcity problem (not enough good content to search), the web suffers from an abundance problem, namely too much content. Given this, the goal, Kleinberg says, is to identify the authoritative pages that are relevant to the user’s query.
Much like Brin and Page and Mihalcea (see eariler posts), Kleinberg suggests a graph based approach based on the links in HTML to determine the authoritative pages. Additionally, he also defines “hub” pages, i.e. those pages that have links to multiple relevant authoritative pages. Hubs never point to hubs and authoritative pages never point to authoritative pages, which yields a dense directed bipartite subgraph (don’t worry, I had to look it up as well.) Hubs and authorities reinforce each other, “a good hub is a page that points to many good authorities; a good authority is a page that is pointed to by many good hubs.” Intuitively this make sense as we all know pages that we trust and we know where to go to find them. I am not sure if this is correct, but I think to some extent in our post-google world, the hubs are now the search engines themselves. That is, being ranked highly in the engine, tends to reinforce a site as being the authority.
Unlike the previous papers we looked at, this one actually has a fair amount of math, especially in section 2, so dig out your Linear Algebra book if you really want an in-depth understanding. The algorithm for calculating the hubs and authorities is very similar to both the PageRank and TextRank algorithims we discussed in prior posts and can be calculated using a similar, simple, iterative algorithm (I implemented it in a few hours). The graph in use for the algorithm consists of a root set of nodes identified by a search engine (Alta Vista in this case) and is then expanded to include pages that are pointed to by a node in the set, or point to a node in the set (possibly bounded by some predetermined amount.) Edges are the links themselves. Given the graph it comes down to calculating the authority weights and the hub weights using the iterative algorithm. Authority weights are the sum of the hub weights and the hub weights are the sum of the authority weights. See pages 9, 10, 11 and 12 for the exact details of proving why this works. You may want to refresh your knowledge of Eigenvectors, etc. Seriously, though, don’t you love when “discovered solutions” to problems have such a pure mathematical relationship? That is, isn’t it great when the math matches up with your intuition?
Section 3 breaks expands on the notion of hubs and authorities to discover communities of links, and, also, “anti” communities. For instance, using a variant of the main algorithm, one can determine clusters of sites that are related. For instance, one can easily determine those sites in favor of abortion and those opposed. Mathematically, this correlates to working with the non-principal eigenvectors, whereas the primary algorithm involves working with the principal eigenvector.
Section 4 continues Section 3 and shows how it is fairly easy to use the link structure to find similar pages given an initial query. Unlike relevance feedback, Kleinberg’s approach relies solely on the link structure.
Tomorrow or Friday, I will look at the remaining sections of the paper.
graph theory, Kleinberg, PageRank, TextRank, Mihalcea, Brin and Page, eigenvectors, link analysis, bibliometrics, linear algebra, information retrieval, search engines
Popularity: 6% [?]
Technorati Tags: graph theory, Kleinberg, PageRank, TextRank, Mihalcea, Brin and Page, eigenvectors, link analysis, bibliometrics, linear algebra, information retrieval, search engines

