<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Paper of the Week &#187; Google</title>
	<atom:link href="http://www.paperoftheweek.com/category/google/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.paperoftheweek.com</link>
	<description>Read. Learn. Discuss.</description>
	<lastBuildDate>Tue, 14 Aug 2007 01:35:31 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Discussion of LexRank by Erkan and Radev</title>
		<link>http://www.paperoftheweek.com/2007/02/22/discussion-of-lexrank-by-erkan-and-radev/</link>
		<comments>http://www.paperoftheweek.com/2007/02/22/discussion-of-lexrank-by-erkan-and-radev/#comments</comments>
		<pubDate>Fri, 23 Feb 2007 02:03:40 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[bibliometrics]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Graph Theory]]></category>
		<category><![CDATA[linear algebra]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>
		<category><![CDATA[text summarization]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/02/22/discussion-of-lexrank-by-erkan-and-radev/</guid>
		<description><![CDATA[POTW 2/18/07: LexRank: Graph-based Lexical Centrality as Salience in Text Summarization The LexRank paper by Erkan and Radev is another PageRank/Graph Theory based approach to working with text, this time applied to the task of summarization. Key parts of sections 1 and 2 discuss the general problem of corpus-based summarization.  Unlike TextRank by Mihalcea, Erkan [...]]]></description>
			<content:encoded><![CDATA[<p>POTW 2/18/07: <a href="http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume22/erkan04a-html/erkan04a.html">LexRank: Graph-based Lexical Centrality as Salience in Text Summarization</a></p>
<p>The LexRank paper by Erkan and Radev is another PageRank/Graph Theory based approach to working with text, this time applied to the task of summarization.</p>
<p>Key parts of sections 1 and 2 discuss the general problem of corpus-based summarization.  Unlike TextRank by Mihalcea, Erkan and Radev are interested in summarizing groups of documents (although it can be applied to individual documents as well.)  They propose a &#8220;centroid-based&#8221; model whereby they try to determine the most important sentences in a group of documents.  In this approach, the sentences that have more words near to the center of the cluster of documents are deemed to be more important, thus giving higher weight.  The authors in section 3 then layout their methods for determining the similarity between two sentences using a TF-IDF formula (think vector space search).  From these similarity scores, a matrix can be constructed tallying all of the similarities between all the sentences.</p>
<p>After doing some fancy linear algebra, they come to define LexRank, which is really just Google&#8217;s PageRank applied to this particular problem.  Sections 3.2 do provides all the gory details on the math and how the problem can be described in terms of stochastic matrices and markov chains, which all seems to boil down to the lovely formula that Brin and Page gave us.   This section also lays out in more detail the iterative algorithm for calculating PageRank/TextRank/LexRank/YouFillInTheBlankRank.</p>
<p>Taking this matrix, we put it into a graph and weight the edges according to our similarity and then run the iterative algorithm over it.</p>
<p>I know this sounds like skimping on my part, but the rest of the paper is just about the experiments they ran and I don&#8217;t feel like rehashing those.  Guess what?  It did pretty well.  One thing that is interesting is the availability of the MEAD summarization system, available at <a href="http://www.summarization.com/">http://www.summarization.com/</a>.</p>
<p>Like most of the graph based approaches we have seen so far, it does well with noisy data, which is one of the big selling points for me.</p>
<p>Next week, on to something new: question answering!  See you then!</p>
<p>LexRank, Radev, Erkan, PageRank, Brin, Page, Google, linear algebra, graph theory</p>
<p>Technorati Tags: <a href="http://technorati.com/tag/LexRank" rel="tag">LexRank</a>, <a href="http://technorati.com/tag/Radev" rel="tag">Radev</a>, <a href="http://technorati.com/tag/Erkan" rel="tag">Erkan</a>, <a href="http://technorati.com/tag/PageRank" rel="tag">PageRank</a>, <a href="http://technorati.com/tag/Brin" rel="tag">Brin</a>, <a href="http://technorati.com/tag/Page" rel="tag">Page</a>, <a href="http://technorati.com/tag/Google" rel="tag">Google</a>, <a href="http://technorati.com/tag/linear+algebra" rel="tag">linear algebra</a>, <a href="http://technorati.com/tag/graph+theory" rel="tag">graph theory</a></p>]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/02/22/discussion-of-lexrank-by-erkan-and-radev/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Discussion of Sections 1 -4 of Kleinberg</title>
		<link>http://www.paperoftheweek.com/2007/02/07/discussion-of-sections-1-4-of-kleinberg/</link>
		<comments>http://www.paperoftheweek.com/2007/02/07/discussion-of-sections-1-4-of-kleinberg/#comments</comments>
		<pubDate>Thu, 08 Feb 2007 02:29:23 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[bibliometrics]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Graph Theory]]></category>
		<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[linear algebra]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/02/07/discussion-of-sections-1-4-of-kleinberg/</guid>
		<description><![CDATA[For the week of Feb. 4, 2007, we are discussing Authoritative Sources in a Hyperlinked Environment, with today&#8217;s posting focused on sections 1 through 4.  This paper really has some meat on it and I found myself having to reread sections and even dig out my old Linear Algebra text. As is typical of most [...]]]></description>
			<content:encoded><![CDATA[<p>For the week of Feb. 4, 2007, we are discussing <a href="http://citeseer.ist.psu.edu/87928.html">Authoritative Sources in a Hyperlinked Environment</a>, with today&#8217;s posting focused on sections 1 through 4.  This paper really has some meat on it and I found myself having to reread sections and even dig out my old Linear Algebra text.</p>
<p>As is typical of most academic papers, section 1 introduces us to the problem that is being discussed, in this case how to determine the authoritative sites on the web.  Specifically, in the prehistoric web days, search engines would bring you back lots of hits, but seldom did they bring you back the most important hits, at least not high in the rankings (we are relatively spoiled these days in this regard.)  As Brin and Page alluded to, doing a query on &#8220;microsoft&#8221; on these old engines may not even have microsoft.com as the first result, even though most users would tell you that it should be.  Klienberg asserts that the big thing missing from search engines in these days is the use of link analysis to determine the authoritative sites on the web so that they may be ranked higher.  Unlike closed corpus search engines that suffer from the scarcity problem (not enough good content to search), the web suffers from an abundance problem, namely too much content.  Given this, the goal, Kleinberg says, is to identify the authoritative pages that are relevant to the user&#8217;s query.</p>
<p>Much like Brin and Page and Mihalcea (see eariler posts), Kleinberg suggests a graph based approach based on the links in HTML to determine the authoritative pages.  Additionally, he also defines &#8220;hub&#8221; pages, i.e. those pages that have links to multiple relevant authoritative pages.  Hubs never point to hubs and authoritative pages never point to authoritative pages, which yields a <a href="http://en.wikipedia.org/wiki/Bipartite_graph">dense directed bipartite subgraph</a> (don&#8217;t worry, I had to look it up as well.)  Hubs and authorities reinforce each other, &#8220;a good hub is a page that points to many good authorities; a good authority is a page that is pointed to by many good hubs.&#8221;  Intuitively this make sense as we all know pages that we trust and we know where to go to find them.  I am not sure if this is correct, but I think to some extent in our post-google world, the hubs are now the search engines themselves.  That is, being ranked highly in the engine, tends to reinforce a site as being the authority.</p>
<p>Unlike the previous papers we looked at, this one actually has a fair amount of math, especially in section 2, so dig out your Linear Algebra book if you really want an in-depth understanding.  The algorithm for calculating the hubs and authorities is very similar to both the PageRank and TextRank algorithims we discussed in prior posts and can be calculated using a similar, simple, iterative algorithm (I implemented it in a few hours). The graph in use for the algorithm consists of a root set of nodes identified by a search engine (Alta Vista in this case) and is then expanded to include pages that are pointed to by a node in the set, or point to a node in the set (possibly bounded by some predetermined amount.)  Edges are the links themselves. Given the graph it comes down to calculating the authority weights and the hub weights using the iterative algorithm.   Authority weights are the sum of the hub weights and the hub weights are the sum of the authority weights.   See pages 9, 10, 11 and 12 for the exact details of proving why this works.  You may want to refresh your knowledge of <a href="http://en.wikipedia.org/wiki/Eigenvector">Eigenvectors</a>, etc.   Seriously, though, don&#8217;t you love when &#8220;discovered solutions&#8221; to problems have such a pure mathematical relationship?  That is, isn&#8217;t it great when the math matches up with your intuition?</p>
<p>Section 3 breaks expands on the notion of hubs and authorities to discover communities of links, and, also, &#8220;anti&#8221; communities.  For instance, using a variant of the main algorithm, one can determine clusters of sites that are related.  For instance, one can easily determine those sites in favor of abortion and those opposed.  Mathematically, this correlates to working with the non-principal eigenvectors, whereas the primary algorithm involves working with the principal eigenvector.</p>
<p>Section 4 continues Section 3 and shows how it is fairly easy to use the link structure to find similar pages given an initial query.  Unlike <a href="http://en.wikipedia.org/wiki/Relevance_feedback">relevance feedback</a>, Kleinberg&#8217;s approach relies solely on the link structure.</p>
<p>Tomorrow or Friday, I will look at the remaining sections of the paper.</p>
<p>graph theory, Kleinberg, PageRank, TextRank, Mihalcea, Brin and Page, eigenvectors, link analysis, bibliometrics, linear algebra, information retrieval, search engines</p>
<p>Technorati Tags: <a href="http://technorati.com/tag/graph+theory" rel="tag">graph theory</a>, <a href="http://technorati.com/tag/Kleinberg" rel="tag">Kleinberg</a>, <a href="http://technorati.com/tag/PageRank" rel="tag">PageRank</a>, <a href="http://technorati.com/tag/TextRank" rel="tag">TextRank</a>, <a href="http://technorati.com/tag/Mihalcea" rel="tag">Mihalcea</a>, <a href="http://technorati.com/tag/Brin+and+Page" rel="tag">Brin and Page</a>, <a href="http://technorati.com/tag/eigenvectors" rel="tag">eigenvectors</a>, <a href="http://technorati.com/tag/link+analysis" rel="tag">link analysis</a>, <a href="http://technorati.com/tag/bibliometrics" rel="tag">bibliometrics</a>, <a href="http://technorati.com/tag/linear+algebra" rel="tag">linear algebra</a>, <a href="http://technorati.com/tag/information+retrieval" rel="tag">information retrieval</a>, <a href="http://technorati.com/tag/search+engines" rel="tag">search engines</a></p>]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/02/07/discussion-of-sections-1-4-of-kleinberg/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Addendum to TextRank</title>
		<link>http://www.paperoftheweek.com/2007/02/03/addendum-to-textrank/</link>
		<comments>http://www.paperoftheweek.com/2007/02/03/addendum-to-textrank/#comments</comments>
		<pubDate>Sat, 03 Feb 2007 23:35:08 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Graph Theory]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/02/03/addendum-to-textrank/</guid>
		<description><![CDATA[I implemented this last night using JGraphT just to see what was involved. I would guess it took me around 4 hours and is pretty straight forward. I did have a little trouble getting down how to best estimate the error for calculating convergence, but did eventually arrive at a solution after simplifying my tests [...]]]></description>
			<content:encoded><![CDATA[<p>I implemented this last night using <a href="http://jgrapht.sourceforge.net/">JGraphT</a> just to see what was involved.  I would guess it took me around 4 hours and is pretty straight forward.  I did have a little trouble getting down how to best estimate the error for calculating convergence, but did eventually arrive at a solution after simplifying my tests to use some simpler input.  One of the cool things about this approach is it doesn&#8217;t have to apply just to text.  The JGraphT library supports Java 1.5 and generics, so you can put pretty much anything you want on a node and define a relation between nodes and then run the ranking algorithm.</p>
<p>I&#8217;m not going to post the code just yet, but I may in the future, but here&#8217;s the Gist of what I did:</p>
<ol>
<li>Create your graph and relations using JGraphT.  I created a Vertex and Edge objects.  The Vertex has two Generic properties (Key, Value) and maintains a score value as a double.  The Edge has a weight as a double and what I call an EdgeLoad, namely a Generic property that can hold some object that is associated with an edge.  While this is overkill for my initial implementation of keyword extraction, I think having the ability to store info on the vertex and edge will come in handy later as I explore different approaches.</li>
<li>Create a GraphRanker interface that has a rank method that is overloaded to take in either an Undirected or Directed Graph</li>
<li>Implement the GraphRanker interface to rank the Graph.  This involved:
<ol>
<li>while  error rate is &gt; than the threshold and we haven&#8217;t reached the maximum number of iterations</li>
<li>for each vertex in the Graph</li>
<li>calculate the rank of the vertex
<ol>
<li>The rank is calculated according to the formula in the paper.  Just remember that it doesn&#8217;t need to be recursive, you can just use any dummy starting value when calculating the first S(V<sub>j</sub>)</li>
</ol>
</li>
<li>Set the error rate (estimated) to be the new rank &#8211; the old rank</li>
<li>Repeat from step 1 until convergence is reached or you max out your iterations</li>
<li>Sort the vertices by score and return</li>
</ol>
</li>
</ol>
<p>Some tips to know you got it:  The average rank of an unweighted graph should be approach 1.  Start very simple and do the keyword extraction version first.  Put only two keywords in your graph.  Since they co-occur within 2 positions, they should have equal rank upon convergence since they both contribute evenly to each other.</p>
<p>Give it a try and let me know if you have any questions.  This really is quite easy to implement.  If you graph gets really large, you may want to look into using <a href="http://lucene.apache.org/hadoop">Hadoop</a> to distribute.</p>
<p>Also, go out and see what you can apply it to, don&#8217;t just limit yourself to text.  Let me know what works and doesn&#8217;t work for you.</p>
<p>TextRank, Graph Theory, JGraphT, Hadoop, Implementation, Java, PageRank, Google, Mihalcea, Brin and Page</p>
<p>Technorati Tags: <a href="http://technorati.com/tag/TextRank" rel="tag">TextRank</a>, <a href="http://technorati.com/tag/Graph+Theory" rel="tag">Graph Theory</a>, <a href="http://technorati.com/tag/JGraphT" rel="tag">JGraphT</a>, <a href="http://technorati.com/tag/Hadoop" rel="tag">Hadoop</a>, <a href="http://technorati.com/tag/Implementation" rel="tag">Implementation</a>, <a href="http://technorati.com/tag/Java" rel="tag">Java</a>, <a href="http://technorati.com/tag/PageRank" rel="tag">PageRank</a>, <a href="http://technorati.com/tag/Google" rel="tag">Google</a>, <a href="http://technorati.com/tag/Mihalcea" rel="tag">Mihalcea</a>, <a href="http://technorati.com/tag/Brin+and+Page" rel="tag">Brin and Page</a></p>]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/02/03/addendum-to-textrank/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Discussion of Section 3 of TextRank</title>
		<link>http://www.paperoftheweek.com/2007/01/31/discussion-of-section-3-of-textrank/</link>
		<comments>http://www.paperoftheweek.com/2007/01/31/discussion-of-section-3-of-textrank/#comments</comments>
		<pubDate>Wed, 31 Jan 2007 23:29:08 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Graph Theory]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/01/31/discussion-of-section-3-of-textrank/</guid>
		<description><![CDATA[Section 3 of TextRank: Bringing Order Into Texts covers the first application of the TextRank approach proposed in section 2.  The authors have chosen keyword extraction to demonstrate the capabilities of the approach.  Keyword extraction is the problem of determining the keywords that best describe a document.  It can be thought of as a precursor [...]]]></description>
			<content:encoded><![CDATA[<p>Section 3 of <a href="http://www.paperoftheweek.com/2007/01/29/textrank-by-rada-mihalcea-and-paul-tarau/">TextRank: Bringing Order Into Texts</a> covers the first application of the TextRank approach proposed in section 2.  The authors have chosen keyword extraction to demonstrate the capabilities of the approach.  Keyword extraction is the problem of determining the keywords that best describe a document.  It can be thought of as a precursor to document summarization, I guess.  Those familiar with academic papers might imagine a system that automatically assigns keywords to a paper.  In the blogosphere, you can imagine how nice it would be to not have to pick your own keywords,  just let the system pick them for you (hmmm, that gives me an idea&#8230; anyone want to write a wordpress plugin?)</p>
<p>With the problem defined, Mihalcea describes the preprocessing steps for building the graph.  Namely, words are tokenized and part of speech is assigned.  These words are then added as nodes to the graph (note, they optionally can filter out certain types of words, such as all adverbs or all nouns) and edges are added between two nodes if the two words co-occur within some variable number of words, usually somewhere between 2 and 10.  Once the graph is setup, the convergence algorithm is run until some approximated error threshold is met and the results are ranked and returned.</p>
<p>Performance wise, the algorithm does quite well, beating the best reported supervised result of the time.  It should be noted that this approach is completely unsupervised, which is a real benefit.  In my mind, if you can get better results without supervision, you have a winner.  The rest of section 3 is a discussion of the TextRank results versus other comparable systems.  See the paper for details.</p>
<p>Next post we will discuss the approach applied to Sentence extraction for automatic summarization, and we will end the week with the remainder of the paper on Friday.</p>
<p>TextRank, PageRank, Google, Mihalcea, Keyword Extraction, Natural Language Processing, NLP, machine learning, algorithms, computer science</p>
<p>Technorati Tags: <a href="http://technorati.com/tag/TextRank" rel="tag">TextRank</a>, <a href="http://technorati.com/tag/PageRank" rel="tag">PageRank</a>, <a href="http://technorati.com/tag/Google" rel="tag">Google</a>, <a href="http://technorati.com/tag/Mihalcea" rel="tag">Mihalcea</a>, <a href="http://technorati.com/tag/Keyword+Extraction" rel="tag">Keyword Extraction</a>, <a href="http://technorati.com/tag/Natural+Language+Processing" rel="tag">Natural Language Processing</a>, <a href="http://technorati.com/tag/NLP" rel="tag">NLP</a>, <a href="http://technorati.com/tag/machine+learning" rel="tag">machine learning</a>, <a href="http://technorati.com/tag/algorithms" rel="tag">algorithms</a>, <a href="http://technorati.com/tag/computer+science" rel="tag">computer science</a></p>]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/01/31/discussion-of-section-3-of-textrank/feed/</wfw:commentRss>
		<slash:comments>15</slash:comments>
		</item>
		<item>
		<title>TextRank by Rada Mihalcea and Paul Tarau</title>
		<link>http://www.paperoftheweek.com/2007/01/29/textrank-by-rada-mihalcea-and-paul-tarau/</link>
		<comments>http://www.paperoftheweek.com/2007/01/29/textrank-by-rada-mihalcea-and-paul-tarau/#comments</comments>
		<pubDate>Mon, 29 Jan 2007 14:05:59 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Graph Theory]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/01/29/textrank-by-rada-mihalcea-and-paul-tarau/</guid>
		<description><![CDATA[This week we will be continuing our graph-theory based approach to NLP and take a look at TextRank: Bringing Order to Texts. The paper claims to show us how to use graph-ranking approaches in some unsupervised learning tasks such as keyword and sentence extraction. TextRank, NLP, natural language processing, unsupervised learning, PageRank, HITS, computer science, [...]]]></description>
			<content:encoded><![CDATA[<p>This week we will be continuing our graph-theory based approach to NLP and take a look at<a href="http://www.cs.unt.edu/~rada/papers/mihalcea.emnlp04.pdf"> TextRank: Bringing Order to Texts</a>. The paper claims to show us how to use graph-ranking approaches in some unsupervised learning tasks such as keyword and sentence extraction.</p>
<p>TextRank, NLP, natural language processing, unsupervised learning, PageRank, HITS, computer science, algorithms</p>
<p>Technorati Tags: <a href="http://technorati.com/tag/TextRank" rel="tag">TextRank</a>, <a href="http://technorati.com/tag/NLP" rel="tag">NLP</a>, <a href="http://technorati.com/tag/natural+language+processing" rel="tag">natural language processing</a>, <a href="http://technorati.com/tag/unsupervised+learning" rel="tag">unsupervised learning</a>, <a href="http://technorati.com/tag/PageRank" rel="tag">PageRank</a>, <a href="http://technorati.com/tag/HITS" rel="tag">HITS</a>, <a href="http://technorati.com/tag/computer+science" rel="tag">computer science</a>, <a href="http://technorati.com/tag/algorithms" rel="tag">algorithms</a></p>]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/01/29/textrank-by-rada-mihalcea-and-paul-tarau/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Discussion of &#8220;The Anatomy of a Search Engine&#8221;</title>
		<link>http://www.paperoftheweek.com/2007/01/25/discussion-of-the-anatomy-of-a-search-engine/</link>
		<comments>http://www.paperoftheweek.com/2007/01/25/discussion-of-the-anatomy-of-a-search-engine/#comments</comments>
		<pubDate>Fri, 26 Jan 2007 02:07:36 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Information Retrieval]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/01/25/discussion-of-the-anatomy-of-a-search-engine/</guid>
		<description><![CDATA[While much of this paper is now &#8220;quaint&#8221; in light of the amount of money that Google is making, what with their &#8220;noble&#8221; goals to save the world and all, the fact remains that there are many interesting ideas in the paper and it does a good job of bridging the gap between real world [...]]]></description>
			<content:encoded><![CDATA[<p>While much of this paper is now &#8220;quaint&#8221; in light of the amount of money that <a href="http://www.google.com">Google</a> is making, what with their &#8220;noble&#8221; goals to save the world and all, the fact remains that there are many interesting ideas in the paper and it does a good job of bridging the gap between real world systems and academic research.  Essentially, this paper comes down to two main parts that I am interested in:</p>
<ol>
<li>PageRank &#8211; The formula and approach for ranking web pages.</li>
<li>System Anatomy &#8211; The discussion of how Brin and Page setup their initial system.</li>
</ol>
<p>I&#8217;ll skip the details on performance, etc. since it should be pretty obvious that it works.  It should be noted, though, that the importance of this new approach did wonders for how the web was searched.  If you remember prior to Google, most search results were pretty low quality except Yahoo! which relied on a directory structure.<br />
<strong>PageRank</strong><br />
Here is the discussion of PageRank from <a href="http://infolab.stanford.edu/~backrub/google.html">http://infolab.stanford.edu/~backrub/google.html</a></p>
<blockquote><p><em>We assume page A has pages T1&#8230;Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d in the next section. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:</em><em>PR(A) = (1-d) + d (PR(T1)/C(T1) + &#8230; + PR(Tn)/C(Tn))</em></p>
<p><em>Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages&#8217; PageRanks will be one.</em></p></blockquote>
<p>PageRank is really just the notion that the more pages that point to a page, the more higher that page should rank.  Additionally, this is a cumulative effect, that is if A points at B and B at C, A&#8217;s PR is also a factor in calculating C&#8217;s PR.  The paper doesn&#8217;t go into details about the exact algorithm for calculating PR, simply stating that it is a simple iterative algorithm and of course this is now part of Google&#8217;s proprietary system, so the world may never know much to the chagrin of many an SEO consultant.  In the upcoming papers, we will see references to PR and how it can be used for doing other NLP tasks, so it is good to know the basic idea.</p>
<p><strong>System Anatomy</strong></p>
<p>From an engineering standpoint, the System Anatomy section of the paper is quite interesting especially since it goes into some detail about how to setup a basic Google-like search engine.  Of note are the discussion, albeit brief, of the Distributed File System (called BigFiles, section 4.2.1.  For an open source version see <a href="http://lucene.apache.org/hadoop">Hadoop</a>, which has both a DFS and an implementation of Google&#8217;s Map/Reduce algorithm.)  Moving on, there is some discussion of how the lexicon is stored as well as the documents.  These sections strike me as standard indexing techniques.</p>
<p>Following the lexicon subsection, though, is the discussion information on how they store the &#8220;Hit Lists&#8221;, that is, the list of occurrences of terms in documents, along with payload information about the terms such as font size, capitalization, etc.  In comparison, Lucene currently only supports the term occurrence info and not payload information, although there is a submitted patch that allows for indexing payloads.  These Hit Lists are then stored in both the forward index and the inverted index.</p>
<p>Section 4.4 concerns indexing the web and dealing with the plethora of errors that occur in malformed HTML as well as how to create the various indexes.  Currently, I think there are a number of good solutions that can help solve this problem, given the right amount of hardware (see <a href="http://lucene.apache.org/nutch">Nutch</a> for an Open Source version.)</p>
<p>Section 4.5 deals with the issues of searching, namely how to handle single word queries and multi-word queries.  Multi-word queries can be a bit tricky since you want to weight hits with terms that occur closer together in a document higher than those that are a further apart.</p>
<p>The final section (section 5) covers results and performance, which I&#8217;m not going to go into because I think it is obvious to anyone who has an online pulse in the last 10 years that the Google approach works.</p>
<p>Next week, we will start looking into some graph based papers that have some basis in the PageRank calculation, but use it in different contexts.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/01/25/discussion-of-the-anatomy-of-a-search-engine/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>The Anatomy of a Search Engine</title>
		<link>http://www.paperoftheweek.com/2007/01/22/the-anatomy-of-a-search-engine/</link>
		<comments>http://www.paperoftheweek.com/2007/01/22/the-anatomy-of-a-search-engine/#comments</comments>
		<pubDate>Mon, 22 Jan 2007 12:40:40 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Graph Theory]]></category>
		<category><![CDATA[Information Retrieval]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/01/22/the-anatomy-of-a-search-engine/</guid>
		<description><![CDATA[The Anatomy of a Search Engine is the Paper of the Week for the week of January 22, 2007. OK, I&#8217;ll admit I&#8217;ve read this one before (two or three times, actually) but it&#8217;s going to serve as the intro to several other pieces on using Graph Theory for doing NLP. Plus, it is the [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://infolab.stanford.edu/~backrub/google.html">The Anatomy of a Search Engine</a> is the Paper of the Week for the week of January 22, 2007.<br />
OK, I&#8217;ll admit I&#8217;ve read this one before (two or three times, actually) but it&#8217;s going to serve as the intro to several other pieces on using Graph Theory for doing NLP.  Plus, it is <strong><em>the</em></strong> <a href="http://www.google.com">Google</a> paper, so it is always interesting to re-read to find new tidbits you passed over before.</p>
<p>Google, PageRank, Information Retrieval, Graph Theory, Brin, Page</p>
<p>Technorati Tags: <a href="http://technorati.com/tag/Google" rel="tag">Google</a>, <a href="http://technorati.com/tag/PageRank" rel="tag">PageRank</a>, <a href="http://technorati.com/tag/Information+Retrieval" rel="tag">Information Retrieval</a>, <a href="http://technorati.com/tag/Graph+Theory" rel="tag">Graph Theory</a>, <a href="http://technorati.com/tag/Brin" rel="tag">Brin</a>, <a href="http://technorati.com/tag/Page" rel="tag">Page</a></p>]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/01/22/the-anatomy-of-a-search-engine/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

