<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Paper of the Week &#187; Graph Theory</title>
	<atom:link href="http://www.paperoftheweek.com/category/computer-science/algorithms/graph-theory/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.paperoftheweek.com</link>
	<description>Read. Learn. Discuss.</description>
	<lastBuildDate>Tue, 14 Aug 2007 01:35:31 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Discussion of LexRank by Erkan and Radev</title>
		<link>http://www.paperoftheweek.com/2007/02/22/discussion-of-lexrank-by-erkan-and-radev/</link>
		<comments>http://www.paperoftheweek.com/2007/02/22/discussion-of-lexrank-by-erkan-and-radev/#comments</comments>
		<pubDate>Fri, 23 Feb 2007 02:03:40 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[bibliometrics]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Graph Theory]]></category>
		<category><![CDATA[linear algebra]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>
		<category><![CDATA[text summarization]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/02/22/discussion-of-lexrank-by-erkan-and-radev/</guid>
		<description><![CDATA[POTW 2/18/07: LexRank: Graph-based Lexical Centrality as Salience in Text Summarization The LexRank paper by Erkan and Radev is another PageRank/Graph Theory based approach to working with text, this time applied to the task of summarization. Key parts of sections 1 and 2 discuss the general problem of corpus-based summarization.  Unlike TextRank by Mihalcea, Erkan [...]]]></description>
			<content:encoded><![CDATA[<p>POTW 2/18/07: <a href="http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume22/erkan04a-html/erkan04a.html">LexRank: Graph-based Lexical Centrality as Salience in Text Summarization</a></p>
<p>The LexRank paper by Erkan and Radev is another PageRank/Graph Theory based approach to working with text, this time applied to the task of summarization.</p>
<p>Key parts of sections 1 and 2 discuss the general problem of corpus-based summarization.  Unlike TextRank by Mihalcea, Erkan and Radev are interested in summarizing groups of documents (although it can be applied to individual documents as well.)  They propose a &#8220;centroid-based&#8221; model whereby they try to determine the most important sentences in a group of documents.  In this approach, the sentences that have more words near to the center of the cluster of documents are deemed to be more important, thus giving higher weight.  The authors in section 3 then layout their methods for determining the similarity between two sentences using a TF-IDF formula (think vector space search).  From these similarity scores, a matrix can be constructed tallying all of the similarities between all the sentences.</p>
<p>After doing some fancy linear algebra, they come to define LexRank, which is really just Google&#8217;s PageRank applied to this particular problem.  Sections 3.2 do provides all the gory details on the math and how the problem can be described in terms of stochastic matrices and markov chains, which all seems to boil down to the lovely formula that Brin and Page gave us.   This section also lays out in more detail the iterative algorithm for calculating PageRank/TextRank/LexRank/YouFillInTheBlankRank.</p>
<p>Taking this matrix, we put it into a graph and weight the edges according to our similarity and then run the iterative algorithm over it.</p>
<p>I know this sounds like skimping on my part, but the rest of the paper is just about the experiments they ran and I don&#8217;t feel like rehashing those.  Guess what?  It did pretty well.  One thing that is interesting is the availability of the MEAD summarization system, available at <a href="http://www.summarization.com/">http://www.summarization.com/</a>.</p>
<p>Like most of the graph based approaches we have seen so far, it does well with noisy data, which is one of the big selling points for me.</p>
<p>Next week, on to something new: question answering!  See you then!</p>
<p>LexRank, Radev, Erkan, PageRank, Brin, Page, Google, linear algebra, graph theory</p>
<p>Technorati Tags: <a href="http://technorati.com/tag/LexRank" rel="tag">LexRank</a>, <a href="http://technorati.com/tag/Radev" rel="tag">Radev</a>, <a href="http://technorati.com/tag/Erkan" rel="tag">Erkan</a>, <a href="http://technorati.com/tag/PageRank" rel="tag">PageRank</a>, <a href="http://technorati.com/tag/Brin" rel="tag">Brin</a>, <a href="http://technorati.com/tag/Page" rel="tag">Page</a>, <a href="http://technorati.com/tag/Google" rel="tag">Google</a>, <a href="http://technorati.com/tag/linear+algebra" rel="tag">linear algebra</a>, <a href="http://technorati.com/tag/graph+theory" rel="tag">graph theory</a></p>]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/02/22/discussion-of-lexrank-by-erkan-and-radev/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>POTW 2/18/07: LexRank: Graph-based Lexical Centrality as Salience in Text Summarization</title>
		<link>http://www.paperoftheweek.com/2007/02/18/potw-21807-lexrank-graph-based-lexical-centrality-as-salience-in-text-summarization/</link>
		<comments>http://www.paperoftheweek.com/2007/02/18/potw-21807-lexrank-graph-based-lexical-centrality-as-salience-in-text-summarization/#comments</comments>
		<pubDate>Sun, 18 Feb 2007 16:57:41 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Graph Theory]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>
		<category><![CDATA[text summarization]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/02/18/potw-21807-lexrank-graph-based-lexical-centrality-as-salience-in-text-summarization/</guid>
		<description><![CDATA[The POTW for 2/18/07 is another graph-based approach, this time by another leader in this area, Dragomir Radev.  The paper is LexRank: Graph-based Lexical Centrality as Salience in Text Summarization Enjoy!]]></description>
			<content:encoded><![CDATA[<p>The POTW for 2/18/07 is another graph-based approach, this time by another leader in this area, Dragomir Radev.  The paper is</p>
<p><a href="http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume22/erkan04a-html/erkan04a.html">LexRank: Graph-based Lexical Centrality as Salience in Text Summarization</a></p>
<p>Enjoy!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/02/18/potw-21807-lexrank-graph-based-lexical-centrality-as-salience-in-text-summarization/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>POTW 2/11/07: Discussion of sections 5-8 of Minkov, et. al</title>
		<link>http://www.paperoftheweek.com/2007/02/16/potw-21107-discussion-of-sections-5-8-of-minkov-et-al/</link>
		<comments>http://www.paperoftheweek.com/2007/02/16/potw-21107-discussion-of-sections-5-8-of-minkov-et-al/#comments</comments>
		<pubDate>Fri, 16 Feb 2007 11:54:09 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[disambiguation]]></category>
		<category><![CDATA[email]]></category>
		<category><![CDATA[Graph Theory]]></category>
		<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/02/16/potw-21107-discussion-of-sections-5-8-of-minkov-et-al/</guid>
		<description><![CDATA[POTW 2/11/07: Contextual Search and Name Disambiguation in Email Using Graphs Discussion of Sections 5 through 8 The remaining sections of this paper are discussions of two applications of the algorithms plus the body of related work and conclusions that can be drawn from the work. Section 5 gives the details on what corpora were [...]]]></description>
			<content:encoded><![CDATA[<p>POTW 2/11/07: <a href="http://www.cs.cmu.edu/%7Eeinat/sigir-06.pdf">Contextual Search and Name Disambiguation in Email Using Graphs</a></p>
<p><strong>Discussion of Sections 5 through 8</strong></p>
<p>The remaining sections of this paper are discussions of two applications of the algorithms plus the body of related work and conclusions that can be drawn from the work.  Section 5 gives the details on what corpora were used (Enron, plus an internal email set not publicly available) before proceeding to the task of name disambiguation.</p>
<p>Name disambiguation in email is the task of correlating the mention of a name in an email with the actual person.  While this is fairly straightforward in most cases for people reading their own email, it becomes difficult when reading other&#8217;s email since one may not know all the people the other person does.  Multiply this by a collection of emails filled with nicknames and initials and it is not hard to see why this is difficult.  The task is useful for establishing social networks as well as other applications.  One could easily imagine an automated system that retrieved relevant information about a person mentioned in an email (bio, address, phone, past conversations, etc.) and made it available to the reader for instant access.</p>
<p>The remaining parts of section 5 go into the details of applying the algorithm and the results that are achieved.   The interesting thing in my mind is that the graph often connects different types of nodes and associates different probabilities with the transitions from one node to the other.  Applying the graph walking strategy then leads to the desired results.  Suffice it to say, the new approach performs better than the baseline approach!</p>
<p>Identifying threads of emails is the second application the authors use to demonstrate their capabilities.  Threading is the problem of identifying one or more messages that are related to some chosen email.  Many email systems do a basic job at this by comparing subject lines, esp. those that use the &#8220;RE:&#8221; prefix.  However, we all know people often treat the subject line differently.  Furthermore, people tend to quote the previous messages in the thread differently.  Some use &#8220;&gt;&#8221;, while others use &#8220;|&#8221;, while still others use nothing at all.  Add in inline replies, which is especially common on mailing lists, and you see why the problem becomes difficult.  Section 5.4 lays out the graph walk approach and compares it to TF-IDF IR approach, which, of course it does better, especially when using a machine learning re-ranking approach.</p>
<p>The rest of the paper is on related work and conclusions.  I am glad the authors address the performance in terms of scalability in the conclusions section, as I had my doubts about how well the approach could perform on large amounts of data.  In fact, I find a lot of papers in the NLP realm fail to account for performance, so it is refreshing to see it addressed.</p>
<p>graph theory, email, natural language processing, NLP, information retrieval, IR, threading, name disambiguation</p>
<p>Technorati Tags: <a href="http://technorati.com/tag/graph+theory" rel="tag">graph theory</a>, <a href="http://technorati.com/tag/email" rel="tag">email</a>, <a href="http://technorati.com/tag/natural+language+processing" rel="tag">natural language processing</a>, <a href="http://technorati.com/tag/NLP" rel="tag">NLP</a>, <a href="http://technorati.com/tag/information+retrieval" rel="tag">information retrieval</a>, <a href="http://technorati.com/tag/IR" rel="tag">IR</a>, <a href="http://technorati.com/tag/threading" rel="tag">threading</a>, <a href="http://technorati.com/tag/name+disambiguation" rel="tag">name disambiguation</a></p>]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/02/16/potw-21107-discussion-of-sections-5-8-of-minkov-et-al/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>POTW 2/11/07: Contextual Search and Name Disambiguation in Email Using Graphs by Minkov, et. al</title>
		<link>http://www.paperoftheweek.com/2007/02/11/potw-21107-contextual-search-and-name-disambiguation-in-email-using-graphs-by-minkov-et-al/</link>
		<comments>http://www.paperoftheweek.com/2007/02/11/potw-21107-contextual-search-and-name-disambiguation-in-email-using-graphs-by-minkov-et-al/#comments</comments>
		<pubDate>Sun, 11 Feb 2007 18:39:04 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[disambiguation]]></category>
		<category><![CDATA[email]]></category>
		<category><![CDATA[Graph Theory]]></category>
		<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/02/11/potw-21107-contextual-search-and-name-disambiguation-in-email-using-graphs-by-minkov-et-al/</guid>
		<description><![CDATA[This week&#8217;s paper is another graph theory (do you sense a trend?) by Minkov, et. al title &#8220;Contextual Search and Name Disambiguation in Email Using Graphs&#8221; and appeared in SIGIR &#8217;06. email, graph theory, disambiguation Technorati Tags: email, graph theory, disambiguation]]></description>
			<content:encoded><![CDATA[<p>This week&#8217;s paper is another graph theory (do you sense a trend?) by Minkov, et. al title &#8220;<a href="http://www.cs.cmu.edu/~einat/sigir-06.pdf">Contextual Search and Name Disambiguation in Email Using Graphs</a>&#8221; and appeared in SIGIR &#8217;06.</p>
<p>email, graph theory, disambiguation</p>
<p>Technorati Tags: <a href="http://technorati.com/tag/email" rel="tag">email</a>, <a href="http://technorati.com/tag/graph+theory" rel="tag">graph theory</a>, <a href="http://technorati.com/tag/disambiguation" rel="tag">disambiguation</a></p>]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/02/11/potw-21107-contextual-search-and-name-disambiguation-in-email-using-graphs-by-minkov-et-al/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Remaining Sections of Kleinberg</title>
		<link>http://www.paperoftheweek.com/2007/02/08/remaining-sections-of-kleinberg/</link>
		<comments>http://www.paperoftheweek.com/2007/02/08/remaining-sections-of-kleinberg/#comments</comments>
		<pubDate>Fri, 09 Feb 2007 03:11:09 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Graph Theory]]></category>
		<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[linear algebra]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/02/08/remaining-sections-of-kleinberg/</guid>
		<description><![CDATA[Sections 5 through 7 of Authoritative Sources in a Hyperlinked Environment cover some more applications of the hubs and authorities approach and then wraps up with the conclusion. Section 5 examines the quality of the authority measure by comparing the results achieved by running the Kleinberg algorithm against some of the searchable hierarchies that exist [...]]]></description>
			<content:encoded><![CDATA[<p>Sections 5 through 7 of <a href="http://citeseer.ist.psu.edu/87928.html">Authoritative Sources in a Hyperlinked Environment</a> cover some more applications of the hubs and authorities approach and then wraps up with the conclusion.</p>
<p>Section 5 examines the quality of the authority measure by comparing the results achieved by running the Kleinberg algorithm against some of the searchable hierarchies that exist on the web (Yahoo, others that are long gone) and shows that this approach does pretty well at the task.</p>
<p>Section 6 then covers one of the problems with the approach, which the authors label as diffusion.  Diffusion happens when the algorithm produces results that are not on the original topic.  In most cases, the results are more generalized given a specific query.  Kleinberg&#8217;s example is the query &#8220;medical conferences&#8221; which yields results that are mainly about medicine.  A proposed solution is to reintroduce the query back into the problem after the algorithm has run by using the terms of the query to rerank the results in the result set.  Section 6 gives a fair amount of detail on how this reranking can take place.</p>
<p>Well, that wraps up another week.  I hope people are finding this useful.  I know I am (even if I cheat a little bit like I did tonight and gloss over the details a little more.)  If anyone is interested in discussing a paper on a particular topic, please feel free to leave a comment suggesting one.  For now, I am going to do one or two more weeks on graph based approaches to NLP and then I&#8217;m going to start looking at some Question Answering papers.</p>
<p>NLP, Graph Theory, Kleinberg, diffusion, information retrieval, IR</p>
<p>Technorati Tags: <a href="http://technorati.com/tag/NLP" rel="tag">NLP</a>, <a href="http://technorati.com/tag/Graph+Theory" rel="tag">Graph Theory</a>, <a href="http://technorati.com/tag/Kleinberg" rel="tag">Kleinberg</a>, <a href="http://technorati.com/tag/diffusion" rel="tag">diffusion</a>, <a href="http://technorati.com/tag/information+retrieval" rel="tag">information retrieval</a>, <a href="http://technorati.com/tag/IR" rel="tag">IR</a></p>]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/02/08/remaining-sections-of-kleinberg/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Computing Eigenvectors and Eigenvalues</title>
		<link>http://www.paperoftheweek.com/2007/02/08/computing-eigenvectors-and-eigenvalues/</link>
		<comments>http://www.paperoftheweek.com/2007/02/08/computing-eigenvectors-and-eigenvalues/#comments</comments>
		<pubDate>Fri, 09 Feb 2007 02:57:39 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Graph Theory]]></category>
		<category><![CDATA[linear algebra]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/02/08/computing-eigenvectors-and-eigenvalues/</guid>
		<description><![CDATA[Help computing eigenvectors and eigenvalues is available at: Computing Eigenvectors and Eigenvalues linear algebra, eigenvalues, eigenvectors Technorati Tags: linear algebra, eigenvalues, eigenvectors]]></description>
			<content:encoded><![CDATA[<p>Help computing eigenvectors and eigenvalues is available at:</p>
<p><a href="http://cnx.org/content/m12083/latest/">Computing Eigenvectors and Eigenvalues</a></p>
<p>linear algebra, eigenvalues, eigenvectors</p>
<p>Technorati Tags: <a href="http://technorati.com/tag/linear+algebra" rel="tag">linear algebra</a>, <a href="http://technorati.com/tag/eigenvalues" rel="tag">eigenvalues</a>, <a href="http://technorati.com/tag/eigenvectors" rel="tag">eigenvectors</a></p>]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/02/08/computing-eigenvectors-and-eigenvalues/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Discussion of Sections 1 -4 of Kleinberg</title>
		<link>http://www.paperoftheweek.com/2007/02/07/discussion-of-sections-1-4-of-kleinberg/</link>
		<comments>http://www.paperoftheweek.com/2007/02/07/discussion-of-sections-1-4-of-kleinberg/#comments</comments>
		<pubDate>Thu, 08 Feb 2007 02:29:23 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[bibliometrics]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Graph Theory]]></category>
		<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[linear algebra]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/02/07/discussion-of-sections-1-4-of-kleinberg/</guid>
		<description><![CDATA[For the week of Feb. 4, 2007, we are discussing Authoritative Sources in a Hyperlinked Environment, with today&#8217;s posting focused on sections 1 through 4.  This paper really has some meat on it and I found myself having to reread sections and even dig out my old Linear Algebra text. As is typical of most [...]]]></description>
			<content:encoded><![CDATA[<p>For the week of Feb. 4, 2007, we are discussing <a href="http://citeseer.ist.psu.edu/87928.html">Authoritative Sources in a Hyperlinked Environment</a>, with today&#8217;s posting focused on sections 1 through 4.  This paper really has some meat on it and I found myself having to reread sections and even dig out my old Linear Algebra text.</p>
<p>As is typical of most academic papers, section 1 introduces us to the problem that is being discussed, in this case how to determine the authoritative sites on the web.  Specifically, in the prehistoric web days, search engines would bring you back lots of hits, but seldom did they bring you back the most important hits, at least not high in the rankings (we are relatively spoiled these days in this regard.)  As Brin and Page alluded to, doing a query on &#8220;microsoft&#8221; on these old engines may not even have microsoft.com as the first result, even though most users would tell you that it should be.  Klienberg asserts that the big thing missing from search engines in these days is the use of link analysis to determine the authoritative sites on the web so that they may be ranked higher.  Unlike closed corpus search engines that suffer from the scarcity problem (not enough good content to search), the web suffers from an abundance problem, namely too much content.  Given this, the goal, Kleinberg says, is to identify the authoritative pages that are relevant to the user&#8217;s query.</p>
<p>Much like Brin and Page and Mihalcea (see eariler posts), Kleinberg suggests a graph based approach based on the links in HTML to determine the authoritative pages.  Additionally, he also defines &#8220;hub&#8221; pages, i.e. those pages that have links to multiple relevant authoritative pages.  Hubs never point to hubs and authoritative pages never point to authoritative pages, which yields a <a href="http://en.wikipedia.org/wiki/Bipartite_graph">dense directed bipartite subgraph</a> (don&#8217;t worry, I had to look it up as well.)  Hubs and authorities reinforce each other, &#8220;a good hub is a page that points to many good authorities; a good authority is a page that is pointed to by many good hubs.&#8221;  Intuitively this make sense as we all know pages that we trust and we know where to go to find them.  I am not sure if this is correct, but I think to some extent in our post-google world, the hubs are now the search engines themselves.  That is, being ranked highly in the engine, tends to reinforce a site as being the authority.</p>
<p>Unlike the previous papers we looked at, this one actually has a fair amount of math, especially in section 2, so dig out your Linear Algebra book if you really want an in-depth understanding.  The algorithm for calculating the hubs and authorities is very similar to both the PageRank and TextRank algorithims we discussed in prior posts and can be calculated using a similar, simple, iterative algorithm (I implemented it in a few hours). The graph in use for the algorithm consists of a root set of nodes identified by a search engine (Alta Vista in this case) and is then expanded to include pages that are pointed to by a node in the set, or point to a node in the set (possibly bounded by some predetermined amount.)  Edges are the links themselves. Given the graph it comes down to calculating the authority weights and the hub weights using the iterative algorithm.   Authority weights are the sum of the hub weights and the hub weights are the sum of the authority weights.   See pages 9, 10, 11 and 12 for the exact details of proving why this works.  You may want to refresh your knowledge of <a href="http://en.wikipedia.org/wiki/Eigenvector">Eigenvectors</a>, etc.   Seriously, though, don&#8217;t you love when &#8220;discovered solutions&#8221; to problems have such a pure mathematical relationship?  That is, isn&#8217;t it great when the math matches up with your intuition?</p>
<p>Section 3 breaks expands on the notion of hubs and authorities to discover communities of links, and, also, &#8220;anti&#8221; communities.  For instance, using a variant of the main algorithm, one can determine clusters of sites that are related.  For instance, one can easily determine those sites in favor of abortion and those opposed.  Mathematically, this correlates to working with the non-principal eigenvectors, whereas the primary algorithm involves working with the principal eigenvector.</p>
<p>Section 4 continues Section 3 and shows how it is fairly easy to use the link structure to find similar pages given an initial query.  Unlike <a href="http://en.wikipedia.org/wiki/Relevance_feedback">relevance feedback</a>, Kleinberg&#8217;s approach relies solely on the link structure.</p>
<p>Tomorrow or Friday, I will look at the remaining sections of the paper.</p>
<p>graph theory, Kleinberg, PageRank, TextRank, Mihalcea, Brin and Page, eigenvectors, link analysis, bibliometrics, linear algebra, information retrieval, search engines</p>
<p>Technorati Tags: <a href="http://technorati.com/tag/graph+theory" rel="tag">graph theory</a>, <a href="http://technorati.com/tag/Kleinberg" rel="tag">Kleinberg</a>, <a href="http://technorati.com/tag/PageRank" rel="tag">PageRank</a>, <a href="http://technorati.com/tag/TextRank" rel="tag">TextRank</a>, <a href="http://technorati.com/tag/Mihalcea" rel="tag">Mihalcea</a>, <a href="http://technorati.com/tag/Brin+and+Page" rel="tag">Brin and Page</a>, <a href="http://technorati.com/tag/eigenvectors" rel="tag">eigenvectors</a>, <a href="http://technorati.com/tag/link+analysis" rel="tag">link analysis</a>, <a href="http://technorati.com/tag/bibliometrics" rel="tag">bibliometrics</a>, <a href="http://technorati.com/tag/linear+algebra" rel="tag">linear algebra</a>, <a href="http://technorati.com/tag/information+retrieval" rel="tag">information retrieval</a>, <a href="http://technorati.com/tag/search+engines" rel="tag">search engines</a></p>]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/02/07/discussion-of-sections-1-4-of-kleinberg/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>POTW for Feb 4, 2007: Authoritative Sources In a Hyperlinked Environment</title>
		<link>http://www.paperoftheweek.com/2007/02/04/potw-for-feb-4-2007-authoritative-sources-in-a-hyperlinked-environment/</link>
		<comments>http://www.paperoftheweek.com/2007/02/04/potw-for-feb-4-2007-authoritative-sources-in-a-hyperlinked-environment/#comments</comments>
		<pubDate>Sun, 04 Feb 2007 23:04:29 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Graph Theory]]></category>
		<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/02/04/potw-for-feb-4-2007-authoritative-sources-in-a-hyperlinked-environment/</guid>
		<description><![CDATA[This weeks Paper of the Week (POTW) is another graph based approach for ranking web pages. It is titled Authoritative Sources In a Hyperlinked Environment and is by Joe Klienberg who was at Cornell at the time. PageRank, HITS, Joe Kliengberg, Paper of the Week, Cornell, graph theory, ranking Technorati Tags: PageRank, HITS, Joe Kliengberg, [...]]]></description>
			<content:encoded><![CDATA[<p>This weeks Paper of the Week (POTW) is another graph based approach for ranking web pages.  It is titled <a href="http://citeseer.ist.psu.edu/87928.html"><em>Authoritative Sources In a Hyperlinked Environment</em></a> and is by Joe Klienberg who was at Cornell at the time.</p>
<p>PageRank, HITS, Joe Kliengberg, Paper of the Week, Cornell, graph theory, ranking</p>
<p>Technorati Tags: <a href="http://technorati.com/tag/PageRank" rel="tag">PageRank</a>, <a href="http://technorati.com/tag/HITS" rel="tag">HITS</a>, <a href="http://technorati.com/tag/Joe+Kliengberg" rel="tag">Joe Kliengberg</a>, <a href="http://technorati.com/tag/Paper+of+the+Week" rel="tag">Paper of the Week</a>, <a href="http://technorati.com/tag/Cornell" rel="tag">Cornell</a>, <a href="http://technorati.com/tag/graph+theory" rel="tag">graph theory</a>, <a href="http://technorati.com/tag/ranking" rel="tag">ranking</a></p>]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/02/04/potw-for-feb-4-2007-authoritative-sources-in-a-hyperlinked-environment/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Addendum to TextRank</title>
		<link>http://www.paperoftheweek.com/2007/02/03/addendum-to-textrank/</link>
		<comments>http://www.paperoftheweek.com/2007/02/03/addendum-to-textrank/#comments</comments>
		<pubDate>Sat, 03 Feb 2007 23:35:08 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Graph Theory]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/02/03/addendum-to-textrank/</guid>
		<description><![CDATA[I implemented this last night using JGraphT just to see what was involved. I would guess it took me around 4 hours and is pretty straight forward. I did have a little trouble getting down how to best estimate the error for calculating convergence, but did eventually arrive at a solution after simplifying my tests [...]]]></description>
			<content:encoded><![CDATA[<p>I implemented this last night using <a href="http://jgrapht.sourceforge.net/">JGraphT</a> just to see what was involved.  I would guess it took me around 4 hours and is pretty straight forward.  I did have a little trouble getting down how to best estimate the error for calculating convergence, but did eventually arrive at a solution after simplifying my tests to use some simpler input.  One of the cool things about this approach is it doesn&#8217;t have to apply just to text.  The JGraphT library supports Java 1.5 and generics, so you can put pretty much anything you want on a node and define a relation between nodes and then run the ranking algorithm.</p>
<p>I&#8217;m not going to post the code just yet, but I may in the future, but here&#8217;s the Gist of what I did:</p>
<ol>
<li>Create your graph and relations using JGraphT.  I created a Vertex and Edge objects.  The Vertex has two Generic properties (Key, Value) and maintains a score value as a double.  The Edge has a weight as a double and what I call an EdgeLoad, namely a Generic property that can hold some object that is associated with an edge.  While this is overkill for my initial implementation of keyword extraction, I think having the ability to store info on the vertex and edge will come in handy later as I explore different approaches.</li>
<li>Create a GraphRanker interface that has a rank method that is overloaded to take in either an Undirected or Directed Graph</li>
<li>Implement the GraphRanker interface to rank the Graph.  This involved:
<ol>
<li>while  error rate is &gt; than the threshold and we haven&#8217;t reached the maximum number of iterations</li>
<li>for each vertex in the Graph</li>
<li>calculate the rank of the vertex
<ol>
<li>The rank is calculated according to the formula in the paper.  Just remember that it doesn&#8217;t need to be recursive, you can just use any dummy starting value when calculating the first S(V<sub>j</sub>)</li>
</ol>
</li>
<li>Set the error rate (estimated) to be the new rank &#8211; the old rank</li>
<li>Repeat from step 1 until convergence is reached or you max out your iterations</li>
<li>Sort the vertices by score and return</li>
</ol>
</li>
</ol>
<p>Some tips to know you got it:  The average rank of an unweighted graph should be approach 1.  Start very simple and do the keyword extraction version first.  Put only two keywords in your graph.  Since they co-occur within 2 positions, they should have equal rank upon convergence since they both contribute evenly to each other.</p>
<p>Give it a try and let me know if you have any questions.  This really is quite easy to implement.  If you graph gets really large, you may want to look into using <a href="http://lucene.apache.org/hadoop">Hadoop</a> to distribute.</p>
<p>Also, go out and see what you can apply it to, don&#8217;t just limit yourself to text.  Let me know what works and doesn&#8217;t work for you.</p>
<p>TextRank, Graph Theory, JGraphT, Hadoop, Implementation, Java, PageRank, Google, Mihalcea, Brin and Page</p>
<p>Technorati Tags: <a href="http://technorati.com/tag/TextRank" rel="tag">TextRank</a>, <a href="http://technorati.com/tag/Graph+Theory" rel="tag">Graph Theory</a>, <a href="http://technorati.com/tag/JGraphT" rel="tag">JGraphT</a>, <a href="http://technorati.com/tag/Hadoop" rel="tag">Hadoop</a>, <a href="http://technorati.com/tag/Implementation" rel="tag">Implementation</a>, <a href="http://technorati.com/tag/Java" rel="tag">Java</a>, <a href="http://technorati.com/tag/PageRank" rel="tag">PageRank</a>, <a href="http://technorati.com/tag/Google" rel="tag">Google</a>, <a href="http://technorati.com/tag/Mihalcea" rel="tag">Mihalcea</a>, <a href="http://technorati.com/tag/Brin+and+Page" rel="tag">Brin and Page</a></p>]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/02/03/addendum-to-textrank/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Discussion of Remaining Sections of TextRank</title>
		<link>http://www.paperoftheweek.com/2007/02/02/discussion-of-remaining-sections-of-textrank/</link>
		<comments>http://www.paperoftheweek.com/2007/02/02/discussion-of-remaining-sections-of-textrank/#comments</comments>
		<pubDate>Fri, 02 Feb 2007 13:31:00 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Graph Theory]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/02/02/discussion-of-remaining-sections-of-textrank/</guid>
		<description><![CDATA[Section 4 of TextRank: Bringing Order Into Texts (http://www.cs.unt.edu/~rada/papers/mihalcea.emnlp04.pdf) by Mihalcea and Tarau continues the paper with another application of the TextRank algorithm.  Taking the keyword extraction process to the next level, Mihalcea applies the algorithm to Sentence Extraction.  The goal of Sentence Extraction is to identify the sentences of a document that best represent [...]]]></description>
			<content:encoded><![CDATA[<p>Section 4 of <em>TextRank: Bringing Order Into Texts</em> (<a href="http://www.cs.unt.edu/~rada/papers/mihalcea.emnlp04.pdf">http://www.cs.unt.edu/~rada/papers/mihalcea.emnlp04.pdf</a>) by Mihalcea and Tarau continues the paper with another application of the TextRank algorithm.  Taking the keyword extraction process to the next level, Mihalcea applies the algorithm to Sentence Extraction.  The goal of Sentence Extraction is to identify the sentences of a document that best represent the document, i.e. those sentences that summarize the document.  For sentence extraction, the graph nodes are sentences and the edges are determined by using a similarity measure as a function of how much content they have in common.  Specifically, the similarity measure is defined by how many tokens are in common between the two sentences.  Similar to keyword extraction, filters can be applied so that only certain tokens are used when calculating overlap.  Once the graph structure is in place, the convergence algorithm is executed and the top ranking nodes are returned.  TextRank does quite well in this task, and section 4.2 provides a discussion of the results.  While it isn&#8217;t the best performing of the comparison, it is in the top 5.  Add in that it is fully unsupervised process and you have a pretty compelling argument for adoption of the approach.</p>
<p>Section 5 has a discussion of why TextRank works from a conceptual point of view.  The idea is that nodes end up recommending other nodes based on the strength of the connections between them, much how people build up structural understanding of concepts.</p>
<p>The biggest strength of TextRank is that it is almost completely portable and requires no knowledge of the domain other than how you want to model your nodes and relations.  No training data is needed.</p>
<p>TextRank, keyword extraction, sentence extraction, document summarization, text summarization, PageRank, Mihalcea, Tarau, Paper of the Week</p>
<p>Technorati Tags: <a href="http://technorati.com/tag/TextRank" rel="tag">TextRank</a>, <a href="http://technorati.com/tag/keyword+extraction" rel="tag">keyword extraction</a>, <a href="http://technorati.com/tag/sentence+extraction" rel="tag">sentence extraction</a>, <a href="http://technorati.com/tag/document+summarization" rel="tag">document summarization</a>, <a href="http://technorati.com/tag/text+summarization" rel="tag">text summarization</a>, <a href="http://technorati.com/tag/PageRank" rel="tag">PageRank</a>, <a href="http://technorati.com/tag/Mihalcea" rel="tag">Mihalcea</a>, <a href="http://technorati.com/tag/Tarau" rel="tag">Tarau</a>, <a href="http://technorati.com/tag/Paper+of+the+Week" rel="tag">Paper of the Week</a></p>]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/02/02/discussion-of-remaining-sections-of-textrank/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

