<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Paper of the Week &#187; bibliometrics</title>
	<atom:link href="http://www.paperoftheweek.com/category/bibliometrics/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.paperoftheweek.com</link>
	<description>Read. Learn. Discuss.</description>
	<lastBuildDate>Tue, 14 Aug 2007 01:35:31 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>Discussion of LexRank by Erkan and Radev</title>
		<link>http://www.paperoftheweek.com/2007/02/22/discussion-of-lexrank-by-erkan-and-radev/</link>
		<comments>http://www.paperoftheweek.com/2007/02/22/discussion-of-lexrank-by-erkan-and-radev/#comments</comments>
		<pubDate>Fri, 23 Feb 2007 02:03:40 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Graph Theory]]></category>
		<category><![CDATA[Natural Language Processing (NLP)]]></category>
		<category><![CDATA[bibliometrics]]></category>
		<category><![CDATA[linear algebra]]></category>
		<category><![CDATA[text summarization]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/02/22/discussion-of-lexrank-by-erkan-and-radev/</guid>
		<description><![CDATA[POTW 2/18/07: LexRank: Graph-based Lexical Centrality as Salience in Text Summarization The LexRank paper by Erkan and Radev is another PageRank/Graph Theory based approach to working with text, this time applied to the task of summarization. Key parts of sections 1 and 2 discuss the general problem of corpus-based summarization.  Unlike TextRank by Mihalcea, Erkan [...]]]></description>
			<content:encoded><![CDATA[<p>POTW 2/18/07: <a href="http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume22/erkan04a-html/erkan04a.html">LexRank: Graph-based Lexical Centrality as Salience in Text Summarization</a></p>
<p>The LexRank paper by Erkan and Radev is another PageRank/Graph Theory based approach to working with text, this time applied to the task of summarization.</p>
<p>Key parts of sections 1 and 2 discuss the general problem of corpus-based summarization.  Unlike TextRank by Mihalcea, Erkan and Radev are interested in summarizing groups of documents (although it can be applied to individual documents as well.)  They propose a &#8220;centroid-based&#8221; model whereby they try to determine the most important sentences in a group of documents.  In this approach, the sentences that have more words near to the center of the cluster of documents are deemed to be more important, thus giving higher weight.  The authors in section 3 then layout their methods for determining the similarity between two sentences using a TF-IDF formula (think vector space search).  From these similarity scores, a matrix can be constructed tallying all of the similarities between all the sentences.</p>
<p>After doing some fancy linear algebra, they come to define LexRank, which is really just Google&#8217;s PageRank applied to this particular problem.  Sections 3.2 do provides all the gory details on the math and how the problem can be described in terms of stochastic matrices and markov chains, which all seems to boil down to the lovely formula that Brin and Page gave us.   This section also lays out in more detail the iterative algorithm for calculating PageRank/TextRank/LexRank/YouFillInTheBlankRank.</p>
<p>Taking this matrix, we put it into a graph and weight the edges according to our similarity and then run the iterative algorithm over it.</p>
<p>I know this sounds like skimping on my part, but the rest of the paper is just about the experiments they ran and I don&#8217;t feel like rehashing those.  Guess what?  It did pretty well.  One thing that is interesting is the availability of the MEAD summarization system, available at <a href="http://www.summarization.com/">http://www.summarization.com/</a>.</p>
<p>Like most of the graph based approaches we have seen so far, it does well with noisy data, which is one of the big selling points for me.</p>
<p>Next week, on to something new: question answering!  See you then!</p>
<p>LexRank, Radev, Erkan, PageRank, Brin, Page, Google, linear algebra, graph theory</p>
<p>Technorati Tags: <a href="http://technorati.com/tag/LexRank" rel="tag">LexRank</a>, <a href="http://technorati.com/tag/Radev" rel="tag">Radev</a>, <a href="http://technorati.com/tag/Erkan" rel="tag">Erkan</a>, <a href="http://technorati.com/tag/PageRank" rel="tag">PageRank</a>, <a href="http://technorati.com/tag/Brin" rel="tag">Brin</a>, <a href="http://technorati.com/tag/Page" rel="tag">Page</a>, <a href="http://technorati.com/tag/Google" rel="tag">Google</a>, <a href="http://technorati.com/tag/linear+algebra" rel="tag">linear algebra</a>, <a href="http://technorati.com/tag/graph+theory" rel="tag">graph theory</a></p>]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/02/22/discussion-of-lexrank-by-erkan-and-radev/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Discussion of Sections 1 -4 of Kleinberg</title>
		<link>http://www.paperoftheweek.com/2007/02/07/discussion-of-sections-1-4-of-kleinberg/</link>
		<comments>http://www.paperoftheweek.com/2007/02/07/discussion-of-sections-1-4-of-kleinberg/#comments</comments>
		<pubDate>Thu, 08 Feb 2007 02:29:23 +0000</pubDate>
		<dc:creator>grant.ingersoll</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Graph Theory]]></category>
		<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[bibliometrics]]></category>
		<category><![CDATA[linear algebra]]></category>

		<guid isPermaLink="false">http://www.paperoftheweek.com/2007/02/07/discussion-of-sections-1-4-of-kleinberg/</guid>
		<description><![CDATA[For the week of Feb. 4, 2007, we are discussing Authoritative Sources in a Hyperlinked Environment, with today&#8217;s posting focused on sections 1 through 4.  This paper really has some meat on it and I found myself having to reread sections and even dig out my old Linear Algebra text. As is typical of most [...]]]></description>
			<content:encoded><![CDATA[<p>For the week of Feb. 4, 2007, we are discussing <a href="http://citeseer.ist.psu.edu/87928.html">Authoritative Sources in a Hyperlinked Environment</a>, with today&#8217;s posting focused on sections 1 through 4.  This paper really has some meat on it and I found myself having to reread sections and even dig out my old Linear Algebra text.</p>
<p>As is typical of most academic papers, section 1 introduces us to the problem that is being discussed, in this case how to determine the authoritative sites on the web.  Specifically, in the prehistoric web days, search engines would bring you back lots of hits, but seldom did they bring you back the most important hits, at least not high in the rankings (we are relatively spoiled these days in this regard.)  As Brin and Page alluded to, doing a query on &#8220;microsoft&#8221; on these old engines may not even have microsoft.com as the first result, even though most users would tell you that it should be.  Klienberg asserts that the big thing missing from search engines in these days is the use of link analysis to determine the authoritative sites on the web so that they may be ranked higher.  Unlike closed corpus search engines that suffer from the scarcity problem (not enough good content to search), the web suffers from an abundance problem, namely too much content.  Given this, the goal, Kleinberg says, is to identify the authoritative pages that are relevant to the user&#8217;s query.</p>
<p>Much like Brin and Page and Mihalcea (see eariler posts), Kleinberg suggests a graph based approach based on the links in HTML to determine the authoritative pages.  Additionally, he also defines &#8220;hub&#8221; pages, i.e. those pages that have links to multiple relevant authoritative pages.  Hubs never point to hubs and authoritative pages never point to authoritative pages, which yields a <a href="http://en.wikipedia.org/wiki/Bipartite_graph">dense directed bipartite subgraph</a> (don&#8217;t worry, I had to look it up as well.)  Hubs and authorities reinforce each other, &#8220;a good hub is a page that points to many good authorities; a good authority is a page that is pointed to by many good hubs.&#8221;  Intuitively this make sense as we all know pages that we trust and we know where to go to find them.  I am not sure if this is correct, but I think to some extent in our post-google world, the hubs are now the search engines themselves.  That is, being ranked highly in the engine, tends to reinforce a site as being the authority.</p>
<p>Unlike the previous papers we looked at, this one actually has a fair amount of math, especially in section 2, so dig out your Linear Algebra book if you really want an in-depth understanding.  The algorithm for calculating the hubs and authorities is very similar to both the PageRank and TextRank algorithims we discussed in prior posts and can be calculated using a similar, simple, iterative algorithm (I implemented it in a few hours). The graph in use for the algorithm consists of a root set of nodes identified by a search engine (Alta Vista in this case) and is then expanded to include pages that are pointed to by a node in the set, or point to a node in the set (possibly bounded by some predetermined amount.)  Edges are the links themselves. Given the graph it comes down to calculating the authority weights and the hub weights using the iterative algorithm.   Authority weights are the sum of the hub weights and the hub weights are the sum of the authority weights.   See pages 9, 10, 11 and 12 for the exact details of proving why this works.  You may want to refresh your knowledge of <a href="http://en.wikipedia.org/wiki/Eigenvector">Eigenvectors</a>, etc.   Seriously, though, don&#8217;t you love when &#8220;discovered solutions&#8221; to problems have such a pure mathematical relationship?  That is, isn&#8217;t it great when the math matches up with your intuition?</p>
<p>Section 3 breaks expands on the notion of hubs and authorities to discover communities of links, and, also, &#8220;anti&#8221; communities.  For instance, using a variant of the main algorithm, one can determine clusters of sites that are related.  For instance, one can easily determine those sites in favor of abortion and those opposed.  Mathematically, this correlates to working with the non-principal eigenvectors, whereas the primary algorithm involves working with the principal eigenvector.</p>
<p>Section 4 continues Section 3 and shows how it is fairly easy to use the link structure to find similar pages given an initial query.  Unlike <a href="http://en.wikipedia.org/wiki/Relevance_feedback">relevance feedback</a>, Kleinberg&#8217;s approach relies solely on the link structure.</p>
<p>Tomorrow or Friday, I will look at the remaining sections of the paper.</p>
<p>graph theory, Kleinberg, PageRank, TextRank, Mihalcea, Brin and Page, eigenvectors, link analysis, bibliometrics, linear algebra, information retrieval, search engines</p>
<p>Technorati Tags: <a href="http://technorati.com/tag/graph+theory" rel="tag">graph theory</a>, <a href="http://technorati.com/tag/Kleinberg" rel="tag">Kleinberg</a>, <a href="http://technorati.com/tag/PageRank" rel="tag">PageRank</a>, <a href="http://technorati.com/tag/TextRank" rel="tag">TextRank</a>, <a href="http://technorati.com/tag/Mihalcea" rel="tag">Mihalcea</a>, <a href="http://technorati.com/tag/Brin+and+Page" rel="tag">Brin and Page</a>, <a href="http://technorati.com/tag/eigenvectors" rel="tag">eigenvectors</a>, <a href="http://technorati.com/tag/link+analysis" rel="tag">link analysis</a>, <a href="http://technorati.com/tag/bibliometrics" rel="tag">bibliometrics</a>, <a href="http://technorati.com/tag/linear+algebra" rel="tag">linear algebra</a>, <a href="http://technorati.com/tag/information+retrieval" rel="tag">information retrieval</a>, <a href="http://technorati.com/tag/search+engines" rel="tag">search engines</a></p>]]></content:encoded>
			<wfw:commentRss>http://www.paperoftheweek.com/2007/02/07/discussion-of-sections-1-4-of-kleinberg/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
