Discussion of LexRank by Erkan and Radev
POTW 2/18/07: LexRank: Graph-based Lexical Centrality as Salience in Text Summarization
The LexRank paper by Erkan and Radev is another PageRank/Graph Theory based approach to working with text, this time applied to the task of summarization.
Key parts of sections 1 and 2 discuss the general problem of corpus-based summarization. Unlike TextRank by Mihalcea, Erkan and Radev are interested in summarizing groups of documents (although it can be applied to individual documents as well.) They propose a “centroid-based” model whereby they try to determine the most important sentences in a group of documents. In this approach, the sentences that have more words near to the center of the cluster of documents are deemed to be more important, thus giving higher weight. The authors in section 3 then layout their methods for determining the similarity between two sentences using a TF-IDF formula (think vector space search). From these similarity scores, a matrix can be constructed tallying all of the similarities between all the sentences.
After doing some fancy linear algebra, they come to define LexRank, which is really just Google’s PageRank applied to this particular problem. Sections 3.2 do provides all the gory details on the math and how the problem can be described in terms of stochastic matrices and markov chains, which all seems to boil down to the lovely formula that Brin and Page gave us. This section also lays out in more detail the iterative algorithm for calculating PageRank/TextRank/LexRank/YouFillInTheBlankRank.
Taking this matrix, we put it into a graph and weight the edges according to our similarity and then run the iterative algorithm over it.
I know this sounds like skimping on my part, but the rest of the paper is just about the experiments they ran and I don’t feel like rehashing those. Guess what? It did pretty well. One thing that is interesting is the availability of the MEAD summarization system, available at http://www.summarization.com/.
Like most of the graph based approaches we have seen so far, it does well with noisy data, which is one of the big selling points for me.
Next week, on to something new: question answering! See you then!
LexRank, Radev, Erkan, PageRank, Brin, Page, Google, linear algebra, graph theory
Popularity: 8% [?]
Technorati Tags: LexRank, Radev, Erkan, PageRank, Brin, Page, Google, linear algebra, graph theory

