The first two sections of TextRank: Bringing Order into Text by Mihalcea and Tarau provide valuable background information and introduction to applying graph theory (see http://en.wikipedia.org/wiki/Graph_theory for background) to NLP based programs. As we saw in Brin and Page last week, we can use graph based approaches for determining node importance through citation analysis. Mihalcea asks why not apply this same notion to lexical and semantic terminology (see Liddy’s http://www.cnlp.org/publications/03NLP.LIS.Encyclopedia.pdf for a discussion of the levels of language) to try and solve problems like keyword extraction and word sense disambiguation (WSD.) Section 2 then lays out the graph ranking formula that Brin and Page and how they intend to use it.

Brin and Page Graph Formula

I am not sure why they chose the same damping factor for d that Brin and Page did. The algorithm for calculating TextRank iterates over the graph until convergence is achieved. According to footnote 1, convergence is achieved when the error rate of the calculation falls below a specified threshold. This error rate is not known ahead of time, so they estimate it by comparing the differences between two successive iterations of the TextRank calculation S(Vi). When the algorithm converges, each node has a score assigned to it. The authors expand on Brin and Page by showing how the formula can be applied to both undirected graphs and weighted graphs. Weighted graphs aren’t necessarily appropriate in the context of web page links, but they can be appropriate in texts because one might have multiple links into a given text. The authors amend the S(Vi) formula to account for the weight these links may have into a given text, yielding the formula:

Weighted Graph Formula

where wji is the weight of the edge between the Vi and Vj.

With the groundwork in place, section 2.3 lays out how to apply the TextRank algorithm. Namely, text units are added as nodes in the graph and relations between the nodes are edges. Edges can be directed or undirected and weighted or unweighted. Once the structure is built, we iterate the ranking algorithm until we have convergence. Final results are attained by sorting the vertices based on their final score.

That covers the intro and section 2 of the paper, next time we will look at the applications of this approach to a few NLP tasks.

TextRank, algorithms, Graph Theory, NLP, Natural Language Processing, sentence detection

Technorati Tags: , , , , ,