Section 3 of TextRank: Bringing Order Into Texts covers the first application of the TextRank approach proposed in section 2.  The authors have chosen keyword extraction to demonstrate the capabilities of the approach.  Keyword extraction is the problem of determining the keywords that best describe a document.  It can be thought of as a precursor to document summarization, I guess.  Those familiar with academic papers might imagine a system that automatically assigns keywords to a paper.  In the blogosphere, you can imagine how nice it would be to not have to pick your own keywords,  just let the system pick them for you (hmmm, that gives me an idea… anyone want to write a wordpress plugin?)

With the problem defined, Mihalcea describes the preprocessing steps for building the graph.  Namely, words are tokenized and part of speech is assigned.  These words are then added as nodes to the graph (note, they optionally can filter out certain types of words, such as all adverbs or all nouns) and edges are added between two nodes if the two words co-occur within some variable number of words, usually somewhere between 2 and 10.  Once the graph is setup, the convergence algorithm is run until some approximated error threshold is met and the results are ranked and returned.

Performance wise, the algorithm does quite well, beating the best reported supervised result of the time.  It should be noted that this approach is completely unsupervised, which is a real benefit.  In my mind, if you can get better results without supervision, you have a winner.  The rest of section 3 is a discussion of the TextRank results versus other comparable systems.  See the paper for details.

Next post we will discuss the approach applied to Sentence extraction for automatic summarization, and we will end the week with the remainder of the paper on Friday.

TextRank, PageRank, Google, Mihalcea, Keyword Extraction, Natural Language Processing, NLP, machine learning, algorithms, computer science

Technorati Tags: , , , , , , , , ,