Addendum to TextRank
I implemented this last night using JGraphT just to see what was involved. I would guess it took me around 4 hours and is pretty straight forward. I did have a little trouble getting down how to best estimate the error for calculating convergence, but did eventually arrive at a solution after simplifying my tests to use some simpler input. One of the cool things about this approach is it doesn’t have to apply just to text. The JGraphT library supports Java 1.5 and generics, so you can put pretty much anything you want on a node and define a relation between nodes and then run the ranking algorithm.
I’m not going to post the code just yet, but I may in the future, but here’s the Gist of what I did:
- Create your graph and relations using JGraphT. I created a Vertex and Edge objects. The Vertex has two Generic properties (Key, Value) and maintains a score value as a double. The Edge has a weight as a double and what I call an EdgeLoad, namely a Generic property that can hold some object that is associated with an edge. While this is overkill for my initial implementation of keyword extraction, I think having the ability to store info on the vertex and edge will come in handy later as I explore different approaches.
- Create a GraphRanker interface that has a rank method that is overloaded to take in either an Undirected or Directed Graph
- Implement the GraphRanker interface to rank the Graph. This involved:
- while error rate is > than the threshold and we haven’t reached the maximum number of iterations
- for each vertex in the Graph
- calculate the rank of the vertex
- The rank is calculated according to the formula in the paper. Just remember that it doesn’t need to be recursive, you can just use any dummy starting value when calculating the first S(Vj)
- Set the error rate (estimated) to be the new rank - the old rank
- Repeat from step 1 until convergence is reached or you max out your iterations
- Sort the vertices by score and return
Some tips to know you got it: The average rank of an unweighted graph should be approach 1. Start very simple and do the keyword extraction version first. Put only two keywords in your graph. Since they co-occur within 2 positions, they should have equal rank upon convergence since they both contribute evenly to each other.
Give it a try and let me know if you have any questions. This really is quite easy to implement. If you graph gets really large, you may want to look into using Hadoop to distribute.
Also, go out and see what you can apply it to, don’t just limit yourself to text. Let me know what works and doesn’t work for you.
TextRank, Graph Theory, JGraphT, Hadoop, Implementation, Java, PageRank, Google, Mihalcea, Brin and Page
Popularity: 6% [?]
Technorati Tags: TextRank, Graph Theory, JGraphT, Hadoop, Implementation, Java, PageRank, Google, Mihalcea, Brin and Page

