POTW 5/14/07: Discussion of “Discovering Trends in Text Databases” by Lent et. al.
This week’s paper, “Discovering Trends in Text Databases” by Lent is my first look at some text mining tools and applications. The paper discusses a method for identifying trends in databases. In this case, a trend is defined as “a specific subsequence of the history of a phrase that satisfies the users’ query over the histories”. Essentially, what the authors are doing is identifying phrases in text that has been timestamped which they can then use to match user’s queries concerning things like spikes in usage of particular phrases, etc.
After covering some related work about Latent Semantic Indexing (I suppose I should look into that some day), the authors delve into the methodology of identifying phrases and their histories. There are 3 steps to the process: 1) identify frequent phrases, 2) generating histories for the phrases and 3) identifying the phrases for a given trend.
Phrases in this paper go beyond the simple sequence of terms, introducing the notion of a “k-phrase”. A k-phrase is essentially a nesting of phrases and they can span sentences, etc. when appropriate. For the histories, each word gets a transaction id and associated timestamps. Then, given these bits of informations, the authors use a shape query language to mine the phrases and histories. The shape query language allows the user to specify they are interested when items are “spiking” or “trending downward”, etc. There is a reference for the shape language in the paper.
Finally, the paper ends with a discussion of how IBM used the approach in a patent mining system to identify trends in patents from the US Patent office.
Popularity: 7% [?]

