POTW 5/21/07: Discussion of “A Study on Retrospective and On-Line Event Detection” by Yang, Pierce and Carbonell
Yang’s paper on on-line event detection (”A Study on Retrospective and On-Line Event Detection“) discusses the use of common text retrieval techniques to automatically detect events in news streams.
Imagine that you are responsible for monitoring all the major news feeds in every single country your company does business in order to advise the CEO on trends and events that effect your company. Obviously, you can’t read all of them on a daily basis and having a large staff to help would be costly. This is exactly the kind of scenario that online event detection is meant to solve. You need a program that can identify and organize news feeds as they come in, allowing you to see the key events (or even minor events) as they are reported, not days later.
This paper discusses the approach of a group of researchers at Carnegie Mellon University in the field of Topic Detection and Tracking. After providing some background information on the topic, the authors dive into the details of their approach. The task at hand was to analyze a corpus of documents in a temporal fashion to identify events and track them. In other words, even though they had the whole corpus at the time, they had to pretend like they were receiving the news in chronological order just like you and I do on a daily basis. They could not look “into the future”, if you will, in doing their analysis.
Yang’s group attempts to solve this problem by using a clustering approach. In fact, they are modifying Cutting’s Scatter/Gather approach that we discussed here and also a single-pass, incremental approach. They approach the problem in two ways. First, they use the scatter/gather approach on a “retrospective” collection containing articles that occurred in the “simulated” past. This is done to build up statistics about past events, figuring that new events will contain similar structures and statistics (TF/IDF, etc.) albeit with variations due to new names and events. Section 3.1 discusses the representation of the clusters and 3.2 discusses their modifications of Cutting’s Scatter/Gather approach into what they call Group Average Clustering (GAC).
To solve the real-time, online problem, an incremental, single-pass approach was used. To do this kind of thing, one needs to somehow estimate corpus statistics like IDF (inverse document frequency) in order to come up with reasonable estimations in order to come up with the proper weights for the term vectors used in the clustering algorithm. The CMU group solves this problem by originally estimating the IDF using the stats from the retrospective process and then updating it as new information becomes available in the real-time approach.
Much of the rest of the paper is about picking parameters and evaluation. Section 4.3 has some interesting discussion of “Behavior Analysis” that is worth looking into. The gist of it being that the GAC approach seemed to be good at identifying large news bursts, while the incremental approach is better at tracking at long-lasting events. In our scenario, you will most likely be interested in both kinds of events. The key, of course, is having the ability to zoom in/out on the various news feeds and to setup alerts, etc. that help you manage the clusters.
Popularity: 15% [?]

