POTW 2/11/07: Discussion of sections 5-8 of Minkov, et. al
POTW 2/11/07: Contextual Search and Name Disambiguation in Email Using Graphs
Discussion of Sections 5 through 8
The remaining sections of this paper are discussions of two applications of the algorithms plus the body of related work and conclusions that can be drawn from the work. Section 5 gives the details on what corpora were used (Enron, plus an internal email set not publicly available) before proceeding to the task of name disambiguation.
Name disambiguation in email is the task of correlating the mention of a name in an email with the actual person. While this is fairly straightforward in most cases for people reading their own email, it becomes difficult when reading other’s email since one may not know all the people the other person does. Multiply this by a collection of emails filled with nicknames and initials and it is not hard to see why this is difficult. The task is useful for establishing social networks as well as other applications. One could easily imagine an automated system that retrieved relevant information about a person mentioned in an email (bio, address, phone, past conversations, etc.) and made it available to the reader for instant access.
The remaining parts of section 5 go into the details of applying the algorithm and the results that are achieved. The interesting thing in my mind is that the graph often connects different types of nodes and associates different probabilities with the transitions from one node to the other. Applying the graph walking strategy then leads to the desired results. Suffice it to say, the new approach performs better than the baseline approach!
Identifying threads of emails is the second application the authors use to demonstrate their capabilities. Threading is the problem of identifying one or more messages that are related to some chosen email. Many email systems do a basic job at this by comparing subject lines, esp. those that use the “RE:” prefix. However, we all know people often treat the subject line differently. Furthermore, people tend to quote the previous messages in the thread differently. Some use “>”, while others use “|”, while still others use nothing at all. Add in inline replies, which is especially common on mailing lists, and you see why the problem becomes difficult. Section 5.4 lays out the graph walk approach and compares it to TF-IDF IR approach, which, of course it does better, especially when using a machine learning re-ranking approach.
The rest of the paper is on related work and conclusions. I am glad the authors address the performance in terms of scalability in the conclusions section, as I had my doubts about how well the approach could perform on large amounts of data. In fact, I find a lot of papers in the NLP realm fail to account for performance, so it is refreshing to see it addressed.
graph theory, email, natural language processing, NLP, information retrieval, IR, threading, name disambiguation
Popularity: 7% [?]
Technorati Tags: graph theory, email, natural language processing, NLP, information retrieval, IR, threading, name disambiguation

