Intro

This week we are reading Contextual Search and Name Disambiguation in Email Using Graphs
by Einat Minkov, William W. Cohen and Andrew Y Ng, all of Carnegie Mellon University. Like the past few papers, this paper also focuses on how to use graph theory to solve some common NLP papers. Unlike the past few by Klienberg, Brin and Page and Mihalcea, this one adds a machine learning component and also is much more current, appearing at SIGIR ‘06, although I ran across an earlier version first at HLT ‘06 in the TextGraphs workshop.

Discussion

Section 1 provides the introduction to the paper and posits that in most current info retrieval applications, the documents being searched are connected to both structured and unstructured items, such as other documents and meta-data. This linkage suggests a graph based approach similar to PageRank would be useful in finding documents that are similar to a user’s query. The author’s distinguishing factor, to quote is:

“In contrast to this previous work [Brin and Page, others], we consider schemes for propogating similarity across a graph that naturally models a structured dataset like an email corpus…”

The authors then propose to demonstrate this on two email related tasks: name disambiguation and email threading.

The next section, section 2, provides us with the details of how to construct a graph out of email. This is much more involved than my first guess, as they not only account for the social networking aspects of the email (to, from, cc, etc.) when creating the graph but they also account for the text in the mail messages. Table 1 in the paper lays out the 5 different source types they can have for nodes in the graph along with what types of edges and targets can be linked to from the given source type. This is another differentiation from past approaches, namely they have several different type of nodes and edges. Additionally, they define the inverse relation for all edge labels, which results in a cyclic graph. It is not clear to me yet what the benefit of the mixed nodes are in one graph, but I’m sure that is to come.

To determine the similarity between two nodes in the graph, the authors propose to do a “lazy walk” of the graph. Whereas Brin and Page and others propose a neverending graph walk, Minkov incorporates a probability that the walk can be halted at any given step in addition to the probability of transitioning to another node.   Sections 3.1 and 3.2 have the math to support this.  The walk, the math shows, can be boiled down to  doing matrix multiplication on a set of sparce matrices.  Finally, queries  are described as  a distribution over the nodes (think vector just like in the tradition vector space, at least I think this is right) plus a statement about what type of output is desired.  The result then is a list of nodes of this type.  Section 3.3 lays out the case that this formulation is very similar to the traditional TF-IDF approach.  I guess, though, that the added benefit here is the ability to get other output types.

Section 4 describes how a learning approach can be used to rerank the nodes.  I’m not sure I understand the details just yet, so it probably is worth reading the reference given in the paper that goes into more depth.  Essentially, I think they are just using a classifier to improve upon the results they get from the initial search.

graph theory, machine learning, email, disambiguation, NLP, natural language processing, Paper of the Week, POTW, information retrieval, PageRank

Popularity: 7% [?]

Technorati Tags: , , , , , , , , ,