Saturday, February 16, 2008

TREC blog retrieval

Jonathan Elsas presented in the Social Media Reading Group yesterday. He presented to us the very successful approach the CMU team took for the Blog Retrieval task at TREC 2007. Details are in this paper (pdf)

He brought up the point that TREC has the cool property of being "task oriented", which is not always the case with data mining research (and is a criticism of the 'what do evolving graphs look like?' approach I tend to take with my own research).

Another point he made is that no teams at TREC (successfully) used two common properties of data that *are* important in the non-task-oriented research in social media: timestamps and link analysis. He did not seem to think that simply aren't "useful" properties, only that nobody figured out how to use them properly.

While I think that link analysis could be used, it certainly could not be used without some significant text analysis. My impression is that link analysis is useful for tasks like measuring influence or information diffusion, or trust and authority. Relevance seems to me to be a much more text-dependent property.

Furthermore, relevance is a subjective measure, just like influence and authority. In fact, the difficulty in a lot of data mining research is the difficulty of finding a good evaluation of your results. TREC scored the entries with human-tagging. If the goal was to find relative blog posts, then each team's algorithm would find some candidate posts, and the competitors themselves would then vote on which seemed best. And that's probably the best we could do for something so imprecise as "relevance".

It's really hard to do "science" when such complex beings as humans are involved in the measurements.

Jon also kindly lent me the WSDM proceedings, which I copied to my laptop and intend to review soon.

8 comments:

Jon said...

Mary — just stumbled across blog & had to read this post :)

A couple further thoughts since I spoke to our reading group: Document retrieval is not a task... its a tool (I think Matt Hurst actually said this at ICWSM). It services many information seeking tasks, but by itself its just a tool. It can be evaluated in the context of information seeking, TREC style or otherwise. It also has the convenient property that an evaluation corpus is relatively easy to create, as compared to creating an evaluation corpus of for web site authority or influence or other similarly hard-to-define concepts.

As far as using the link structure in retrieval evaluation, I think you're absolutely right. These "social" features are much more likely to correlate with information flow than relevance. But, measuring information flow, authority, influence, etc. is yet another tool. And, I would say this type of tool is almost as important as measuring relevance for many online information seeking tasks, especially the one we looked at — feed retrieval.

Anonymous said...
This comment has been removed by a blog administrator.
Anonymous said...
This comment has been removed by a blog administrator.
Anonymous said...
This comment has been removed by a blog administrator.
Anonymous said...
This comment has been removed by a blog administrator.
Anonymous said...
This comment has been removed by a blog administrator.
unlock iphone 2g said...
This comment has been removed by a blog administrator.
Anonymous said...
This comment has been removed by a blog administrator.