The Dataphiles: data mining

Showing posts with label data mining. Show all posts

Saturday, August 9, 2008

What is Eliza crossed with a Magic 8 ball, times a billion?

I just got my weekend entertainment from http://bossy.appspot.com. It's an "ask" app inspired by Ask MSR, a paper written 7 years ago for TREC. (I'm not sure the authors ever intended or expected such a thing, but there it is.)

For 50 lines of API code, it's pretty impressive-- and when it's wrong it's usually entertaining. It does well on short and simple word-association queries, like "Who is Batman?" or "What is xkcd?", and seems to do reasonably well on easily-searchable names such as my own.

It even has some political opinions. Try "Who is the worst president in history?", "Who lost the 2004 election?", or "What is the United States?" and you'll get some cynical answers. It's also pacifist, at least for certain queries.

It tends to get snarky when asked other binary queries. I was chagrined when I asked it "Which is better, CMU or MIT?" It also defects when asked to decide between Microsoft and Google, or between OU and OSU. However, does have a preference with respect to the statistics cults.

Alas, it does not seem to be immune to spam.

Saturday, February 16, 2008

TREC blog retrieval

Jonathan Elsas presented in the Social Media Reading Group yesterday. He presented to us the very successful approach the CMU team took for the Blog Retrieval task at TREC 2007. Details are in this paper (pdf)

He brought up the point that TREC has the cool property of being "task oriented", which is not always the case with data mining research (and is a criticism of the 'what do evolving graphs look like?' approach I tend to take with my own research).

Another point he made is that no teams at TREC (successfully) used two common properties of data that *are* important in the non-task-oriented research in social media: timestamps and link analysis. He did not seem to think that simply aren't "useful" properties, only that nobody figured out how to use them properly.

While I think that link analysis could be used, it certainly could not be used without some significant text analysis. My impression is that link analysis is useful for tasks like measuring influence or information diffusion, or trust and authority. Relevance seems to me to be a much more text-dependent property.

Furthermore, relevance is a subjective measure, just like influence and authority. In fact, the difficulty in a lot of data mining research is the difficulty of finding a good evaluation of your results. TREC scored the entries with human-tagging. If the goal was to find relative blog posts, then each team's algorithm would find some candidate posts, and the competitors themselves would then vote on which seemed best. And that's probably the best we could do for something so imprecise as "relevance".

It's really hard to do "science" when such complex beings as humans are involved in the measurements.

Jon also kindly lent me the WSDM proceedings, which I copied to my laptop and intend to review soon.

Thursday, December 13, 2007

Andrew Tomkins on search data privacy

Yesterday Andrew Tomkins gave a talk at CMU. He addressed some social media, but a large part of his talk was regarding how it's tough to "anonymize" search data. Using the AOL scandal as an example, he basically granularized that data and showed in surprising ways that you can identify people.

One point he brought up was "person attacks". While "trace attacks" such as finding credit card information or SSNs in the queries are dangerous enough, an attacker might decide to exploit one person. For instance, if they know that you vacationed to Tahiti and recently bought a Honda minivan, they can look through the data and identify anyone who has queries on both Honda minivans and Tahiti vacation packages-- you could probably do well with this even without trying to find people limited to those who have searched for, say, "* in Pittsburgh PA". Or, if an attacker knows the victim and is at his house for a party or something, the attacker might ask to use the victim's computer and put in a unique term-- then when the attacker obtains the "anonymized" data later he can find that person. With knowledge of what they've searched for, for instance "AIDS clinics" or some adult term, the atacker could use blackmail.

It makes me want to play with the AOL search data-- I have some ideas for trend analysis. I'd downloaded it at some point but never got around to doing anything with it.

Wednesday, November 28, 2007

Breaking power laws

Cosma Shalizi just sent me a paper, Power Law Distributions in Empirical Data, that uses real statistics to fit power laws and other models that power laws are often mistaken for. Yesterday's discussion in the seminar we're in was on a Barabasi paper and Stouffer et al's response to it, which was apparently a family feud among two students who shared the same advisor (Barabasi and Amaral, IIRC). I've fit power laws in some of my work-- and according the Clauset, Shalizi, and Newman work, I've been fitting them wrongly. I intend to fix that.

What's strange is that it's so widely accepted in the (CS side of the) data mining community that degree distributions follow power laws-- not one of my reviewers has said anything about it, and at ICWSM, everybody who analyzed the blog dataset said the degree distributions were power laws (in-degree, maybe, but even visually I'm not convinced about out-degree). Maybe this has to do with the idea of having preferential attachment be the generative model, which I am also not a huge fan of (people with lots of friends may be more likely to gain more friends, but not because new arrivals to the network immediately link to them).

The first thing I would like to check on is the observation (in SDM 07) we had that popularity drop-off of blog posts follow power laws with exponent 1.5. We truncated it, due to the fact we only had reliable in-link information for about 30 days. First I intend to look at a longer time period, and see if we can avoid truncating it. Then I'll run some of the code Clauset posted that relates to the paper. If something fits better than the power law, I'd like to know what it is.

Monday, October 8, 2007

Dataphiles 2.0

How had I not heard about this until now?

Friday, July 20, 2007

Google data miners have come across a scientific breakthrough.

They have proved that GTalk users are very, very lonely, and are probably about 16 years old.

The Dataphiles