The Dataphiles: November 2007

Cosma Shalizi just sent me a paper, Power Law Distributions in Empirical Data, that uses real statistics to fit power laws and other models that power laws are often mistaken for. Yesterday's discussion in the seminar we're in was on a Barabasi paper and Stouffer et al's response to it, which was apparently a family feud among two students who shared the same advisor (Barabasi and Amaral, IIRC). I've fit power laws in some of my work-- and according the Clauset, Shalizi, and Newman work, I've been fitting them wrongly. I intend to fix that.

What's strange is that it's so widely accepted in the (CS side of the) data mining community that degree distributions follow power laws-- not one of my reviewers has said anything about it, and at ICWSM, everybody who analyzed the blog dataset said the degree distributions were power laws (in-degree, maybe, but even visually I'm not convinced about out-degree). Maybe this has to do with the idea of having preferential attachment be the generative model, which I am also not a huge fan of (people with lots of friends may be more likely to gain more friends, but not because new arrivals to the network immediately link to them).

The first thing I would like to check on is the observation (in SDM 07) we had that popularity drop-off of blog posts follow power laws with exponent 1.5. We truncated it, due to the fact we only had reliable in-link information for about 30 days. First I intend to look at a longer time period, and see if we can avoid truncating it. Then I'll run some of the code Clauset posted that relates to the paper. If something fits better than the power law, I'd like to know what it is.

The Dataphiles

Wednesday, November 28, 2007

Breaking power laws

Other DataPhiles

Blog Archive