The Dataphiles

Thursday, August 7, 2008

The male-female demographic in social media

Mike on Ads has a cool script to infer whether a user is male or female, based on browser history. If you have it analyze your history, it will give you a list of the sites you visited and the corresponding male:female ratio. He got the sites to poll from quantcast, but I can't tell if the demographics came from there as well. The numbers seem to be different when I plug them in, so I'm guessing he either used more/older data than what's currently up, or got it elsewhere.

Here are the ratios for various social media I plugged in, in order from "most male" to "most female":

Site M:F ratio
Digg 1.56
Flickr 1.15
Feedburner 1.11
Worldofwarcraft 1.08
Blogger 1.06
Youtube 1
Last.fm 0.96
Linkedin 0.94
Pandora 0.9
Facebook 0.83
Myspace 0.74
Livejournal 0.68

Twitter and Wikipedia didn't seem to be in the feature set. However, straight form quantcast it seems Twitter's ratio is 0.97 and Wikipedia's is 1.07. Quantcast also lists a ton of other demographic info, which is interesting to look at.

Tuesday, August 5, 2008

How our brains deal with large numbers

Via Andrew Gelman, a recent Science article claims that humans innately use a logarithmic scale.

When askedto point toward the correct location for a spoken number wordonto a line segment labeled with 0 at left and 100 at right,even kindergarteners understand the task and behave nonrandomly,systematically placing smaller numbers at left and larger numbersat right. They do not distribute the numbers evenly, however,and instead devote more space to small numbers, imposing a compressedlogarithmic mapping. For instance, they might place number 10near the middle of the 0-to-100 segment.

(Full text here, SciAm report here)

When I was a little kid my dad helped me "count to a million" using log scale (1,2,3...10,20,...100,200,...). Even then it seemed intuitive. I knew that there were increasingly more numbers in between counts as it got higher, and I felt I was "cheating" by skipping them, but I did not understand how long it truly would have taken if we'd counted all the numbers in between (I probably would have guessed it'd have taken hours, rather than days).

It's not that people cannot grasp large numbers-- they just have trouble converting back to a linear scale. :-)

Monday, August 4, 2008

Memeorandum : Scandal : : Techmeme : ?

A few days ago, I posted that the top "most discussed" links on Memeorandum were related to scandal, violence, or both. Akshay suggested I do the same for Techmeme. His bid on the #1 discussed story was the Microsoft-Yahoo merger. For updates since September 2005, that turned out to be #2, and my bid, the Hans Reiser case, was nowhere close to the top.

So, what do Techmeme's sources *really* care about? Smartphones.

"Dear early adopters: Sorry we made iPhone available to the proles. Here's some iTunes. Love, Steve Jobs [smartphones] 152

Microsoft's bid for Yahoo! [acquisitions] 138

"Dear iTunes customers: No DRM-free music for you. Love, Steve Jobs [IP] 122

Macworld 2008 Keynote [smartphones, gadgets] 121

Google announces Android [smartphones] 113

Digg tells DMCA to bug off [IP] 112

Google on Microsoft's "hostile" bid for Yahoo! [acquisitions] 111

Google acquires Youtube [acquisitions] 107

Steve Jobs announces cheaper iPhone [smartphones] 104

Macworld 2007 (and announcement of the iPhone) [smartphones, gadgets] 101

Gadgets, with a smattering of IP and corporate bureaucracy. Spots 11-20 seem to be more of the same.

One thing that is worth noting is the "most discussed" story is a single link. So, if a number of news sites "split the vote" and have several discussion links apiece, the story may not surface in this list. With techmeme it is a little more obvious, since most bloggers seem to link to the official corporate press releases. With memeorandum, I'm trusting that preferential attachment (NYTimes and Washington Post dominate) makes it so the vote isn't split often enough to dramatically misrepresent reality.

Feed experimentation

I'm attempting to redirect my default Blogger RSS feed to http://feeds.feedburner.com/dataphiles. Google Reader seems to take awhile to figure these things out, but if you don't see another post from me in the next week, you might check to make sure you're getting the right feed. Of course, given my history there's also some chance I just dropped off the face of the blogosphere, but we'll try to not let that happen.

Thursday, July 31, 2008

Telling us what we already know

Via Thursday Bram, communications agency Universal McCann recently conducted the third wave of their global study on social media usage. The results indicated, of course, a growing usage of all kinds of social media worldwide. Also, it notes that "blogs are a mainstream media worldwide and as a collective rival any traditional media" (emphasis mine). Sooner or later, it seems we'll have to be more specific when we say "mainstream media". :-)

You can see a complete slide show of results here. (Warning: It's very colorful, and people with a sensitivity to circles should not consume.) It should be useful for citing whenever a convincing intro to a SM research paper is needed.

What has been happening to the political Usenet?

When I first heard about the Netscan project (see Marc Smith's homepage), my thought was, "People still post on Usenet? Last I heard about it, one of the more active groups was alt.fan.spice-girls." Working on a related project, I've gotten a similar reaction from other people I've mentioned it to. So an overarching theme of my project has been to answer whether Usenet is a distinct community, or simply a sample of what we already know about online communication.

One advantage to studying Usenet is that since it's been around for so long, it's easy to get historical data and say something about its evolution. Furthermore, it's easier to call what we know of it a "community" (although we're still forced to sample it, for our purposes), whereas we never really know if we've crawled*all* the blogs.

What we have done so far is obtained data since 2003 for 200 newsgroups with "polit" somewhere in the newsgroup name. Here's some over-time behavior, a plot of number of posts per day, and number of hyperlinks (in original, non-quoted content) per day:

Posts and Links for All Political Newsgroups

This is a smoothed version of the data, so to illustrate a general trend. The first thing you'll notice is the bump in November 2004, which we can attribute to the US Presidential Election. The next thing you'll notice is that while the number of posts is declining, the number of links remains stable.

Here's the same data for a small subset, can.politics:

Posts and Links for can.politics

Predictably, the "USA Election bump" doesn't hold for all the groups. For uncultured folks like me who had to look it up, the last election in Canada was January 23, 2006. We do still notice an increasing tendency to link, per post. Perhaps people on Usenet are going more to outside sources. Or, as another intern put it, "They're getting lazy."

Tuesday, July 29, 2008

Scandal sells

Since starting at Live Labs I've gotten to play with a lot of data, including the political Usenet and crawled memeorandum hourly data (since mid-September 2005, following Katrina). Today I came across something less-than-surprising.

Top 10 links on memeorandum according to most number of 'discussion' links-- that is, number of discussions (usually blogged) that are related to a parent story (usually news).

"For McCain, Self-Confidence on Ethics Poses its Own Risk" [McCain and scandals] 219

"Spitzer is Linked to Prostitution Ring" 178

"Embattled Attorney General Resigns" [Gonzales and scandals] 170

[Text of Obama's race speech] 158

"NSA has massive database of Americans' phone calls" 129

"The Long Run-Up" [McCain and scandal] 119

"Craig Arrested, Pleads Guilty Following Incident in Airport Restroom" 116

"US Web Primer Is Said to Reveal a Nuclear Primer" [Iraq and Nukes] 115

"Digging Out More CNN/Youtube Plants" [Youtube politics and staged debates] 115

"Dark Suspicions About the NIE" [Iran and Nukes] 107

So, for the most part, what sells is sex and violence.