The Dataphiles

Sunday, August 1, 2010

Why can't we be journal-driven like real scientists?

Disclaimer: I have been in the depths of thesis-draft writing, so I believe that around 60% of my angst is an extremely acute case of senioritis. However, the other 40% is founded on actual frustrations with academia, which I am working on articulating bit by bit.

-------

I just got back from my last conference as a grad student. There were several papers I thought were really neat, some great talks, and lots of fun people to hang out with, which definitely made it worthwhile. But conferences have their share of frustrations (which led to me tweeting, at one point, "Academic conferences should provide more of a venue for punching people in the face".) I think some of these frustrations would be less significant if CS were journal-driven like almost any other field.

This has been well articulated by Lance Fortnow in "Time for Computer Science to Grow Up" (pdf). He reasons that the quality of publications would increase if we moved completed research to journals, and that using conferences as endpoints simply detracts from the main purpose of conferences: to bring people together. I'm not sure about the former, since others have pointed out to me that some flaws he mentions (biased reviewers, overspecialization, sloppy pre-deadline rushes, publish-or-perish, etc.) exist regardless of publication model. But, I think there are some things to be said for modifying the model just so conferences will be more fun.

And I'm way more concerned with how much fun our field is than its publication quality.

Most conference talks suck. It takes a huge amount of time to make a quality presentation, and there's little incentive to do so. Unless your talk is just plain offensive, the worst outcome for a thrown-together presentation is that people will forget it. For posterity, there's a good chance the line on your CV for the publication is all that matters. Yet, if you get a paper in a conference you're typically required to give a talk, so we end up with a lot of mediocrity. (Panos Iperiotis proposed having peer-review for talks, which sounds great in theory, but good luck finding enough people for that review committee.)

Poster conferences like NIPS avoid a lot of that, but conference attendance is expensive. And it's a shame that people should have to pick what publication to submit to based on where they can get travel visas.

Journal-driven fields treat conferences like our (less-well-attended) workshops, which seems more appropriate. CS conference talks leave little room for discussion. Difficult questions are typically perceived as an attack, tabled with "That's a good question! Let's discuss this offline". After all, the paper has already been published in essentially final form-- what's the point in arguing with the authors except to make yourself look smart? Furthermore, if one does have a significant issue with a paper that they'd like to address in the public forum, there's no "letters to the editor" section as in several journals. There's only Open Mic Night, and most of the audience is either checking email or leaving to go see a talk in a different track.

I'm sure there's a joke about how the super-introverted CS crowd wouldn't go to conferences unless they had to. But overall, I think the conference as the end point of publication creates a high-pressure situation that detracts from the open forum it should be, and makes attendees less sociable.

In lieu of changing the model, I would advocate each conference having a Punch-People-In-The-Face Plenary Melee.

Saturday, September 26, 2009

Machine Learning Protest at G20

I have returned to the blogosphere to report on our successful voicing of machine learning concerns at the G20 People's March yesterday.

(Top: "Support Vector Machines," "Repeal Power Laws," "End Duality Gap," "MapReduce, MapReuse, MapRecycle: Green Data Processing." Bottom: "Bayesians Against Discrimination"," "Free Variables," "Ban Genetic Algorithms")

Several CMU SCS grad students and postdocs gathered at CMU and walked to Oakland where the march was to begin. As we carried our signs the LOL:Puzzlement ratio decreased (but by no means disappeared) as the distance from CMU increased. Then we marched with the crowd towards dahntahn.

John Oliver of the Daily Show showed up.

At some point we realized we were walking in front of the United Steelworkers team, who were all wearing the same hard hats as me (my friend having snagged mine from the USW booth at Netroots Nation last month). Here I am with a USW comparing causes.

All in all it was a success. For full set of pictures, go HERE.

For publications related to machine learning activism, see "Data Mining Disasters: A Report" [pdf], from SIGBOVIK 2008; and "MapReuse and MapRecycle: Two More Frameworks for Eco-Friendly Data Processing" [pdf], of SIGBOVIK 2009.

Sunday, September 7, 2008

The popular get richer

Pardon the hiatus; I have been busy at KDD and moving back to Pittsburgh (which included a 4-day scenic drive across the country).

In the spirit of the season I've been looking at a large dataset of campaign donations, from 1980 to 2006. This data is free to the public from the FEC; I've parsed and made it available on my website.

One can form a bipartite graph of committees (such as the Underwater Basketweavers' Political Action Committee) and candidates (such as Abraham Lincoln). Individual donations are all filtered through committees (usually a candidate has one or several designated committees), so the organization-candidate graph is the best way to measure donations to specific candidates.

A surprising observation in our KDD paper was the "fortification effect". First, if one takes the number of unique edges added to the graph (that is, the number of interactions between orgs and candidates) and compares with the total weight of the graph (that is, the total $ donated), one finds super-linear behavior. That is, the more unique donor-candidate relationships, the higher the average check becomes. The power law exponent in the org-cand graph was 1.5. (This also holds for $ vs nonunique edges, or number of checks, with exponent 1.15).

Even more interestingly, if one looks closer into the individual candidates, similar behavior emerges. The more donors a candidate has, the higher the average amount received from a donor becomes. The plot below shows money received from candidates vs. number of donor organizations.

Each green point represents one candidate, with the y-axis being the total money that candidate has received, and the x-coordinate being the number of donating organizations. The lower points represent the median for edge-intervals, with upper quartile error bars drawn. The red line is the power law fit-- here we have super-linear behavior between number of donors and the amount donated (with exponent 1.17). And again, the same is true for non-unique donations-- the more checks, the higher the average check.

Again, this does not include the 2008 data. I hear that Obama's donation patterns are different (lots of little checks, they tell me), but haven't confirmed this yet.

Wednesday, August 27, 2008

KDD 2008 <3's social networks

KDD this year put a lot of focus on social networks research. In addition to the long-running WebKDD, there was SNA-KDD (produced "Snack DD". Or at least it should be.) The Matrix and Tensor Methods workshop also included a paper featuring graph mining, and of course matrix methods in general are important for SNA. There were also two relevant tutorials I plan look over, now that I have the proceedings: Neville and Provost's Predictive Modeling with Social Networks, and Liu and Agrawal's "Blogosphere: Research Issues, Applications, and Tools."

And so far I've only listed SNA-related things going on first day. The Social Networks research track session was completely full: all seats taken, people crammed into the aisles, and overflowing into the hall. It was unfortunate that there were several people who just couldn't get in, but it was a lot of fun for the speakers to have such a great audience. There was also a panel on social networks, featuring Backstrom, Faloutsos, Jensen, and Leskovec. The conference took a cue from the overflowing session and switched it to a bigger room. (Unfortunately, it was scheduled at the same time as some other regular sessions that didn't receive the audience they might have were they only paired with other sessions.) There was an industry track session also devoted to social networks, and other relevant papers were in the Text Mining and Graph Mining sessions.

It's pretty exciting that social networks research is getting so much more attention, even in the last year. It will be interesting to see how long it lasts, and what sort of Big Questions get answered.

Time to board. It looks like the plane back to Seattle is mostly full of a different kind of geek: PAX attendees.

Tuesday, August 19, 2008

Butterflies in Vegas

I'd like to take this opportunity to self-promote our talk, "Weighted Graphs and Disconnected Components", at KDD 2008 next week in Las Vegas. This is work from CMU, with Leman Akoglu and Christos Faloutsos. In it we look at some commonly overlooked features of social networks: "disconnected components", or the components that are not connected to the largest connected component; and weighted edges, edges formed through repeated or valued interactions. We also propose the "Butterfly model" to generate graphs that match some of these new properties, along with previously-established graph properties.

For those planning to attend KDD, it will be the third in the Social Networks Research session, on Monday morning (with a poster Monday evening). Even if I haven't convinced you to attend our talk, you will want to see the other talks in the session (one of which is reviewed by Akshay Java here). They'll include the usual suspects from Yahoo! Research and Cornell, plus a paper analyzing of Sprint's call network. It should be a fascinating couple of hours!

Monday, August 18, 2008

Usenet vs blog linking in the 2004 election

As a historical study, I've taken a subset of the political Usenet links and compared to blog links in the same time period. I took links from October-November 2004 and made "A lists" for both political and conservative blogs, and compared with the A-lists found by Adamic and Glance in 2004.

Conservative A-List

Liberal A-List

I did little in the way of pruning the list, while Adamic and Glance were careful to only include traditionally-formatted blogs, but I have removed ones they explicitly mentioned omitting. As in their study, I've left off drudgereport (originally #2 in conservative) and democraticunderground (originally #1 in liberal), as they were not "weblogs" in the traditional sense-- drudgereport is an aggregator and democraticunderground is a message board. freerepublic.com is also message board-like, so that might have warranted removal from the list as well, but it was not mentioned in the paper. Bluelemur also claims to be the liberal version of drudgereport, (ETA: and realclearpolitics is another news aggregator) so those might also have been eliminated from the original study. Gadflyer.com, #2 in Liberals, is no longer in existence, which is interesting in itself, considering its popularity just a few years ago. But it is also notably missing from the Blog A-list, either by the numbers or by classification.

It is perhaps curious that blog-message boards democraticunderground and freerepublic have topped both lists in Usenet, which is closer to message board format. I wonder where they actually ranked on the pre-pruned blog list.

My overall impression from looking at this data is that Usenet is "edgier"; we more commonly see conspiracy theories and wingers getting a lot of attention. Given that, and the apparent popularity of message-board-like blogs, I wonder if we could consider the Usenet to be even more "democratic" (or "ruled by the mob", to be more cynical) than blogs.

Thursday, August 14, 2008

Preferential Installation of Facebook Apps [SIGCOMM WOSN]

I'm reading over the proceedings of SIGCOMM's Workshop On Social Networks, which is in Seattle next Monday.

Minas Gjoka, Michael Sirivianos, Athina Markopoulou, and Xiaowei Yang, a team of authors at UCI, wrote a paper, Poking Facebook: Characterization of OSN Applications, which looks at data from Facebook application installation and use.

First, they seem to have gotten a pretty successful crawl, which is saying something since Facebook is pretty selfish with data. Here is a PDF of application installation, both according to facebook stats and their crawled dataset, which match up pretty well:

They also modeled the histogram of installed-apps-per-user as preferential, running a simulation with "users as bins" and different apps as "different colored balls", iteratively assigning balls to bins. For instance, saying that 100 users have installed the Zombies application, would translate to "gray balls appear in 100 bins".

For each iteration, one goes through each "ball" (application installation), starting with the "most popular color" (application with the most installations). For each ball one then assigns an additional "bin that doesn't already contain that color" (picks a new user that hasn't already installed the app) according to a probability:

Where balls(i) is the number of applications a user i has installed, and B is the set of users that hasn't already installed the application. init is a parameter to moderate the initial activity, and rho is the preferential exponent, chosen in simulations to be 1.6. In the end you get a sort of heavy-tailed behavior, with most users installing a couple apps and a few who go nuts with application installs. It fits pretty well:

One of the fun parts of these sort of data is the outliers-- the users who go nuts on something. (Netflix users rating thousands of movies, etc.) It looks like in the crawled data there are a few users with 500 apps installed!

In the paper there is also fit to the "coverage of applications"-- that is, how many of the ranked apps we need to go through before we have all the users with one of those apps installed, and it appears the simulation reaches coverage a little too quickly, so perhaps the most popular applications are taking too many users in the simulation.

What's somewhat surprising to me is that this isn't at all based on the behavior of a user's friends, but of the entire Facebook network at large. I suspect that in reality that does govern user behavior, but for large-scale patterns one can overlook it. This might be different for actually modeling how an application catches on. (Using other features like network effects are listed as "future work" for the authors in refining the model.)