The Dataphiles: 2008

Sunday, September 7, 2008

The popular get richer

Pardon the hiatus; I have been busy at KDD and moving back to Pittsburgh (which included a 4-day scenic drive across the country).

In the spirit of the season I've been looking at a large dataset of campaign donations, from 1980 to 2006. This data is free to the public from the FEC; I've parsed and made it available on my website.

One can form a bipartite graph of committees (such as the Underwater Basketweavers' Political Action Committee) and candidates (such as Abraham Lincoln). Individual donations are all filtered through committees (usually a candidate has one or several designated committees), so the organization-candidate graph is the best way to measure donations to specific candidates.

A surprising observation in our KDD paper was the "fortification effect". First, if one takes the number of unique edges added to the graph (that is, the number of interactions between orgs and candidates) and compares with the total weight of the graph (that is, the total $ donated), one finds super-linear behavior. That is, the more unique donor-candidate relationships, the higher the average check becomes. The power law exponent in the org-cand graph was 1.5. (This also holds for $ vs nonunique edges, or number of checks, with exponent 1.15).

Even more interestingly, if one looks closer into the individual candidates, similar behavior emerges. The more donors a candidate has, the higher the average amount received from a donor becomes. The plot below shows money received from candidates vs. number of donor organizations.

Each green point represents one candidate, with the y-axis being the total money that candidate has received, and the x-coordinate being the number of donating organizations. The lower points represent the median for edge-intervals, with upper quartile error bars drawn. The red line is the power law fit-- here we have super-linear behavior between number of donors and the amount donated (with exponent 1.17). And again, the same is true for non-unique donations-- the more checks, the higher the average check.

Again, this does not include the 2008 data. I hear that Obama's donation patterns are different (lots of little checks, they tell me), but haven't confirmed this yet.

Wednesday, August 27, 2008

KDD 2008 <3's social networks

KDD this year put a lot of focus on social networks research. In addition to the long-running WebKDD, there was SNA-KDD (produced "Snack DD". Or at least it should be.) The Matrix and Tensor Methods workshop also included a paper featuring graph mining, and of course matrix methods in general are important for SNA. There were also two relevant tutorials I plan look over, now that I have the proceedings: Neville and Provost's Predictive Modeling with Social Networks, and Liu and Agrawal's "Blogosphere: Research Issues, Applications, and Tools."

And so far I've only listed SNA-related things going on first day. The Social Networks research track session was completely full: all seats taken, people crammed into the aisles, and overflowing into the hall. It was unfortunate that there were several people who just couldn't get in, but it was a lot of fun for the speakers to have such a great audience. There was also a panel on social networks, featuring Backstrom, Faloutsos, Jensen, and Leskovec. The conference took a cue from the overflowing session and switched it to a bigger room. (Unfortunately, it was scheduled at the same time as some other regular sessions that didn't receive the audience they might have were they only paired with other sessions.) There was an industry track session also devoted to social networks, and other relevant papers were in the Text Mining and Graph Mining sessions.

It's pretty exciting that social networks research is getting so much more attention, even in the last year. It will be interesting to see how long it lasts, and what sort of Big Questions get answered.

Time to board. It looks like the plane back to Seattle is mostly full of a different kind of geek: PAX attendees.

Tuesday, August 19, 2008

Butterflies in Vegas

I'd like to take this opportunity to self-promote our talk, "Weighted Graphs and Disconnected Components", at KDD 2008 next week in Las Vegas. This is work from CMU, with Leman Akoglu and Christos Faloutsos. In it we look at some commonly overlooked features of social networks: "disconnected components", or the components that are not connected to the largest connected component; and weighted edges, edges formed through repeated or valued interactions. We also propose the "Butterfly model" to generate graphs that match some of these new properties, along with previously-established graph properties.

For those planning to attend KDD, it will be the third in the Social Networks Research session, on Monday morning (with a poster Monday evening). Even if I haven't convinced you to attend our talk, you will want to see the other talks in the session (one of which is reviewed by Akshay Java here). They'll include the usual suspects from Yahoo! Research and Cornell, plus a paper analyzing of Sprint's call network. It should be a fascinating couple of hours!

Monday, August 18, 2008

Usenet vs blog linking in the 2004 election

As a historical study, I've taken a subset of the political Usenet links and compared to blog links in the same time period. I took links from October-November 2004 and made "A lists" for both political and conservative blogs, and compared with the A-lists found by Adamic and Glance in 2004.

Conservative A-List

Liberal A-List

I did little in the way of pruning the list, while Adamic and Glance were careful to only include traditionally-formatted blogs, but I have removed ones they explicitly mentioned omitting. As in their study, I've left off drudgereport (originally #2 in conservative) and democraticunderground (originally #1 in liberal), as they were not "weblogs" in the traditional sense-- drudgereport is an aggregator and democraticunderground is a message board. freerepublic.com is also message board-like, so that might have warranted removal from the list as well, but it was not mentioned in the paper. Bluelemur also claims to be the liberal version of drudgereport, (ETA: and realclearpolitics is another news aggregator) so those might also have been eliminated from the original study. Gadflyer.com, #2 in Liberals, is no longer in existence, which is interesting in itself, considering its popularity just a few years ago. But it is also notably missing from the Blog A-list, either by the numbers or by classification.

It is perhaps curious that blog-message boards democraticunderground and freerepublic have topped both lists in Usenet, which is closer to message board format. I wonder where they actually ranked on the pre-pruned blog list.

My overall impression from looking at this data is that Usenet is "edgier"; we more commonly see conspiracy theories and wingers getting a lot of attention. Given that, and the apparent popularity of message-board-like blogs, I wonder if we could consider the Usenet to be even more "democratic" (or "ruled by the mob", to be more cynical) than blogs.

Thursday, August 14, 2008

Preferential Installation of Facebook Apps [SIGCOMM WOSN]

I'm reading over the proceedings of SIGCOMM's Workshop On Social Networks, which is in Seattle next Monday.

Minas Gjoka, Michael Sirivianos, Athina Markopoulou, and Xiaowei Yang, a team of authors at UCI, wrote a paper, Poking Facebook: Characterization of OSN Applications, which looks at data from Facebook application installation and use.

First, they seem to have gotten a pretty successful crawl, which is saying something since Facebook is pretty selfish with data. Here is a PDF of application installation, both according to facebook stats and their crawled dataset, which match up pretty well:

They also modeled the histogram of installed-apps-per-user as preferential, running a simulation with "users as bins" and different apps as "different colored balls", iteratively assigning balls to bins. For instance, saying that 100 users have installed the Zombies application, would translate to "gray balls appear in 100 bins".

For each iteration, one goes through each "ball" (application installation), starting with the "most popular color" (application with the most installations). For each ball one then assigns an additional "bin that doesn't already contain that color" (picks a new user that hasn't already installed the app) according to a probability:

Where balls(i) is the number of applications a user i has installed, and B is the set of users that hasn't already installed the application. init is a parameter to moderate the initial activity, and rho is the preferential exponent, chosen in simulations to be 1.6. In the end you get a sort of heavy-tailed behavior, with most users installing a couple apps and a few who go nuts with application installs. It fits pretty well:

One of the fun parts of these sort of data is the outliers-- the users who go nuts on something. (Netflix users rating thousands of movies, etc.) It looks like in the crawled data there are a few users with 500 apps installed!

In the paper there is also fit to the "coverage of applications"-- that is, how many of the ranked apps we need to go through before we have all the users with one of those apps installed, and it appears the simulation reaches coverage a little too quickly, so perhaps the most popular applications are taking too many users in the simulation.

What's somewhat surprising to me is that this isn't at all based on the behavior of a user's friends, but of the entire Facebook network at large. I suspect that in reality that does govern user behavior, but for large-scale patterns one can overlook it. This might be different for actually modeling how an application catches on. (Using other features like network effects are listed as "future work" for the authors in refining the model.)

Tuesday, August 12, 2008

Shared authors in the political Usenet

My apologies if this image messes with your RSS feed readers. It doesn't show up well on the main page, so go here for full view.

Using Marc Smith's .Netmap plug-in for excel, I visualized some data I had. These are shared authorships of political Usenet groups, based on Jaccard coefficient (similar to cosine similarity). A thin edge indicates a coefficient > 0.3, a thick edge indicates >0.5.

The alt.politics.* and talk.politics groups were a tangled mess that's nearly a clique, but there is some interesting behavior with local groups. In the top left are the Canadian local groups. Quebec's qc.politique doesn't appear at all in this graph (nodes are only visible if there is an edge associated), probably due to language barrier. Then, we have Saskatchewan and Manitoba connected with a thick edge, and British Columbia, Alberta, and Ontario connected with thick edges. Only the latter group is connected with can.politics, the general Canada group. Looking at the Canadian map it isn't regional since ONT is east of SK and MAN. However, there is something that does correlate the groups: population density. The group of three has a higher population, and a significantly higher population density, than the group of two. What comes with that is a higher-traffic local group, and more authors with which to share with even larger groups-- giving a higher coefficient.

The same thing may be happening with the US local groups, too. I've circled the "connected" US groups-- that is, the ones that share lots of authors with the alt.politics.* group. What's interesting is that these high-traffic groups form a bridge between alt.politics.* and the other local US groups. Several of these statewide, lower-traffic groups share authors amongst themselves, but with only a couple exceptions, they don't venture outside the local politics sphere. And again, there are some that don't show up here (most notably Virginia and Maryland-- I would imagine their nearest neighbor would be dc.politics, but they didn't have enough volume to get a high share-rate).

Just some cool-looking effects of the Jaccard index. I think another interesting way to visualize this might be to use a digraph, with an arrow from A to B if "p% of A's authors post in B". I bet that would get Virginia and Maryland to show up.

Disney and CMU collaborate on graphics, pixie dust analysis

The newest resident to the Collaborative Innovation Center (CIC) on the CMU campus is the Disney Research Pittsburgh Lab, joining Google, Intel, and Apple Pittsburgh labs. This is, of course, pretty sweet.

One thing CMU has to navigate is a relative lack of nearby tech industry compared to schools on the coasts. Industry collaborations have not been a huge problem, as they can be done over long distance and CMU is very good at encouraging them through internships/sabbaticals. (Of course, having folks next door at the CIC makes things easier.) However, when it comes to tech-industry couples, it's always good to have more potential workplaces. Sometimes CMU doesn't have two faculty positions, or two grad student positions, etc. For more admin-type analysis, see post from CSD head Peter Lee at CSDiary.

CMU will collaborate with Disney Research Pittsburgh primarily on autonomous systems, graphics, and other entertainment technologies. Collaborations thus far have proved difficult, however, due to Donald Duck's frequent temper tantrums and insertion of nonsensical paragraphs into the text of papers. Also, Tigger keeps tearing up the lab equipment.

Monday, August 11, 2008

The end of Usenet?

In this post I put up some plots with the posting and linking rates in our political Usenet data set. Would the political Usenet continue to decline at the current linear rate (according to the past 4 years, and using a rough linear regression in R), the post rate would approach 0 around July 2014. Here is the (unsmoothed) data for posting rates, and the completely unoptimized linear fit in R.

Of course, it's safe to say that regressing on 4 years of data and projecting it 6 years into the future will leave a huge margin of error-- even the fit "looks" like it should be a little steeper. (Smoothed data gives the endtimes to occur a year earlier, and I imagine more sophisticated time series analysis would project something completely different.) Errors involved in fitting aside, there's no telling what other sorts of things could happen between now and then to bring it back or (perhaps more likely) speed its decline.

Saturday, August 9, 2008

What is Eliza crossed with a Magic 8 ball, times a billion?

I just got my weekend entertainment from http://bossy.appspot.com. It's an "ask" app inspired by Ask MSR, a paper written 7 years ago for TREC. (I'm not sure the authors ever intended or expected such a thing, but there it is.)

For 50 lines of API code, it's pretty impressive-- and when it's wrong it's usually entertaining. It does well on short and simple word-association queries, like "Who is Batman?" or "What is xkcd?", and seems to do reasonably well on easily-searchable names such as my own.

It even has some political opinions. Try "Who is the worst president in history?", "Who lost the 2004 election?", or "What is the United States?" and you'll get some cynical answers. It's also pacifist, at least for certain queries.

It tends to get snarky when asked other binary queries. I was chagrined when I asked it "Which is better, CMU or MIT?" It also defects when asked to decide between Microsoft and Google, or between OU and OSU. However, does have a preference with respect to the statistics cults.

Alas, it does not seem to be immune to spam.

Thursday, August 7, 2008

The male-female demographic in social media

Mike on Ads has a cool script to infer whether a user is male or female, based on browser history. If you have it analyze your history, it will give you a list of the sites you visited and the corresponding male:female ratio. He got the sites to poll from quantcast, but I can't tell if the demographics came from there as well. The numbers seem to be different when I plug them in, so I'm guessing he either used more/older data than what's currently up, or got it elsewhere.

Here are the ratios for various social media I plugged in, in order from "most male" to "most female":

Site M:F ratio
Digg 1.56
Flickr 1.15
Feedburner 1.11
Worldofwarcraft 1.08
Blogger 1.06
Youtube 1
Last.fm 0.96
Linkedin 0.94
Pandora 0.9
Facebook 0.83
Myspace 0.74
Livejournal 0.68

Twitter and Wikipedia didn't seem to be in the feature set. However, straight form quantcast it seems Twitter's ratio is 0.97 and Wikipedia's is 1.07. Quantcast also lists a ton of other demographic info, which is interesting to look at.

Tuesday, August 5, 2008

How our brains deal with large numbers

Via Andrew Gelman, a recent Science article claims that humans innately use a logarithmic scale.

When askedto point toward the correct location for a spoken number wordonto a line segment labeled with 0 at left and 100 at right,even kindergarteners understand the task and behave nonrandomly,systematically placing smaller numbers at left and larger numbersat right. They do not distribute the numbers evenly, however,and instead devote more space to small numbers, imposing a compressedlogarithmic mapping. For instance, they might place number 10near the middle of the 0-to-100 segment.

(Full text here, SciAm report here)

When I was a little kid my dad helped me "count to a million" using log scale (1,2,3...10,20,...100,200,...). Even then it seemed intuitive. I knew that there were increasingly more numbers in between counts as it got higher, and I felt I was "cheating" by skipping them, but I did not understand how long it truly would have taken if we'd counted all the numbers in between (I probably would have guessed it'd have taken hours, rather than days).

It's not that people cannot grasp large numbers-- they just have trouble converting back to a linear scale. :-)

Monday, August 4, 2008

Memeorandum : Scandal : : Techmeme : ?

A few days ago, I posted that the top "most discussed" links on Memeorandum were related to scandal, violence, or both. Akshay suggested I do the same for Techmeme. His bid on the #1 discussed story was the Microsoft-Yahoo merger. For updates since September 2005, that turned out to be #2, and my bid, the Hans Reiser case, was nowhere close to the top.

So, what do Techmeme's sources *really* care about? Smartphones.

"Dear early adopters: Sorry we made iPhone available to the proles. Here's some iTunes. Love, Steve Jobs [smartphones] 152

Microsoft's bid for Yahoo! [acquisitions] 138

"Dear iTunes customers: No DRM-free music for you. Love, Steve Jobs [IP] 122

Macworld 2008 Keynote [smartphones, gadgets] 121

Google announces Android [smartphones] 113

Digg tells DMCA to bug off [IP] 112

Google on Microsoft's "hostile" bid for Yahoo! [acquisitions] 111

Google acquires Youtube [acquisitions] 107

Steve Jobs announces cheaper iPhone [smartphones] 104

Macworld 2007 (and announcement of the iPhone) [smartphones, gadgets] 101

Gadgets, with a smattering of IP and corporate bureaucracy. Spots 11-20 seem to be more of the same.

One thing that is worth noting is the "most discussed" story is a single link. So, if a number of news sites "split the vote" and have several discussion links apiece, the story may not surface in this list. With techmeme it is a little more obvious, since most bloggers seem to link to the official corporate press releases. With memeorandum, I'm trusting that preferential attachment (NYTimes and Washington Post dominate) makes it so the vote isn't split often enough to dramatically misrepresent reality.

Feed experimentation

I'm attempting to redirect my default Blogger RSS feed to http://feeds.feedburner.com/dataphiles. Google Reader seems to take awhile to figure these things out, but if you don't see another post from me in the next week, you might check to make sure you're getting the right feed. Of course, given my history there's also some chance I just dropped off the face of the blogosphere, but we'll try to not let that happen.

Thursday, July 31, 2008

Telling us what we already know

Via Thursday Bram, communications agency Universal McCann recently conducted the third wave of their global study on social media usage. The results indicated, of course, a growing usage of all kinds of social media worldwide. Also, it notes that "blogs are a mainstream media worldwide and as a collective rival any traditional media" (emphasis mine). Sooner or later, it seems we'll have to be more specific when we say "mainstream media". :-)

You can see a complete slide show of results here. (Warning: It's very colorful, and people with a sensitivity to circles should not consume.) It should be useful for citing whenever a convincing intro to a SM research paper is needed.

What has been happening to the political Usenet?

When I first heard about the Netscan project (see Marc Smith's homepage), my thought was, "People still post on Usenet? Last I heard about it, one of the more active groups was alt.fan.spice-girls." Working on a related project, I've gotten a similar reaction from other people I've mentioned it to. So an overarching theme of my project has been to answer whether Usenet is a distinct community, or simply a sample of what we already know about online communication.

One advantage to studying Usenet is that since it's been around for so long, it's easy to get historical data and say something about its evolution. Furthermore, it's easier to call what we know of it a "community" (although we're still forced to sample it, for our purposes), whereas we never really know if we've crawled*all* the blogs.

What we have done so far is obtained data since 2003 for 200 newsgroups with "polit" somewhere in the newsgroup name. Here's some over-time behavior, a plot of number of posts per day, and number of hyperlinks (in original, non-quoted content) per day:

Posts and Links for All Political Newsgroups

This is a smoothed version of the data, so to illustrate a general trend. The first thing you'll notice is the bump in November 2004, which we can attribute to the US Presidential Election. The next thing you'll notice is that while the number of posts is declining, the number of links remains stable.

Here's the same data for a small subset, can.politics:

Posts and Links for can.politics

Predictably, the "USA Election bump" doesn't hold for all the groups. For uncultured folks like me who had to look it up, the last election in Canada was January 23, 2006. We do still notice an increasing tendency to link, per post. Perhaps people on Usenet are going more to outside sources. Or, as another intern put it, "They're getting lazy."

Tuesday, July 29, 2008

Scandal sells

Since starting at Live Labs I've gotten to play with a lot of data, including the political Usenet and crawled memeorandum hourly data (since mid-September 2005, following Katrina). Today I came across something less-than-surprising.

Top 10 links on memeorandum according to most number of 'discussion' links-- that is, number of discussions (usually blogged) that are related to a parent story (usually news).

"For McCain, Self-Confidence on Ethics Poses its Own Risk" [McCain and scandals] 219

"Spitzer is Linked to Prostitution Ring" 178

"Embattled Attorney General Resigns" [Gonzales and scandals] 170

[Text of Obama's race speech] 158

"NSA has massive database of Americans' phone calls" 129

"The Long Run-Up" [McCain and scandal] 119

"Craig Arrested, Pleads Guilty Following Incident in Airport Restroom" 116

"US Web Primer Is Said to Reveal a Nuclear Primer" [Iraq and Nukes] 115

"Digging Out More CNN/Youtube Plants" [Youtube politics and staged debates] 115

"Dark Suspicions About the NIE" [Iran and Nukes] 107

So, for the most part, what sells is sex and violence.

Sunday, June 8, 2008

Book: Beyond Fear, by Bruce Schneier

I read this a couple months ago and failed to take it with me to Seattle, so I've lost the notes I took on it, but it at least bears mentioning.

He proposes looking at a security problem/solution using the following steps:
1. What assets are you trying to protect?
2. What are the risks to these assets?
3. How does the proposed security solution mitigate those risks?
4. What other risks does the solution cause?
5. What trade-offs and costs does the solution impose?

It's a good introduction to some of the principles and key terms in security (at least, from what I can tell, as someone who knows very little about the field). He uses examples of national security throughout the book, essentially telling readers that terrorism isn't as much of a threat as everyday dangers like heart disease and car accidents, and that the current solutions do not mitigate the risks well. What I liked most about it was that he can frame anything in terms of a security problem and explore it in-depth (including a lot of things I wouldn't normally have thought of in that way, such as maintaining a population of honeybees), which puts it in the category of "books that help you learn to think differently". If I were put in the position to teach an undergrad-level course on computer security I would make it required reading in the first couple weeks, just to get students in the right frame of mind to think about security problems and solutions.

Tuesday, June 3, 2008

E coli: not just for health scares

Today MSR had Carl Zimmer visiting to give a talk on his latest book Microcosm: E coli and the New Science of Life, following a pre-talk backyard burger grilling (not really). I watched over the live-streaming video. Zimmer addressed how E coli has been used in the past for scientific experiments, and some new directions that microbiology is taking.

E coli has been used in bioengineering to make synthetic insulin, jet fuel, and cancer treatments, to name a few. Some students even found a way to make it "take pictures". E coli has around 2,000 "core" genes, while the entire genome (all strains of E coli) has nearly 10,000 that have been found so far (for comparison, humans have 30,000). Some scientists believe that the "bare minimum" of genes necessary for its survival is around 200. Venter and company have already been working with a different smaller-genomed species, and "keep knocking out genes, to see if it still lives." Their count is down to 350. Potential experiments are to take these O(100) genes and begin adding more to create "new life" specialized for some purposes, which is very futuristic-sounding.

Other interesting experiments involve finding bacteria that are already suited for human needs. For instance, a teenager in Canada already isolated bacteria that eat plastic bags. These sorts of experiments could solve a lot of problems. I wonder if there are bacteria that turn lead into gold. :-)

Sunday, June 1, 2008

Newsflash: Flying is Frustrating

Via The Consumerist, Americans are flying less because it's such a frustrating process, according to the Travel Industry Association. Detailed survey results are here (PDF).

Oddly enough they don't say anything about fuel costs, which I imagine has a much larger impact. For one, people are also driving less, and presumably this is not a reaction to the fact they're just sick and tired of having to fight their neighbor for the armrest.

For two, people have a greater tendency to grin and bear it when they're paying less for something (just ask any Southwest Airlines customer*). But when flights start costing more, whether on the ticket or by new-and-improved fees ("Now you want $15 to lose my bag, a service that used to be free?"), people expect a better experience, even if logically they know the cash is just getting pumped into the fuel tanks.

Perhaps I'm missing something. I haven't paid much attention to flight prices over the past year; I'm just guessing they've increased. (And if they haven't, that might explain why airlines can't get their stuff together enough to satisfy their customers.) Does anybody have solid data on this? Better, does anybody have solid data on how many people actually fly, not just what a consumer survey says?

*- I kid, but SWA flight attendants have been known to say during the pre-flight recitation, "Please do not tamper with the lavatory smoke detectors, as the penalty for disabling a smoke detector is up to $2000. And we know that if you had $2000, you'd be flying American."

Started at MSR/LL

I'm in Bellevue, WA now, and just finished my first week as an intern at Microsoft Live Labs. I'm working with Matt Hurst on some social media stuff. So far MSFT has been a fun place to work; everyone seems really happy.

One of the things I'm most excited about is the puzzle culture. I did PuzzleQuest, sponsored by MSFT, once awhile back and really enjoyed it. I hear there is an intern puzzle day as well as weekend-long "The Game" (not to be confused with The Game that I just lost). The latter is apparently invite-only, so I will have to get more details later.

Other notes:

-We found out that some recent work with Leman and Christos was accepted to KDD, so I will be in Las Vegas at the end of August. With my trusty free Microsoft Research nalgene bottle, so as not to dehydrate.
-As I tend to do when I travel, I've done an unusual (for me) amount of non-work-related reading in the past couple months. Will update later with some notes.

Thursday, April 17, 2008

How to make time for literature review

Answer: just wait until you're completely unmotivated to do anything else. Sunny days with perfect weather are really the only times I get a chance to do any significant literature reviews. This afternoon, when I was unable to get myself to stay in my windowless office, I (finally) sifted through the WSDM proceedings that I'm most interested in, and read a couple papers on trust/distrust propagation. I'm getting better at adding papers to my bibsonomy [rss]. The top 10 or so should be what I covered today.

Also a fun article: via Physics Arxiv Blog, To How Many Politicians Should Government Be Left? The article looks at the "efficacy" of a government compared to its cabinet size, and makes a rather nifty model of how opinions are formed in small networks. Another interesting bit is that while cabinet size ranged from 5 to 54, not a single government of the nearly 200 surveyed had a cabinet of size 8-- apparently it is common knowledge that that is bad luck, or something.

I also discovered that Jure was smart enough to submit last year's SDM paper to ArXiv, which yielded a citation. That has prompted me to register so I can post other publications.

This is related to a recent pet peeve of mine-- the fact that it's difficult to get conference proceedings. The ACM/Citeseer folks don't always things from workshops and the like that I'm interested in. Most authors have the sense to post their papers on their websites, but I much prefer being able to get a conference all in one place. Of course, professional organizations don't like to do that. I find it hard to believe that they really make money off of conference proceedings, so I can only guess that it has to do with publisher/copyright/legalities rules outside their control. Maybe someday CC/GPL will be able to wrest away some control.

Sunday, April 6, 2008

ICWSM, semi-supervised learning

Returned from ICWSM, and was inspired to perhaps start blogging again, but we'll see how long that lasts.

The tutorial at ICWSM went well (pdf slides available at that link, ppt available by emailing me). I will be giving it again at NESCAI. There were a lot of great talks and posters at ICWSM; a lot more toward the text/sentiment mining side of things than last year, but still a great variety of concepts.

While in Seattle I missed the 10-601 class lectures on semi-supervised learning, and had to prepare a recitation anyway. So as part of that preparation I came across a good survey paper by Xiaojin Zhu. It has an entire section devoted to graph-based methods, some of which I hadn't heard of, so this was useful to me beyond giving me interesting things to talk about in recitation. It might be of use to try some of these algorithms on community detection in networks.

Tuesday, February 19, 2008

Open problems in movie stunt coordination

Via Fark, stuntman is attempting a 24-mile skydive.

But Steve, of East London, said: “It’s the last great challenge left on Earth. Obviously it will be dangerous. We’re playing with a lot of unknowns. But it’s my job to assess risk and I don’t believe the problems are insurmountable.”

Last great challenge on Earth? Looks like all of us scientists can quit our jobs soon! :-)

Saturday, February 16, 2008

TREC blog retrieval

Jonathan Elsas presented in the Social Media Reading Group yesterday. He presented to us the very successful approach the CMU team took for the Blog Retrieval task at TREC 2007. Details are in this paper (pdf)

He brought up the point that TREC has the cool property of being "task oriented", which is not always the case with data mining research (and is a criticism of the 'what do evolving graphs look like?' approach I tend to take with my own research).

Another point he made is that no teams at TREC (successfully) used two common properties of data that *are* important in the non-task-oriented research in social media: timestamps and link analysis. He did not seem to think that simply aren't "useful" properties, only that nobody figured out how to use them properly.

While I think that link analysis could be used, it certainly could not be used without some significant text analysis. My impression is that link analysis is useful for tasks like measuring influence or information diffusion, or trust and authority. Relevance seems to me to be a much more text-dependent property.

Furthermore, relevance is a subjective measure, just like influence and authority. In fact, the difficulty in a lot of data mining research is the difficulty of finding a good evaluation of your results. TREC scored the entries with human-tagging. If the goal was to find relative blog posts, then each team's algorithm would find some candidate posts, and the competitors themselves would then vote on which seemed best. And that's probably the best we could do for something so imprecise as "relevance".

It's really hard to do "science" when such complex beings as humans are involved in the measurements.

Jon also kindly lent me the WSDM proceedings, which I copied to my laptop and intend to review soon.

Tips on the Interview Process

Jeannette Wing, former CSD chair now at the NSF, came back to give her famous talk "Tips on the Job Interview Process". I was one of the "younger" grad students there-- since it had been three years since the last time she gave the talk there's the potential it wouldn't be given again in time for me to graduate. Also, sometimes the interview process for internships is similar.

Slides from the 2005 version of the talk.

Friday, February 15, 2008

Large-scale visualization reading group

Independent of the social media reading group (though I imagine some folks will participate in both), a visualization group has been founded by Peter Landwehr and Anita Sarma. And the first group meeting is on graph visualization (Thursday at 12:30 in the gradlounge). I'm stoked.

For the schedule and to subscribe to the mailing list, visit their wiki page.

Sunday, February 10, 2008

Boycotting assistant professorships

I'm currently reading A Ph.D. Is Not Enough: A Guide To Survival In Science. There is one chapter devoted to deciding on a career path, mainly between academia and industry/government.

It brings up some good arguments against going into academia. One oft-cited reason is the "begging for money"-- while you have the freedom to study whatever you like, you're limited by what you can find funding for. One thing that isn't often mentioned is that even fully-tenured professors still don't get to do whatever they like. By the time someone has tenure they're pretty well-known and have a lot of administrative stuff to take care of. Giving invited talks and schmoozing with NSF officials tends to crowd out time to spend meeting with students, never mind actually doing nitty-gritty research like they did in the golden days of grad school.

According to the chapter, junior professors generally have that minus the job security. Grants are often awarded based on track-record, which junior profs haven't had time to get. Also, in the first few years they need to start teaching courses from scratch rather than re-using things from past years. And, of course, they need to make a place for themselves in the community by reviewing papers, serving on PCs, writing papers, mentoring students, etc. Then, if they don't get tenure, they have to go away and start all over somewhere else.

Feibelman makes an argument that we simply shouldn't stand for that. By accepting an asst. prof job as-is, one consents to that mistreatment by The Man. And the intense competition that goes on for these few prized positions isn't giving The Man any incentive to change the way he does things. Feibelman suggests that one closely evaluate his or her priorities and recognize the inherent bias one who's spent so long in school may have toward academia (as in, we want to emulate our heroic advisors). He also suggests the option of making a name for oneself in industry/government labs and then walking right up to a university and getting a tenured or almost-tenured job right away.

I would tend to agree that a 6+ year hazing period, if that is indeed what it is, isn't good for the system. It appears from the outside that of the set of {JuniorProfessorship, Sleep, Family}, a mortal being can pick two at most. And whining about it won't do jack until enough sought-after PhDs start making ultimatums. However, I am not convinced that the ones fighting for professorships are going into it blindly. Anybody who's been in grad school for a few months knows the demands on their professors. People who have a choice between academia and industry and choose academia are usually willing to make sacrifices somewhere, whether it means their family or their hobbies or their health. Scientific research is the greater good. People who actually score academic jobs probably have done little else besides work for the last 10 years of their life-- if they didn't like it that way they would have changed some time ago. It works for them.

On the other hand, being focused to the point of peripheral blindness does somewhat correlate with being in grad school, so maybe some people are shooting for academia without knowing all the costs.

Friday, February 8, 2008

Slashdot moderation

In the social media reading group today, Yi-Chia Wang led discussion over Slash(dot) and Burn: Distributed Moderation in a Large Online Conversation Space, by Cliff Lampe and Paul Resnick at UMichigan. It's an interesting study of comment moderation on Slashdot.

I don't participate in comments on /., but I occasionally read them if I really don't have enough other things on the internet to distract me, and was always a little confused about how comments were moderated and decided upon (I never read this little FAQ of theirs). Users can vote for comments to be upgraded or downgraded. Most comments are not moderated (only 28% are), so there is a high tendency for default values between -1 and 2 to remain as-are: -1 is trolls, 0 for Anonymous Cowards, 1 for regular users, and 2 for users with good "karma", which is decided by whether you've posted.

One not-surprising thing was that at a median, 18 hours is how long it takes for 90% of a post's comments to happen. Not sure if it necessarily follows a power law dropoff, but if it does it is somewhat steeper than the -1.5 power law for post-responses in general for blogs that we wrote about in this paper, so I wonder if that is the case with comments for all blogs, or just high-traffic ones like Slashdot.

It's certainly an interesting moderating scheme, considering the computational methods for finding "interesting" things are not there yet.

Tuesday, February 5, 2008

Teaching Seminar

I just took my second grad student teaching seminar from the Eberly Teaching Center.
Unlike my education courses I took as an education major in undergrad, these seem to be useful. Today's was on Teaching Perspectives. There were five main perspectives presented:

Transmission: "Teacher pitches content to students."
Apprenticeship: "Teacher, who is knowledgeable about the content, can serve as a model / guide students"
Developmental: "Students learn by interacting with content."
Nurturing: "Encourage learning by forging relationship between student and teacher."
Social Reform: "Focus on ideals, everything else is bonus.

We took a Seventeen-magazine-style questionnaire that showed how we "scored" on each perspective with respect to our beliefs, actions, and intentions. I scored high on the first three, and low on the second two, while I would have expected myself to score lower on transmission and apprenticeship than I did.

However, I think the perspective depends on the class. I'm currently TAing for an undergrad/Master's level course in Machine Learning. While I think that social reform and emotional growth are important for young adults, I don't think that's my job. These students are paying a ton of tuition money, and here they're paying to be taught about machine learning. I'm much better equipped to give them their tuition's worth in cold, hard knowledge than in a great teacher-student relationship. They're perfectly capable of getting qualitative ideas from their philosophy courses and extracurriculars, and their nurturing from their friends and other relationships. However, if I were teaching, say, a course on statistical bullshit detection, I'd have a high priority on social reform, and if I were teaching a freshman class on literature or composition, I might want to incorporate more nurturing into the class.

Anyway, it's at least something to think about while I'm teaching. And I'll probably go to more of these seminars.

Wednesday, January 23, 2008

Metafilter data

Here is a bunch of datasets, dumped from Metafilter. Looks like it could be interesting.

Thursday, January 10, 2008

Bad Computer Science Writing

I just now ran across a 1997 writing of Jonathan Shewchuk: Three Sins of Authors in Computer Science and Math. They are:

1. Grandmothering. That is, writing an introduction that does not tell what the paper is really about, often making it both inaccessible to newbies and obvious and irrelevant to experts.
2. A paragraph-long table of contents in the introduction. (e.g. "In section 2 we survey related work. In section 3 we go over some preliminaries...")
3. Essentially copy-pasting the introduction into the conclusions.

I've been guilty of at least the last two simply out of the oral tradition of CS folk. Oops. I did always think the Table of Contents thing, while logical for book introductions, was a little silly for an 8-page paper where you're already worried about space. As #3, I suspect it's to make sure reviewers do have a "takeaway" message, in case they're too lazy to go back and read your introduction. However, if you have to re-state all your major findings on the last page in order for people to figure out what you've done, then the rest of your paper must have been poorly-written.

I am rather ashamed at how my writing skills have slipped since I changed majors five years ago. I could probably still pass freshman comp (and I have it easier than many of my fellow grad students since I get to write in my native language), but it's nothing like I could in the heyday of my high school journalism career.