The Dataphiles: 2007

Thursday, December 13, 2007

Andrew Tomkins on search data privacy

Yesterday Andrew Tomkins gave a talk at CMU. He addressed some social media, but a large part of his talk was regarding how it's tough to "anonymize" search data. Using the AOL scandal as an example, he basically granularized that data and showed in surprising ways that you can identify people.

One point he brought up was "person attacks". While "trace attacks" such as finding credit card information or SSNs in the queries are dangerous enough, an attacker might decide to exploit one person. For instance, if they know that you vacationed to Tahiti and recently bought a Honda minivan, they can look through the data and identify anyone who has queries on both Honda minivans and Tahiti vacation packages-- you could probably do well with this even without trying to find people limited to those who have searched for, say, "* in Pittsburgh PA". Or, if an attacker knows the victim and is at his house for a party or something, the attacker might ask to use the victim's computer and put in a unique term-- then when the attacker obtains the "anonymized" data later he can find that person. With knowledge of what they've searched for, for instance "AIDS clinics" or some adult term, the atacker could use blackmail.

It makes me want to play with the AOL search data-- I have some ideas for trend analysis. I'd downloaded it at some point but never got around to doing anything with it.

Emailing prospective schools

FemaleScienceProfessor has a post on pre-accepted students writing professors for advice on grad school, queries regarding availability, etc.

In the end, I don't think that e-mailing professors in such a manner actually helps your application. What does help is *real* getting your foot in the door, such as doing REU's or other summer programs available, or getting a job as a research programmer (this is good if you're wanting to take a year "off" between degrees anyway). Asking professors about this sort of availability is a good idea-- the programs aren't always easy to find on department main pages. Of course, you have to do that a year before your apps are due.

Also, if you have the opportunity to go to a conference while you're an undergrad is helpful-- you get more of a chance to demonstrate you know what you're talking about. And if nothing else you can at least talk to other grad students, who are less intimidating and more likely to ask you to join them for a beer.

Friday, December 7, 2007

Consequences of geographic distance and social networks

A well-known phenomenon here at CMU SCS is the NewellSimon-Wean barrier. There are several sub-departments of SCS-- including Computer Science, Machine Learning, Language Technologies, Robotics, Human-Computer Interaction, Software Research, and probably others I've forgotten. CSD, MLD, and ISR are in Wean; LTI, HCII, and robotics are in NSH. (Then there are students with offices in Doherty or the CIC, etc) There is a covered bridge about 20m long connecting the two buildings.

And yet somehow I know disproportionately more students in CSD, MLD, and ISR than in the others, even though LTI and Robotics have more overlap with my department in terms of research interests. I think this has to do with socializing factors. The NSH departments have their own lounges, where all the departments in Wean share a lounge (ISR and MLD are both fairly small). Each department has their own social organization to some extent, but the all-SCS social organization, Dec/5, is mostly CSD and ISR people (with growing MLD representation). Even though all of our events happen in Newell-Simon, and I believe our happy hours are well-attended by both buildings.

Of course, anecdotal evidence reveals that Dec/5 participation has a lot to do with personal connections. It is a time commitment, after all, and it's very easy to flake out on volunteer organizations because any given graduate student is "too busy". It's not so easy to do that if your best buddy is in the organization too and will have to pick up the slack. While we get a lot of great volunteers toward the beginning of the semester, once November/April hits it becomes very difficult to put on a TG (happy hour) and for the most part only people in the central "clique" sign up to help out-- and usually out of peer pressure. I also recognize that if I'm not friends with people I'm volunteering with, even if I like them as people, I'm going to get kind of bored.

This makes me think that the key to retaining Dec/5 volunteers is to integrate them in quickly though separate social activities. If they can become friends with existing committed folks, they're more likely to become committed themselves.

Wednesday, November 28, 2007

Breaking power laws

Cosma Shalizi just sent me a paper, Power Law Distributions in Empirical Data, that uses real statistics to fit power laws and other models that power laws are often mistaken for. Yesterday's discussion in the seminar we're in was on a Barabasi paper and Stouffer et al's response to it, which was apparently a family feud among two students who shared the same advisor (Barabasi and Amaral, IIRC). I've fit power laws in some of my work-- and according the Clauset, Shalizi, and Newman work, I've been fitting them wrongly. I intend to fix that.

What's strange is that it's so widely accepted in the (CS side of the) data mining community that degree distributions follow power laws-- not one of my reviewers has said anything about it, and at ICWSM, everybody who analyzed the blog dataset said the degree distributions were power laws (in-degree, maybe, but even visually I'm not convinced about out-degree). Maybe this has to do with the idea of having preferential attachment be the generative model, which I am also not a huge fan of (people with lots of friends may be more likely to gain more friends, but not because new arrivals to the network immediately link to them).

The first thing I would like to check on is the observation (in SDM 07) we had that popularity drop-off of blog posts follow power laws with exponent 1.5. We truncated it, due to the fact we only had reliable in-link information for about 30 days. First I intend to look at a longer time period, and see if we can avoid truncating it. Then I'll run some of the code Clauset posted that relates to the paper. If something fits better than the power law, I'd like to know what it is.

Monday, October 8, 2007

Dataphiles 2.0

How had I not heard about this until now?

Sunday, September 9, 2007

Brains, politics, and PhDs

According to this study, liberals more easily handle change than conservatives. I'm not on campus, but I'm not sure that CMU gets that journal anyway. I'd be interested to read the study, because I'd like to know if they controlled for education, or if they ran the same study varying educational levels while keeping political bent constant. I would assume that to get approved for publication they would have at least controlled for age.

My curiousity re: education is because awhile back I read something (sorry, forgot source) saying that people who had PhDs had less "mature" brains because they had to constantly learn new things, or something. Which would explain why it seems a lot of professors tend to retain childlike qualities. And how most of us go to grad school to avoid growing up.

Tuesday, September 4, 2007

The many faces of Milgram

In the networks seminar class, we are currently discussing Milgram's small world experiment, and the various papers written about it. The small world experiment being that a bunch of people across the U.S. were selected to have a big packet sent to them. This packet directs them to send the packet through a chain of "handshakes" (people acquainted on a first-name basis) to get to a stockbroker in Boston.

What I had forgotten was that this character is the same guy who did the obedience to authority studies. The rather disturbing study that suggested that people don't mind causing pain to other beings, so long as a guy in a lab coat says it's OK. The funny thing is that the authority studies took place in 1963, four years before the small world experiment. Obviously, "Milgram studies" wasn't enough of a household name by 1967 to get people who received the folder to immediately toss it, saying "Whatever, this guy may say he wants me to get this packet to some dude in Boston, but really he's just going to kill my family."
The small-world study, of course, has some holes in it, but our conclusion was "pretty good for a sociologist in the 60's". They didn't have the resources available to us data-mining folk (according to [2], the Nebraska portion of the study had a budget of $680), and they weren't really interested in the statistics as much.

Dodds et. al [1] replicated the experiment on a much larger scale. I'm sure they had to go to a lot of IRB red tape to get that sort of permission, too. Thanks in part to Milgram, of course.

[1] P. Dodds, et al. An Experimental Study of Search in Global Social
Networks. Science 301, 827 (2003).

[2] J. Kleinfeld, The Small World Problem, Society, 39(2), 61-66 (2002)

Seminars and semester

I'm in two seminars now, which I intend to use for blog fodder. We'll see how long that lasts, since I obviously have bursty blogging habits. The first seminar is Statistical Models and Methods for Networks, taught by Steve Fienberg, and the second is Analysis of Social Media, led by William Cohen and Natalie Glance (of Google Pittsburgh). Both seem very promising, as the syllibi list a number of papers that are in my embarrassingly large "should read but haven't gotten around to" pile. Plus, since grades are based on class participation (and perhaps a one-hour presentation), they should be low-stress, compared to other courses I've taken.

This semester Christos and I will be putting together a tutorial on graph mining for ICWSM 2008. (We're currently putting together the abstract/bios, so the actual tutorial page will be up soon!) I also have a number of research projects going on, including putting together a paper from my summer work at PricewaterhouseCoopers, actually looking toward a thesis topic, and getting some new datasets. In my spare time I'm compiling into social-network-form election campaign donation data, as my good deed to the machine learning community (and out of personal curiousity). As of now I just need to get it into MATLAB readable format and get about half a gig of webspace so I can post the compressed files. Also this semester I will be continuing my involvement in Dec/5, helping run the first few TG's as the torch is passed to new members.

Friday, July 27, 2007

Talking college students out of drinking

This study involved different punishments for drug/alcohol violations at a college campus. It compared the effectiveness of motivational discussions vs. a $300 fine for a violation. Both were pretty effective in the short term (4 months). But a little more surprisingly, the discussions were more effective in the longer term (15 months).

Frankly I'm surprised the little interventions worked that well. I would think the short-term effects would happen because it's embarrassing, but I find it hard to believe they're convincing. All the alcohol education seminars I heard about in college were kind of a joke.

Of course, this would depend on the type of violation we're talking about. A violation where other people are put at risk (hazing at frats, drunk driving) is something that somebody can be guilted into not doing again. Offenses for loud parties or possession, where the most the motivators can argue for is the hallmate's study habits or the offender's own health, are a little harder to use guilt against.

Now, if part of this "motivational discussion" included them saying to the offender "Next time we're turning you over to the cops", I would understand.

Friday, July 20, 2007

Google data miners have come across a scientific breakthrough.

They have proved that GTalk users are very, very lonely, and are probably about 16 years old.

Wednesday, July 18, 2007

Fiction as an exaggeration of inner fiction

An interesting post in Overcoming Bias. It suggests that our bias toward reality tends toward the direction of fiction. That is, (successful) fiction is simply a further exaggeration of things we already tend to overestimate. I think it's suggested that biases cause such fiction to be well-written and well-received, not that exposure to fiction causes this. Hanson then suggests to "Find ways in which fiction tends to deviate from reality, and then move your estimates of reality in the other direction."

A few possible human biases that this "fix" would identify.
-Your boss at the office probably isn't as socially inept and ignorant as you think.
-Your adversaries or competitors are not as evil and immoral as you think.
-Solutions cannot be wrapped up as quickly as you think.
-Serial killers are not as interesting as you think.
-People don't have nearly as much sex as you think.

This is related to the idea that everybody is their own protagonist.

Tuesday, July 10, 2007

Oh dear. I found a new (to me) toy, STEM: Spatio-Temporal Epidemological/Event Modeler. It comes built-in with a simulation of what would happen if the 1918 Spanish Flu outbreak happened in 2000, with various starting points. You can create your own SIR/SIS/SIER virus and set them loose, while they infect the entire world by way of bird migratory patterns, air traffic, etc. What's more, they've released the source code, allowing a whole new level of tinkering.

Full press release here.

For some background on the mathematics, see The Mathematics of Infectious Disease, by Herbert Hethcote.

I am beginning this blog in an attempt to maintain a public blog. I'm too shy to make public my personal blog, but too vain to not broadcast at least some of my thoughts to the unknown.

I'm a grad student studying data mining for social networks. While I would like to be able to write about things related to my field on a regular basis, I will probably write to the tune of whatever I'm reading. I read blogs and books focusing on data mining, stats, sci/tech, public health, social sciences (particularly w.r.t. gender issues), and humor. Since my goal is just to write regularly, off-topic posts are better than none at all.

The Dataphiles