December, 2009


31
Dec 09

Project I Haven’t Started: Liberate All My Data Every Night Forever

I want to collect a bunch of python scripts that log in to your account on a network service, scrape out as much data of yours as possible, and save that data in an easily parseable format. For some services, such as gmail and Google Docs (and some other Google services, thanks in part to the Data Liberation Front), this is mostly a question of doing the logging in and then clicking a few buttons to use something convenient like an “export” feature. For other services, such as facebook, a bunch of python needs to be written that manually crawls through pages and grabs all relevant data.

I want to have all of these scripts in a central, open repository that invites contributions. (As an aside, I’d also like a more general central repository/tracker for bits of python that are useful for crawling websites–from a full, manual wikipedia scraper to something like a single function that gets you a logged-in cookie for gmail.com). Once I have this repository, I want to pack all the software together with an easy front-end where you simply enter your login credentials for all the network services that you use. I want this to run every single night, doing incremental backups of all of my data.

I want this for the implication for computing autonomy–using proprietary network services is still bad because I can’t understand or build on (or share) the code that I am running (or, more closely, causing to be run), but at least this way I can feel like I have more “control” over my data (more on this two paragraphs down).

I also want this for the practical benefits. Techies sometimes say about computer data that “if it isn’t backed up, it doesn’t exist.” We often (rightfully) think of putting our data into a network service as making it more likely to live forever–after all, they often have their own backup system which is much more robust and heavily funded than ours. But of course we should not assume anything about what a proprietary network service is doing behind the scenes! There may be no backup system at all, or it may fail (see the sidekickocalypse), or the service-providers may just decide to pull the plug and let all your data die (see the Geocitenocide). Also, if a service goes down but it has backups, there is nothing you can do to expediate the process of having those backups restored. Your data is “safe” but it’s locked up until further notice. This is why I want a client that runs these scrapes every night, ideally using incrementation (maybe a browser plugin can track when you add data to a site? Maybe this whole idea ought to be implemented as a browser plugin?).

Of course, knowing that I always have access to my data in easily-parsable formats has another important advantage: it makes it easier to leave one network service for another (or just altogether), especially if an additional part of this project is to collect scripts that can take these exported chunks of data and import them back in to other services or just other pieces of software (I’d like all of my flickr photos on facebook and also my on-disk copy of F-Spot. Also post profile data from facebook back on my myspace. kthxbai.) Locking up data has long been a malicious strategic device used to keep people using your software/service even after you’ve decided you don’t want to anymore–from network services that are data black holes to locally-run software that uses DRM or generally proprietary file formats.

With this type of control over my data, it’s easier to leave a proprietary network service (this is a good reminder that computing autonomy is strongly related to data control–or data ownership, if you prefer). This has useful implications for people who occupy a middle ground on computing freedom in relation to network services. These people may believe that the usefulness of a computing resource is more important than its respect for one’s freedom–these are the “I’d use a Free alternative if one existed that was as good (or at least nearly as good)” people. These people will be have fewer excuses when a great piece of free software comes together that fits their needs (and, conversely, they’ll have the freedom to leave if/when a nicer proprietary service comes along).

Of course, the TOS for some (most?) of these services may be violated by the use of the software that I’m describing. However, I for one would love to see what happens when a bunch of people violate TOSs by doing this. Can we script cleverly enough that the service can never tell? Do people get found out and all lose their accounts? Do they really care that much, now that they have all of their data? Do they write angry blog posts about how they were “booted off of facebook for trying to download [or even 'take back'] their own data,” which eventually end up on larger news outlets? Do these stories make people care more about data portability and computing autonomy in network services? Does facebook come back with its tail between its legs and implement its own export feature?


29
Dec 09

“I Wish The ‘is now following you’ emails from Twitter were more useful”

Every few days, I get an email from twitter telling me that someone new is following me. 70% of the time it’s spammers, 20% of the time it’s people I don’t care about, and 10% of the time it’s people who I want to follow back (and I’m thankful that twitter gave me the heads up!)

The problem:
Though these emails include a few statistics about the user (number of tweets, number of followers, number of people following), they don’t include any of their actual tweets. This is unfortunate, because the best way to tell whether or not you care about someone on twitter is to look at what they say! As it is, I have to click a link in the email to visit their profile if I want to do that.

The solution:
Some python that I wrote, which logs in to your email, finds these messages from twitter, deletes them, and sends you more useful messages that
include recent tweets by the person who has just followed you!

In the future, I’d like to make this happen for identica as well. Also, I’d like to (and am very confident that I can write the code to) be able to follow the person back by responding to the email.