Uncategorized


29
Mar 11

“Usable,” More Than Just “Open”: How Do We Make Public Domain Image Search Actually Work?

Alice Wants Public Domain Images

Alice is an educator who needs an image of a poodle for a lecture slide. Alice wants to share her work (say, under Creative Commons Attribution), so she needs to make sure that she has permission to share all of the pieces she is borrowing.

Alice searches Google for “public domain image poodle,” but doesn’t quite get what she wants. She digs further and discovers that public domain images are scattered across the web:

  • There are the common repositories, like archive.org, Wikimedia Commons, Flickr, and Openclipart.
  • Then there are the hundreds of repositories maintained independently by libraries, universities, federal government agencies, and even hobbyists–some but not all of them are listed on the CC wiki.
  • Finally, there are the the broader search engines, like google image search and yahoo image search, which let you search across a bunch of these for openly licensed images, but often miss things for various reasons (perhaps the licensing information was not articulated in a way that the search engine could pick up).

Alice is confused and overwhelmed.

Suddenly, finding an image of a poodle is a huge can of worms for Alice.

Crucially, what we usually call “openness” isn’t going to help Alice here—the images are already “open” because they are available online under a permissive license. Yet they’re not really discoverable.

In fact, “open” misses several features that Alice should really expect from her image search (hi-res files, standard formats, etc).

Parker is trying to Help Alice

For the past year I’ve been working in this problem space, starting with a project last winter to scrape out and mirror all of the images in the CDC’s Public Health Image Library. This summer, I drafted the Usable Data Criteria 1.0, which enumerates all of the things that Alice wants in her public domain image search, beyond just openness.

This Spring, with generous funding from Digital Studies at Dartmouth College, I’m continuing work on making federally owned, public domain images (listing of databases here) more usable, with a focus on discoverability.

My approach is two-pronged:
Crowdsourced Metadata Tags to Improve Search
I’m working with the Metadata Games Project, which is making it easy for the crowd to add metadata tags to images by playing games. My focus will be on making it easy to add an image set to metadata games and to dynamically retrieve the metadata tags in a machine-readable format. This way, Bob the librarian, who has an image set, can use metadata games to dynamically retrieve tags for his images that improve over time (and he can give his patrons an easy and fun way to help with tagging). Better tags means a better search experience for Alice.

Downloading and Cross-Posting Public Domain Images from Federal Agency Archives
Using web scraping techniques, I’m downloading images and their metadata from various federal agency image archives and making them more accessible to Alice in a number of ways:

  • providing additional tags from Metadata Games
  • cross-posting the images to other image archives
  • most importantly, providing one meta-index, updated nightly, with a machine-readable interface

The machine-readable meta-index will make all of the images in the federal image archives in my system easy to add to any larger search engine or dataset.

The code that I’m writing for image archive scraping will be generalized into a python library that’ll be available under the GPL.

Help me help Alice

What do I do next?

In terms of these federal agency images, what is the best way to put them where Alice will see them? Is it to cross-post to archive.org and wikimedia commons? Is it to create Yet Another Public Domain Image Search that makes them all accessible through one search engine? Is it to simply host mirror copies with semantic metadata articulating that the images are in the public domain, so that e.g., Google Image Search can correctly index them? Do we really just need One Usable Image Search to Rule Them All?

I need some advice from people who know a thing or two about image search, library and archival science, etc. I’m also looking for good journal articles about image search and image tagging. And if you’re interested in helping write or review code (such as html parsers for individual image archives), that would rule too.

Drop me a line: parker@madebyparker.com

I’m living and working on this project in Boston this spring.


Heavily edited 4/13/12. Thanks, Asheesh, for your feedback!


26
May 10

How are you Going to Live Your Life Differently Starting Right Now?

I spent the better part of the day today reading/watching/thinking about stuff by the ever-awesome Merlin Mann. If you’re a fan of his work, you’ll see where I’m coming from when I say that I decided that I had better do something generative and fulfilling before the day ends, so here I am at 3am scribbling some half-baked thoughts in to my browser window.

Because I’m interested in playing an active role in creating an awesome college experience for myself, I often think back to my idealistic 12th-grade vision of my future small-school liberal arts education. Probably the biggest part of that vision was intellectually stimulating late-night conversations in dormitory hallways.

And recently, especially tonight, I’m thinking that I don’t just want to have “intellectually stimulating” conversations with my peers; I want to read words and watch lectures by people who have/had really great ideas about morality and creativity and personal fulfillment and everything else that is inspiring and potentially life-changing. Then, I want people—students, profs, both—to hold my feet to the fire and say this to me:

You’ve encountered this inspiring idea, and you’ve taken some time to reflect on it alone or through conversation. Armed with this new knowledge, how are you going to live your life differently starting right now?

Then I want them to hold me to it. I want them to check in with me and to give me unsolicited feedback when I’m talking the talk without walking the walk (or worse, ceasing to talk the talk when I realize how hard it is to walk the walk). I want them to encourage me to write on my blog about my personal goals and convictions so that they are fully articulated and I am publicly accountable.

Then I want to do the same for my peers. I want to be surrounded by purposeful personal growth.


22
May 10

My Contrarian Stance on O’Reilly’s Contrarian Stance on Facebook

This is in response to Tim O’Reilly’s piece on the O’Reilly Radar: My Contrarian Stance on Facebook and Privacy. O’Reilly’s thesis is simple:

The essence of my argument is that there’s enormous advantage for users in giving up some privacy online and that we need to be exploring the boundary conditions

In his essay it quickly becomes clear that O’Reilly presupposes that the only/best way to provide web services is through centralized servers owned and operated by private (and probably large) companies. I think that this is really super extremely wrong.

O’Reilly writes, “We give up our location in order to get turn by turn directions on our phone.” But you could imagine a service where you hosted all of the map code and data on your own computer—maybe a machine sitting under your bed at home, or maybe your cellphone itself (hey, they’re getting faster and faster). This mapping software gains no benefit from any sort of network—it’s purely Software as a Service (SaaS).

Of course, the network service doesn’t have to be SaaS to benefit from thinking outside the box of centralization. This is what Diaspora* and a few other projects are hoping to prove. If the network is decentralized, you don’t have one primary party that can access all of the data, so you have more privacy. See also email, where your messages go directly to the email provider of the person receiving them. Sure, we can all choose to give up on the decentralized internet and use gmail (guilty), but if we decide that we really care about controlling who sees our messages, we can still choose to host our own mailservers (plus there’s always end-to-end encryption). No one person is controlling this information flow—we just agreed on a protocol and then moved on.

I don’t doubt that there are some examples of services where the very nature of the service means that information from a huge number of people makes it better, and thus some amount of centralization (or at least reporting back to mother ship) makes sense. Recommendation engines, especially ones that use machine learning algorithms, are a clear example (think Pandora, Grooveshark, Netflix). However, this class of services is only one slice of the social web, and facebook does not lie within this slice.


14
Mar 10

Dartmouth’s Mission Statement: The Most Proprietary in the Ivy League?

A logical early step in creating conversation about OpenCourseWare on Dartmouth campus is to look to the school’s mission statement to see where it connects. I expected Dartmouth’s mission statement to include at least one statement about “educating its students and the world” or “broadly disseminating knowledge” or at least “advancing knowledge and research for the greater good worldwide.”

Here is Dartmouth’s mission statement (full text at the bottom of this post).

The closest passage that I could find was this:
“Dartmouth fosters lasting bonds among faculty, staff, and students, which encourage a culture of integrity, self-reliance, and collegiality and instill a sense of responsibility for each other and for the broader world.”

This comes close, but it’s disappointing in that the “the broader world” is tacked on at the very end, and doesn’t play a central role in the sentence which is actually about bonds among university affiliates. If you pick it apart (it seems to me that being picky and semantic is warranted for something as important and central as a mission statement) it says that the bonds which Dartmouth fosters “instill a sense of responsibility for each other and for the broader world” among the individual members of the college. In other words, the school does not specifically make a point of spreading knowledge throughout the world, but the individual members feel “responsible” for the world in a broad sense, which might include a responsibility to spread knowledge throughout the world. Even though this statement involves a global scope, it says nothing specific about education or dissemination of knowledge on a global scope.

There’s also this statement:
“Dartmouth faculty and student research contributes substantially to the expansion of human understanding.”

This is only useful if we can understand course materials as research, which is a stretch to say the least.

I thought this was weird, so I did some research. My friend at Yale suggested that I consult the Dartmouth faculty handbook. But as far as I can tell, it’s of no help. The mission statement listed within is just a longer version of the one mentioned above, and makes no additional mention of educating the broader world.

Though our missions statement is lacking in this department, President Kim’s rhetoric seems to easily support OpenCourseWare. His saying, “make the world’s problems your problems” easily applies if you think of the lack of access to educational resources worldwide as one of the world’s problems.

I undestand that our mission statement was last revised in 2006. I think that it would be beneficial to add another revision to include a phrase that explicitly gave a global scope to our school’s goal of education.

As a comparison, I looked up the mission statements for all of the universities in the Ivy League. Some make an explicit point about global knowledge dissemination, and some come close. I would argue that they all offer more support for something like OpenCourseWare, though for a couple (mostly just Harvard), that’s easily contestable.

Continue reading →


18
Feb 10

Proposal: OCW at Dartmouth College

I’ve been working on this proposal for OpenCourseWare at Dartmouth College to help with my advocacy there. I’m hoping to present it to President Kim after I return to campus in the Spring. I also hope that it will be useful for activists at other schools. It’s of course CC licensed. Click here for a recent version in .doc format. There’s an unformatted (wiki markup only) copy/paste of the full text after the jump. It’s also up on the freeculture.org wiki, if you want to play around with it in wiki format. However, since I may not check the wiki page often, the best way to suggest changes back to me is probably to email me or write in the comments.
Continue reading →


31
Dec 09

Project I Haven’t Started: Liberate All My Data Every Night Forever

I want to collect a bunch of python scripts that log in to your account on a network service, scrape out as much data of yours as possible, and save that data in an easily parseable format. For some services, such as gmail and Google Docs (and some other Google services, thanks in part to the Data Liberation Front), this is mostly a question of doing the logging in and then clicking a few buttons to use something convenient like an “export” feature. For other services, such as facebook, a bunch of python needs to be written that manually crawls through pages and grabs all relevant data.

I want to have all of these scripts in a central, open repository that invites contributions. (As an aside, I’d also like a more general central repository/tracker for bits of python that are useful for crawling websites–from a full, manual wikipedia scraper to something like a single function that gets you a logged-in cookie for gmail.com). Once I have this repository, I want to pack all the software together with an easy front-end where you simply enter your login credentials for all the network services that you use. I want this to run every single night, doing incremental backups of all of my data.

I want this for the implication for computing autonomy–using proprietary network services is still bad because I can’t understand or build on (or share) the code that I am running (or, more closely, causing to be run), but at least this way I can feel like I have more “control” over my data (more on this two paragraphs down).

I also want this for the practical benefits. Techies sometimes say about computer data that “if it isn’t backed up, it doesn’t exist.” We often (rightfully) think of putting our data into a network service as making it more likely to live forever–after all, they often have their own backup system which is much more robust and heavily funded than ours. But of course we should not assume anything about what a proprietary network service is doing behind the scenes! There may be no backup system at all, or it may fail (see the sidekickocalypse), or the service-providers may just decide to pull the plug and let all your data die (see the Geocitenocide). Also, if a service goes down but it has backups, there is nothing you can do to expediate the process of having those backups restored. Your data is “safe” but it’s locked up until further notice. This is why I want a client that runs these scrapes every night, ideally using incrementation (maybe a browser plugin can track when you add data to a site? Maybe this whole idea ought to be implemented as a browser plugin?).

Of course, knowing that I always have access to my data in easily-parsable formats has another important advantage: it makes it easier to leave one network service for another (or just altogether), especially if an additional part of this project is to collect scripts that can take these exported chunks of data and import them back in to other services or just other pieces of software (I’d like all of my flickr photos on facebook and also my on-disk copy of F-Spot. Also post profile data from facebook back on my myspace. kthxbai.) Locking up data has long been a malicious strategic device used to keep people using your software/service even after you’ve decided you don’t want to anymore–from network services that are data black holes to locally-run software that uses DRM or generally proprietary file formats.

With this type of control over my data, it’s easier to leave a proprietary network service (this is a good reminder that computing autonomy is strongly related to data control–or data ownership, if you prefer). This has useful implications for people who occupy a middle ground on computing freedom in relation to network services. These people may believe that the usefulness of a computing resource is more important than its respect for one’s freedom–these are the “I’d use a Free alternative if one existed that was as good (or at least nearly as good)” people. These people will be have fewer excuses when a great piece of free software comes together that fits their needs (and, conversely, they’ll have the freedom to leave if/when a nicer proprietary service comes along).

Of course, the TOS for some (most?) of these services may be violated by the use of the software that I’m describing. However, I for one would love to see what happens when a bunch of people violate TOSs by doing this. Can we script cleverly enough that the service can never tell? Do people get found out and all lose their accounts? Do they really care that much, now that they have all of their data? Do they write angry blog posts about how they were “booted off of facebook for trying to download [or even 'take back'] their own data,” which eventually end up on larger news outlets? Do these stories make people care more about data portability and computing autonomy in network services? Does facebook come back with its tail between its legs and implement its own export feature?


29
Dec 09

“I Wish The ‘is now following you’ emails from Twitter were more useful”

Every few days, I get an email from twitter telling me that someone new is following me. 70% of the time it’s spammers, 20% of the time it’s people I don’t care about, and 10% of the time it’s people who I want to follow back (and I’m thankful that twitter gave me the heads up!)

The problem:
Though these emails include a few statistics about the user (number of tweets, number of followers, number of people following), they don’t include any of their actual tweets. This is unfortunate, because the best way to tell whether or not you care about someone on twitter is to look at what they say! As it is, I have to click a link in the email to visit their profile if I want to do that.

The solution:
Some python that I wrote, which logs in to your email, finds these messages from twitter, deletes them, and sends you more useful messages that
include recent tweets by the person who has just followed you!

In the future, I’d like to make this happen for identica as well. Also, I’d like to (and am very confident that I can write the code to) be able to follow the person back by responding to the email.


27
Nov 09

I sometimes write code that is almost useful

For example, today I wrote a python script that scrapes the images out of a gallery2 install. I did this for a few reasons:
*I just learned a bunch of new tricks for vim, so I wanted to exercise them before I forget them
*I want to become comfortable writing python
*I want to become comfortable with (at least some of) the python libraries involved in web scraping

Code on pastebin

update: github’d


5
Oct 09

A Response Op-Ed on Dartmouth OCW

Update: Huge Huge thanks to Cole Ott, Kevin Donovan, and Jared Benedict for their help collecting arguments and editing drafts. Sorry I forgot to include this note initially. This piece would have sucked far more without you guys’ help!

I’m still waiting to see whether or not The Dartmouth will publish my response. So far I haven’t heard anything back.

I am writing in response to “No Such Thing as Free Learning,” from the September 30th issue of The Dartmouth.

In her article, Johnson questions the benefits of OpenCourseWare, writing, “we should not fool ourselves into thinking that publishing course material serves any purpose but to garner publicity for the College.” Publicity is only one advantage of OpenCourseWare, and it is perhaps the least important one. OCW also has immense advantages within the university. The 2005 audit of MIT OCW showed that 71% of students, 42% of alumni, and 59% of faculty used it. Students use it to learn more about a course that they’re considering signing up for, or to follow along with one that they can’t fit into their schedules. Alumni use it for continuing education and to maintain a feeling of connection with their alma-mater (donations, anyone?). Most importantly, professors use OCW to observe their colleagues (both on campus and at other schools) in order to learn from their teaching methods and to identify potential collaborations. In this way, OpenCourseWare expands learning across generations within the university itself.

However, the whole point of OCW is that it expands learning beyond the university. In her article, Johnson challenges the idea that OCW systems are effective learning resources, writing “A student would have to be a very rare breed of self-starter to be able to gain anything from the available course materials.” Now, I’ll be the first to say that there’s nothing quite like being in the classroom and taking part in a dialogue involving both students and professors. It is my humble opinion that this is the most effective way to educate, and Dartmouth professors ought to stress discussion more in their courses. This is also the reason why a Dartmouth OCW system would never de-incentivize enrollment in the college.

That being said, lectures and syllabi can be incredibly useful to people who don’t have access to high-quality learning resources or who just need something that is free, accessible, and fast. Imagine the farmer in Kenya who wants to increase his crop yield or the student in Argentina who can’t understand her out of date textbook. One student in the US used MIT’s OCW to help him study for the physics AP exam. The front page of the MIT OCW website has a large banner linking to a page with many of these stories of how their system has been useful to students, educators, and independent learners around the world.

Clearly, OCW is not just something that worked once for MIT because they’re a big name and they were the first. The OpenCourseWare Consortium has over 200 member universities, and that doesn’t include many other open course projects such as Yale Open Courses and Harvard Medical School’s MyCourses. The future of education allows people of all backgrounds access to learning resources from top professors around the world with the click of a button.

If we are to make the world’s problems our own problems, as President Kim has urged us to, there is an obvious moral argument for why we as an institution should not be hoarding the great learning resources that we are creating. The demand for higher education is increasing far more rapidly than our universities can accommodate. Our mission should be to expand education and knowledge worldwide, from Hanover to Hanoi.

Johnson’s article brings up a cost-benefit analysis, which is an important thing to consider. It’s true, MIT OCW is expensive. However, it is largely funded by outside grants (from the William and Flora Hewlett Foundation and the Andrew W. Mellon Foundation, to name a few), thus it is not true that every dollar put towards OCW is taken away from another aspect of the university. Furthermore, MIT is not the only model—the University of Michigan significantly reduces costs in their OCW system by using students in the publishing process through their dScribe system. Here at Dartmouth, many courses in both Thayer and the Physics & Astronomy department are already being captured on video for internal use. Also, many courses in the computer science department have lecture notes, syllabi, homework assignments, and even practice exams that are publicly available from the department’s website. The cost of making these materials OpenCourseWare would be very small—the barriers are almost purely administrative.

Finally, implementing OpenCourseWare at Dartmouth would be far more than simply a hop on the higher-ed bandwagon. Because of the transparency of OCW, each new system has the ability to observe existing ones to learn from and build off of them with fresh ideas. Though it’s clear that OpenCourseWare is part of the future of higher education, it’s not yet clear what the most effective system looks like. Dartmouth could really push the movement by exploring how OCW systems could be more collaborative and participatory. I sincerely believe that by building on the work of our peers and adding our own twist, Dartmouth can use OpenCourseWare to truly advance higher education in a lasting way.


9
Sep 09

LolBro

This thing I put together instead of doing something else: Dontxmebro.com.