Alice Wants Public Domain Images
Alice is an educator who needs an image of a poodle for a lecture slide. Alice wants to share her work (say, under Creative Commons Attribution), so she needs to make sure that she has permission to share all of the pieces she is borrowing.
Alice searches Google for “public domain image poodle,” but doesn’t quite get what she wants. She digs further and discovers that public domain images are scattered across the web:
- There are the common repositories, like archive.org, Wikimedia Commons, Flickr, and Openclipart.
- Then there are the hundreds of repositories maintained independently by libraries, universities, federal government agencies, and even hobbyists–some but not all of them are listed on the CC wiki.
- Finally, there are the the broader search engines, like google image search and yahoo image search, which let you search across a bunch of these for openly licensed images, but often miss things for various reasons (perhaps the licensing information was not articulated in a way that the search engine could pick up).
Alice is confused and overwhelmed.
Suddenly, finding an image of a poodle is a huge can of worms for Alice.
Crucially, what we usually call “openness” isn’t going to help Alice here—the images are already “open” because they are available online under a permissive license. Yet they’re not really discoverable.
In fact, “open” misses several features that Alice should really expect from her image search (hi-res files, standard formats, etc).
Parker is trying to Help Alice
For the past year I’ve been working in this problem space, starting with a project last winter to scrape out and mirror all of the images in the CDC’s Public Health Image Library. This summer, I drafted the Usable Data Criteria 1.0, which enumerates all of the things that Alice wants in her public domain image search, beyond just openness.
This Spring, with generous funding from Digital Studies at Dartmouth College, I’m continuing work on making federally owned, public domain images (listing of databases here) more usable, with a focus on discoverability.
My approach is two-pronged:
Crowdsourced Metadata Tags to Improve Search
I’m working with the Metadata Games Project, which is making it easy for the crowd to add metadata tags to images by playing games. My focus will be on making it easy to add an image set to metadata games and to dynamically retrieve the metadata tags in a machine-readable format. This way, Bob the librarian, who has an image set, can use metadata games to dynamically retrieve tags for his images that improve over time (and he can give his patrons an easy and fun way to help with tagging). Better tags means a better search experience for Alice.
Downloading and Cross-Posting Public Domain Images from Federal Agency Archives
Using web scraping techniques, I’m downloading images and their metadata from various federal agency image archives and making them more accessible to Alice in a number of ways:
- providing additional tags from Metadata Games
- cross-posting the images to other image archives
- most importantly, providing one meta-index, updated nightly, with a machine-readable interface
The machine-readable meta-index will make all of the images in the federal image archives in my system easy to add to any larger search engine or dataset.
The code that I’m writing for image archive scraping will be generalized into a python library that’ll be available under the GPL.
Help me help Alice
What do I do next?
In terms of these federal agency images, what is the best way to put them where Alice will see them? Is it to cross-post to archive.org and wikimedia commons? Is it to create Yet Another Public Domain Image Search that makes them all accessible through one search engine? Is it to simply host mirror copies with semantic metadata articulating that the images are in the public domain, so that e.g., Google Image Search can correctly index them? Do we really just need One Usable Image Search to Rule Them All?
I need some advice from people who know a thing or two about image search, library and archival science, etc. I’m also looking for good journal articles about image search and image tagging. And if you’re interested in helping write or review code (such as html parsers for individual image archives), that would rule too.
Drop me a line: parker@madebyparker.com
I’m living and working on this project in Boston this spring.
Heavily edited 4/13/12. Thanks, Asheesh, for your feedback!
