Monday, March 9, 2009

Making the Masses Useful

You know those word verification thingys that everyone hates? They show up on websites where you can leave comments or on sites where you can purchase event tickets. A lot of times it’s nonsense words like inquive or jublut or sisupfix. The technical term for these things is “captcha” which is short for (and I’m not making this up) "Completely Automated Public Turing test to tell Computers and Humans Apart." They serve several purposes, but are primarily to ensure that a living, breathing human is behind the comment or purchase, not a computer or automated spamming system. With captchas, a company can’t go online and buy all the Dave Matthews Band tickets for annual shows at the Gorge and then resell them at higher prices, because the captchas slow them down. But sometimes, you’ll notice, the captchas aren’t nonsense words at all; they are real words that are just difficult to read. In those cases, it’s not a captcha, it’s a reCaptcha.

reCaptchas are the brainchild Luis von Ahn, an assistant professor of computer science at Carnegie Mellon University in Pittsburgh He invented captchas in 2000, but was willing to admit that, while they served their security purpose, they also seemed a huge waste of time for most people. So in 2007, he had another great idea. He teamed up with the Open Content Alliance, an organization out of San Francisco that aims to scan all public domain printed material into one giant, free, online library. They mainly scan older books that have outlived their copyright protections. As you can imagine, it’s a bit of a grueling process to scan all the pages into the system. The process is made harder by the fact than in many old books the type is small, smeared, faded or otherwise difficult to read. If the computer can’t make sense of a word, then it is flagged to be checked individually by a staff member, which slows down the whole process. So von Ahn came up with a brilliant way to speed up the process. When a word is flagged as unreadable, it is sent out to a captcha system, to be used on the web as a verification tool. In a regular old captcha (still in use by some sites), a random nonsense word comes up on the screen for verification. With a reCaptcha, two words come up. The first is a control word that the computer already knows. The second is one it can’t decipher. Once several people have typed in the word the same way, that spelling is taken as correct and is sent back to the Open Content Alliance where it is placed back in it’s proper place within the digitized manuscript. The system’s accuracy rate of 99.1 percent is about the same as professional human transcribers. It’s a brilliant use of manpower, with each person on the web contributing a tiny amount of their time.

One of the Open Content Alliance’s latest projects was to digitize the entire catalogue of the New York Times. It is estimated that over 400 million people helped with this project, whether or not they knew they were helping. That is about 6% of the world's population!

More on reCaptchas here.

8 comments:

Herbal Amanda said...

That is just too cool! I love the internets!

...and when will Blogger get with the program?

Sean said...

I am so glad you discovered a word for them...Captcha...ahhhh :))

Sean said...

slondod

Sean said...

bralla

Sean said...

sersestm extersh dingueti...are they just making these words up?

Jillian said...

Sometimes I click to read the comments on someone's blog, and the captcha is so funny that I feel I have to leave a comment just so I can type it out :).

wv: unadwag

Troy said...

Sometimes I feel like I'm typing something very dirty.

For example, "He was arrested for being a FRANOT in public."

Herbal Amanda said...

i like this one lentlog