ReCaptcha - Reusing the energy of fighting evil for social good

The other day I discovered reCAPTCHA , one of those really innovative nifty little ideas that warm my heart. Everyday all of us netizens use so much of our energy fighting the evils of spam - service providers keep warding off waves of attacks, while the users keep assuring providers through various means that they are not bots working on behalf on the dark side(read spammers).

One of the ways of ensuring that the person accessing a web page is actually a human is to pass the CAPTCHA test. These are those weirdly distorted, often multicolored, mostly english words that you are asked to write down on various sites like Yahoo, Google, Orkut, message boards,etc.

Millions of such captchas are solved every day, and the beauty of the reCAPTCHA project is that it channels the effort you and others put in solving these puzzles, into a noble cause - a giant project currently being taken up by the Internet Archive to digitize our literary heritage. Many out-of-print books of the past which would otherwise be lost and forgotten are being scanned and their text versions provided freely to the world. The scanning process is fraught with errors that come from scanning blotted or misfigured text. The reCAPTCHA process, a brilliant idea by a CMU student connects the energy spent on solving captcha with the massive input needed to solve the scanning problem. How does it work? Their page explains it very well, but here is a shorted extract.

reCAPTCHA improves the process of digitizing books by sending words that cannot be read by computers to the Web in the form of CAPTCHAs for humans to decipher. More specifically, each word that cannot be read correctly by OCR is placed on an image and used as a CAPTCHA. This is possible because most OCR programs alert you when a word cannot be read correctly.

But if a computer can’t read such a CAPTCHA, how does the system know the correct answer to the puzzle? Here’s how: Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct.

They also have a nifty relevant idea of using the efforts spent in email obfuscation into doing the same thing - it is called Mailhide.

An Interesting Excel Bug Bush, USA and Public Healthcare