A friend writes:
I was just buying some tickets and had to type the security word thing and as I was getting it wrong four times in a row learned that I was actually helping (or perhaps hurting by getting them wrong) to:
“Digitize books one word at a time by entering the words in the box, you are also helping to digitize books from the Internet Archive and preserve literature that was written before the computer age.”
I’ve spent my whole day trying to figure out how this possibly could work.
My guess is that the mistakes give them a similarity measure between letters in different fonts. For instance e’s are similar to a’s because people often mistake an e for an a and vice versa, but e’s are not similar to k’s since people rarely make that mistake. This means the mistakes are more useful to them than correct answers. But I’m not sure what the road is from a similarity matrix to OCR software. Suggestions?

5 comments
Comments feed for this article
Tuesday ~ April 27th, 2010 at 11:51 pm
Joshua Zelinsky
This doesn’t work in a way nearly that sophisticated. The process is called reCAPTCHA. Outline of how it works: Books are scanned with standard OCR. When the OCR cannot recognize a word it saves the image of the word. That image is then later used in a reCAPTCHA which is a CAPTCHA consisting of two words. One of the words is an unknown word and one word is a word that is known. You don’t know which is which and so need to enter in both. Each word requiring reCAPTCHA is presented multiple times to different people. If many of them agree on the same string, then it decides that that is what the word actually represents. This is just a clever capture of computing that would already occur. Nothing as fancy as a neural net. However, the stored data from reCAPTCHA might one day be used to train neural nets.
Wednesday ~ April 28th, 2010 at 8:25 am
Adam Ozimek
Thanks Joshua, that is much simpler than what I had in mind… BTW, I didn’t actually think they used neural nets, the title is from a Terminator 2 quote. But I am intrigued that you say they could.
Wednesday ~ April 28th, 2010 at 9:48 am
Joshua Zelinsky
Yes, thanks I recognized the quote. I do know that there has been work on teaching neural nets to do OCR but I don’t know how much success there has been (my impression is not much).
Wednesday ~ April 28th, 2010 at 11:11 am
Adam Ozimek
I’ve seen multidimensional scaling used for handwriting recognition, but the similarity measure there comes from eigenanalysis of a pixelated image of the actual text. In principal I suppose the same thing could be done with digitized books. I’m not sure how the data from these type of captchas could inform that process.
I’m pretty unfamiliar with neural nets, do they work similarly to that?
Wednesday ~ April 28th, 2010 at 12:29 pm
Joshua Zelinsky
I don’t know enough to comment. The end of neural nets I’m familiar with (and even then not very familiar at all) is the underlying mathematical modeling of basic neural net systems and a few theorems to the effect of what sort of functions you can actually represent from R^m to R^n with a neural net. Is there someone else reading who has more of an idea?