Parts of Speech, Bayesian Analysis, Magic

Oh, spelling and grammar. My greatest educational and professional bane. My (close to) greatest source of embarrassment. Until roughly the late 19th century, spelling for most people, even the “educated classes” was essentially a process of deciding what looked good. It wasn’t even probabilistic so much as optimistic. Note that Terry Pratchett described one character’s writing approach as more “ballistic” than probabilistic. That is, he fired away and charged on regardless.

It was only the creation of publicly available schooling that really started to alter the spelling landscape. That and Samuel Johnson’s dictionary, which was essentially the first of its kind.

Bayesian analysis is, honestly, a little bit like magic. It is for me, very reminiscent of Arthur C. Clarke’s quotation that inspired my blogs name — “Any sufficiently advanced technology is indistinguishable from magic.” Bayesian     analysis, even though it is, in many respects, quite accessible once you have read through the formula a few times. At the same time, when actually working with a good training set and a good test set, it’s ability to identify patterns (parts of speech, spam, etc.) is really nothing short of magical. I’m always reminded of the silly question “how does the aspirin know where the headache is?” But it does seem that way. The impact of probability on our ability to perform these sorts of analysis is transformational.

However, I also want to note that I have been working with variants of these technologies for many many years, and the expression about “almost only being acceptable for horseshoes and hand grenades” remains equally true with grammar, spelling, OCR, writing and speech recognition.

I have been working, as an extension of my professional responsibilities, with OCR, grammar checking, handwriting recognition and speech recognition for more than 20 years. I have tried every commercially available OCR package, grammar checking package (yes, there used to be more than 1), handwriting recognition package (LiveScript currently makes the best on the market, but people have completely forgotten the Apple Newton and the early Palm recognition applications. Similarly, Kurzweill’s speech recognition systems were originally hardware based and there were multiple other packages out there.

OCR, in particular, is critical in the legal profession. The core of litigation is to conduct “discovery” wherein you review *enormous* quantities of documents looking for materials relevant to the matter. Historically, this has meant paper. Boxes and boxes and boxes of paper. Warehouses full of paper that had to be OCR’d. For many years, one couldn’t expect an accuracy of much more than 75% – 80%. That means one in every 4 or 5 words is wrong. Not just wrong, but *completely* wrong. When you’re reviewing or searching through hundreds of thousands of documents, this is a serious issue. Speech recognition was about the same and handwriting recognition was an unmitigated disaster.

Over the last 10 years or so, the accuracy has gone up significantly, so that many vendors claim 99%+ on “good OCR” and 95% on speech and handwriting recognition.

I feel comfortable in saying that (1) these numbers are still wildly inflated and (2) even if they are accurate, you need to look at a broader context. Namely, documents tend to average 250 to 500 words per page. A typically “bankers box” for paper (a standard box for storing papers) holds between 5000 and 15000 pages. This means, on average, there will be 12,500 and 75,000 spelling mistakes for a single box of documents. If you are relying on word and phrase search to review these documents (which you must), then this is a major problem. There are no better solutions, but OCR is still really on a mediocre technology at best. When it attempts to handle formatting and the like, it tends to fail utterly.

Speech recognition and handwriting recognition are, at best, 95% accurate. Again, a number that I believe to be highly overstated. Considering that the average human being speaks at between 80 and 120 words per minute in normal conversation, this means for every minute of dictating or transcribed dialog, there will be at least 4 to 6 completely incomprehensible spelling errors. Worse, they will not be misspellings but completely wrong words. If you are going to dictate for 20 minutes, you are going to spend at least another 20 minutes identifying the misinterpretations that the system has generated.

In short, I feel all of these technologies have a significant improvement left to be generated.

Getting the code to work was, frankly, quite a bit more difficult than previous work, and unfortunately, I still don’t have anything to show.

Leave a Comment