Corrections to the blogosphere, the consensus, and the world

Wednesday, April 22, 2009

Why I hate Bill Gates

One of the irritating things about my life is that I’ve always thought that I had no route to vast riches; the money-making schemes I think up have always been thought of before. One of the even more irritating things is when this turns out not to have been so, and that nobody had in fact thought of it.

Some ten or so years ago, I noted that the spelling programs that correct OCR text were falling down consistently on tasks that should have been simple. Any halfway decent linguist – more to the point, I myself – could suggest after ten minutes’ work fixes that would lift the performance by orders of magnitude. I could at that time have grabbed a coder and settled down to do this work, but I assumed – and I still don’t blame myself too much – that these were so obvious that they’d be fixed in the next iteration and I’d have wasted my time.

Ten years later and the spelling correction programs in Microsoft Word are as bad and as unthinking as ever. And I could now, I suppose, still grab a coder and settle down to do this work, but I still can’t believe that the flaws aren’t so blindingly obvious that they will finally be fixed in the next iteration and I’ll have wasted my time.
The basic principle is unarguable, and it’s utterly opaque to me why the spelling programs don’t grasp it.

Let’s input those principles now. Hell, that principle; there's only one, with three sections.

Text mistakes are not random. They occur because
(1) in scanning, some combinations of letters look very like other letters.
(2) in typing, some letters are next to other letters.
At the moment, absolutely all the attention of the programs is focussed on
(3) in writing, some words are misspelled.
Which is probably the least important.

Let’s look at the spelling program working through a page scanned off a rather poor fax.



Misspelled WordActual wordSuggestion
Departrnent Departmentno suggestions
ConcemsConcernsconches
fa~nilyfamilyfancily
mernbersmembersmourners
DISABIUTYDISABILITYDISUNITY
SyndrorneSyndrome Sandrine
I)ocumentsDocumentsI)documents

Or, by simple deduction,

Misspelled WordRule that was appliedSuggestion
Departrnent What word begins with Departr?no suggestions
ConcemsWhat word beginning with conc shares most letters with Concems?conches
fa~nilyWhat word shares most letters with fanily?fancily
mernbersWhat word shares most letters with mernbers?mourners
DISABIUTYWhat word shares most letters with DISABIUTY?DISUNITY
SyndrorneSearch me Sandrine
I)ocumentsScramble, scramble, what word shares most letters with ocuments?I)documents

The problem here is that the scrambling of letters, which is the primary worry of this software, is a fairly rare source of errors.

This would not be particularly difficult to fix.

First page.
Are you
Correcting scanning? Click here X
Correcting typing? Click here





Misspelled WordRule that's neededActual word
Departrnentr n looks like m – check if substitution produces a word Department
Concems m looks like rn – check if substitution produces a wordConcerns
fa~nily ~n could be m or r n – check if substitution produces a wordfamily
mernbersr n looks like m – check if substitution produces a wordmembers
DISABIUTYLI looks like U - check if substitution produces a wordDISABILITY
Syndrorner n looks like m – check if substitution produces a word Syndrome
I)ocumentsI) looks like D – check if substitution produces a wordDocuments

The fact that one sub-rule would have removed half the errors gives some clue to the ease of the enterprise. Run through a couple of thousand examples and you’d knock over 99% of the problems. WHY HAS NOBODY DONE THIS?

2 comments:

Hammy said...

There are no shortage of people that have. I find the standard firefox spell checker much better.

Google searches on a misspelt word are usually more effective as well.

Anonymous said...

Hiya Chris,

How about suggesting it to the OpenOffice developers now that OOo 3.0 runs natively on Macs ?

cheers!
Chris

Blog Archive

Search This Blog

Followers

Total Pageviews