Some ten or so years ago, I noted that the spelling programs that correct OCR text were falling down consistently on tasks that should have been simple. Any halfway decent linguist – more to the point, I myself – could suggest after ten minutes’ work fixes that would lift the performance by orders of magnitude. I could at that time have grabbed a coder and settled down to do this work, but I assumed – and I still don’t blame myself too much – that these were so obvious that they’d be fixed in the next iteration and I’d have wasted my time.
Ten years later and the spelling correction programs in Microsoft Word are as bad and as unthinking as ever. And I could now, I suppose, still grab a coder and settle down to do this work, but I still can’t believe that the flaws aren’t so blindingly obvious that they will finally be fixed in the next iteration and I’ll have wasted my time.
The basic principle is unarguable, and it’s utterly opaque to me why the spelling programs don’t grasp it.
Let’s input those principles now. Hell, that principle; there's only one, with three sections.
Text mistakes are not random. They occur because
(1) in scanning, some combinations of letters look very like other letters.
(2) in typing, some letters are next to other letters.
At the moment, absolutely all the attention of the programs is focussed on
(3) in writing, some words are misspelled.
Which is probably the least important.
Let’s look at the spelling program working through a page scanned off a rather poor fax.
|Misspelled Word||Actual word||Suggestion|
Or, by simple deduction,
|Misspelled Word||Rule that was applied||Suggestion|
|Departrnent||What word begins with Departr?||no suggestions|
|Concems||What word beginning with conc shares most letters with Concems?||conches|
|fa~nily||What word shares most letters with fanily?||fancily|
|mernbers||What word shares most letters with mernbers?||mourners|
|DISABIUTY||What word shares most letters with DISABIUTY?||DISUNITY|
|I)ocuments||Scramble, scramble, what word shares most letters with ocuments?||I)documents|
The problem here is that the scrambling of letters, which is the primary worry of this software, is a fairly rare source of errors.
This would not be particularly difficult to fix.
Correcting scanning? Click here X
Correcting typing? Click here
|Misspelled Word||Rule that's needed||Actual word|
|Departrnent||r n looks like m – check if substitution produces a word||Department|
|Concems||m looks like rn – check if substitution produces a word||Concerns|
|fa~nily||~n could be m or r n – check if substitution produces a word||family||mernbers||r n looks like m – check if substitution produces a word||members|
|DISABIUTY||LI looks like U - check if substitution produces a word||DISABILITY|
|Syndrorne||r n looks like m – check if substitution produces a word||Syndrome|
|I)ocuments||I) looks like D – check if substitution produces a word||Documents|
The fact that one sub-rule would have removed half the errors gives some clue to the ease of the enterprise. Run through a couple of thousand examples and you’d knock over 99% of the problems. WHY HAS NOBODY DONE THIS?