A live toad every morning: Why I hate Bill Gates

Wednesday, April 22, 2009

Why I hate Bill Gates

One of the irritating things about my life is that I’ve always thought that I had no route to vast riches; the money-making schemes I think up have always been thought of before. One of the even more irritating things is when this turns out not to have been so, and that nobody had in fact thought of it.

Some ten or so years ago, I noted that the spelling programs that correct OCR text were falling down consistently on tasks that should have been simple. Any halfway decent linguist – more to the point, I myself – could suggest after ten minutes’ work fixes that would lift the performance by orders of magnitude. I could at that time have grabbed a coder and settled down to do this work, but I assumed – and I still don’t blame myself too much – that these were so obvious that they’d be fixed in the next iteration and I’d have wasted my time.

Ten years later and the spelling correction programs in Microsoft Word are as bad and as unthinking as ever. And I could now, I suppose, still grab a coder and settle down to do this work, but I still can’t believe that the flaws aren’t so blindingly obvious that they will finally be fixed in the next iteration and I’ll have wasted my time.
The basic principle is unarguable, and it’s utterly opaque to me why the spelling programs don’t grasp it.

Let’s input those principles now. Hell, that principle; there's only one, with three sections.

Text mistakes are not random. They occur because
(1) in scanning, some combinations of letters look very like other letters.
(2) in typing, some letters are next to other letters.
At the moment, absolutely all the attention of the programs is focussed on
(3) in writing, some words are misspelled.
Which is probably the least important.

Let’s look at the spelling program working through a page scanned off a rather poor fax.

Misspelled Word	Actual word	Suggestion
Departrnent	Department	no suggestions
Concems	Concerns	conches
fa~nily	family	fancily
mernbers	members	mourners
DISABIUTY	DISABILITY	DISUNITY
Syndrorne	Syndrome	Sandrine
I)ocuments	Documents	I)documents

Or, by simple deduction,

Misspelled Word	Rule that was applied	Suggestion
Departrnent	What word begins with Departr?	no suggestions
Concems	What word beginning with conc shares most letters with Concems?	conches
fa~nily	What word shares most letters with fanily?	fancily
mernbers	What word shares most letters with mernbers?	mourners
DISABIUTY	What word shares most letters with DISABIUTY?	DISUNITY
Syndrorne	Search me	Sandrine
I)ocuments	Scramble, scramble, what word shares most letters with ocuments?	I)documents

The problem here is that the scrambling of letters, which is the primary worry of this software, is a fairly rare source of errors.

This would not be particularly difficult to fix.

First page.
Are you
Correcting scanning? Click here X
Correcting typing? Click here

Misspelled Word	Rule that's needed	Actual word
Departrnent	r n looks like m – check if substitution produces a word	Department
Concems	m looks like rn – check if substitution produces a word	Concerns
fa~nily	~n could be m or r n – check if substitution produces a word	family
mernbers	r n looks like m – check if substitution produces a word	members
DISABIUTY	LI looks like U - check if substitution produces a word	DISABILITY
Syndrorne	r n looks like m – check if substitution produces a word	Syndrome
I)ocuments	I) looks like D – check if substitution produces a word	Documents

The fact that one sub-rule would have removed half the errors gives some clue to the ease of the enterprise. Run through a couple of thousand examples and you’d knock over 99% of the problems. WHY HAS NOBODY DONE THIS?

2 comments:

Hammy said...: There are no shortage of people that have. I find the standard firefox spell checker much better.

Google searches on a misspelt word are usually more effective as well.; 2:27 PM
Anonymous said...: Hiya Chris,

How about suggesting it to the OpenOffice developers now that OOo 3.0 runs natively on Macs ?

cheers!
Chris; 9:52 PM

A live toad every morning

Wednesday, April 22, 2009

Why I hate Bill Gates

2 comments:

Blog Archive

Search This Blog

Links

Followers

Total Pageviews

About Me

A live toad every morning

Wednesday, April 22, 2009

Why I hate Bill Gates

2 comments:

Blog Archive

Search This Blog

Links

Followers

Subscribe To Live Toad. Then you'll know nothing worse is going to happen to you all day.

Total Pageviews

About Me