Spelling Suggestion on the JVM #1: Norvig’s Approach to Spelling Correction

Ricardo Rocha
4 min readMar 4, 2018

--

This tutorial project implements a basic spelling suggestion service. The motivation is didactic: becoming familiar with “better java” languages around a simple but instructional example. Ease of implementation outranks optimal performance so readers can focus on the JVM languages as such. Examples and new concepts are introduced in Java 9 first so they’re immediately understandable to the experienced Java programmer. Crucially, the Java implementation is accompanied by equivalent (but idiomatic) implementations in today’s most relevant alternative JVM languages: Kotlin, Scala and Xtend. Code is available at Spellbound_’s Github repo.

This first post discusses Norvig’s approach to generating typo corrections.

The next post (Implementing Norvig’s Algo in Java) illustrates how to implement Norvig’s spelling corrector in Java 9 following a functional approach.

A Word on Spelling Suggestion

Spelling Suggestion is all about helping document writers spell words correctly.This service is pervasive among editors, word processors and, nowadays, even IDE’s.

We’ve chosen spelling suggestion as a tutorial subject because of how familiar an application domain it is. This application domain also lends itself to showcase modern language constructs such as functional programming, extension methods or type inference, to name a few.

Simple, readable code examples are first presented in Java 9 so as to ensure they’re readily understood; this reduces the cognitive effort of perusing implementations written in not yet familiar languages.

Typo Edits

A typo is the mistaken transcription of a valid, in-dictionary word. Most typos originate in one or more of the following cases:

deletes

A letter is omitted from the word: dilbert becomes dlbert

inserts

A letter is inserted that doesn’t belong in the word: dogbert becomes dogvbert

transposes

Two contiguous letters are swapped: alice becomes alcie

replaces

A letter is changed to another letter: wally becomes walky

Note a delete could be amended with an insert. Correspondingly, a transpose could be fixed with an appropriate replace. We have two groups of symmetric, opposite edits.

Hence, upon detecting a typo, we can hope to reconstitute the original word by applying an opposite edit. This insightful realization is the core of Norvig’s algorithm.

Obviously, we don’t know what letter was omitted or what letters were transposed. Therefore, we need to traverse the entire problem space by brute-force generating all possible opposite edits.

Thus, for example, we can try to amend dlbert with dalbert, dbbert, dcbert, …, dzbert and check to see if one or more of the resulting strings actually occurs in the dictionary (which, happily in this case, happens to appear as dilbert).

The general strategy is, then, to exhaustively apply four edits to the typo in the hope of finding at least one dictionary word that was distorted by the typo.

Statistically, most typos arise from one edit only (which Norvig dubs edit1). Two-edit typos (edit2), while less frequent, are sufficiently common so as to warrant their own two-pass attempt at fixing the typo. edit2 is used only when edit1 fails to find at least one matching dictionary word; edit2‘s input is the result of a failed edit1.

These two scenarios should suffice for the vast majority of typo cases. Of course, it’s still possible for a typo to slip through our crib anyway but we’re not aiming for perfection ;-)

Word Splits

Another frequent type of typo is the joining of two contiguous words by omitting the intervening space: two-word poor asok mutates to poorasok.

Norvig’s solution to this conundrum is to apply edit1 and, possibly, edit2 to all possible typo word splits. Word split pairs progress from ("", "poorasok") to ("p", "oorasok") to ("poo", "rasok"") all the way down to ("poorasok", "").

Note: for the sake of simplicity we’re not addressing the case where a space is inserted inside a single word, which effectively turns it into two separate words. By operating on entire sentences rather than individual words we can detect the occurrence of two consecutive typos and try to compensate by joining them. If the result is not a dictionary word then we apply vanilla single-word correction to both typos.

Next in our plate: Implementing Norvig’s Algo in Java

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Ricardo Rocha
Ricardo Rocha

No responses yet

Write a response