How to write a spelling corrector — in Lisp
Back in 2007 Peter Norvig shared a simple spelling corrector (last updated in 2016); I thought that I’d share my translation into Lisp, with some comments. Norvig’s original article explains the actual algorithm he’s using; I’ll focus on stuff specific to my own version.
I’m definitely not a Lisp wizard (although I’ve been using it for many years), so it’s entirely possible that I’ve messed one thing or another up — I’m glad to receive corrections or comments.
Literal translation
I pretty literally translated it from Python. Where Norvig uses a
Python Counter
, I just used an EQUAL
hash table mapping strings to
counts.
You can download the literal translation.
Boilerplate
It’s good practice to set up a special package for anything you’re working on, although for toys you really don’t need to.
(ql:quickload "cl-ppcre")
(defpackage spelling-corrector
(:use #:cl #:cl-ppcre)
(:export #:correction))
(in-package #:spelling-corrector)
This:
- Uses Quicklisp to load CL-PPCRE, an excellent library which implements portable Perl-compatible regexps. It’s not idiomatic to do it this way, but it’s simple.
- Defines a new package
SPELLING-CORRECTOR
.1SPELLING-CORRECTOR
usesCOMMON-LISP
(whose nickname is the shorterCL
) andCL-PPCRE
(whose nicknamePPCRE
I didn’t bother to use) and exports the symbolCORRECTION
, which is the name of the function which will actually suggest spelling corrections for words. - Switch the current package to
SPELLING-CORRECTOR
.
Dictionary
If we’re going to correct misspelt words, we need to have some idea of properly spelt words.
(defparameter *words* (make-hash-table :test 'equal))
(defun words (text)
(all-matches-as-strings "\\w+" (nstring-downcase text)))
(with-open-file (words #P"big.txt")
(do ((line (read-line words nil)
(read-line words nil)))
((null line))
(dolist (word (words line))
(incf (gethash word *words* 0)))))
(let ((n (loop for word being the hash-keys of *words* using (hash-value count)
sum count)))
(defun p (word)
(/ (gethash word *words* 0) n)))
This:
-
Defines a variable named
*WORDS*
; it’s anEQUAL
hash table. Lisp supports a number of different types of hash table, based on the sort of identity relation one wishes to use for keys;EQUAL
means that strings will be considered equal if they have the same characters. Each timeDEFPARAMETER
is executed (e.g. if we reload the file),*WORDS*
is reset to an empty hash table, which is what we’d want. It maps words to the number of times they’ve been seen.Norvig uses a
collections.Counter
here, which is a Python library object which does much the same thing: maintain the number of times the items it stores have been seen. -
Defines a function
WORDS
which takes a string and splits it out into lowercase words (by breaking on whitespace). -
Reads in a text corpus — the file
big.txt
— line-by-line, splits each line into words and stores the number of times each word has been seen in*WORDS*
. -
Defines a function
P
which calculates the probability of a word: how many times it’s been seen out of how many times all words have been seen. This function captures the value ofN
when it is defined, so that we don’t need to calculate the sum of all words every timeP
is called.
Functions
CORRECTION
(defun correction (word)
(loop for word in (candidates word)
with max = -1 and best
when (> (p word) max)
do (setf best word)
finally (return word)))
CORRECTION
is the spelling-corrector itself. Given a word, first it
finds all the candidates generated for the word (i.e., all those words
in the corpus for which that word could have been typed, e.g. ‘taht’
would generate the candidates ‘tat,’ ‘that,’ ‘tact,’ ‘taft’ & ‘taut’),
and then picks whichever candidate has the highest probability of
being the chosen word. The whole thing’s really that simple!
CANDIDATES
(defun candidates (word)
(or (known (list word))
(known (edits-1 word))
(known (edits-2 word)) (list word)))
CANDIDATES
returns the first set of the potential words that a word
might have been intended to be: itself, if it’s known from the corpus;
otherwise all known words one edit away from the word; otherwise all
words two edits away from the word; otherwise the word itself.
E.g. (candidates "blog")
→ ("log" "bog" "flog" "bloc" "blot" "blow")
(because ‘blog’ is not in the corpus, instead it finds all
words one edit away).
EDITS-1
(defun edits-1 (word &aux (letters "abcdefghijklmnopqrstuvwxyz"))
(let* ((splits (loop for i from 0 upto (length word)
collect (cons (subseq word 0 i)
(subseq word i))))
(deletes (loop for (l . r) in splits
unless (string-equal r "")
collect (concatenate 'string l (subseq r 1))))
(transposes (loop for (l . r) in splits
when (> (length r) 1)
collect (concatenate 'string l (subseq r 1 2) (subseq r 0 1) (subseq r 2))))
(replaces (loop for (l . r) in splits
unless (string-equal r "")
nconc (loop for c across letters
collect (concatenate 'string l (list c) (subseq r 1)))))
(inserts (loop for (l . r) in splits
nconc (loop for c across letters
collect (concatenate 'string l (list c) r)))))
(append deletes transposes replaces inserts)))
EDITS-1
:
- Finds all potential words created by deleting a single letter from
WORD
. - Finds all potential words created by transposing letters in
WORD
. - Finds all potential words created by replacing a letter in
WORD
with another letter. - Finds all potential words created by inserting a letter in
WORD
.
The function is pretty much a direct translation of Norvig’s Python,
and it could use some optimisation: we don’t really need to store the
list of splits; we don’t need to iterate over SPLITS
four times;
we don’t need to iterate over LETTERS
twice.
I do use one unusual Lisp feature, &AUX
arguments. They’re a
convenient way of defining a new variable within the function without
having to clutter things up with a LET
or LET*
form.
EDITS-2
(defun edits-2 (word)
(loop for e1 in (edits-1 word) nconc (loop for e2 in (edits-1 e1) collect e2)))
EDITS-2
finds all words two edits away from WORD
: first it gets
all words one edit away, then it collects all words one edit away from
those words. Note that it doesn’t do anything to deduplicate the
results, e.g. ‘teh’ would generate the deletion ‘eh,’ ‘th’ & ‘te’,
each of which would generate 26 insertions, which would each
regenerate ‘teh’ (by inserting ‘t,’ ‘e’ & ‘h,’ respectively). This is
okay in this toy code, because the only thing we end up doing with the
results is calling P
; it might even be more expensive to deduplicate
— that’s something to check later.
KNOWN
(defun known (words) (remove-if-not (lambda (x) (gethash x *words*)) words))
KNOWN
just filters a list of words, removing any which weren’t seen
in the corpus.
Test suite
I also translated the test suite from Norvig’s original article.
(ql:quickload "split-sequence")
(defpackage test-spelling-corrector
(:use #:cl #:spelling-corrector #:split-sequence)
(:export #:make-testset
#:spelltest))
(in-package #:test-spelling-corrector)
(defun make-testset (path)
(with-open-file (file path)
(loop for line = (read-line file nil)
while line for (right wrongs) = (split-sequence:split-sequence #\: line)
nconc (loop for wrong in (split-sequence:split-sequence
#\Space
wrongs
:remove-empty-subseqs t)
collect (cons right wrong)))))
(defun spelltest (tests
&optional verbose
&aux (start (get-universal-time)) (n (length tests)))
(loop with good = 0 and unknown = 0
for (right . wrong) in tests for w = (correction wrong)
do (if (string-equal w right)
(incf good)
(progn
(unless (gethash right spelling-corrector::*words*)
(incf unknown))
(when verbose
(format t
"~&correction(~a) → ~a (~a); expected ~a (~a)"
wrong
w
(gethash w spelling-corrector::*words*)
right
(gethash right spelling-corrector::*words*)))))
finally
(format
t
"~&~,,2f% of ~a correct (~,,2f% unknown) at ~f words per second"
(/ good n) n (/ unknown n) (/ n (- (get-universal-time) start)))))
And then tested against his two test sets:
(test-spelling-corrector:spelltest (test-spelling-corrector:make-testset #P"spell-testset1.txt"))
(test-spelling-corrector:spelltest (test-spelling-corrector:make-testset #P"spell-testset2.txt"))
Which yields the results:
76.666665% of 270 correct (5.5555556% unknown) at 54.0 words per second
69.0% of 400 correct (10.75% unknown) at 40.0 words per second
To compare, Norvig’s Python 3 version generates these results:
75% of 270 correct (6% unknown) at 44 words per second
68% of 400 correct (11% unknown) at 38 words per second
Interestingly, the Lisp version — despite being compiled to machine code — is not appreciably faster than the Python version. This suggests that there are some potential optimisations, which will (hopefully) be the subject of a later blog entry.
-
Note that Lisp package names can contain hyphens. ↩︎