Back in 2007 Peter Norvig shared a simple spelling corrector (last updated in 2016); I thought that I’d share my translation into Lisp, with some comments. Norvig’s original article explains the actual algorithm he’s using; I’ll focus on stuff specific to my own version.
I’m definitely not a Lisp wizard (although I’ve been using it for many years), so it’s entirely possible that I’ve messed one thing or another up — I’m glad to receive corrections or comments.
I pretty literally translated it from Python. Where Norvig uses a
Counter, I just used an
EQUAL hash table mapping strings to
You can download the literal translation.
It’s good practice to set up a special package for anything you’re working on, although for toys you really don’t need to.
(ql:quickload "cl-ppcre") (defpackage spelling-corrector (:use #:cl #:cl-ppcre) (:export #:correction)) (in-package #:spelling-corrector)
- Uses Quicklisp to load CL-PPCRE, an excellent library which implements portable Perl-compatible regexps. It’s not idiomatic to do it this way, but it’s simple.
- Defines a new package
COMMON-LISP(whose nickname is the shorter
PPCREI didn’t bother to use) and exports the symbol
CORRECTION, which is the name of the function which will actually suggest spelling corrections for words.
- Switch the current package to
If we’re going to correct misspelt words, we need to have some idea of properly spelt words.
(defparameter *words* (make-hash-table :test 'equal)) (defun words (text) (all-matches-as-strings "\\w+" (nstring-downcase text))) (with-open-file (words #P"big.txt") (do ((line (read-line words nil) (read-line words nil))) ((null line)) (dolist (word (words line)) (incf (gethash word *words* 0))))) (let ((n (loop for word being the hash-keys of *words* using (hash-value count) sum count))) (defun p (word) (/ (gethash word *words* 0) n)))
Defines a variable named
*WORDS*; it’s an
EQUALhash table. Lisp supports a number of different types of hash table, based on the sort of identity relation one wishes to use for keys;
EQUALmeans that strings will be considered equal if they have the same characters. Each time
DEFPARAMETERis executed (e.g. if we reload the file),
*WORDS*is reset to an empty hash table, which is what we’d want. It maps words to the number of times they’ve been seen.
Norvig uses a
collections.Counterhere, which is a Python library object which does much the same thing: maintain the number of times the items it stores have been seen.
Defines a function
WORDSwhich takes a string and splits it out into lowercase words (by breaking on whitespace).
Reads in a text corpus — the file
big.txt— line-by-line, splits each line into words and stores the number of times each word has been seen in
Defines a function
Pwhich calculates the probability of a word: how many times it’s been seen out of how many times all words have been seen. This function captures the value of
Nwhen it is defined, so that we don’t need to calculate the sum of all words every time
(defun correction (word) (loop for word in (candidates word) with max = -1 and best when (> (p word) max) do (setf best word) finally (return word)))
CORRECTION is the spelling-corrector itself. Given a word, first it
finds all the candidates generated for the word (i.e., all those words
in the corpus for which that word could have been typed, e.g. ‘taht’
would generate the candidates ‘tat,’ ‘that,’ ‘tact,’ ‘taft’ & ‘taut’),
and then picks whichever candidate has the highest probability of
being the chosen word. The whole thing’s really that simple!
(defun candidates (word) (or (known (list word)) (known (edits-1 word)) (known (edits-2 word)) (list word)))
CANDIDATES returns the first set of the potential words that a word
might have been intended to be: itself, if it’s known from the corpus;
otherwise all known words one edit away from the word; otherwise all
words two edits away from the word; otherwise the word itself.
(candidates "blog") →
("log" "bog" "flog" "bloc" "blot"
"blow") (because ‘blog’ is not in the corpus, instead it finds all
words one edit away).
(defun edits-1 (word &aux (letters "abcdefghijklmnopqrstuvwxyz")) (let* ((splits (loop for i from 0 upto (length word) collect (cons (subseq word 0 i) (subseq word i)))) (deletes (loop for (l . r) in splits unless (string-equal r "") collect (concatenate 'string l (subseq r 1)))) (transposes (loop for (l . r) in splits when (> (length r) 1) collect (concatenate 'string l (subseq r 1 2) (subseq r 0 1) (subseq r 2)))) (replaces (loop for (l . r) in splits unless (string-equal r "") nconc (loop for c across letters collect (concatenate 'string l (list c) (subseq r 1))))) (inserts (loop for (l . r) in splits nconc (loop for c across letters collect (concatenate 'string l (list c) r))))) (append deletes transposes replaces inserts)))
- Finds all potential words created by deleting a single letter from
- Finds all potential words created by transposing letters in
- Finds all potential words created by replacing a letter in
WORDwith another letter.
- Finds all potential words created by inserting a letter in
The function is pretty much a direct translation of Norvig’s Python,
and it could use some optimisation: we don’t really need to store the
list of splits; we don’t need to iterate over
SPLITS four times;
we don’t need to iterate over
I do use one unusual Lisp feature,
&AUX arguments. They’re a
convenient way of defining a new variable within the function without
having to clutter things up with a
(defun edits-2 (word) (loop for e1 in (edits-1 word) nconc (loop for e2 in (edits-1 e1) collect e2)))
EDITS-2 finds all words two edits away from
WORD: first it gets
all words one edit away, then it collects all words one edit away from
those words. Note that it doesn’t do anything to deduplicate the
results, e.g. ‘teh’ would generate the deletion ‘eh,’ ‘th’ & ‘te’,
each of which would generate 26 insertions, which would each
regenerate ‘teh’ (by inserting ‘t,’ ‘e’ & ‘h,’ respectively). This is
okay in this toy code, because the only thing we end up doing with the
results is calling
P; it might even be more expensive to deduplicate
— that’s something to check later.
(defun known (words) (remove-if-not (lambda (x) (gethash x *words*)) words))
KNOWN just filters a list of words, removing any which weren’t seen
in the corpus.
I also translated the test suite from Norvig’s original article.
(ql:quickload "split-sequence") (defpackage test-spelling-corrector (:use #:cl #:spelling-corrector #:split-sequence) (:export #:make-testset #:spelltest)) (in-package #:test-spelling-corrector) (defun make-testset (path) (with-open-file (file path) (loop for line = (read-line file nil) while line for (right wrongs) = (split-sequence:split-sequence #\: line) nconc (loop for wrong in (split-sequence:split-sequence #\Space wrongs :remove-empty-subseqs t) collect (cons right wrong))))) (defun spelltest (tests &optional verbose &aux (start (get-universal-time)) (n (length tests))) (loop with good = 0 and unknown = 0 for (right . wrong) in tests for w = (correction wrong) do (if (string-equal w right) (incf good) (progn (unless (gethash right spelling-corrector::*words*) (incf unknown)) (when verbose (format t "~&correction(~a) → ~a (~a); expected ~a (~a)" wrong w (gethash w spelling-corrector::*words*) right (gethash right spelling-corrector::*words*))))) finally (format t "~&~,,2f% of ~a correct (~,,2f% unknown) at ~f words per second" (/ good n) n (/ unknown n) (/ n (- (get-universal-time) start)))))
And then tested against his two test sets:
(test-spelling-corrector:spelltest (test-spelling-corrector:make-testset #P"spell-testset1.txt")) (test-spelling-corrector:spelltest (test-spelling-corrector:make-testset #P"spell-testset2.txt"))
Which yields the results:
76.666665% of 270 correct (5.5555556% unknown) at 54.0 words per second 69.0% of 400 correct (10.75% unknown) at 40.0 words per second
To compare, Norvig’s Python 3 version generates these results:
75% of 270 correct (6% unknown) at 44 words per second 68% of 400 correct (11% unknown) at 38 words per second
Interestingly, the Lisp version — despite being compiled to machine code — is not appreciably faster than the Python version. This suggests that there are some potential optimisations, which will (hopefully) be the subject of a later blog entry.
- Note that Lisp package names can contain hyphens. [return]