How to write a spelling corrector — in Lisp

Sunday, 29 April 2018 by Robert A. Uhl in tech lisp, programming, ai

Back in 2007 Peter Norvig shared a simple spelling corrector (last updated in 2016); I thought that I’d share my translation into Lisp, with some comments. Norvig’s original article explains the actual algorithm he’s using; I’ll focus on stuff specific to my own version.

I’m definitely not a Lisp wizard (although I’ve been using it for many years), so it’s entirely possible that I’ve messed one thing or another up — I’m glad to receive corrections or comments.

Literal translation

I pretty literally translated it from Python. Where Norvig uses a Python Counter, I just used an EQUAL hash table mapping strings to counts.

You can download the literal translation.

Boilerplate

It’s good practice to set up a special package for anything you’re working on, although for toys you really don’t need to.

(ql:quickload "cl-ppcre")

(defpackage spelling-corrector
  (:use #:cl #:cl-ppcre)
  (:export #:correction))

(in-package #:spelling-corrector)

This:

Uses Quicklisp to load CL-PPCRE, an excellent library which implements portable Perl-compatible regexps. It’s not idiomatic to do it this way, but it’s simple.
Defines a new package SPELLING-CORRECTOR.¹ SPELLING-CORRECTOR uses COMMON-LISP (whose nickname is the shorter CL) and CL-PPCRE (whose nickname PPCRE I didn’t bother to use) and exports the symbol CORRECTION, which is the name of the function which will actually suggest spelling corrections for words.
Switch the current package to SPELLING-CORRECTOR.

Dictionary

If we’re going to correct misspelt words, we need to have some idea of properly spelt words.

(defparameter *words* (make-hash-table :test 'equal))

(defun words (text)
  (all-matches-as-strings "\\w+" (nstring-downcase text)))

(with-open-file (words #P"big.txt")
  (do ((line (read-line words nil)
             (read-line words nil)))
      ((null line))
    (dolist (word (words line))
      (incf (gethash word *words* 0)))))

(let ((n (loop for word being the hash-keys of *words* using (hash-value count)
               sum count)))
  (defun p (word)
    (/ (gethash word *words* 0) n)))

This:

Defines a variable named *WORDS*; it’s an EQUAL hash table. Lisp supports a number of different types of hash table, based on the sort of identity relation one wishes to use for keys; EQUAL means that strings will be considered equal if they have the same characters. Each time DEFPARAMETER is executed (e.g. if we reload the file), *WORDS* is reset to an empty hash table, which is what we’d want. It maps words to the number of times they’ve been seen.

Norvig uses a collections.Counter here, which is a Python library object which does much the same thing: maintain the number of times the items it stores have been seen.
Defines a function WORDS which takes a string and splits it out into lowercase words (by breaking on whitespace).
Reads in a text corpus — the file big.txt — line-by-line, splits each line into words and stores the number of times each word has been seen in *WORDS*.
Defines a function P which calculates the probability of a word: how many times it’s been seen out of how many times all words have been seen. This function captures the value of N when it is defined, so that we don’t need to calculate the sum of all words every time P is called.

Functions

`CORRECTION`

(defun correction (word)
  (loop for word in (candidates word)
        with max = -1 and best
        when (> (p word) max)
          do (setf best word)
        finally (return word)))

CORRECTION is the spelling-corrector itself. Given a word, first it finds all the candidates generated for the word (i.e., all those words in the corpus for which that word could have been typed, e.g. ‘taht’ would generate the candidates ‘tat,’ ‘that,’ ‘tact,’ ‘taft’ & ‘taut’), and then picks whichever candidate has the highest probability of being the chosen word. The whole thing’s really that simple!

`CANDIDATES`

(defun candidates (word)
  (or (known (list word))
      (known (edits-1 word))
      (known (edits-2 word)) (list word)))

CANDIDATES returns the first set of the potential words that a word might have been intended to be: itself, if it’s known from the corpus; otherwise all known words one edit away from the word; otherwise all words two edits away from the word; otherwise the word itself.

E.g. (candidates "blog") → ("log" "bog" "flog" "bloc" "blot" "blow") (because ‘blog’ is not in the corpus, instead it finds all words one edit away).

`EDITS-1`

(defun edits-1 (word &aux (letters "abcdefghijklmnopqrstuvwxyz"))
  (let* ((splits (loop for i from 0 upto (length word)
                       collect (cons (subseq word 0 i)
                                     (subseq word i))))
         (deletes (loop for (l . r) in splits
                        unless (string-equal r "")
                          collect (concatenate 'string l (subseq r 1))))
         (transposes (loop for (l . r) in splits
                           when (> (length r) 1)
                             collect (concatenate 'string l (subseq r 1 2) (subseq r 0 1) (subseq r 2))))
         (replaces (loop for (l . r) in splits
                         unless (string-equal r "")
                           nconc (loop for c across letters
                                       collect (concatenate 'string l (list c) (subseq r 1)))))
         (inserts (loop for (l . r) in splits
                        nconc (loop for c across letters
                                    collect (concatenate 'string l (list c) r)))))
    (append deletes transposes replaces inserts)))

EDITS-1:

Finds all potential words created by deleting a single letter from WORD.
Finds all potential words created by transposing letters in WORD.
Finds all potential words created by replacing a letter in WORD with another letter.
Finds all potential words created by inserting a letter in WORD.

The function is pretty much a direct translation of Norvig’s Python, and it could use some optimisation: we don’t really need to store the list of splits; we don’t need to iterate over SPLITS four times; we don’t need to iterate over LETTERS twice.

I do use one unusual Lisp feature, &AUX arguments. They’re a convenient way of defining a new variable within the function without having to clutter things up with a LET or LET* form.

`EDITS-2`

(defun edits-2 (word)
  (loop for e1 in (edits-1 word) nconc (loop for e2 in (edits-1 e1) collect e2)))

EDITS-2 finds all words two edits away from WORD: first it gets all words one edit away, then it collects all words one edit away from those words. Note that it doesn’t do anything to deduplicate the results, e.g. ‘teh’ would generate the deletion ‘eh,’ ‘th’ & ‘te’, each of which would generate 26 insertions, which would each regenerate ‘teh’ (by inserting ‘t,’ ‘e’ & ‘h,’ respectively). This is okay in this toy code, because the only thing we end up doing with the results is calling P; it might even be more expensive to deduplicate — that’s something to check later.

`KNOWN`

(defun known (words) (remove-if-not (lambda (x) (gethash x *words*)) words))

KNOWN just filters a list of words, removing any which weren’t seen in the corpus.

Test suite

I also translated the test suite from Norvig’s original article.

(ql:quickload "split-sequence")

(defpackage test-spelling-corrector
  (:use #:cl #:spelling-corrector #:split-sequence)
  (:export #:make-testset
           #:spelltest))

(in-package #:test-spelling-corrector)

(defun make-testset (path)
  (with-open-file (file path)
    (loop for line = (read-line file nil)
          while line for (right wrongs) = (split-sequence:split-sequence #\: line)
          nconc (loop for wrong in (split-sequence:split-sequence
                                    #\Space
                                    wrongs
                                    :remove-empty-subseqs t)
                      collect (cons right wrong)))))

(defun spelltest (tests
                  &optional verbose
                  &aux (start (get-universal-time)) (n (length tests)))
  (loop with good = 0 and unknown = 0
        for (right . wrong) in tests for w = (correction wrong)
        do (if (string-equal w right)
               (incf good)
               (progn
                 (unless (gethash right spelling-corrector::*words*)
                   (incf unknown))
                 (when verbose
                   (format t
                           "~&correction(~a) → ~a (~a); expected ~a (~a)"
                           wrong
                           w
                           (gethash w spelling-corrector::*words*)
                           right
                           (gethash right spelling-corrector::*words*)))))
        finally
           (format
            t
            "~&~,,2f% of ~a correct (~,,2f% unknown) at ~f words per second"
            (/ good n) n (/ unknown n) (/ n (- (get-universal-time) start)))))

And then tested against his two test sets:

(test-spelling-corrector:spelltest (test-spelling-corrector:make-testset #P"spell-testset1.txt"))
(test-spelling-corrector:spelltest (test-spelling-corrector:make-testset #P"spell-testset2.txt"))

Which yields the results:

76.666665% of 270 correct (5.5555556% unknown) at 54.0 words per second
69.0% of 400 correct (10.75% unknown) at 40.0 words per second

To compare, Norvig’s Python 3 version generates these results:

75% of 270 correct (6% unknown) at 44 words per second
68% of 400 correct (11% unknown) at 38 words per second

Interestingly, the Lisp version — despite being compiled to machine code — is not appreciably faster than the Python version. This suggests that there are some potential optimisations, which will (hopefully) be the subject of a later blog entry.

Note that Lisp package names can contain hyphens. ↩︎