Anagram Solving in Python – Part 1 : The NLTK

An anagram is a fun bit of wordplay that rearranges the letters of one phrase into another. For example, Andrew Tremblay is an anagram of Amble, Wander, Try and Apple Macintosh rearranged is Laptop Machines.  Does making a program that finds all the anagrams of your name sound like fun? Well then keep reading!


First off, if you just want to know a few anagrams about your own name, there are plenty of websites that claim to be anagram solvers, but there are a lot of drawbacks to them.

  • You’re limited to whatever anagrams they find for you with no explanation on how they found those anagrams.
  • They can limit your searches or your word length.
  • They can be slower to return fewer results.
  • They can only work when you’re connected to the Internet.

And most importantly they aren’t nearly as cool as making your own anagram solver!

Setup

First thing we need to do is set up our programming environment. You need to download and install three things:

  • the Python programming language,
    • The language we’ll be writing in. It’s named after Monty Python so you know it’s legit.
    • You already have this language installed if you’re running a Mac
  • the pip package manager,
    • package managers are great because they let you download code that’s already written
    • you need this to set up the NLTK
  • and the Natural Language Toolkit (NLTK).
    • a super powerful toolkit/library that analyzes words, grammar, and languages
    • we need this so we don’t spend too much time explaining to Python what words are

Once you’ve installed python, pip, and NLTK, launch the python shell and see how your versions compare to mine, if you’re using any version of Python 2.7 and somewhere around NLTK version 3 you’re probably fine.

My Anagram Solver Setup (2015-12-11)

Okay! You’re all set up. Let’s get started. Or you can just scroll to the end where I have

NLTK and using a Corpus

Programs are only as smart as you make them, and our anagram solver doesn’t even know what a word is yet. Let’s fix that.

One thing that NLTK comes bundled with a slew of corpora, which are broadly just files full of words. Download or copy this text into a python file and run it.

The output should be something like:

236736 total words loaded

236,736 words! That’s not every single word in the english language but it’s good enough to start.

Introducing: FreqDist

Okay, now that we have words, how do we find the anagrams? Well let’s start with some jumbled letters that I’d like to solve as an anagram. Add the following line to your py file:

My name has 14 letters. If you wrote down every combination of ‘a’, ‘n’, ‘d’, etc, onto a lined sheet of paper, that paper would need to be 87,178,291,200 (factorial 14)  lines long. Inside those lines would be every anagram my name could make, as well as my own name, as well as a whole lot of gibberish. That’s quite a large set of combinations for my computer to search through, so let’s trim that down a bit.

Three of my letters (e, a, and r) occur multiple times, so out of my

possible combinations, there are only

unique permutations that we need to look at. We already shrank our search space down to an eighth of what it was!

You’d be surprised, but one of the main differences between good programming and bad programming is knowing about the data you’re dealing with. By knowing the frequency of each letter in the jumbled letters (in this case, my name), you save yourself a lot of hassle.

NLTK has a way to compute this frequency distribution of letters in a word, it’s called FreqDist:

FreqDist is great and you can read all about it here on Chapter 1 of the NLTK book. For now, though, just think of it as a dictionary of letter occurrences (although it is a LOT more than that).

We’re going to use this to trim our corpus, remember our corpus? This part’s about the corpus.

In the above gist we take our initial code that loads the corpus and sets our jumble, and then filter the word list based on comparing the frequency distributions. If a word in our corpus has too many letters to possibly fit in our jumble, then it is discarded. For example, “axe” is discarded because even though it has “a” and “e”, there is no ‘x’ in ‘andrewtremblay’, so using it in an anagram would be impossible.

Sorted Lists for the Win

If we count our matching_wordlist now, we’ll get 1,794 words, which is still a lot of different word combinations. Even if we are looking for three word anagrams only, that’s 1,794 * 1,793 * 1,792 = 5,764,222,464 possible anagrams. That goes up to 5,773,874,184 permutations if we can reuse words, (let’s not deal with that).

Here’s a bombshell: we don’t need to keep the original order of our wordlist! Let’s start by sorting the words by their length (longest first).

It might not look obvious, but now that we’re working with a sorted list, we can run our sentence comparisons much more quickly. Here are the facts about sorted_wordlist:

  • The first word will always be greater or equal in length to the second word,
  • The second word will always longer or as long as the third
  • There’s power in that knowledge

Finally Getting Anagrams

Let’s finally get to an okay solution. For this case let’s solve for all anagrams up to three words.

The above gist is a rough-but-working example of finding one-, two-, and three-word anagrams from a jumbled_letters variable. It’s quite long, but really it’s only using the two things we introduced here; NLTK and frequency distribution. Looking closer, you’ll see that the nested while loops are basically do the same thing:

  1. Get the next word in the sorted_wordlist
  2. If the word has the exact same letters as the remaining letters, you’ve found an anagram!
  3. If the word fits into some of the remaining letters in our jumble, remove the word from the remaining letters and start again at step #1 with the remaining letters.

This process repeats for every word in the list, but stops if it doesn’t find an anagram after it finds three fitting words. This means we will only get a three word anagram at most.

The sharper eyed among you will notice that, in order to save on iterations, we take advantage of the sorted wordlist by not going back and repeating the comparison for every word. The reason? Well, comparing the frequency distribution of two words is a very expensive process, so minimizing the number of times it is done is essential. We can solve this with a bit common sense:

Two words, each six letters long, cannot be an anagram of an 11-letter jumble.

– Common Sense

This is why we sorted our words by length, longest first, once we found a word that fits, all previous words that we checked are either 1.) the same length but with an unfit letter distribution, or 2.) too long, which automatically disqualifies them. This vastly improves the speed of our program, because we’re never comparing the same word combination twice, and we’re also never wasting our time with words combos that are clearly too long to be an anagram.

After all the anagrams are found, they’re written to a csv (comma-separated value) file in the same location as the py file, with a filename matching whatever anagram jumble you put in it.

For example: ‘andrewtremblay_anagrams.csv’ looks like this:

Perfect for sorting! Excel and Google Sheets will both support this filetype.

Take the script for a spin and see if there’s any improvements you can make. I can think of two just by looking at the csv example I pasted.

So What’s Next?

This tutorial code could greatly be cleaned up in a number of ways, and will be handled in the next tutorial. But here’s a few more things to think about before then:

When people usually point anagrams out, they’re actually talking about aptagrams, which are anagrams that might have a poignant meaning to them about the original phrase. As another example, “astronomers” is an anagram/aptagram for “moon starers“. How could we use language analysis to filter our nonsense anagrams and find these kinds of statements? The answer lies inside the Natural Language Toolkit.

What if you want to prefilter for specific words, like have ‘an’, or ‘and’ in all the anagrams?

What if you want to try anagrams using a different corpus, like finding anagrams that only use Shakespearean words, for example? Well, all you have to do is import that corpus or write your own. See the NLTK site for a list of corpora and how to download them.