Anagram Solving in Python – Part 2 : Recursion

In my last post about anagram solving I talked about how to set up a decent way to get anagrams out of a string of letters. This time I’m going to take it a step further and refactor the code into a recursive function so that you can find anagrams beyond the current three word limit.

Our Anagram Solver So Far

Here’s what we ended on last time: A working anagram solver for NLTK and Python.

and here’s what the output looked like in the csv file that it wrote:

Removing “Weird” Words and Duplicates

The first reaction you’re probably thinking is: “m”? “r”? those aren’t words! Well unfortunately, our original corpus thinks they are. Remember when I told you in the last post that computer programs are only as smart as you make them? Well apparently we didn’t make them smart enough. The corpus that we selected out of the NLTK ( nltk.corpus.words.words() ) is also the default dictionary used for spellchecking on Unix machines. And in that dictionary, single letters are included since they are commonly used for abbreviations.  Looking further down the list you’ll see that some two-letter “words” like “wy,” “wa,” and “te” also get through in the results.

These are actually english words, but only in the most technical of terms. If that doesn’t bother you, fine. Maybe you want the abbreviation for Wyoming in your results. Personally, I don’t, and the easiest way to getting around these false positives is to trim out all words in the corpus that are less than three letters long and then re-adding the common words like ‘a’, ‘an’, ‘and’, ‘so’, ‘on’, and so on. Add “len(w) > 2” to your trimmed_lowercase_wordlist declaration.

Then you can extend the list with a hand-defined list of two or one letter words. Here’s mine.

That takes care of most of the weird results while keeping some of the common two letter words relevant. You could also try appending the nltk.corpus.stopwords.words(“english”) wordlist (a list of words composed only of words like “the”, “who”, etc.).

Alternatively (and arguably even easier than tweaking a corpus, which is what we’re doing), we could look online or on the NLTK website for a better corpus to find Anagrams. I’ve heard the WordNet corpus is a decent one, but out of laziness simplicity I haven’t included it in this tutorial.

You also noticed that there are some duplicates. Why are there duplicates? Well I looked into it and it seems like our wordlist has duplicates in them. The easiest way to remove duplicates is to convert our list of words to a set of words and then back to a list (before we sort it, of course).

As long as our wordlist isn’t astronomically large, this will not be an expensive operation.

Better Living Through Recursion

Another problem with the original code is that it’s just too long for what it does. Like a bad conversation, it repeats itself often.

Let’s look at our code again, what are the absolutely essential objects that we need to find an anagram?

  • A bunch of jumbled letters.
  • A list of words that might be inside these jumbled letters.

Let’s write a function with those two objects as parameters, which returns a list of anagram results.

Something like that to start with. Now let’s add a variation of our current word finding iterator (making sure that we copy our parameters to local values).

Why do we copy? Well, because Python doesn’t copy objects you pass during a function call ever. And since we’re recursing with list objects, we need to make sure we aren’t having one call to this function alter the data in another. If you don’t believe me, remove the copy lines (marked with #copy) and see what happens. It’s a mess.

Alright, let’s re-add our wordlist prep code in the main function.

and run! What do we get?

A lot of results, huh? Well look again: Those matches are repeating each time in different orders. If you let this process run with 14 characters like in my example you’d be there for at least an hour watching the program repeat itself over and over and over.

So let’s change one line of code and then add one line of code to shorten our recursion considerably.

See what we’re doing? Basically what our index-related approach was doing. We’re splitting the sorted list so that we only pass the remaining words to the recursive function call. Since splitting the list doesn’t affect the sorted order of the subsection, then we can safely do that without losing any possible combinations (including anagrams with a word repeating multiple times).

Here’s our program as a recursive function, much nicer looking.

If we run the program again, we see things running a LOT faster. However, the process still never seems to end. This is because 4-word anagrams take several times longer to find than three- or two-word anagrams. Out of my 82,204 unique anagram results, 75,419 were 4-word anagrams while only 6783 were three-word anagrams and only 28 two-word anagrams.

anagram_type_amts_img

In our next post, we’ll look at some more advanced optimizations that will trim this long-ass process down to less than a minute or two. Read on to find out!