NLTK in Python – Introduction to Corpora


Corpora. Now that’s an intimidating word isn’t it? Especially when it’s up there in it’s big huge letters. Just letting you think about yourself. About what you’ve done with your life other than learn what a “corpora” is.  Truthfully, though, a corpora is simply nothing but a collection of arrangements of words (each arrangement being called a corpus). It’s usually a LARGE collection of words, the entire works of Shakespeare is a commonly used corpora. “Romeo and Juliet” could then be considered a corpus inside the Shakespeare corpora. But really a corpus can be as simple as a file full of words.

That means loading your own “corpus” could look as simple as this:

That 1000_words.txt file is available in the full gist here:

Why Should You Care?

The corpus is one of the main cornerstones to all language processing. It’s the
experience that your program draws from. You can use it to tell if new text that your program receives is mentioning a certain topic (without just looking for explicit words), telling something that is actually correct or incorrect, or if it’s even expressing a certain emotion.

There’s no need to create your own corpora straight off the bat. There are a great many corpora riding along with the NLTK, all of which can be found in the NLTK Downloader feature, easily accessible from inside the python shell.

Load corpora through the handy dandy NLTK downloader

Among the selection in the NLTK Downloader is a variety of historic corpuses. I find myself using the following ones most often:

  • A Selection of Public Domain Books from Project Gutenberg
  • The Collected Works of William Shakespeare
  • Every Inaugural Address by every US President
  • Official Carnegie Mellon Corpus and The Official Brown Corpus
  • The Chat80 chat logs

But the real adventure starts after you start to strike out and build your own corpora for your own projects. There’s no glory in reinventing the wheel, however. Feel free to google around the internet or use my own corpus findings here.