Syllables Matter

The Natural Language Toolkit

The software I am writing for Dr. Kelly calculates a Flesch-Kincaid grade level for every assignment he grades. The formula is straight forward enough.

However, I was surprised to find that different implementations of this formula diverge dramatically from each other. The same text can have wildly different grade levels depending on the software you use.

For example, Microsoft word says that this blog entry reads at the grade 7  level, but another implementation says that it reads at a grade 9 level. Who is right? How does this happen?

It turns out that writing code to count the syllables in a word is not an easy thing. Early this summer, I tried to use the natural language toolkit to count the syllables in a word, but the code did not work. I threw the problem into the too-hard-basket for a while, and came back to it last week. A new version has been posted, and it works. It was patched by Alex Rudnick.

Here are two lines of code that show you how easy it is to count syllables in a word using the NLTK:

>>> from nltk_contrib.readability.textanalyzer import syllables_en
>>> print syllables_en.count("hello")

If you get the latest version of nltk_contrib, you will have no problems. However, I did find one small problem. This code insists that the word ‘the’ has zero syllables. The fix was easy enough, but I am not sure this how the maintainers would fix it:

Edit the file called syllables_en.py. It contains a list of so-called special syllables called specialSyllables_en. Add ‘the’ to the list, and specify that it has one syllable.

It is easy to see how this list works. For example “Mr.” has two syllables, but there is no easy way to figure out that “Mr.” is pronounced as “mister.” I may find other exceptions as I go.

However, having fussed with code, and having downloaded the latest version, I am pleased to report that the flesch-kincaid score I generate matches the score that is generated by Microsoft Word. My syllable count is correct, and the rest just follows.

My code will probably never agree perfectly with the score that is generated with Microsoft Word. It turns out that some English words are pronounced with a different number of syllables by English speakers from different parts of the world. Also, the code in the natural language toolkit is sometimes wrong. For example, it says that the word ‘calculates’ has four syllables. However, it is close enough in a great majority of cases. You should give it a try.