The software I am writing for Dr. Kelly calculates a Flesch-Kincaid grade level for every assignment he grades. The formula is straight forward enough.
However, I was surprised to find that different implementations of this formula diverge dramatically from each other. The same text can have wildly different grade levels depending on the software you use.
For example, Microsoft word says that this blog entry reads at the grade 7 level, but another implementation says that it reads at a grade 9 level. Who is right? How does this happen?
It turns out that writing code to count the syllables in a word is not an easy thing. Early this summer, I tried to use the natural language toolkit to count the syllables in a word, but the code did not work. I threw the problem into the too-hard-basket for a while, and came back to it last week. A new version has been posted, and it works. It was patched by Alex Rudnick.
Here are two lines of code that show you how easy it is to count syllables in a word using the NLTK:
>>> from nltk_contrib.readability.textanalyzer import syllables_en
>>> print syllables_en.count("hello")
If you get the latest version of nltk_contrib, you will have no problems. However, I did find one small problem. This code insists that the word ‘the’ has zero syllables. The fix was easy enough, but I am not sure this how the maintainers would fix it:
Edit the file called syllables_en.py. It contains a list of so-called special syllables called specialSyllables_en. Add ‘the’ to the list, and specify that it has one syllable.
It is easy to see how this list works. For example “Mr.” has two syllables, but there is no easy way to figure out that “Mr.” is pronounced as “mister.” I may find other exceptions as I go.
However, having fussed with code, and having downloaded the latest version, I am pleased to report that the flesch-kincaid score I generate matches the score that is generated by Microsoft Word. My syllable count is correct, and the rest just follows.
My code will probably never agree perfectly with the score that is generated with Microsoft Word. It turns out that some English words are pronounced with a different number of syllables by English speakers from different parts of the world. Also, the code in the natural language toolkit is sometimes wrong. For example, it says that the word ‘calculates’ has four syllables. However, it is close enough in a great majority of cases. You should give it a try.
Since early June, a professor from Carleton University and I have been designing and building software to help people write more clearly. Professor John Medicine Horse Kelly is a journalism professor who won a teaching award for a system of instruction he developed during his twenty years of teaching.
He is an interesting man, humble and soft-spoken. He is a natural listener, and he will listen to almost anything you have to say with a gentle and sympathetic smile. But, if you can stop talking and start listening, he will fill any silence you create with lovely stories about his grandfather, Peter Kelly, or Bill Reid, or any of his Haida relatives and friends. His soft-spokeness masks deep passions. One is teaching.
He has developed a system for teaching students to write well by identifying patterns that students can edit to improve their writing. I have written code that uses the natural language toolkit to read student’s assignments and to identify these patterns. Even I have learned to apply Dr. Kelly’s principles to create writing that is clear, lively and easy to understand. Here are things to watch for:
I was able to improve this article by using our tool. The first draft of this blog entry read at the grade 10.5 level, and it contained several complexifiers. I was able to eliminate these words, and my writing improved. The draft you are now reading reads at a grade 8.6 level. As I use Dr. Kelly’s system, I see that it works.
Plain writing is important. People can find it hard to understand medical consent forms, legal documents, and privacy policies. Is it ethical to write important documents that ordinary people cannot understand? Is it smart?
In the next few weeks, I am going write about our tool. We call it WISE – Writing Instruction Software for Educators. Recently, we realized that we may have to call it Writing Instruction Software for Everyone. Who knows, if the code is easy to use, and if the principles can be easily explained, we may share the code with anybody who wants to try it. We have to get some legal advice first, but we hope to make a file available for download some time between now and Christmas.