’Tis hard to say, if greater Want of Skill
Appear in Writing or in Judging ill
1 Prolegomenon
Automated Essay Scoring has been contemplated as an application of machine learning since its earliest days. The ETS began using its proprietary e-rater in 1999, which, with a human cohort, now grades the SAT essay. In 2012, the Hewlitt Foundation sponsored the Automated Student Assessment Prize (ASAP), offering a $100,000 reward for the best scoring system. Not long after, Mark D. Shermis and Ben Hamner1 found that automated scoring systems performed similarly to human graders, a claim met with both praise and skepticism. Les Perelman, for example, inveighed that e-rater looked for particular stylistic cues without considering their rhetorical effect:
E-Rater, [Perelman] said, does not like short sentences.
Or short paragraphs.
Or sentences that begin with “or.” And sentences that start with “and.” Nor sentence fragments.
However, he said, e-Rater likes connectors, like “however,” which serve as programming proxies for complex thinking. Moreover, “moreover” is good, too.
Gargantuan words are indemnified because e-Rater interprets them as a sign of lexical complexity. “Whenever possible,” Mr. Perelman advises, “use a big word. ‘Egregious’ is better than ‘bad.’”
And in a more thorough rejoinder,2 Perelman contests the statistical results as cherry-picked:
The clearest omission is the failure of the authors to report the fairly large percentage of machine values for the Pearson and the Quadratic Weighted Kappa that fell below the minimum standard of 0.7. […] Any value below 0.7 will be predicting significantly less than half the population and, because this is an exponential function, small decreases in value produce large decreases in the percentage accurately predicted. […] Yet for the Quadratic Weighted Kappa, 28 of the 81 machine scores, 35.6%, are below the minimally acceptable level of 0.7, even though the machines had the advantage in half of the essay sets of matching an inflated Resolved Score. In contrast, the human readers, who had to match each other with no artificial advantage, had only one Quadratic Weighted Kappa below 0.7, for the composite score on essay set #8 or only 1 out of 9 or 11.1%.
Besides these issues, and the ethics of eschewing a human reader’s eye, criticism of these systems has focused on the ease of gaming them, such as Donald E. Powers et al.3 who managed to finagle higher scores from e-rater than humans were willing to grant (though not lower scores). Perelman himself, in response to a prompt about whether “the rising cost of a college education is the fault of students who demand […] luxuries”, wrote an essay, excerpted below, which despite earning e-rater’s highest possible score of 6, is laden with solecisms, factual errors, and non sequiturs, including a full line of Allen Ginsberg’s “Howl” (the full essay is reproduced in Appendix A):
I live in a luxury dorm. In reality, it costs no more than rat infested rooms at a Motel Six. The best minds of my generation were destroyed by madness, starving hysterical naked, and publishing obscene odes on the windows of the skull. Luxury dorms pay for themselves because they generate thousand and thousands of dollars of revenue. In the Middle Ages, the University of Paris grew because it provided comfortable accommodations for each of its students, large rooms with servants and legs of mutton. Although they are expensive, these rooms are necessary to learning. The second reason for the five-paragraph theme is that it makes you focus on a single topic. Some people start writing on the usual topic, like TV commercials, and they wind up all over the place, talking about where TV came from or capitalism or health foods or whatever. But with only five paragraphs and one topic you’re not tempted to get beyond your original idea, like commercials are a good source of information about products. You give your three examples, and zap! you’re done. This is another way the five-paragraph theme keeps you from thinking too much.
With the above criticisms leveled, I should disclaim that I am training a model to predict essay scores, not to score essays, which is a much harder task (and should be held to a much higher standard) and not an obviously meaningful thing to ask of a mathematical model to begin with. However, the results show that much—even if not all—of what constitutes an essay grade is not the je ne sais quoi only a human evaluator can glimpse, but rather mechanical issues that can be straightforwardly calculated and modeled.
2 Data exploration & cleaning4
2.1 Essay set selection
The corpus is in the form of 13,000 essays, totaling 2.9 million words—more than twice the length of Proust’s In Search of Lost Time. The length, however, was not as immediate an obstacle as the composition, shown in tbl. 1. The eight essay sets were not only responding to different prompts, but were of different lengths and genres, written by students of different grade levels, and, most importantly, scored using incommensurate rubrics and scoring protocols.
Essay set | Grade level | Genre | Train size | Test size | Avg. length | Rubric range | Resolved score range | Adjudication |
---|---|---|---|---|---|---|---|---|
1 | 8 | Persuasion | 1,785 | 592 | 350 | 1–6 | 2–12 | Sum if adjacent, else third scorer |
2 | 10 | Persuasion | 1,800 | 600 | 350 | 1–6, 1–4 | 1–6, 1–4 | First |
3 | 10 | Exposition | 1,726 | 575 | 150 | 0–3 | 0–3 | Higher if adjacent, else third scorer |
4 | 10 | Exposition | 1,772 | 589 | 150 | 0–3 | 0–3 | Higher if adjacent, else third scorer |
5 | 8 | Exposition | 1,805 | 601 | 150 | 0–4 | 0–4 | Higher |
6 | 10 | Exposition | 1,800 | 600 | 150 | 0–4 | 0–4 | Higher |
7 | 7 | Narrative | 1,730 | 576 | 250 | 0–15 | 0–30 | Sum |
8 | 10 | Narrative | 918 | 305 | 650 | 0–30, 0–30 | 0–60 | Sum if adjacent, else third scorer |
Limiting myself to a single essay set would have produced a somewhat feeble model, as words idiosyncratic to the topic in question became artificially elevated in importance. In the end, I combined sets 3 and 4, which both consisted of expository essays written by tenth graders, graded on a scale from 0 (worst) to 3 (best). These scores are holistic, i.e., not broken down into categories representing grammar and mechanics, relevance, organization, etc., which makes them easier for a model to predict.
2.2 Data cleaning
The scores are broken down, for each essay set, into “domain scores” representing the valuations of the individual scorers. In the interest of having a single number to try to predict, I combined these scores by taking the mean:
# If only one score exists, use that. Otherwise, take the mean of both scores.
essays["score"] = list(map(np.nanmean, zip(essays["domain1_score"], essays["domain2_score"])))
We can then look at the way scores are distributed among the essays in our chosen subset.
In fig. 1, we see that the scorers of the fourth essay set were somewhat less lenient than those grading the third, the latter of whom awarded the highest score to a full quarter of the papers, and the lowest score of 0 to only 39 unhappy test-takers. Putting these together, we have a roughly normal-looking distribution, with many ones and twos, and fewer zeroes and threes. This gives us a baseline to use for the modeling below: a dumb model, which assigned every essay to the plurality score class, giving every essay a score 1, would have an accuracy of 35%. This is the number our models must beat.
The essays themselves are in little need of cleaning: they are hand-transcribed from the originals, and have been anonymized by replacing named entities, including names, dates, addresses, and numbers, with enumerated placeholders.
2.3 Data exploration5
A basic exploration of the essays shows some striking patters. For example, as fig. 2 illustrates, score is highly correlated with length at , meaning that over half the variation in score can be explained by variation in length. In other words, all else held equal, adding 82 words corresponds with a point increase in score.
One interesting thing we see is that, despite the correlation, there are many essays earning top marks that are almost impossibly short. The following are recorded in the dataset as having earned a top score (both prompts instructed students to use examples from the texts):
The features of the setting affect the cyclist in many ways. It made him tired thirsty and he was near exaustion [sic].6
Because she saying when the @CAPS1 grow back she will be @CAPS2 to take the test again.7
Reserved need to check keenly8
That that gnomic last “essay” (yes, that’s the whole text!) earned a coveted score of 3 is almost certainly an error, though the source of the error (the recording of the scores, the compilation of the dataset, or the scoring process itself) is as mysterious as the cryptic phrase’s meaning. However, there doesn’t seem to be an objective way of pruning these aberrant rows from the dataset, necessitating my leaving them in.
Other measures are telling as well. For instance, we can look at the rate of misspelled words, by tokenizing with spaCy, and counting each token that is not in a given wordlist.9
import spacy
nlp = spacy.load("en")
# Generate wordlist
with open("/usr/share/dict/words" , "r") as infile:
wordlist = set(infile.read().lower().strip().split("\n"))
# Number of words that are misspelled
essays["misspellings"] = len([word for word in nlp(essays["essay"])
if not word.is_space
and not word.is_punct
and not word.text.startswith("@") # named entities
and not word.text.startswith("'") # contractions
and word.text.lower() not in wordlist])
# Percentage of words misspelled
essays["misspellings"] /= essays["tokens"]
The results, in fig. 3, are curiously complementary to those in fig. 2: the rate of misspellings is practically the same across score classes (), but those at the extremes, with 10% or more of their words misspelled, are overwhelmingly likely to be low scorers.
The question of assessing prompt-relevance is trickier. One way of tackling it is to calculate the document vector of the story to which the students are responding, and calculate its cosine similarity with the document vector of each essay. We can see the results in fig. 4.
The results aren’t bad (), especially considering the outliers for score 3 are the same bizarrely short essays we saw above, including our Delphic “reserved need to check keenly”.
At this point, we must ask what the value of this metadata is. The ETS claims that its e-rater accounts for prompt-relevance, as well as:
- errors in grammar (e.g., subject–verb agreement)
- usage (e.g., preposition selection)
- mechanics (e.g., capitalization)
- style (e.g., repetitious word use)
- discourse structure (e.g., presence of a thesis statement, main points)
- vocabulary usage (e.g., relative sophistication of vocabulary)
- sentence variety
- source use
- discourse coherence quality
While it would take a sophisticated natural language parser to incorporate these details into our model, we may be able to approximate these things using metadata as proxies. Type–token ratio, for instance, could stand in for “repetitious word use”, and vector similarity to the prompt for relevance. As an alternative to parsing for narrative structure, I included a count of “linking words” that would likely signal a transition between paragraphs,10 but this bore little relationship to the human scorers’ judgments (). Finally, as a proxy for sentence complexity, I used spaCy to parse the syntactic trees of each sentence, and took the longest branch, thus rewarding complex sentences with prepositional phrases and dependent clauses.
# Depth of longest branch in dependency tree
essays["max_depth"] = [max([len(list(token.children))
for token in nlp(essay)])
for essay in essays["essay"]]
This fared somewhat better as a metric: . Finally, I tried to measure “relative sophistication of vocabulary” by quantifying the uncommonness of the words used. I did this by building a word frequency list from the 14-million-word American National Corpus, the details of which are in Appendix B. This correlated well with score (), although it was no doubt standing in somewhat for length.
3 Modeling
3.1 Classical models11
As hinted at by the high scores above, we can get fair prediction scores by modeling on metadata alone. The first step, after splitting our essays into train and test sets, is to standardize the data by scaling to -score. I then ran principal component analysis (PCA) on the data, because many of the columns (e.g., type count and token count) encoded essentially the same information in parallel. The PCA transformation extracts those components which encode the greatest variance; together, the ten components extracted accounted for 98% of the variance within the metadata.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y)
# Standardize to z-score
ss = StandardScaler()
X_train_sc = ss.fit_transform(X_train))
X_test_sc = ss.transform(X_test))
# PCA-transform
pca = PCA(n_components=10)
Z_train = pca.fit_transform(X_train_sc)
Z_test = pca.transform(X_test_sc)
The modeling itself is fairly straightforward. I modeled the data both with and without the PCA transform, and found the latter to have a slight edge, although all models achieved similar test scores (tbl. 2).
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import AdaBoostClassifier, ExtraTreesClassifier
from sklearn.metrics import cohen_kappa_score
gnb = GaussianNB().fit(Z_train, y_train)
svm = SVC(kernel="rbf", C=1).fit(Z_train, y_train)
ext = ExtraTreesClassifier().fit(Z_train, y_train)
ada = AdaBoostClassifier().fit(Z_train, y_train)
for model in [gnb, svm, ext, ada]:
print("Test score:", model.score(X_test_sc, y_test))
print("Test kappa:", cohen_kappa_score(model.predict(X_test_sc), y_test),
weighting="quadratic")
I also included the weighted Cohen’s kappa,12 which was the metric used for the original competition, although Cohen’s kappa is typically used to compare model results to each other, not to a gold standard.
Model | Test acc. | PCA test acc. | Test | PCA Test |
---|---|---|---|---|
Naïve Bayes | 59.8% | 59.3% | 0.710 | 0.613 |
Support vector machine | 65.1% | 63.6% | 0.690 | 0.674 |
ExtraTrees | 63.7% | 62.4% | 0.695 | 0.679 |
AdaBoost | 60.0% | 56.8% | 0.664 | 0.670 |
The support vector machine and ExtraTrees models performed slightly better than their rivals, and in fact made similar predictions to each other (). We should also take into account that on essay sets 3 and 4, human graders agreed only about 75% of the time, with a weighted Cohen’s kappa of 0.77 and 0.85, respectively.13
3.2 Recurrent Neural Network14
One of the state of the art tools in text processing is the recurrent neural network, into which ordered data is fed in series, and the model is retrained on prior data, in order to learn things about the sequence. The first step to doing this with word data is to convert the words to numerical indices (so “a” becomes 1, “aardvark” becomes 2, “Aaron” becomes 3, etc.), then padding them to be of equal length.
from tensorflow.keras.preprocessing.sequence import pad_sequences
# Define vocabulary
vocab = set(token.text for essay in essays["essay"] for token in nlp.tokenizer(essay))
# Convert words to numerical indices <https://www.tensorflow.org/tutorials/text/text_generation>
word2idx = {u: i for i, u in enumerate(vocab)}
# Convert essays to vectors of indices
X_vector = [[word2idx[token.text]
for token in nlp.tokenizer(essay)]
for essay in essays["essay"]]
# Create padded sequences
X_vector = pad_sequences(X_vector)
# Split into train and test sets
X_vector_train, X_vector_test = train_test_split(X_vector);
This then goes into an embedding layer, which condenses it into a dense vector.
With neural networks, it is possible to include both the vectorized document and the metadata, by processing the former in a GRU or LSTM layer, concatenating the latter to its output neurons, and processing both in a regular perceptron structure.15 Following the example in this blog post, I implemented the code below:16
from tensorflow.keras.layers import Dense, GRU, Embedding, Input, Bidirectional, Concatenate
from tensorflow.keras.models import Model
# Define inputs
vector_input = Input(shape=(X_vector.shape[1],)) # Text vectors, in series of length 1,000
meta_input = Input(shape=(X_meta.shape[1],)) # Scaled metadata (types, tokens, etc.)
# Embedding layer turns lists of word indices into dense vectors
rnn = Embedding(
input_dim = len(vocab),
output_dim = 128,
input_length = X_vector.shape[1],
)(vector_input)
# GRU layers for RNN
rnn = Bidirectional(GRU(128, return_sequences=True, kernel_regularizer=l2(0.01)))(rnn)
rnn = Bidirectional(GRU(128, return_sequences=False, kernel_regularizer=l2(0.01)))(rnn)
# Incorporate metadata
rnn = Concatenate()([rnn, meta_input])
# Define hidden and output layers
rnn = Dense(128, activation="relu", kernel_regularizer=l2(0.01))(rnn)
rnn = Dense(128, activation="relu", kernel_regularizer=l2(0.01))(rnn)
rnn = Dense(4, activation="softmax")(rnn)
# Define model
model = Model(inputs=[vector_input, meta_input], outputs=[rnn])
# Fit model
model.fit([X_vector_train, X_meta_train_sc], y_train,
validation_data=([X_vector_test, X_meta_test_sc], y_test))
The results are surprisingly close to the models in sec. 3.1 above. Amending our previous table:
Model | Test acc. | Test |
---|---|---|
Naïve Bayes | 59.8% | 0.710 |
Support vector machine | 65.1% | 0.690 |
ExtraTrees | 63.7% | 0.695 |
AdaBoost | 60.0% | 0.664 |
RNN | 63.6% | 0.704 |
It seems that the metadata was more valuable in predicting test scores than the vectorized documents—or else, that the RNN couldn’t make better use of the two than a support vector machine could of the one. Nevertheless, I have shown that using a few key linguistic metrics, we can train a simple model to predict essay scores in fairly good agreement with human scorers.
Appendix A: Perelman’s (2012) essay
Prompt:
Question: “The rising cost of a college education is the fault of students who demand that colleges offer students luxuries unheard of by earlier generations of college students—single dorm rooms, private bathrooms, gourmet meals, etc.”
Discuss the extent to which you agree or disagree with this opinion. Support your views with specific reasons and examples from your own experience, observations, or reading.
Response:
In today’s society, college is ambiguous. We need it to live, but we also need it to love. Moreover, without college most of the world’s learning would be egregious. College, however, has myriad costs. One of the most important issues facing the world is how to reduce college costs. Some have argued that college costs are due to the luxuries students now expect. Others have argued that the costs are a result of athletics. In reality, high college costs are the result of excessive pay for teaching assistants.
I live in a luxury dorm. In reality, it costs no more than rat infested rooms at a Motel Six. The best minds of my generation were destroyed by madness, starving hysterical naked, and publishing obscene odes on the windows of the skull. Luxury dorms pay for themselves because they generate thousand and thousands of dollars of revenue. In the Middle Ages, the University of Paris grew because it provided comfortable accommodations for each of its students, large rooms with servants and legs of mutton. Although they are expensive, these rooms are necessary to learning. The second reason for the five-paragraph theme is that it makes you focus on a single topic. Some people start writing on the usual topic, like TV commercials, and they wind up all over the place, talking about where TV came from or capitalism or health foods or whatever. But with only five paragraphs and one topic you’re not tempted to get beyond your original idea, like commercials are a good source of information about products. You give your three examples, and zap! you’re done. This is another way the five-paragraph theme keeps you from thinking too much.
Teaching assistants are paid an excessive amount of money. The average teaching assistant makes six times as much money as college presidents. In addition, they often receive a plethora of extra benefits such as private jets, vacations in the south seas, a staring roles in motion pictures. Moreover, in the Dickens novel Great Expectation, Pip makes his fortune by being a teaching assistant. It doesn’t matter what the subject is, since there are three parts to everything you can think of. If you can’t think of more than two, you just have to think harder or come up with something that might fit. An example will often work, like the three causes of the Civil War or abortion or reasons why the ridiculous twenty-one-year-old limit for drinking alcohol should be abolished. A worse problem is when you wind up with more than three subtopics, since sometimes you want to talk about all of them.
There are three main reasons while Teaching Assistants receive such high remuneration. First, they have the most powerful union in the United States. Their union is greater than the Teamsters or Freemasons, although it is slightly smaller than the international secret society of the Jedi Knights. Second, most teaching assistants have political connections, from being children of judges and governors to being the brothers and sisters of kings and princes. In Heart of Darkness, Mr. Kurtz is a teaching assistant because of his connections, and he ruins all the universities that employ him. Finally, teaching assistants are able to exercise mind control over the rest of the university community. The last reason to write this way is the most important. Once you have it down, you can use it for practically anything. Does God exist? Well, you can say yes and give three reasons, or no and give three different reasons. It doesn’t really matter. You’re sure to get a good grade whatever you pick to put into the formula. And that’s the real reason for education, to get those good grades without thinking too much and using up too much time.
In conclusion, as Oscar Wilde said, “I can resist everything except temptation.” Luxury dorms are not the problem. The problem is greedy teaching assistants. It gives me an organizational scheme that looks like an essay, it limits my focus to one topic and three subtopics so I don’t wander about thinking irrelevant thoughts, and it will be useful for whatever writing I do in any subject.1 I don’t know why some teachers seem to dislike it so much. They must have a different idea about education than I do. By Les Perelman
Appendix B: ANC wordlist
The following code generates the wordlist I used (see sec. 2). It took about 15 minutes to run. The ANC data is available from anc.org, and is, per that website, “fully open and unrestricted for any use”. The resulting wordlist obeys Zipf’s law, as shown in fig. 5, and is part-of-speech tagged, so homographs of different frequencies (e.g., sawV vs. sawN) can be distinguished.
The actual frequency measure used was the sum of word token ranks. While this gave higher results for longer sentences, and was therefore intercorrelated with token length, a very uncommon word could give the score an order-of-magnitude boost.
#!/usr/bin/env python3
# Libraries
import glob
import spacy
from unidecode import unidecode
# Options
anc_path = "/home/alex/Data/ANC/" # freely downloadable from anc.org
dict_path = "/usr/share/dict/words" # wamerican-insane v2017.08.24-1
freq_per = 100_000 # scaling factor (i.e., compute freq. per this many words)
include_hapaxes = True
# Initialize spaCy
nlp = spacy.load("en")
freqs = {}
total_tokens = 0
with open(dict_path, "r") as file:
dictionary = set(file.read().split("\n"))
# Get all text files recursively <https://stackoverflow.com/a/45172387>
for filename in glob.iglob(anc_path + "**/*.txt", recursive=True):
# Open each file in the corpus
with open(filename, "r") as file:
# Remove diacritics, parse, & tokenize
for token in nlp(unidecode(file.read())):
# Eliminate non-words
if not token.is_punct and not token.is_space:
# Lemmatize and remove diacritics/ligatures
lemma = token.lemma_.lower().strip("-")
# Only use dictionary words
if lemma in dictionary:
# Add lemma/part-of-speech tag
type_pos = ",".join([lemma, token.pos_])
# Update our dictionary
freqs[type_pos] = freqs.setdefault(type_pos, 0) + 1
# Update our running total
total_tokens += 1
print("{:,} tokens,".format(total_tokens),
"{:,} types".format(len(freqs.keys())))
# <https://stackoverflow.com/a/9001529>
freqs_sorted = dict(sorted(freqs.items()))
with open("anc_frequency_list.csv", "w") as file:
# CSV header
file.write(f"lemma,pos,count,freq_per_{freq_per}\n")
# CSV rows
for word, freq in freqs_sorted.items():
if include_hapaxes or freq > 1:
file.write(f"{word},{freq},{freq_per*freq/total_tokens}\n")
“Contrasting State-of-the-Art Automated Scoring of Essays: Analysis,” in Handbook of Automated Essay Evaluation: Current Applications and New Directions, ed. Mark D. Shermis and Jill Burstein (New York: Routledge, 2013), 313–46, doi:10.4324/9780203122761.ch19.↩
“Critique of Mark D. Shermis & Ben Hamner, ‘Contrasting State-of-the-Art Automated Scoring of Essays: Analysis’,” Journal of Writing Assessment 6, no. 1 (2013), http://www.journalofwritingassessment.org/article.php?article=69.↩
“Stumping E-Rater: Challenging the Validity of Automated Essay Scoring,” Computers in Human Behavior 18, no. 2 (March 2002): 103–34, doi:10.1016/S0747-5632(01)00052-8.↩
Essay no. 6332, set 3↩
Essay no. 10057, set 4↩
Essay no. 9870, set 4↩
I used wamerican-insane v2017.08.24-1, which contains 654,749 entries.↩
Phrases culled from Wiktionary (1, 2). The full list:
accordingly, additionally, alphabetically, alphanumerically, also, alternatively, antepenultimately, anyway, at any rate, before, besides, by the way, chronologically, consequently, conversely, eighthly, either, eleventhly, equally, fifthly, fiftiethly, finally, first, first of all, first off, first up, firstly, for another thing, for example, for instance, for one thing, fortiethly, fourthly, further, furthermore, hence, however, hundredthly, in addition, in other words, in the first place, incidentally, indeed, lastly, likewise, moreover, neither, nevertheless, next, nextly, ninthly, nonetheless, on the contrary, on the gripping hand, on the one hand, on the other hand, otherwise, parenthetically, penultimately, rather, secondly, serially, seventhly, similarly, sixthly, sixtiethly, still, tenthly, that is, that is to say, then again, therefore, thirdly, thirteenthly, thirtiethly, though, thus, to that end, too, twelfthly, twentiethly, wherefore
Jacob Cohen, “A Coefficient of Agreement for Nominal Scales,” Educational and Psychological Measurement 20, no. 1 (1960): 37–46, doi:10.1177/001316446002000104.↩
Shermis and Hamner, “Contrasting State-of-the-Art Automated Scoring of Essays,” 316.↩
See, e.g., Linzi Xing and Michael J. Paul, “Incorporating Metadata into Content-Based User Embeddings,” in Proceedings of the 3rd Workshop on Noisy User-Generated Text, ed. Leon Derczynski et al. (Association for Computational Linguistics, 2017), 45–49, doi:10.18653/v1/W17-4406.↩
The schema is, roughly:↩