The History of every major Galactic Civilization tends to pass through three distinct and recognizable phases, those of Survival, Inquiry and Sophistication, otherwise known as the How, Why and Where phases.
For instance, the first phase is characterized by the question “How can we eat?”, the second by the question “Why do we eat?” and the third by the question, “Where shall we have lunch?”
1 Introduction
Since 2011, the New York Public Library has maintained “What’s on the menu?”, a collection of tens of thousands of restaurant menus going back to the mid-nineteenth century. This is an invaluable collection not just for data scientists, but for food historians, since it is well known that foods go in and out of fashion like clothing. For example, the famed Oyster Bar at Grand Central Terminal, which in 1941 featured “cream of chicken à la reine”, “broiled sweetbreads on toast with Virginia ham”, and “farina custard pudding, Melba sauce” (not to mention oysters for a nickel apiece), today serves such mouthfuls as “poached farmed Norwegian salmon over baby red oak-watercress salad with charred scallion-honey vinaigrette, avocado, and goat cheese.”
These vicissitudes are due partly to economics. In 1914, avocados, then known as “alligator pears”, could go for $1 each—more than $25 today. But economics can be adulterated by public perception. David Foster Wallace, considering the lobster, wrote that:1
Up until sometime in the 1800s, though, lobster was literally low-class food, eaten only by the poor and institutionalized. Even in the harsh penal environment of early America, some colonies had laws against feeding lobsters to inmates more than once a week because it was thought to be cruel and unusual, like making people eat rats. One reason for their low status was how plentiful lobsters were in old New England. “Unbelievable abundance” is how one source describes the situation, including accounts of Plymouth pilgrims wading out and capturing all they wanted by hand, and of early Boston’s seashore being littered with lobsters after hard storms—these latter were treated as a smelly nuisance and ground up for fertilizer. There is also the fact that premodern lobster was often cooked dead and then preserved, usually packed in salt or crude hermetic containers. Maine’s earliest lobster industry was based around a dozen such seaside canneries in the 1840s, from which lobster was shipped as far away as California, in demand only because it was cheap and high in protein, basically chewable fuel.
By 1941, of course, the Oyster Bar’s menu lists “alligator pear salad” for 45¢ and “lobster pan roast”—one of the priciest items named—for $1.45.2 In this project I wanted to see if, using this data,3 I could predict the year a menu was served based on the dishes listed.
2 Data structuring
The NYPL provides the data in the very simple snowflake schema shown in fig. 1. The central table is MenuItem.csv
, in which each of the 1,334,417 rows represents an item on a menu. Each references a particular dish, which are named in Dish.csv
(426,959 comestibles in total), and each is also referenced to a page in MenuPage.csv
, which are in turn referenced in Menu.csv
to the particular bills of fare on which they appear.
Since all I wanted was the name of the dish (from Dish.csv
), the date (from Menu.csv
), and the menu ID (in case I wanted to group dishes by menu), I merged the data frames like so:
# Add menu id to each menu item
df = pd.merge(
left = menu_item[["dish_id", "menu_page_id"]],
right = menu_page[["id", "menu_id"]],
how = "right",
left_on = "menu_page_id",
right_on = "id"
)
# Add menu date to each menu id
df = pd.merge(
left = df,
right = menu[["id", "date"]],
how = "right",
left_on = "menu_id",
right_on = "id"
)
# Add dish name to each menu item
df = pd.merge(
left = df,
right = dish[["id", "name"]],
how = "right",
left_on = "dish_id",
right_on = "id"
)
# Remove intermediate columns
df = df[["name", "date", "menu_id"]]
This left me with a single data frame of menu items to clean and parse.
3 Data cleaning
3.1 Dates
The first work to be done was on the dates. For example, 638 dates turned out to be incorrect or malformed. In some, like 1091-01-27
, the error was transparent, but with less than 0.05% of the data so corrupted, I decided to just drop them. However, another 68,438 items—5% of the data—were missing dates altogether. Since they were useless to the analysis, and I couldn’t know if there was a pattern to the missingness, I was forced to drop these as well. Then, from the 1.27 million well-formed dates remaining, I extracted the year and decade, the latter of which would prove a more reasonable target for modeling than the former.
# Drop malformed dates
df["date"] = pd.to_datetime(df["date"], errors="coerce")
df.dropna(inplace=True)
# Calculate year and decade
df["year"] = df["date"].dt.year
df["decade"] = df["year"] // 10 * 10
A further problem is that our dataset’s classes are heavily unbalanced. Looking at fig. 2, we see that the great preponderance of them—a full 63% of menus and 62% of items—are from the initial two decades of the twentieth century.
This could cause problems during modeling, not only because a naïve model could latch onto the majority class and return it without considering the inputs, but because the meagerness of data—particularly before 1880 and after 1990—will leave the model with inadequate information to categorize any menu as being from these eras.
3.3 Dishes
The dish names themselves were somewhat less tractable. The menus are transcribed by hand, eliminating the need to deal with OCR errors, but many items were unreasonably long.
As the histogram in fig. 4 shows, these go well beyond gusty descriptions like our “poached farmed Norwegian salmon”—the longest, from first class on a 1993 Virgin Atlantic flight, reads with the paragraph breaks removed like a deranged Basil Fawlty monologue:
Afternoon Tea- A Great British Tradition- Tea, the most universally consumed of all drinks, is especially popular in Britain where the annual consumption is something in the region of 512 million cups. W. E. Gladstone observed “If you are cold, tea will warm you- if you are heated, it will cool you- if you are depressed, it will cheer you- if you are excited, it will calm you.” First brought to England c. 1559 by Giambattista Rusmusio, tea did not evolve into an afternoon meal until the end of the 18th century. Anna, Duchess of Bedford, invented afternoon tea to fill the long gap between early lunch and dinner which bored many house parties. It became a meal surrounded by etiquette and customs, delicate china, silver, cake stands and doilies- a time when friend and family meet. Famous tea parties include Mad Hatter’s (Alice’s Adventures in Wonderland by Lewis Carroll 1865), the Boston Tea Party, 1773, and not forgetting HM Queen Elizabeth II’s annual garden parties at Buckingham Palace. The Duke of Wellington declared that “Tea cleared my head and left no misapprehensions.” He was right- tea contains small amounts of two B vitamins, and has no calories, artificial flavourings or colourings. It is said to cure gout, apoplexy, epilepsy, gall stones and sleepiness, and one’s longevity is assured. “Thank God for Tea! What would the world do without tea?”- Sydney Smith
Like the erroneous dates, these were sparse—only 6,682, or 0.5% of the listed dishes exceeded 100 characters, and these held only 3.7% of the dataset’s total characters. However, the decision to drop them was less clear-cut, since they potentially contained period-specific text which could be used to inform a model. I ultimately stetted them for this reason.
I then processed the text. 251 menu items contained non-ASCII characters, of which 126 were in French,4 56 Chinese, 21 German, 20 Swedish, 11 Greek, 10 Hindi, 3 Hungarian, and one lonely item in Polish. The remaining three were English with special characters such as ½. Happily, the excellent Python library Unidecode can do most of the heavy lifting here, stripping accents, and Romanizing the Greek, Hindi, and Chinese.
from unidecode import unidecode
# Remove special characters
df["name"] = df["name"].apply(unidecode)
# Check for remaining non-ASCII characters
df[df["name"].str.match("[^\x00-\x7F]")]["name"]
It then remained only to normalize to lower case and tokenize on the regular expression [a-z'-]+
, which captures strings of letters, apostrophes, and hyphens and throws out other punctuation and numbers (since this destroys ordering information, I put it into a new column called tokens
).
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
tokenizer = RegexpTokenizer("[a-z'-]+")
stopwords = set(stopwords.words("english"))
def tokenize(text):
tokens = set(tokenizer.tokenize(text))
return " ".join(tokens - stopwords)
df["name"] = df["name"].str.lower()
df["tokens"] = df["name"].apply(tokenize)
I also threw out stopwords such as “the”. The text was now ready for exploring trends and building a predictive model.
4 Data exploration
4.1 Cross-sectional
I first examined the top words from each decade, excerpted in fig. 5–fig. 8.
There are only a few menus from the 1850s in the collection, but we can get a sense of the palate. Names of wines—madeira, sherry, and claret, and probably also château and pale—feature prominently, and the most common cooking methods are boiling and roasting.
By the 1910s, we have a new picture: fried foods are popular, as are cream sauces. Chicken and beef have made the list, as, notably, does salad, as fresh fruits and vegetables become more available to the average patron.
We see fewer changes in the post-war era, but “fresh” has been advanced, as have French de ‘of’ and German mit ‘with’, indicating European dishes, or at least European phrasings, coming into vogue.
In the 1990s, “fried” has been replaced by “grilled”, and a renewed interest in French cuisine seems to be the dernier cri, as five of the ten are French grammatical words not filtered by the English stopword list (and de outpacing the next-highest word by almost 2 to 1).
4.2 Longitudinal
Equally illuminating is to look at the waxing and waning of particular foods across time, in the style of Google Ngrams.
words = ["lobster", "oyster"]
occurrences = {word: df["tokens"].str.contains(f"\\b{word}\\b") for word in words}
menu_item_prop = pd.DataFrame(occurrences).join([df["year"]).groupby("year").mean()
The Madeira that was so popular in the 1850s dropped off steeply soon after (fig. 9).
We see the Sun rise and set on the age of Jell-O in fig. 10, and in fig. 11 the nascence of tofu.
It is also interesting to look at descriptors. Organic food is rooted in the environmental movement of the ’60s and ’70s, but doesn’t appear on restaurant menus until the turn of the millennium.
Health terms such as “diet” show a similar trend, appearing in numbers in the ’70s.
We can also use foreign words that commonly appear on menus as a rough proxy for how fashionable those cuisines were in different periods.
A clear post-war interest in French and German cuisine manifests itself—although the 40% of menu items in 2005 containing mit is more likely an artifact of a small sample than a genuine trend.
5 Modeling
The first step is to sample from the data, firstly to correct the class imbalance seen in fig. 2 above, but more importantly, to curtail the incredibly large matrix we would get if we vectorized the entire data frame.5
# We want each class to have only as many as the smallest class
sample_size = df.groupby("decade")["name"].count().min()
# Sample randomly from each decade
sample = df.sample(frac=1).groupby("decade").head(sample_size)
Then we define our variables and create train and test classes:
from sklearn.model_selection import train_test_split
X = sample["tokens"]
y = sample["decade"]
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)
Since the tokens
column has already been cleaned, we can it for modeling using tf-idf vectorization, which assigns high scores to words which are highly localized, occurring, say, only in the 1970s and nowhere else. This turns a vector of words into a matrix of tf-idf scores.
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
X_train_vec = tfidf.fit_transform(X_train)
X_test_vec = tfidf.transform(X_test)
Finally, the model itself can be constructed. I use a Gaussian naïve Bayes classifier, which is standard despite the data not following a Gaussian distribution.
Finally, we assess the accuracy of the model.
The results are pretty disappointing: 77.5% train accuracy and 17.6% test accuracy mean that the model arbitrarily latched onto some words that weren’t really indicative of their decades. This is due to our miserliness with the data: the model was only allowed to see 0.2% of the menu items. We can fix this by taking the second-lowest class rather than the lowest. This will give us mostly balanced classes, but with fewer in the 1870s, to which only 174 menus are dated.
The results are more promising: 52.7% train and 21.7% test accuracy. Another consideration is that, since our classes are ordinal, even when the model is wrong, it may only be wrong by a decade or two. We can check this by defining a “fuzzy accuracy” score, which counts a prediction as accurate if it is within a given tolerance.
def fuzzy_accuracy(y_true, y_pred, tolerance):
return np.mean(np.abs(y_true - y_pred) <= tolerance)
fuzzy_accuracy(y_train, bayes.predict(X_train_vec.toarray()), tolerance=10)
fuzzy_accuracy(y_test, bayes.predict(X_test_vec.toarray()), tolerance=10)
We now get 62.7% train and 40.2% test accuracy with a tolerance of one decade. With more memory available, it would be possible to feed more of the dataset into the model, and perhaps create a yet more accurate model.
6 Final thoughts
I’ll end with an amusing and instructive story: as I was working on this, I reimported my data, and was surprised to see there were dishes listed as missing. I verified that no rows were missing in the original dataset, and tried to figure out which dishes had been dropped. It turns out that an Indian menu from 1981 listed naan bread, but spelled it “nan”, and pandas interpreted that as not a number. The solution was simply to pass na_filter=False
to pd.read_csv()
, but the lesson learned was always to check that the data read in is the same as the data written out.
“Consider the Lobster,” Gourmet, August 2004, 55.↩
In 2020, $7.85 and $25.29, respectively↩
Retrieved April 27, 2020↩
Or else in what Chesterton called “a sort of super-French employed by cooks, but quite unintelligible to Frenchmen”↩
Sampling function borrowed from https://stackoverflow.com/a/56841648↩