May 19, 2010
Mapping the Demographics of American English with Twitter

Absolutely amazing post courtesy of David Bamman at The Language Log

It took me a while to really make sense of Twitter. For the longest time, it was (to me) the stomping ground of 14-year-olds and Ashton Kutcher, each issuing a minute-by-minute feed of their lives. Around the time Twitter arrived, however, I had just had a breakthrough on YouTube’s enormous popularity - it was only after watching a dozen different videos of the Super Mario Brothers theme song performed a dozen different ways that I finally got it: I may not care about cats playing the keyboard or wedding parties dancing down the aisle, but somebody does, and without a distribution system for people to broadcast whatever their hearts felt like, I never would have had my life improved by that kid with the beatboxing flute or the one with the double guitar.

So I waited for a similar breakthrough with Twitter. It came, at long last, after I realized that it was exactly what I first thought it was: 14-year-olds (and Ashton Kutcher) chronicling the minutiae of their lives. It is colloquial language, constrained by 140 characters: everyday conversations about waiting in line at the grocery store, your flight just landing at ORD, what to do this Saturday night, “omg did u see hr dress?” In spurts it is, of course, much more than that, as its use during the protests of the 2009 Iranian election proved, but in its unmarked use, it’s the language of how millions of people across the world talk to their friends.

To say Twitter is colloquial is putting it lightly. “Brother,” for example, occurs in Twitter data during the week of May 10-17, 2010 with an average frequency of once every 7,338 words, not too distant from its frequency in its closest cousin, the Corpus of Contemporary American English (once every 9,405 words). The difference for “bro,” however, is much more dramatic: in the Twitter data during that same period, it occurs once every 5,833 words (more frequently, in fact, than “brother”), while in the COCA it occurs once every 757,575 words - two orders of magnitude less frequently.

In April 2010, Twitter had approximately 106M registered users. The volume of data that flows through the Twitter pipe dwarfs any other publicly available linguistic corpus in existence (except the web itself), and unlike fixed corpora, it still flows. Such a huge dataset has proven itself to be a fertile resource for a number of natural language processing tasks (such as trend detection and sentiment analysis), but its value as a collection of colloquial language begs to be used for lexicography as well: if the purpose of a dictionary is to record actual usage, then Twitter data allows us to broaden the scope of our corpus beyond newswire, literary works and other forms of privileged publication and include the unedited language of everyday folks as well.

Inducing language demographics

In addition to letting us capture the colloquial language of over one hundred million people, Twitter also provides us with a rich data source for inducing the demographics of that language community.

The data that Twitter releases as part of its public datastream includes a number of features for each “tweet.” In addition to the content of the message itself, each tweet is accompanied by the creator’s username, display information such as the URL of the profile picture and background color, follower count, and a wealth of other metadata, including a timestamp for its creation and user-defined geographical information - for me (@dbamman), this is “Boston, MA” but a user can write anything (“New York,” “NYC,” “in the intertubes”) or choose to have their location tagged with precise latitude and longitude coordinates (though only a small fraction actually do so). The datasource is ostensibly so rich in order to enable third-party development (e.g., location-enabled iPhone apps), but all of this information contains valuable demographic indicators for plotting language use across time, space and different populations.

Geography

The geographic information embedded in each tweet allows us to map language use across the US and, like the Dictionary of American Regional English, report on the nuances of language that are characteristic of certain communities.

The user-defined geographic information is noisy data: while “Boston, MA” can be automatically disambiguated relatively easily to a physical location on the earth (corresponding to coordinates 42.35843, -71.05977), others (“Springfield”) are more difficult (there are many Springfields); others still are nearly impossible (“home of that boy Biggie,” a reference to New York City quoted from Jay-Z’s “Empire State of Mind”), and some (“in ur fridge eatin ur foodz”) don’t map to any space in physical reality. The sheer volume of data, however, gives us the flexibility to focus more on precision than on overall accuracy - we can throw away all tweets where we aren’t over 99% sure of the physical location.

With this disambiguated data, we can map the usage of words and phrases across the US by normalizing each word’s count by the volume of total data coming out of each state (to avoid biasing the statistics toward populous states such as New York and California). Comparing these resulting ratios allows us to get a demographic picture of word use across the US. Here “grand canyon” is visualized on a map using the Google Charts API (lighter blue represents more characteristic usage).

Figure 1: Demographics of “grand canyon”

A sanity check reveals that, yes, it is in Boston that the Red Sox are most characteristically talked about (this is different from “most talked about” which, again, could even be a state like New York given its large population). Californians talk characteristically of earthquakes and Wisconsin of the Green Bay Packers(even in May).

This same method that works for detecting prevalent topics in certain areas also allows us to detect regionalisms in slang as well. The southern US is the focal point for words like “bruh” and “ima” (and its orthographic variants i’ma andimma), while “hella” is centered in California and “rad” in the Pacific Northwest. “Wicked” is more characteristic of New England (especially Massachusetts), but less strongly than the others (perhaps because of its polysemy - its meaning as “very” is likely a regionalism but its older sense of “evil” is almost certainly not). This gets to one limitation of this method: statistics are all computed on a token level (the word form), not at the level of individual senses, so the clear regional distinction between “pop” and “soda” that we would love to see gets blurred, since “pop” is used throughout the US not simply as a synonym of “soda” but in other senses as well (“pop music,” “pop out,” etc.).

Figure 2: Demographics of “bruh”

Age and gender

While Twitter doesn’t explicitly ask for or subsequently publish any age or gender data for its users, we can approximate both on a large scale using common demographic indicators such as the user’s first name. While some names like “David” have relatively even distributions across birth years (which we can compute using information from the US Social Security Department), other names are heavily biased toward certain generations. “Jasmyn,” for instance, is far more likely to be the name of a teenager now than someone named “Pearl”; if your name is “Arsenio” and you were born in the US, it’s over 99% likely that you’re a male born between the years of 1989-1991. With this statistical information, we can compute a probability distribution for the entire age range between 12 and 75 and increment the weight count of each word according to this distribution.

With these distributional counts for each word and phrase of interest in the corpus, we can chart out the demographics using the same normalizing technique used for mapping geographical distributions above: for each word or phrase, dividing the observed weight count in each age group by the total volume of tweets for that age (otherwise the statistics would be biased toward heavily tweeting age groups, such as 12-17 year olds) and then comparing those computed ratios.

A sanity check again reveals that, yes, women aged 12-24 who tweet do loveGrey’s AnatomyGossip Girl and 90210, while men aged 35+ like TV shows such as 24 and Mythbusters. To complete the demographic picture of “bruh” above, we can see that it’s used predominantly by males aged 18-24. In contrast, a word like “bro” is used comparatively more frequently by females and by all age groups, while a formal word like “brother” is used with more or less equal geographic and gender distribution across the entire US.

Figure 3: Gender demographics of “bruh”Figure 4: Age demographics of “bruh”

Inducing age demographics in this way results in a noisier picture than inducing gender and geographic information, for which the indicators are much more clear (it’s my intuition, at least, that people over the age of 65 are probably using “bruh” a little less than reported). The sheer volume of information coming from this data source does, however, help to offset this noise - even if there is some fixed amount artificially inflating the probabilities of each age group, we can at least begin to see a picture emerging of the major age groups involved.

An evolving dictionary

The goal of the Lexicalist project is to develop a dictionary that depicts, in real time, the changing demographics of English in the United States, a dictionary that supplements the fundamental meaning of a word or phrase with the current cultural backdrop that’s informing its use today. My work in the NEH-fundedDynamic Lexicon project has taught me that (for Ancient Greek and Latin at least) the language of a given era is not a homogeneous beast able to be captured in a single volume (or caged in a set of fascicles); it is the language of Caesar plus the language of Vergil and so on. English two thousand years later in the United States is no different: it is the sum of the hundreds of millions of people who use it, often in very different ways. By focusing on the demographics of contemporary usage, my hope is to shine a spotlight on all of those millions of individuals and see American English as the product of their distinct and discernable voices.