Linguistic tone is related to the population frequency of the adaptive haplogroups of two brain size genes, ASPM and Microcephalin

by Dan Dediu and D. Robert Ladd

- Further information -

Here is the official version (PNAS Early Edition)...

...and here a pre-print PDF version, as well.

Summary

This paper, which was published in the Proceedings of the National Academy of Sciences of the USA (PNAS) on 30 May 2007, has attracted a fair amount of press coverage. Since newspaper stories are often cut to fit the amount of space available and since the published paper goes into a lot of technical detail about both the genetic background and the statistical techniques we used, we have posted the following description of our work for interested readers.

Our paper reports a statistical study of the relationship between the geographical distribution of two genes and the geographical distribution of tone languages.

The two genes, ASPM and Microcephalin, have attracted a lot of attention in the last couple of years, following two papers (1 and 2) published in Science in 2005 by a Chicago research group led by Bruce Lahn. Lahn’s group showed that there are two variants (alleles), one for each of these two genes, which emerged fairly recently (estimated 6,000 years ago for ASPM and 37,000 years ago for Microcephalin) and that these new alleles seem to be spreading quickly in the human species (and are therefore probably “adaptive”, or favoured by natural selection). They also showed that these “derived” alleles (as they are known) are unevenly distributed in the world’s populations, being especially rare in sub-Saharan Africa and most common in Europe, North Africa and Western Asia.

The distribution of the "derived" allele of ASPM across the Old World

The distribution of the "derived" allele of ASPM in the Old World populations we studied in our paper.

Each circle represents one population and the intensity of blue reflects the allele frequency (min 0%, max 60%).

The distribution of the "derived" allele of Microcephalin across the Old World

The distribution of the "derived" allele of Microcephalin in the Old World populations we studied in our paper.

Each circle represents one population and the intensity of green reflects the allele frequency (min 3%, max 100%).

Tone languages are languages (like Chinese, Thai, Yoruba, and Zulu) in which the pitch or “tone” of words and syllables makes a difference to word meaning. For example, in Chinese huār (with a high level pitch) means ‘flower’ and huàr (with a falling pitch) means ‘picture’. In non-tonal languages (like English or Spanish), pitch is only used at the sentence level, for emphasis and overall meanings like questioning. Roughly half the languages in the world are tonal and half are non-tonal, but they’re fairly unevenly distributed: tone languages are the norm in sub-Saharan Africa and are common in Southeast Asia and among Native American languages especially in parts of Central and South America. Non-tone languages are the norm in Europe and Central, South and West Asia, and among the aboriginal languages of Australia. For more details about their distribution you can consult, for example, the entry on tone in the World Atlas of Language Structures.

(Please, go here for another Chinese example, with sound files. In Yoruba, igba spoken with different tones means different things (recordings courtesy of Dr. Lawrence Olufemi Adewole of Ile-Ife University, Nigeria): LowHigh = a kind of tree, MidMid = '200', MidHigh = 'gourd' and LowLow = 'time'.)

The distribution of tone languages across the Old World

The distribution of tone languages in the Old World populations we studied in our paper.

Each square represents one population: yellow stands for non-tone languages and gray for tone languages.

(But what about the Americas?)

Superficially, the distribution of the older (i.e., non-"derived") alleles, as reported by Lahn’s group, resembles the distribution of tone languages. Because the two genes in question are known to be involved in brain growth and development, and because there is some evidence that differences in performance on language-related experimental tasks can be linked to differences in brain structure, we hypothesised that the proportion of the older alleles of ASPM and Microcephalin in a given population would correlate with whether the language spoken by the population is tonal.

This means that our approach is different from the well-known work of Cavalli-Sforza and his colleagues, which aims to correlate genetic and linguistic classifications of populations, using known or hypothesised historical relations between languages and language families (do populations genetically similar tend to be also linguistically similar? - where genetic similarity involves many independent loci and linguistic similarity involves historical, ancestor-descendant relationships). Our work investigates correlations between genetic markers and typological features of languages (do populations having certain alleles tend to speak languages using the same feature? - without reference to overall genetic similarity or linguistic historical classifications).

Language typology studies the ways in which languages can differ. Some of this is fairly familiar: for example, in French and English adjectives and nouns go in the opposite order - that’s word order typology. But there are typological differences in sound structure and word structure, too. In most Australian aboriginal languages, there are no fricative sounds (sounds like S or SH or F), whereas in most European languages there are lots - yet most Australian languages have lots of different N and L and R sounds that many English speakers struggle to tell apart. Or again: in many language (e.g. Turkish, Inuktitut (Eskimo) and Swahili) the verb forms have lots of prefixes or suffixes to indicate the subject, the object, the tense, and so forth; in English or Chinese there’s hardly any of this kind of marking. All these kinds of differences are what language typology is about.

By comparing nearly 1000 genetic markers and 26 linguistic features (the linguistic data with details on our sources and methods can be found here), we were able to show that, as most people would expect, there is generally no correlation between population genetics and language typology – but the relation between tone and the two genes under study was confirmed to be especially strong in all our analyses. It’s because there generally isn’t a correlation between population genetics and language typology that the correlation we’ve found may be interesting.

This relationship remains important and statistically highly significant even when we consider the correlation between tone and ASPM and Microcephalin simultaneously, after we take into account the fact that neighbouring populations tend to share both genes and languages, plus some more tests. (Go here for more details of what we did.)

The distribution of the correlations between all pairs of genetic markers and linguistic features in our database.

The horizontal axis represents the strength of the correlation (Pearson's r, between -1 and +1, 0 means no correlation).

It can be seen that most correlations are around zero, but that the correlation between tone and ASPM, and tone and

Microcephalin, respectively, are very improbable (stronger than 98.6% of all the correlations).

It must be noted that the correlation between tone and ASPM, and tone and Microcephalin are highly significant.

The distribution of tone and non-tone languages function of the population frequency of the

"derived" alleles of ASPM (horizontal axis) and Microcephalin (the vertical axis).

Tone languages are represented by empty squares and non-tone languages by black squares.

It can be seen that in the bottom-left quadrant there are only tone languages, in the to-right quadrant only non-tone languages,

while in the top-left quadrant there is a balanced mixture (the Americas fit here, supporting our prediction).

The bottom-right quadrant contains no populations in our sample and the reason is not known.

We believe that this correlation may reflect some sort of predisposition or cognitive bias induced by the two genes in question. We don’t have any detailed idea of what this bias might consist of, but we assume it is very small and would only manifest itself in language change over many generations. We know, of course, that any normal human infant can learn the language of any human community that it’s brought up in – genes don’t play any role at the individual level. But subtle differences in the way children acquire language might lead to changes in the long run. All languages change over time (as anyone who has struggled with Shakespeare knows), and computer simulations and mathematical models have suggested that small differences in the way children acquire language could, over enough generations, give rise to big differences in the way a language is structured. And if those subtle differences are influenced by a child’s genetic make-up, that could explain the kind of correlation we’ve found.

What about the Americas?

There were two main reasons for not including these languages in our analysis: one is the difficulty of knowing whether genetic data are contaminated by recent contact with people of European ancestry, and the other is the fact that the Americas are very diverse linguistically, and that diversity is not represented by the small sample of five languages available in the database we used (which included two Amazonian languages, Karitiana and Surui (both members of the Tupi language family), an Arawakan language of Colombia, and two Mexican languages). We had no control over the sample of languages, because that came from the Lahn group's published papers, so it seemed better to exclude them.

The Americas represent a test case for our idea in the sense that American populations seem to have, in general, very low proportions of "derived" ASPM and quite high proportions of "derived" Microcephalin. This means that we would expect a mix of tonal and non-tonal languages (as in the upper left quadrant of our scatterplot), which is exactly what we find in the Americas generally. But if we wanted to do a more detailed analysis of American languages, we would have to have a much larger sample of populations and languages, and obtaining genetic data uncontaminated by European admixture would be very difficult or impossible.

The next step is to do experiments in which we look for evidence of the nature of the predisposition or bias. The work of Patrick Wong and his colleagues provides one possible lead here: they have shown that some monolingual adults find it much harder than others to learn an artificial language vocabulary that makes use of tone or pitch distinctions, and that the differences between these groups show up in subtle differences of brain structure as well. If we could show that these differences also reflect differences in genetic make-up, it would go some way to showing that the correlation we have found is based on a real causal link.

Our work has no immediate practical implications, but its longer term interest would lie in discovering that there’s a causal link between population genetics and language typology. (Again, we haven’t found that: we’ve just demonstrated some very unlikely correlations that suggest there might be such a link.) If that link can be found, then it will fit into the rapidly growing scientific understanding of how genetic make-up influences behaviour and cognitive development. That’s important work with lots of practical ethical dimensions: as science finds out more and more about specific genetic influences, society is really going to have to start dealing with a lot of policy questions that have only been theoretical up till now. But at this point all our paper does is report something that might be a piece of the overall jigsaw puzzle.

What the paper doesn't show nor claim

First, we are not claiming that there is any direct connection between an individual’s genes and an individual’s language. We’re talking about small individual biases adding up to group effects over many generations of language change. People acquire the language(s) they’re exposed to in early childhood, regardless of their genes.

Second, we’re not making any suggestion of “superiority” or “selective advantage” for one language over another. Our work provides absolutely no reason to think that non-tonal languages are easier or “more advanced” than tonal languages (or vice-versa). There’s also no reason to think that there’s any evolutionary advantage to non-tonal languages: Chinese society developed advanced technology and politics and philosophy with a tonal language just as successfully as Eastern Mediterranean societies at about the same time with non-tonal languages.

Third, we’re not offering any new findings about the effects of these genes on brain development. We make only very limited suggestions about the detailed neurocognitive mechanisms that might be involved. Not much is known about the functions of these genes in brain development anyway, though this is certainly a hot topic in genetics. Since we’re not geneticists, we’re not involved in the front-line biochemical research, so not really in a position to speculate about what exactly might be going on in the brain.

Finally, we’re not suggesting that language is involved in the selective pressure for the "derived" alleles of ASPM and Microcephalin. Nobody really knows what the selective pressures were (although a lot of people would certainly like to find out). Bruce Lahn’s group were very explicit that they didn’t know what the selective advantage might be. Some people have even argued that there is no selective advantage and that the whole story is just a matter of genetic drift. We assume that the “cognitive bias” we propose could be an accidental by-product of whatever it is that these genes are doing.

Sample media coverage:

COSMOS Magazine (Australia);
The Times Online (UK);
New Scientist (UK);
Scientific American (USA);
Science (USA);
National Geographic (USA);
Wissenschaft.de (Germany);
Science.ORF.at (Austria);
Ciência Hoje (Brasil);
Noorderlicht Nieuws (Netherlands);
NeoFronteras (Spanish);
CBC radio,Quirks & Quarks programme (Canada);
Mark Liberman's Language Log and our response to his posting.

Last updated: 25 June 2007
D.R. Ladd & Dan Dediu