Approaching linguistic issues with the tools of data science - Christophe Coupé

At the intersection of computational modelling, data analysis and statistics, data science is gradually developing across all scientific domains, and the humanities and social sciences are no exception. Especially in these latter fields, data science rests on an epistemology and a range of techniques which strengths and weaknesses differ from those of longer-established approaches (Kitchin, 2014; Leonelli, 2010). This opens the way to fruitful collaborations between scholars of complementary expertise, and can offer fresh perspectives on old and complex issues.

As an illustration of what precedes, linguistics has seen a steady increase in quantitative approaches in recent years, and large datasets are being built to support data-driven approaches (e.g. Dryer and Haspelmath, 2013; Kirby et al., 2016). Some have emphasized the potential impact of environmental factors on linguistic diversity (Axelsen & Manrubia, 2014; Everett, 2013; Everett, Blasi, & Roberts, 2015). Others have studied how social structures underlie situations of language change, language contact and multilingualism (Ke, Gong, & Wang, 2008; Loureiro-Porto & Miguel, 2017; Lupyan & Dale, 2010). Others yet have shed light on a trade-off between information density and speech rate across languages (Coupé, Oh, Pellegrino, & Marsico, 2014; Pellegrino, Coupé, & Marsico, 2011).

Along these lines, I will report two case studies. First, I will depict an ongoing investigation in child language acquisition, which aims at modelling the age of acquisition of words on the basis of longitudinal data. I will illustrate how the use of so-called censored distributions, to account for incomplete information, and of advanced statistical models, can lead to robust predictions. Second, I will introduce attempts to highlight the role of vegetation, humidity and temperature in the composition of phonemic inventories, relying on a large dataset of languages, satellite imagery and geographic information systems (Coupé, Hombert, Marsico, & Pellegrino, 2013; Maddieson & Coupé, 2015). More precisely, I will give acoustic arguments in favor of a relationship between the number of obstruent consonants and the previous environmental factors. I will assess this proposal while accounting for a range of potentially confounding factors, such as the number of speakers or the genealogical relationships between languages.