Supplementary materials: Life of p

Steven Moran and Adriano Lameira

(06 February, 2023)

Overview
Analyses
References

Overview

This RMarkdown report contains supplementary materials for the manuscript “Life of p: A consonant older than speech”. It uses the R programming language (R Core Team 2021) and the following R libraries (Wickham et al. 2019; Xie 2021; Revell 2012; J. Zhang 2017; Yu 2020; Wickham 2011):

library(tidyverse)
library(knitr)
library(phytools)
library(phylotools)
library(ggtree)
library(testthat)

In what follows, we undertake several analyses including:

Investigating the cross-linguistic frequency of labial segments including /p/ and /b/ in a large sample of the world’s languages
Examining how common the contrastive feature labial is in the documented phonological inventories of the world
Identifying the presence of labial segments in ancient and reconstructed languages
Looking at labials in language families from a diachronic perspective, i.e., investigating whether they are prominent or not within large language families for which we have robust computational phylogenetics data

Analyses

Modern cross-linguistic segment frequencies

For cross-linguistic analyses of segment frequencies, we combine the PHOIBLE sample of phonological inventories (Moran and McCloy 2019) and append to them the linguistic and non-linguistic metadata data associated with the Glottolog, a catalog of the world’s languages (Hammarström et al. 2020).

phoible <- read_csv(url("https://github.com/phoible/dev/blob/646f5e4f64bfefb7868bf4a3b65bcd1da243976a/data/phoible.csv?raw=true"),
  col_types = c(InventoryID = "i", Marginal = "l", .default = "c")
)

glottolog <- read_csv(url("https://cdstar.shh.mpg.de/bitstreams/EAEA0-E62D-ED67-FD05-0/languages_and_dialects_geo.csv"))

phoible <- left_join(phoible, glottolog, by = c("Glottocode" = "glottocode"))

The data look like this.

phoible %>%
  select(InventoryID, LanguageName, Phoneme) %>%
  head() %>%
  kable()

InventoryID	LanguageName	Phoneme
1	Korean	h
1	Korean	j
1	Korean	k
1	Korean	kʰ
1	Korean	kˀ
1	Korean	l

How many inventories (data points) are there in PHOIBLE?

nrow(phoible %>% select(InventoryID) %>% distinct())

## [1] 3020

How many languages (defined as distinct ISO 639-3 language identifiers) are there?

nrow(phoible %>% select(ISO6393) %>% distinct())

## [1] 2093

How many segment types are there?

nrow(phoible %>% select(Phoneme) %>% distinct())

## [1] 3169

Get all rows with phonemes that are voiceless bilabial plosive, i.e., “p” like.

df <- phoible %>% filter(grepl("p", Phoneme))

What are they?

ps <- df %>%
  select(Phoneme) %>%
  distinct()
ps <- df %>%
  select(Phoneme) %>%
  group_by(Phoneme) %>%
  summarize(count = n()) %>%
  arrange(desc(count))
head(ps) %>% kable()

Phoneme	count
p	2594
pʰ	592
kp	373
pʼ	178
p͉	79
p͈	72

# write_csv(ps, 'ps.csv')

Some of the “p-like” segments include labiovelars like /kp/. Let’s drop those.

ps <- ps %>% filter(!(grepl("kp|pf|mp", Phoneme)))

Since PHOIBLE may have multiple analyses for the same language variety (see explanation regarding so-called “doculects” in the PHOIBLE FAQ), we combine the phonological inventories from multiple sources into single entries so that we can examine which languages have been reported to have certain segments or not.

phoible_by_iso <- phoible %>%
  select(ISO6393, Phoneme) %>%
  group_by(ISO6393) %>%
  distinct()

Since the ISO 6393 code mis is for languages that are missing a language name identifier, we drop those.

phoible_by_iso <- phoible_by_iso %>% filter(ISO6393 != "mis")

How many distinct languages are left once they have been aggregated by ISO 639-3 code?

num_languages <- nrow(phoible_by_iso %>% distinct(ISO6393))
num_languages

## [1] 2092

Now we select from these languages which have a p-like segment.

phoible_by_iso_with_p <- phoible_by_iso %>% filter(Phoneme %in% ps$Phoneme)
phoible_by_iso_with_p %>%
  head() %>%
  kable()

ISO6393	Phoneme
kor	p
kor	pʰ
kor	pˀ
lbe	pʰ
lbe	pʼ
lbe	p͈

We summarize their counts.

phoible_by_iso_with_p %>%
  group_by(ISO6393) %>%
  summarize(n = n()) %>%
  arrange(desc(n))

## # A tibble: 1,949 × 2
##    ISO6393     n
##    <chr>   <int>
##  1 lez         6
##  2 sjd         6
##  3 bcq         5
##  4 gle         5
##  5 lbe         5
##  6 yey         5
##  7 acn         4
##  8 ahk         4
##  9 alw         4
## 10 amh         4
## # … with 1,939 more rows

And ask what percentage of languages in PHOIBLE have p-like sounds.

nrow(phoible_by_iso_with_p %>% select(ISO6393) %>% distinct()) / num_languages

## [1] 0.9316444

Which are the languages that contain no “p-like” segments?

phoible_by_iso_no_p <- phoible_by_iso %>%
  filter(!(ISO6393 %in% phoible_by_iso_with_p$ISO6393)) %>%
  select(ISO6393) %>%
  distinct() %>%
  arrange(ISO6393)

phoible_by_iso_no_p %>%
  head() %>%
  kable()

ISO6393
aar
aey
aft
aha
aht
aiw

There are quite a few languages without “p-like” segments.

nrow(phoible_by_iso_no_p)

## [1] 143

Or about 7% of languages in the sample (see above).

nrow(phoible_by_iso_no_p) / num_languages

## [1] 0.06835564

One random example is Afar, which does not contain a voiceless bilabial plosive, but it does contain its voiced counterpart “b”. Another is Somali with the same segment configuration within the bilabial plosives. Both languages are also in Africa.

Let’s look geographically at which languages lack voiceless bilabial plosives.

no_p_by_geography <- phoible %>%
  filter(ISO6393 %in% phoible_by_iso_no_p$ISO6393) %>%
  select(ISO6393, latitude, longitude, macroarea) %>%
  distinct() %>%
  arrange(ISO6393)

How do these language points look on a map? There are no data points in Eurasia nor Australia, but many in Oceania, Africa, and South America.

ggplot(data = no_p_by_geography, aes(x = longitude, y = latitude)) +
  borders("world", colour = "gray50", fill = "gray50") +
  geom_point()

Africa is notable for lacking voiceless bilabial plosives (Houis 1974; Maddieson 1984; Clements and Rialland 2008) and is interesting because within a relatively broad sample of phonological segment borrowings /p/ is the most frequently borrowed speech sound (Grossman et al. 2020). In other words, /p/ seems to have been lost in certain linguistic areas, probably due to regular processes of sound change, but then is easily re-introduced into languages via borrowing, e.g., Tem (Central Gur, Togo), Tigrinya (Semitic, Ethiopia), or !Xóõ (Tuu, Botswana and Namibia) (Clements and Rialland 2008). (Similar observations about cross-linguistically frequently segments missing in certain world areas, which were perhaps lost at some point in the past and this loss then inherited by daughter languages and dialects, is reported by Moran, Lester, and Grossman (2021)).

So of the languages that lack /p/, how many also lack /b/?

tmp <- phoible_by_iso %>% filter(ISO6393 %in% phoible_by_iso_no_p$ISO6393)
tmp <- tmp %>% filter(grepl("b", Phoneme))
tmp <- tmp %>% filter(!grepl("ɡb", Phoneme)) # Let's drop labial velars
tmp <- tmp %>%
  select(ISO6393) %>%
  distinct() %>%
  arrange(ISO6393)
tmp <- phoible_by_iso_no_p %>% filter(!(ISO6393 %in% tmp$ISO6393))
tmp %>% kable()

ISO6393
bvi
chr
eya
kam
kuj
mch
ndh
one
opy
tcb
trr
unk
waw
wic
wya

Where are they spoken?

no_bilabials <- phoible %>%
  filter(ISO6393 %in% tmp$ISO6393) %>%
  select(ISO6393, latitude, longitude, macroarea) %>%
  distinct() %>%
  arrange(macroarea)
no_bilabials %>% kable()

ISO6393	latitude	longitude	macroarea
kuj	-1.50636	34.50490	Africa
bvi	7.41311	27.69560	Africa
kam	-1.60827	37.95320	Africa
ndh	-9.88948	33.61180	Africa
wic	35.06650	-98.18310	North America
one	43.43874	-75.70811	North America
chr	35.46640	-83.16300	North America
eya	60.42320	-144.76200	North America
wya	NA	NA	North America
tcb	63.40460	-143.33800	North America
unk	-12.43020	-58.98020	South America
mch	4.70705	-64.38770	South America
waw	1.50881	-59.14170	South America
trr	-3.22497	-75.56030	South America
opy	-22.27800	-53.72270	South America

They are mainly found in the Americas.

ggplot(data = no_bilabials, aes(x = longitude, y = latitude)) +
  borders("world", colour = "gray50", fill = "gray50") +
  geom_point()

Languages lacking native bilabial plosives /p/ and /b/ are extremely rare in the PHOIBLE sample overall, i.e. 14 observations out of 2092 languages (0.007%). They include North American languages like Cherokee and Eyak that lack labials except the nasal /m/, which is reportedly rare or only occur in loanwords:

And Wichita with a cross-linguistically unusual phonology that lacks pure labials (e.g., /p/, /b/, and /m/), although it has the voiced labial-velar approximant /w/ and the labiovelar /kʷ/.

https://phoible.org/inventories/view/611#tipa

In East Africa, the languages include Kikamba, Kuria, Chindali, respectively:

Kikamba and Kuria both have a phonemic voiced bilabial fricative /β/ and bilabial nasal /m/. Chindali has /m/, but lacks the voiceless and voiced plosive – although it has the rare labiodental approximant /ʋ/.

The languages reported in South America, Enawené-Nawé, Yekwana, Waiwai, Taushiro, Ofayé, all contain /w/, and /m/, /β/, or /kʷ/ to various extents.

Thus, even when languages lack pure /p/ and /b/, there tends to me to some extent the phonological feature labial still present in the phonological inventory.

What about languages with no labial sounds at all? First let’s get all the languages with labials.

labials <- phoible %>%
  select(ISO6393, Phoneme) %>%
  filter(grepl("p|b|m|ɸ|β|ʙ", Phoneme)) %>%
  distinct()

phoible_by_iso_no_labials <- phoible_by_iso %>%
  filter(!(ISO6393 %in% labials$ISO6393)) %>%
  select(ISO6393) %>%
  distinct() %>%
  arrange(ISO6393)

There are five languages in the total sample that purportedly have no kind of labial, /w/ notwithstanding.

phoible_by_iso_no_labials %>% kable()

ISO6393
one
opy
trr
wic
wya

Of the five languages (and six doculects) listed above, only Oneida does not list the voiced labial-velar approximant /w/ as contrastive. Oneida is noted as being exceptional because it lacks bilabial consonants and labiodental fricatives. However, Oneida reportedly has /w/ and labialized /kw/ (Michelson 1990; Abbott 2006) and even in the source above in phoible (Lounsbury 1953), it is noted that many speakers use the voiceless bilabial fricative or other bilabial or labiodental articulations instead of the voiced velar approximant.

The feature labial

How common is the phonological feature labial in the phonological inventories of the world’s documented languages? The segment data in PHOIBLE contain information about their phonetic properties. For example the bilabial consonants:

phoible %>%
  select(Phoneme, SegmentClass, consonantal, labial, periodicGlottalSource, delayedRelease, nasal) %>%
  filter(Phoneme %in% c("p", "b", "m", "ɸ", "β", "ʙ")) %>%
  distinct() %>%
  kable()

Phoneme	SegmentClass	consonantal	labial	periodicGlottalSource	delayedRelease	nasal
m	consonant	+	+	+	0	+
p	consonant	+	+	-	-	-
b	consonant	+	+	+	-	-
β	consonant	+	+	+	+	-
ɸ	consonant	+	+	-	+	-
ʙ	consonant	+	+	+	0	-

A phonological inventory can be described in terms of these contrastive phonetic features by identifying which features are needed to contrastively encode all of the segments within the language. For example in the table above, if a language contains /p/ and /b/, we can consider periodicGlottalSource necessary for encoding the phonological distinction between the words ‘pad’ and ‘bad’, which only differ in terms of the voicing, i.e., vocal chord vibration.

In other work, we have developed an algorithm for identifying which phonological features are needed to encode each language’s segments.

load("new_answers.RData")

# The data is a list of matrices in R, so we flatten those and then summarize their presence and plot the results.
features <- lapply(new_answers, function(x) unique(as.vector((x))))
features <- features %>%
  enframe() %>%
  unnest(value)
features <- unnest(features, value) %>% arrange(name, value) # save this for later processing
cross_features <- features %>%
  group_by(name, value) %>%
  summarize(total = n())
cross_features <- cross_features %>%
  group_by(value) %>%
  summarize(total = n()) %>%
  arrange(desc(total))
head(cross_features)

## # A tibble: 6 × 2
##   value                 total
##   <chr>                 <int>
## 1 labial                 2728
## 2 continuant             2720
## 3 periodicGlottalSource  2652
## 4 front                  2640
## 5 syllabic               2572
## 6 high                   2511

There are this many inventories represented in the data set:

length(new_answers)

## [1] 2759

This plot shows that the features labial and continuant (i.e. the set of sounds that are not stops or affricates) are the most common in languages cross-linguistically. The feature periodic glottal source, aka voice, is third.

ggplot(cross_features, aes(reorder(x = value, -total), y = total)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Feature", y = "Nr. of languages") +
  ggtitle("Feature frequency by language")

Most languages employ the phonological feature labial to encode contrastive sounds in their language.

Because we want to know which languages employ the phonological feature labial in our phylogenetic analysis below, we first create a table with those values.

labial_features <- features %>% filter(value == "labial")
traits <- phoible %>%
  select(InventoryID, Glottocode) %>%
  distinct()
traits <- left_join(traits, labial_features, by = c("InventoryID" = "name"))

traits <- traits %>% mutate(value = replace(value, is.na(value), "N"))
traits <- traits %>% mutate(value = replace(value, value == "labial", "Y"))

traits <- traits %>% select(-InventoryID)
traits <- traits %>% rename(taxa = Glottocode, has_labial = value)

# Remove the NA glottocodes in phoible
traits <- traits %>% filter(!is.na(taxa))

# There are a handful of dimensionality reduction results that do not agree
# tmp <- traits %>% group_by(taxa) %>% summarize(labials = paste(labial, collapse = ','))
# tmp %>% filter(grepl("T,F", labials))
# tmp %>% filter(grepl("F,T", labials))

# For now we just take the first result and discard the rest
traits <- traits %>%
  group_by(taxa) %>%
  slice_head() %>%
  ungroup()

# Sometimes... just (a)R(rgh)
traits <- as.data.frame(traits)
rownames(traits) <- NULL
rownames(traits) <- traits[, 1]

Labials in ancient and reconstructed languages

First let’s have a look at data that exist for ancient and reconstructed languages. For example, we have seen that missing (bi)labials include areal features, for example, in Africa. What percentage of languages lack /p/ but contain /b/?

phoible_by_iso_with_b <- phoible %>% filter(grepl("b", Phoneme))
phoible_by_iso_with_b <- phoible_by_iso_with_b %>% filter(!(ISO6393 == "mis"))
phoible_by_iso_with_b <- phoible_by_iso_with_b %>% filter(!grepl("ɡb", Phoneme)) # Let's drop labial velars
phoible_by_iso_with_b <- phoible_by_iso_with_b %>%
  select(ISO6393) %>%
  distinct() %>%
  arrange(ISO6393)

with_p <- phoible_by_iso_with_p %>%
  select(ISO6393) %>%
  distinct()
with_p$has_p <- TRUE

with_b <- phoible_by_iso_with_b %>%
  select(ISO6393) %>%
  distinct()
with_b$has_b <- TRUE

results <- full_join(with_p, with_b)
results <- results %>% filter(is.na(has_p) & has_b)

So, roughly 6% of languages in the PHOIBLE sample contain a voiced bilabial plosive, but lack its voiceless counterpart.

nrow(results) / num_languages

## [1] 0.06118547

Where are they spoken?

b_but_no_p <- phoible %>%
  filter(ISO6393 %in% results$ISO6393) %>%
  select(ISO6393, latitude, longitude, macroarea) %>%
  distinct() %>%
  arrange(macroarea)

ggplot(data = b_but_no_p, aes(x = longitude, y = latitude)) +
  borders("world", colour = "gray50", fill = "gray50") +
  geom_point()

Mainly in Africa, in which this phenomenon has also been noted as a feature north of the equator and in the Arabian peninsula. It is also known that Arabic lost its /p/ in prehistoric times, but it is unclear whether the lack of /p/ in these areas is due to Arabic’s influence as a prestige language or whether the effect itself is even more ancient.

Hence, one interesting area to investigate the cross-linguistic frequency of labial sounds is in ancient and reconstruction languages of the world. BDPROTO is a database of phonological inventories from ancient and reconstructed languages (Marsico et al. 2018; Moran, Grossman, and Verkerk 2020). We can evaluate these (proto) languages for the presence or absence of bilabials in ancient times.

bdproto <- read_csv(url("https://raw.githubusercontent.com/bdproto/bdproto/master/bdproto.csv"))
num_languages_bdproto <- bdproto %>%
  select(BdprotoID) %>%
  distinct()

We note that by using BDPROTO phonological inventory IDs we count different reconstructions of the same proto-language in several cases and that some of the proto-languages are embedded within higher order language families, e.g. Germanic within Indo-European. For issues regarding so-called temporal bias, refer to Moran, Grossman, and Verkerk (2020) and Moran, Lester, and Grossman (2021).

bdproto <- bdproto %>% filter(!(is.na(BdprotoID)))

bdproto_with_b <- bdproto %>% filter(grepl("b", Phoneme))
bdproto_with_b <- bdproto_with_b %>% filter(!grepl("ɡb|mb", Phoneme))
bdproto_with_b %>%
  select(Phoneme) %>%
  distinct()

## # A tibble: 9 × 1
##   Phoneme
##   <chr>  
## 1 b      
## 2 bʰ     
## 3 ˀb     
## 4 b̥      
## 5 bː     
## 6 b̤ʰ     
## 7 bʲ     
## 8 bʷ     
## 9 bʼ

bdproto_with_b <- bdproto_with_b %>%
  select(BdprotoID) %>%
  distinct() %>%
  arrange(BdprotoID)
bdproto_with_b$has_b <- TRUE

bdproto_with_p <- bdproto %>% filter(grepl("p", Phoneme))
bdproto_with_p <- bdproto_with_p %>% filter(!grepl("kp|mp", Phoneme))
bdproto_with_p %>%
  select(Phoneme) %>%
  distinct()

## # A tibble: 13 × 1
##    Phoneme
##    <chr>  
##  1 p      
##  2 pʼ     
##  3 pʰ     
##  4 pː     
##  5 pʷ     
##  6 pːʷ    
##  7 pˀ     
##  8 pʲ     
##  9 ʰp     
## 10 pl     
## 11 pr     
## 12 p̰      
## 13 ˀp

bdproto_with_p <- bdproto_with_p %>%
  select(BdprotoID) %>%
  distinct() %>%
  arrange(BdprotoID)
bdproto_with_p$has_p <- TRUE

bdproto_results <- full_join(bdproto_with_p, bdproto_with_b)

Which ancient and reconstructed languages lack labial plosives /p/ and /b/? None.

bdproto_results %>% filter(is.na(has_p) & is.na(has_b))

## # A tibble: 0 × 3
## # … with 3 variables: BdprotoID <dbl>, has_p <lgl>, has_b <lgl>

Which lack a /p/? 15 out of 253 data points, so around 8%.

bdproto_results %>% filter(is.na(has_p))

## # A tibble: 19 × 3
##    BdprotoID has_p has_b
##        <dbl> <lgl> <lgl>
##  1         5 NA    TRUE 
##  2         8 NA    TRUE 
##  3        31 NA    TRUE 
##  4        71 NA    TRUE 
##  5       159 NA    TRUE 
##  6       166 NA    TRUE 
##  7       184 NA    TRUE 
##  8       185 NA    TRUE 
##  9       186 NA    TRUE 
## 10       190 NA    TRUE 
## 11      1014 NA    TRUE 
## 12      1023 NA    TRUE 
## 13      1025 NA    TRUE 
## 14      1032 NA    TRUE 
## 15      1039 NA    TRUE 
## 16      1059 NA    TRUE 
## 17      1081 NA    TRUE 
## 18      2002 NA    TRUE 
## 19      2015 NA    TRUE

Which lack a /b/? Quite a few more – 94 out of 253 data points, so around 37%.

bdproto_results %>% filter(is.na(has_b))

## # A tibble: 93 × 3
##    BdprotoID has_p has_b
##        <dbl> <lgl> <lgl>
##  1        10 TRUE  NA   
##  2        20 TRUE  NA   
##  3        21 TRUE  NA   
##  4        22 TRUE  NA   
##  5        23 TRUE  NA   
##  6        24 TRUE  NA   
##  7        25 TRUE  NA   
##  8        26 TRUE  NA   
##  9        27 TRUE  NA   
## 10        30 TRUE  NA   
## # … with 83 more rows

Which lack a /p/ but not /b/?

bdproto_results %>% filter(is.na(has_p) & has_b)

## # A tibble: 19 × 3
##    BdprotoID has_p has_b
##        <dbl> <lgl> <lgl>
##  1         5 NA    TRUE 
##  2         8 NA    TRUE 
##  3        31 NA    TRUE 
##  4        71 NA    TRUE 
##  5       159 NA    TRUE 
##  6       166 NA    TRUE 
##  7       184 NA    TRUE 
##  8       185 NA    TRUE 
##  9       186 NA    TRUE 
## 10       190 NA    TRUE 
## 11      1014 NA    TRUE 
## 12      1023 NA    TRUE 
## 13      1025 NA    TRUE 
## 14      1032 NA    TRUE 
## 15      1039 NA    TRUE 
## 16      1059 NA    TRUE 
## 17      1081 NA    TRUE 
## 18      2002 NA    TRUE 
## 19      2015 NA    TRUE

Which lack a /b/ but not /p/?

bdproto_results %>% filter(is.na(has_b) & has_p)

## # A tibble: 93 × 3
##    BdprotoID has_p has_b
##        <dbl> <lgl> <lgl>
##  1        10 TRUE  NA   
##  2        20 TRUE  NA   
##  3        21 TRUE  NA   
##  4        22 TRUE  NA   
##  5        23 TRUE  NA   
##  6        24 TRUE  NA   
##  7        25 TRUE  NA   
##  8        26 TRUE  NA   
##  9        27 TRUE  NA   
## 10        30 TRUE  NA   
## # … with 83 more rows

Interestingly, no ancient or reconstructed languages in the BDPROTO lack both /p/ and /b/. And the general tendency, if either one or the other is missing, is to favor /p/.

What about bilabial fricatives. About 9 percent of the data points reported in BDPROTO have them.

bdproto_bialbials <- bdproto %>% filter(grepl("ɸ|β", Phoneme))
lgs_with_bdproto_bialbials <- bdproto_bialbials %>%
  select(BdprotoID) %>%
  distinct()
nrow(lgs_with_bdproto_bialbials) / nrow(bdproto %>% select(BdprotoID) %>% distinct())

## [1] 0.0858209

How many data points do not have Glottocodes – a marker of how many languages / dialects are reported in BDPROTO – 214 out of 272.

nrow(bdproto %>% select(BdprotoID) %>% distinct())

## [1] 268

nrow(bdproto %>% select(Glottocode) %>% filter(!is.na(Glottocode)) %>% distinct())

## [1] 212

Does the frequency of the presence of bilabial fricatives change if we subset the data on Glottocodes? It actually goes up to about 10%.

bdproto_bialbials <- bdproto %>%
  filter(!is.na(Glottocode)) %>%
  filter(grepl("ɸ|β", Phoneme))
lgs_with_bdproto_bialbials <- bdproto_bialbials %>%
  select(Glottocode) %>%
  distinct()
nrow(lgs_with_bdproto_bialbials) / nrow(bdproto %>% select(Glottocode) %>% filter(!is.na(Glottocode)) %>% distinct())

## [1] 0.09433962

When compared to PHOIBLE is this prevalence of languages with contrastive (or reconstructed) bilabial fricatives greater or less today than in the past? We might expect the percentage to go down, as sounds shift to, for example, labiodentals (Blasi et al. 2019; Moran, Lester, and Grossman 2021), which have a greater intensity of noise and greater amplitude. (Maddieson (2005) notes, however, that the differences between bilabial fricatives and labiodental fricatives in the few languages that contrast them may not be subtle to speakers of those languages.)

Let’s calculate phoible by ISO 639-3 language codes. PHOIBLE actually has a higher percentage of bilabial fricatives than in the BDPROTO study at nearly 17%.

phoible_bialbials <- phoible %>%
  filter(!is.na(ISO6393)) %>%
  filter(grepl("ɸ|β", Phoneme))
lgs_with_phoible_bialbials <- phoible_bialbials %>%
  select(ISO6393) %>%
  distinct()
nrow(lgs_with_phoible_bialbials) / nrow(phoible %>% select(ISO6393) %>% filter(!is.na(ISO6393)) %>% distinct())

## [1] 0.167304

Labials from a diachronic perspective

Next we undertake a phylogenetic analysis of the feature labial. That is, we would like to know whether within certain language families (for which there exits high resolution phylogenies) if the feature labial is present or not present during the evolution of languages within particular families. However, what we find is that the feature labial is present throughout all of the daughter languages for which we have data and high resolution phylogenies.

We define some convenience functions for pruning the phylogenies with the PHOIBLE data.

PruneTraits <- function(traits, tip.labels) {
  traits.cut <- subset(traits, traits$taxa %in% tip.labels)
  return(traits.cut)
}

PruneSummaryTree <- function(nexus.file, codes, which = c("LanguageName", "ISO", "Glottocode")) {
  # Trees have tip labels like "Ache<ache1246|guq>" with language name and Glottolog codes. Take Glottolog code. Return tree.
  tree <- read.nexus(nexus.file)
  switch(which,
    Glottocode = {
      tree$tip.label <- gsub("(.*)(<)(.*)(\\|)(.*)(>)", "\\3", tree$tip.label)
    },
    ISO = {
      tree$tip.label <- gsub("(.*)(<)(.*)(\\|)(.*)(>)", "\\5", tree$tip.label)
    },
    LanguageName = {
      tree$tip.label <- gsub("(.*)(<)(.*)(\\|)(.*)(>)", "\\1", tree$tip.label)
    }
  )
  # Drop tips missing in traits
  tree <- drop.tip(tree, setdiff(tree$tip.label, codes))
  # Remove any remaining duplicates.
  if (any(duplicated(tree$tip.label))) {
    index <- which(duplicated(tree$tip.label))
    tree$tip.label[index] <- "remove"
    tree <- drop.tip(tree, "remove")
  }
  return(tree)
}

First we prune the Indo-European phylogeny as published by (Chang et al. 2015), which is available via D-PLACE (Kirby et al. 2016).

# Tree paths
tree <- "trees/ie-c-tree.nex"
pr_sum_tree <- PruneSummaryTree(tree, traits$taxa, "Glottocode")

# Prune the traits data to match the tree tips for analysis
data <- PruneTraits(traits, pr_sum_tree$tip.label)

# Combine them into a list of R data objects for analysis with the BT3 wrapper
pr_sum_tree <- list(data = data, tree = pr_sum_tree)

Here is a convenience function for plotting the phylogeny together with the presence of absence of a discrete variable.

# Define color schema
color.scheme <- c("blue", "red")
names(color.scheme) <- c("Y", "N")

# Function to reverse time in the plot
reverse.time <- function(p) {
  p$data$x <- p$data$x - max(p$data$x)
  return(p)
}

# Create tree and heatmap figure
plot.tree <- function(pr_sum_tree_plot, features_plot) {
  gheatmap(pr_sum_tree_plot, features_plot,
    colnames_position = "top", color = "black",
    colnames_offset_y = 0.1, font.size = 2.5,
    width = 0.4, offset = 8
  ) +
    scale_fill_manual(name = "", values = color.scheme) +
    scale_x_continuous(breaks = c(-6000, -4000, -2000, 0)) +
    scale_y_continuous(expand = c(-0.01, 1)) +
    theme_tree2(axis.text.x = element_text(size = 8)) +
    theme(
      legend.position = "none",
      axis.ticks = element_line(color = "grey")
    )
}

Next we plot the presence of absence of each trait (has or does not have labial as a contrastive phonological feature) on the phylogeny.

traits.print <- pr_sum_tree$data %>% select(has_labial)

p <- reverse.time(ggtree(pr_sum_tree$tree, ladderize = T, right = T)) +
  geom_tiplab(align = T, linesize = .1, size = 2)
plot.tree(p, traits.print)

What we find is that within the Indo-European phylogeny the feature labial is present in all daughter nodes of the family tree (which has been pruned to the languages that we have information about in PHOIBLE). Thus, we cannot generate for example a stochastic character mapping on the Indo-European tree because there is only one dimension, i.e., one categorical value for the input to the model, and that value is always present.

We can only assume that either all languages innovated a labial contrast and the root node (the proto-language) did not have labial contrastive segments – or the that the proto-language used the labial feature and all languages have kept that feature through time. Given the cross-linguistic frequency of contrastive labial segments in phonological inventories and the extreme rarity in which they are absent in a handful of languages in the PHOIBLE sample, we assume that labial has long been a feature of spoken languages.

Next we prune the Sino-Tibetan phylogeny as published by M. Zhang et al. (2019) and also available via D-PLACE (Kirby et al. 2016).

# Tree paths
tree <- "trees/sinotibetan-z-tree.nex"
pr_sum_tree <- PruneSummaryTree(tree, traits$taxa, "Glottocode")

# Prune the traits data to match the tree tips for analysis
data <- PruneTraits(traits, pr_sum_tree$tip.label)

# Combine them into a list of R data objects for analysis with the BT3 wrapper
pr_sum_tree <- list(data = data, tree = pr_sum_tree)

We plot at the distribution of the feature labial in Sino-Tibetan. Again, we see that the feature labial is present in all daughter nodes of the phylogeny, leading us to the same conclusion as in Indo-European – labial is a (near) universal phonologically contrastive feature used by spoken languages.

traits.print <- pr_sum_tree$data %>% select(has_labial)

p <- reverse.time(ggtree(pr_sum_tree$tree, ladderize = T, right = T)) +
  geom_tiplab(align = T, linesize = .1, size = 2)
plot.tree(p, traits.print)

References

Abbott, Clifford. 2006. Oneida Teaching Grammar. University of Wisconsin – Green Bay. https://www.uwgb.edu/UWGBCMS/media/Oneida-Language/files/teaching-grammar-revised4.pdf.

Blasi, Damián E., Steven Moran, Scott R. Moisik, Paul Widmer, Dan Dediu, and Balthasar Bickel. 2019. “Human Sound Systems Are Shaped by Post-Neolithic Changes in Bite Configuration.” Science 363 (6432). https://doi.org/10.1126/science.aav3218.

Chang, Will, Chundra Cathcart, David Hall, and Andrew Garrett. 2015. “Ancestry-Constrained Phylogenetic Analysis Supports Indo-European Steppe Hypothesis.” Language 91: 194–244. https://doi.org/10.1353/lan.2015.0005.

Clements, G. N., and Annie Rialland. 2008. “Africa as a Phonological Area.” In A Linguistic Geography of Africa, edited by Bernd Heine and Derek Nurse, 36–85. Cambridge: Cambridge University Press.

Grossman, Eitan, Elad Eisen, Dmitry Nikolaev, and Steven Moran. 2020. “SegBo: A Database of Borrowed Sounds in the World’s Languages.” In Proceedings of the 12th Language Resources and Evaluation Conference, 5316–22. Marseille, France: European Language Resources Association. https://www.aclweb.org/anthology/2020.lrec-1.654.

Hammarström, Harald, Robert Forkel, Martin Haspelmath, and Sebastian Bank. 2020. Glottolog 4.2.1. Jena: Max Planck Institute for the Science of Human History; Max Planck Institute for the Science of Human History. https://doi.org/10.5281/zenodo.3754591.

Houis, Maurice. 1974. “A Propos de /p/.” Afrique Et Langage 1: 35–38.

Kirby, Kathryn R., Russell D. Gray, Simon J. Greenhill, Fiona M. Jordan, Stephanie Gomes-Ng, Hans-Jörg Bibiko, Damián E. Blasi, et al. 2016. “D-PLACE: A Global Database of Cultural, Linguistic and Environmental Diversity.” PLoS ONE 11 (7): e0158391.

Lounsbury, Floyd G. 1953. Oneida Verb Morphology. Yale University Publications in Anthropology. New Haven: Yale University Press.

Maddieson, Ian. 1984. Patterns of Sounds. Cambridge: Cambridge University Press.

———. 2005. “Bilabial and Labio-Dental Fricatives in Ewe.” UC Berkeley Phonology Lab Annual Report 1 (1). https://escholarship.org/uc/item/4r49g6qx.

Marsico, Egidio, Sebastien Flavier, Annemarie Verkerk, and Steven Moran. 2018. “BDPROTO: A Database of Phonological Inventories from Ancient and Reconstructed Languages.” In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018).

Michelson, Karin. 1990. “The Oneida Lexicon.” In Proceedings of the Sixteenth Annual Meeting of the Berkeley Linguistics Society: Special Session on General Topics in American Indian, 16:73–84. 2.

Moran, Steven, Eitan Grossman, and Annemarie Verkerk. 2020. “Investigating Diachronic Trends in Phonological Inventories Using BDPROTO.” Language Resources and Evaluation. https://doi.org/https://doi.org/10.1007/s10579-019-09483-3.

Moran, Steven, Nicholas A. Lester, and Eitan Grossman. 2021. “Inferring Recent Evolutionary Changes in Speech Sounds.” Philosophical Transactions of the Royal Society B: Biological Sciences 376 (20200198). https://doi.org/https://doi.org/10.1098/rstb.2020.0198.

Moran, Steven, and Daniel McCloy, eds. 2019. PHOIBLE 2.0. Jena: Max Planck Institute for the Science of Human History. https://phoible.org/.

R Core Team. 2021. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Revell, Liam J. 2012. “Phytools: An r Package for Phylogenetic Comparative Biology (and Other Things).” Methods in Ecology and Evolution 3: 217–23.

Wickham, Hadley. 2011. “Testthat: Get Started with Testing.” The R Journal 3: 5–10. https://journal.r-project.org/archive/2011-1/RJournal_2011-1_Wickham.pdf.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

Xie, Yihui. 2021. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.

Yu, Guangchuang. 2020. “Using Ggtree to Visualize Data on Tree‐Like Structures.” Curr. Protoc. Bioinformatics 69 (1). https://doi.org/10.1002/cpbi.96.

Zhang, Jinlong. 2017. Phylotools: Phylogenetic Tools for Eco-Phylogenetics. https://CRAN.R-project.org/package=phylotools.

Zhang, Menghan, Shi Yan, Wuyun Pan, and Li Jin. 2019. “Phylogenetic Evidence for Sino-Tibetan Origin in Northern China in the Late Neolithic.” Nature 569 (7754): 112–15. https://doi.org/10.1038/s41586-019-1153-z.