For our estimation, we analyse the information stored for three well-studied acoustic cues: voice onset time VOT in ms—a cue to voiced-voiceless distinctions e. We assume that initially, learners have maximum uncertainty along each cue R , following uniform distributions bounded by the limits of perception.
Embodied Semantics in a Second Language: Critical Review and Clinical Implications
For frequencies, we assume bounds on human hearing of 20—20 Hz, which translate to 0. We find that language users store 3 bits of information for voiceless VOT, 5 bits for voiced VOT, 3 bits for central frication frequency and 15 bits for formant frequencies. As these acoustic cues are only a subset of the cues required to identify consonant phonemes, we assume that consonants require three cues; each cue requiring 5 bits of information.
For vowels, we do not adjust the 15 bits of information conveyed by formant frequencies. As a best guess, again paying attention primarily to the order of magnitude, we assume there are 50 phonemes each requiring 15 bits, totalling bits of information. For lower and upper estimates, we introduce a factor of two error [— bits]. Entire dissertations could be and have been written on these distinctions. These difficulties are in part why the Fermi approach is so useful: we do not need to make strong theoretical commitments in order to study the problem if we focus on rough estimation of orders of magnitude.
Estimates of the number of words children acquire range in the order of 20 —80 total wordforms [ 13 ]. However, when words are grouped into families e. Lexical knowledge extends beyond words, too. Jackendoff [ 17 ] estimates that the average adult understands 25 idioms, items out of the view of most vocabulary studies.
Our estimates of capacity could, of course, be based on upper bounds on what people could learn, which, to our knowledge, have not been found. The most basic thing each learner must acquire about a word is its phonemic wordform, meaning the sequence of phonemes that make up its phonetic realization. If we assume that word forms are essentially memorized, then the entropy H [ R D ] is zero after learning—e. The challenge then is to estimate what H [ R ] is: before learning anything, what uncertainty should learners have? To answer this, we can note that H [ R ] in 1. Here, we use a language model to estimate the average negative log probability of the letter sequences that make up words and view this as an estimate of the amount of entropy that has been removed for each word.
In other words, the average surprisal of a word under a language model provides one way to estimate the amount of uncertainty that learners who know a given word must have removed. We computed the surprisal of each word under 1-phone, 2-phone, 3-phone and 4-phone models see [ 19 ] trained on the lexicon. This analysis revealed that 43 bits per word on average are required under the 1-phone, 33 bits per word under the 2-phone, 24 under the 3-phone and 16 under the 4-phone model.
The information contained in lexical semantics is difficult to evaluate because there are no accepted theories of semantic content, or conceptual content more generally [ 20 ]. However, following Fermi, we can make very simplified assumptions and try to estimate the general magnitude of semantic content.
One way to do this is to imagine that the set of word meanings are distributions in an N -dimensional semantic space. If we assume that the entire space is a Gaussian with standard deviation R and the standard deviation of an individual word meaning is r , then we can compute the information contained in a word meaning as the difference in uncertainty between an N -dimensional Gaussian with radius R when compared with one with radius r.
The general logic is shown in figure 1. The reduction in entropy from a total semantic space of size R —no idea what a word means—to one of size r is what we use to approximate the amount of information that has been learned.
e-book Semantics in Acquisition: 35 (Studies in Theoretical Psycholinguistics)
Figure 1. The shaded spheres represent uncertainty in semantic space centred around a word in green. The reduction in uncertainty from R to r reflects the amount of semantic information conveyed by the green word. Equation 2. However, the dimensionality of semantic space is considerably larger.
We estimate R and r in several different ways by looking at WordNet [ 21 ] to determine the closeness of each word to its neighbours in semantic space. In particular, we take r to be a characteristic distance to nearby neighbours e.
- The Mercy of Strange Men: Erotic Stories.
- Pokémon can teach us about language.
- IATL | Israel Association for Theoretical Linguistics?
- Department of Linguistics.
- Download Limit Exceeded!
- Bibliographic Information;
- Get PDF Semantics in Acquisition: 35 (Studies in Theoretical Psycholinguistics)?
Note, this assumes that the size of a Gaussian for a word is about the same size as its distance to a neighbour, and in reality this may underestimate the information a word meaning contains because words could be much more precise than their closest semantic neighbour. The likely values fall within the range of 0. For instance, if semantic space was one-dimensional then it would require 0. The nearness of these values to 1 means that even continuous semantic dimensions can be viewed as approximately binary in terms of the amount of information they provide about meaning.
Figure 2. These robustly show that 0. The dimensionality of semantic space has been studied by [ 22 , 23 ], with numbers ranging from to dimensions. Our best guess will use 1 bit per dimension and dimensions following [ 22 ] for 12 bits. Our upper bound uses 2 bits-per-dimension and dimensions for a total of 40 bits.
For our lower bound in this domain, we may pursue a completely alternative technique, which surprisingly, gives a similar order of magnitude as our best guess. In this case, the problem of learning is figuring out which of the 40 ! It will take log 2 40 ! We will use this as our lower bound. While this seems like an unmanageable task for the child, it is useful to imagine how much information is conveyed by a single pedagogical learning instance.
Word frequencies are commonly studied in psychology as factors influencing language processing and acquisition e. In one extreme, language users might store perhaps only a single bit about word frequency, essentially allowing them to categorize high- versus low-frequency words along a median split. On the other extreme, language users may store information about word frequency with higher fidelity—for instance, 10 bits would allow them to distinguish 2 10 distinct levels of word frequency as a kind of psychological floating point number.
Or, perhaps language learners store a full ranking of all 40 words in terms of frequency, requiring log 40 ! We removed words below the bottom 30th percentile frequency count of 1 and words above the upper 99th percentile in word frequency in order to study the intermediate-frequency majority of the lexicon. Each participant completed trials. This shows, for instance, that people are poor at distinguishing very close i and j near the red line , as should be expected. Figure 3. Accuracy in frequency discrimination accuracy as a function of log word frequency bin faceted by log reference word frequency bin.
Vertical red lines denote within bin comparison. Neglecting the relatively small difference in accuracy and thus fidelity with a word's absolute frequency, this accuracy can be modelled by imagining that participants store M levels of word frequencies. We construct our lower and upper bounds by introducing a factor of two error on the computation e. It is important to note that by assuming objective frequency rankings, our estimate is conservative. Syntax has traditionally been the battleground for debates about how much information is built-in versus learned.
Indeed, syntactic theories span the gamut from those that formalize a few dozen binary parameters [ 34 , 35 ] to ones that consider alternative spaces of infinite models e. In the face of massively incompatible and experimentally under-determined syntactic theories, we aim here to study the question in a way that is as independent as possible from the specific syntactic formalism.
In many cases, the sentences of English will share syntactic structure.
However, we can imagine a set s 1 , s 2 , …, s n of sentences which share as little syntactic structure as possible between each s i and s j. In this case, the bits specifying these parses can be added together to estimate the total information learners know. In general, the number of logically possible parses can be computed as the number of binary trees over s i , which is determined only by the length of s i. Our upper and lower bounds will take into account uncertainty about the number of distinct sentences s i that can be found.
To estimate the number of such sentences, we use the textbook linguistic examples studied by [ 39 ]. They present sentences that are meant to span the range of interesting linguistic phenomena and were presented independently in [ 40 ].
Our best estimate is therefore 2. We take the lower bound to be 2. For an upper bound, we consider the possibility that the sentences in [ 39 ] may not cover the majority of syntactic structures, particularly when compared to more exhaustive grammars like [ 41 ]. The upper bound is constructed by imagining that linguists could perhaps construct two times as many sentences with unique structures, meaning that we should double our best guess estimate.
Notably, these tactics to bound the estimate do not qualitatively change its size: human language requires very little information about syntax— [—] bits. In either case, the number is much smaller than most other domains. It may seem surprising but, in terms of digital media storage, our knowledge of language almost fits compactly on a floppy disk. The best-guess estimate implies that learners must be remembering — bits per day about their native language, which is a remarkable feat of cognition.
Our lower bound is around a million bits, which implies that learners would remember around bits each day from birth to 18 years.
To put our lower estimate in perspective, each day for 18 years a child must wake up and remember, perfectly and for the rest of their life, an amount of information equivalent to the information in this sequence, Naturally, the information will be encoded in a different format—presumably one which is more amenable to the working of human memory.
But in our view, both the lower and best-guess bounds are explainable only under the assumption that language is grounded in remarkably sophisticated mechanisms for learning, memory, and inference. There are several limitations to our methods, which is part of the reason we focus on orders of magnitude rather than precise estimates.