About rare words

 

Although the purpose of this web site is to tell little stories about particular and quirky words, it may be helpful to put quirky words in a statistical context of Shakespearean vocabulary and,  more particularly, rare vocabulary.

 

The electronic version of The Riverside Shakespeare, on which these calculations are based, has 45 distinct works by Shakespeare, including 38 plays, the Thomas More fragment, the Sonnets, the two long narrative poems, and three other poems or collections. This corpus consists of about 895,000 word occurrences,  some 30,000 distinct word forms,  and about 18,500 lemmata or lexical items that bundle variant forms of a word.

 

The following table gives some rough information about the distribution of word types and word tokens across the Shakespearean canon

 

 

Number of plays in which word occurs

Number of
 word types
(lemmata)

Number of
word tokens
(occurrences)

% of word types

% of word tokens

1

8139

13695

43.39%

1.53%

2

2340

6921

12.48%

0.77%

3-4

2182

11356

11.63%

1.27%

5-8

1985

18671

10.58%

2.09%

9-16

1703

32846

9.08%

3.67%

17-32

1393

73217

7.43%

8.18%

>32

833

738056

4.44%

82.49%

 

There is nothing special about the distribution of words across Shakespeare's works. The table illustrates clearly the bipolar distribution that is characteristic of all texts. Less than 5% of Shakespeare's vocabulary accounts for over 80% of all word tokens. On the other end of the spectrum, almost half of his vocabulary occurs in only one work but  accounts for only 1.5% of word occurrences. 

 

How does one define a rare word in the light of these figures? It is tempting to look at rare words in the aggregate, but one needs to be aware that in frequency-based analyses of overlap vocabulary different answers produce quite different results. How many rare "Hamlet words" occur in other plays and do the figures tell us something useful about the place of Hamlet in the corpus?  

 

In the following table I have tried to map the proximity of Hamlet  to other Shakespearean works by three different tests. The first test takes words that occur in 2-4 works and expresses their frequency in other plays as a rate per 20,000 words, which is close to the average length of a play. I omit very short works.

 

I list the range of frequencies for the bottom and top quartile and list the individual plays in the bottom and top quartiles by ascending "z-scores," which express the number of standard deviations by which a given value departs from the average. Grouping plays in this way makes the different tests comparable.

 

The second and third tests repeat the procedure with words that occur in 5-8 and 9-16 words. Works that score high or low on two of the three tests are bolded:

 

 

Close and distant neighbors of Hamlet as measured by overlap vocabulary in three categories
(bold titles mark plays that appear in more than one test)

 

word occurs in 2-4 works

word occurs in 5-8 works

word occurs in 9-16 works

Lemma frequency per 20,000 words

Range of bottom quartile

8.6 - 15.9

42.91 -55.6

141.1 -157.3

Range of top quartile

27.9 - 47.1

67.1 -93.7

189.7 - 269.3

Bottom Quartile: 10 plays most distant from Hamlet

z-range -2.0- -1.5

3H6

JuC

 

z-range -1.5 - -1.0

AYL, He8, 1H6, Tam.,  Tit., TGV

TGV, Ri2,

Tam., TGV

z-range -1.0 - -0.5

MAN, Ri3, RoJ

RoJ, Ri3, AWW, MeV, 3H6, Cor, 2H6

JuC, AYL, Win., MAN, He8, 3H6, Cor., Com.

Top quartile: 10 plays closest to Hamlet

z.-range 0 - 0.5

 

MND, He5

Tro.

z-range 0.5 - 1.0

TwN

Tem

MND, 1H6, Tim.

z-range 1.0 -1.5

MeM, 1H4, Tro. Cym., LLL, Ant.

MWW, KiL, LLL

KiJ

z-range 1.5 -2.0

Tim. , Mac.

TNK, Rape

Son. ,Tem., Mac.

z-range 2.0 -2.5

 

Tro. Mac

Rape

z-range 2.5 -3.0

Oth.

 

 

z.-range 3.0 -3.5

 

 

Venus

 

The only clear inference to be drawn from this table is that purely frequency-based analysis of overlap vocabulary does not take you clearly in a particular direction. It is true that on the whole the three tests together do not confuse close and distant neighbors of Hamlet : only 1Henry VI  shows up in the bottom quartile of one test and the top quartile of another. The tests also show more agreement on distant than close neighbors: on the basis of agreement between two tests there are eleven distant, but only six close, neighbors. But beyond these very modest findings, it is not easy to make very much of the results.

 

Perhaps the tests show the greater  discriminating power of words that occur in very few works. The 2-4 works test distinguishes much more sharply between close and distant neighbors than the other two. With regard to words in the 2-4 work category  the highest value (47.1) is almost six times as great as the lowest value (8.6), and the lowest value of the top quartile (27.9) is almost twice as great as the highest value of the bottom quartile (15.9).  The 5-8 and 9-16 categories are much closer to each other than 2-4 is to 5-8. The high:low ratio is around 2 and the top and bottom quartiles are only separated by about 20%. It is not especially surprising to learn that as a word occurs in more works it tends to be more evenly diffused across the corpus. But it is nonetheless helpful to know with some precision just how differently different types of vocabulary behave.

 

From this analysis one may draw the tentative conclusion that it is useful to distinguish between a category of  very rare words, which occur in 2-4 works and a category of rare words, which occur in 5-16 works. Such a conclusion is supported by the fact that quirky words are overwhelmingly found in the category of very rare words. But it leads to situations in which philological and statistical analysis part ways.  Thus the most interesting quirky words in Hamlet  establish a suggestive web of scenic and thematic links to Twelfth Night, Measure for Measure, and most particularly Othello.  This shows up in the results for the frst test, which separates Othello from all the other plays. But Othello  barely ranks in the top half of plays on the 9-16 test. On the other hand,  a close reader would be very hard pressed to find much of interest in the analysis in the 9-16 overlap vocabularly withVenus of  Adonis,  even though the poem has a score that puts it practically off the chart.

 

If I have to choose between philology and statistics, I choose philology.