Although the purpose of this web site is to tell little stories about particular and quirky words, it may be helpful to put quirky words in a statistical context of Shakespearean vocabulary and, more particularly, rare vocabulary.
The electronic version of The Riverside Shakespeare, on which these calculations are based, has 45 distinct works by Shakespeare, including 38 plays, the Thomas More fragment, the Sonnets, the two long narrative poems, and three other poems or collections. This corpus consists of about 895,000 word occurrences, some 30,000 distinct word forms, and about 18,500 lemmata or lexical items that bundle variant forms of a word.
The following table gives some rough information about the distribution of word types and word tokens across the Shakespearean canon
|
Number of plays in which
word occurs |
Number of |
Number of |
% of word types |
% of word tokens |
|
1 |
8139 |
13695 |
43.39% |
1.53% |
|
2 |
2340 |
6921 |
12.48% |
0.77% |
|
3-4 |
2182 |
11356 |
11.63% |
1.27% |
|
5-8 |
1985 |
18671 |
10.58% |
2.09% |
|
9-16 |
1703 |
32846 |
9.08% |
3.67% |
|
17-32 |
1393 |
73217 |
7.43% |
8.18% |
|
>32 |
833 |
738056 |
4.44% |
82.49% |
There is nothing special about the distribution of words across Shakespeare's works. The table illustrates clearly the bipolar distribution that is characteristic of all texts. Less than 5% of Shakespeare's vocabulary accounts for over 80% of all word tokens. On the other end of the spectrum, almost half of his vocabulary occurs in only one work but accounts for only 1.5% of word occurrences.
How does one define a rare word in the light of these figures? It is tempting to look at rare words in the aggregate, but one needs to be aware that in frequency-based analyses of overlap vocabulary different answers produce quite different results. How many rare "Hamlet words" occur in other plays and do the figures tell us something useful about the place of Hamlet in the corpus?
In the following table I have tried to map the proximity of Hamlet to other Shakespearean works by three different tests. The first test takes words that occur in 2-4 works and expresses their frequency in other plays as a rate per 20,000 words, which is close to the average length of a play. I omit very short works.
I list the range of frequencies for the bottom and top quartile and list the individual plays in the bottom and top quartiles by ascending "z-scores," which express the number of standard deviations by which a given value departs from the average. Grouping plays in this way makes the different tests comparable.
The second and third tests repeat the procedure with words that occur in 5-8 and 9-16 words. Works that score high or low on two of the three tests are bolded:
Close and distant neighbors
of Hamlet as measured by overlap vocabulary in three categories
(bold titles mark plays that appear in more than one test)
|
|
word
occurs in 2-4 works |
word
occurs in 5-8 works |
word
occurs in 9-16 works |
|
Lemma frequency per 20,000 words |
|||
|
Range
of bottom quartile |
8.6
- 15.9 |
42.91
-55.6 |
141.1
-157.3 |
|
Range
of top quartile |
27.9
- 47.1 |
67.1
-93.7 |
189.7
- 269.3 |
|
Bottom Quartile: 10 plays most distant from Hamlet |
|||
|
z-range
-2.0- -1.5 |
3H6 |
JuC |
|
|
z-range
-1.5 - -1.0 |
AYL, He8, 1H6, Tam., Tit., TGV |
TGV, Ri2, |
Tam., TGV |
|
z-range
-1.0 - -0.5 |
MAN, Ri3, RoJ |
RoJ, Ri3, AWW, MeV, 3H6, Cor, 2H6 |
JuC, AYL, Win., MAN, He8, 3H6,
Cor., Com. |
|
Top quartile: 10 plays closest to Hamlet |
|||
|
z.-range
0 - 0.5 |
|
MND,
He5 |
Tro. |
|
z-range
0.5 - 1.0 |
TwN |
Tem |
MND,
1H6, Tim. |
|
z-range
1.0 -1.5 |
MeM,
1H4, Tro. Cym., LLL, Ant. |
MWW,
KiL, LLL |
KiJ |
|
z-range
1.5 -2.0 |
Tim.
, Mac. |
TNK,
Rape |
Son.
,Tem., Mac. |
|
z-range
2.0 -2.5 |
|
Tro. Mac |
Rape |
|
z-range
2.5 -3.0 |
Oth. |
|
|
|
z.-range
3.0 -3.5 |
|
|
Venus |
The only clear inference to be drawn from this table is that purely frequency-based analysis of overlap vocabulary does not take you clearly in a particular direction. It is true that on the whole the three tests together do not confuse close and distant neighbors of Hamlet : only 1Henry VI shows up in the bottom quartile of one test and the top quartile of another. The tests also show more agreement on distant than close neighbors: on the basis of agreement between two tests there are eleven distant, but only six close, neighbors. But beyond these very modest findings, it is not easy to make very much of the results.
Perhaps the tests show the greater discriminating power of words that occur in very few works. The 2-4 works test distinguishes much more sharply between close and distant neighbors than the other two. With regard to words in the 2-4 work category the highest value (47.1) is almost six times as great as the lowest value (8.6), and the lowest value of the top quartile (27.9) is almost twice as great as the highest value of the bottom quartile (15.9). The 5-8 and 9-16 categories are much closer to each other than 2-4 is to 5-8. The high:low ratio is around 2 and the top and bottom quartiles are only separated by about 20%. It is not especially surprising to learn that as a word occurs in more works it tends to be more evenly diffused across the corpus. But it is nonetheless helpful to know with some precision just how differently different types of vocabulary behave.
From this analysis one may draw the tentative conclusion that it is useful to distinguish between a category of very rare words, which occur in 2-4 works and a category of rare words, which occur in 5-16 works. Such a conclusion is supported by the fact that quirky words are overwhelmingly found in the category of very rare words. But it leads to situations in which philological and statistical analysis part ways. Thus the most interesting quirky words in Hamlet establish a suggestive web of scenic and thematic links to Twelfth Night, Measure for Measure, and most particularly Othello. This shows up in the results for the frst test, which separates Othello from all the other plays. But Othello barely ranks in the top half of plays on the 9-16 test. On the other hand, a close reader would be very hard pressed to find much of interest in the analysis in the 9-16 overlap vocabularly withVenus of Adonis, even though the poem has a score that puts it practically off the chart.
If I have to choose between philology and statistics, I choose philology.