disp_tdm {tlda} | R Documentation |
Calculate parts-based dispersion measures for a term-document matrix
Description
This function calculates a number of parts-based dispersion measures and allows the user to choose the directionality of scaling, i.e. whether higher values denote a more even or a less even distribution. It also offers the option of calculating frequency-adjusted dispersion scores.
Usage
disp_tdm(
tdm,
row_partsize = "first",
directionality = "conventional",
freq_adjust = FALSE,
freq_adjust_method = "even",
unit_interval = TRUE,
digits = NULL,
verbose = TRUE,
print_scores = TRUE,
suppress_warning = FALSE
)
Arguments
tdm |
A term-document matrix, where rows represent items and columns represent corpus parts; must also contain a row giving the size of the corpus parts (first or last row in the term-document matrix) |
row_partsize |
Character string indicating which row in the term-document matrix contains the size of the corpus parts. Possible values are |
directionality |
Character string indicating the directionality of scaling. See details below. Possible values are |
freq_adjust |
Logical. Whether dispersion score should be adjusted for frequency (i.e. whether frequency should be 'partialed out'); default is |
freq_adjust_method |
Character string indicating which method to use for devising dispersion extremes. See details below. Possible values are |
unit_interval |
Logical. Whether frequency-adjusted scores that exceed the limits of the unit interval should be replaced by 0 and 1; default is |
digits |
Rounding: Integer value specifying the number of decimal places to retain (default: no rounding) |
verbose |
Logical. Whether additional information (on directionality, formulas, frequency adjustment) should be printed; default is |
print_scores |
Logical. Whether the dispersion scores should be printed to the console; default is |
suppress_warning |
Logical. Whether warning messages should be suppressed; default is |
Details
This function takes as input a term-document matrix and returns, for each item (i.e. each row) a variety of dispersion measures. The rows in the matrix represent the items, and the columns the corpus parts. Importantly, the term-document matrix must include an additional row that records the size of the corpus parts. For a proper term-document matrix, which includes all items that appear in the corpus, this can be added as a column margin, which sums the frequencies in each column. If the matrix only includes a selection of items drawn from the corpus, this information cannot be derived from the matrix and must be provided as a separate row.
Directionality: The scores for all measures range from 0 to 1. The conventional scaling of dispersion measures (see Juilland & Chang-Rodriguez 1964; Carroll 1970; Rosengren 1971) assigns higher values to more even/dispersed/balanced distributions of subfrequencies across corpus parts. Gries (2008) uses the reverse scaling, with higher values denoting a more uneven/bursty/concentrated distribution; this is implemented by the value
gries
.Frequency adjustment: Dispersion scores can be adjusted for frequency using the min-max transformation proposed by Gries (2022: 184-191; 2024: 196-208). The frequency-adjusted score for an item considers the lowest and highest possible level of dispersion it can obtain given its overall corpus frequency as well as the number (and size) of corpus parts. The unadjusted score is then expressed relative to these endpoints, where the dispersion minimum is set to 0, and the dispersion maximum to 1 (expressed in terms of conventional scaling). The frequency-adjusted score falls between these bounds and expresses how close the observed distribution is to the theoretical maximum and minimum. This adjustment therefore requires a maximally and a minimally dispersed distribution of the item across the parts. These hypothetical extremes can be built in different ways. The method used by Gries (2022, 2024) uses a computationally expensive procedure that finds the distribution that produces the highest value on the dispersion measure of interest. The current function constructs extreme distributions in a different way, based on the distributional features pervasiveness (
"pervasive"
) or evenness ("even"
). You can choose between these with the argumentfreq_adjust_method
; the default iseven
. For details and explanations, seevignette("frequency-adjustment")
.To obtain the lowest possible level of dispersion, the occurrences are either allocated to as few corpus parts as possible (
"pervasive"
), or they are assigned to the smallest corpus part(s) ("even"
).To obtain the highest possible level of dispersion, the occurrences are either spread as broadly across corpus parts as possible (
"pervasive"
), or they are allocated to corpus parts in proportion to their size ("even"
). The choice between these methods is particularly relevant if corpus parts differ considerably in size. See documentation forfind_max_disp()
andvignette("frequency-adjustment")
.
The following measures are computed, listed in chronological order (see details below):
-
R_{rel}
(Keniston 1920) -
D
(Juilland & Chang-Rodriguez 1964) -
D_2
(Carroll 1970) -
S
(Rosengren 1971) -
D_P
(Gries 2008; modification: Egbert et al. 2020) -
D_A
(Burch et al. 2017) -
D_{KL}
(Gries 2024)
In the formulas given below, the following notation is used:
-
k
the number of corpus parts -
T_i
the absolute subfrequency in parti
-
t_i
a proportional quantity; the subfrequency in parti
divided by the total number of occurrences of the item in the corpus (i.e. the sum of all subfrequencies) -
W_i
the absolute size of corpus parti
-
w_i
a proportional quantity; the size of corpus parti
divided by the size of the corpus (i.e. the sum of the part sizes) -
R_i
the normalized subfrequency in parti
, i.e. the subfrequency divided by the size of the corpus part -
r_i
a proportional quantity; the normalized subfrequency in parti
divided by the sum of all normalized subfrequencies -
N
corpus frequency, i.e. the total number of occurrence of the item in the corpus
Note that the formulas cited below differ in their scaling, i.e. whether 1 reflects an even or an uneven distribution. In the current function, this behavior is overridden by the argument directionality
. The specific scaling used in the formulas below is therefore irrelevant.
R_{rel}
refers to the relative range, i.e. the proportion of corpus parts containing at least one occurrence of the item
D
denotes Juilland's D and is calculated as follows (this formula uses conventional scaling); \bar{R_i}
denotes the average over the normalized subfrequencies:
1 - \sqrt{\frac{\sum_{i = 1}^k (R_i - \bar{R_i})^2}{k}} \times \frac{1}{\bar{R_i} \sqrt{k - 1}}
D_2
denotes the index proposed by Carroll (1970); the following formula uses conventional scaling:
\frac{\sum_i^k r_i \log_2{\frac{1}{r_i}}}{\log_2{k}}
S
is the dispersion measure proposed by Rosengren (1971); the formula uses conventional scaling:
\frac{(\sum_i^k r_i \sqrt{w_i T_i}}{N}
D_P
represents Gries's deviation of proportions; the following formula is the modified version suggested by Egbert et al. (2020: 99); it implements conventional scaling (0 = uneven, 1 = even) and the notation min\{w_i: t_i > 0\}
refers to the w_i
value among those corpus parts that include at least one occurrence of the item.
1 - \frac{\sum_i^k |t_i - w_i|}{2} \times \frac{1}{1 - min\{w_i: t_i > 0\}}
D_A
refers is a measure introduced into dispersion analysis by Burch et al. (2017). The following formula is the one used by Egbert et al. (2020: 98); it relies on normalized frequencies and therefore works with corpus parts of different size. The formula represents conventional scaling (0 = uneven, 1 = even):
1 - \frac{\sum_{i = 1}^{k-1} \sum_{j = i+1}^{k} |R_i - R_j|}{\frac{k(k-1)}{2}} \times \frac{1}{2\frac{\sum_i^k R_i}{k}}
The current function uses a different version of the same formula, which relies on the proportional r_i
values instead of the normalized subfrequencies R_i
. This version yields the identical result:
1 - \frac{\sum_{i = 1}^{k-1} \sum_{j = i+1}^{k} |r_i - r_j|}{k-1}
D_{KL}
denotes a measure proposed by Gries (2020, 2021); for standardization, it uses the odds-to-probability transformation (Gries 2024: 90) and represents Gries scaling (0 = even, 1 = uneven):
\frac{\sum_i^k t_i \log_2{\frac{t_i}{w_i}}}{1 + \sum_i^k t_i \log_2{\frac{t_i}{w_i}}}
Value
A numeric matrix with one row per item and seven columns
Author(s)
Lukas Soenning
References
Burch, Brent, Jesse Egbert & Douglas Biber. 2017. Measuring and interpreting lexical dispersion in corpus linguistics. Journal of Research Design and Statistics in Linguistics and Communication Science 3(2). 189–216. doi:10.1558/jrds.33066
Carroll, John B. 1970. An alternative to Juilland’s usage coefficient for lexical frequencies and a proposal for a standard frequency index. Computer Studies in the Humanities and Verbal Behaviour 3(2). 61–65. doi:10.1002/j.2333-8504.1970.tb00778.x
Egbert, Jesse, Brent Burch & Douglas Biber. 2020. Lexical dispersion and corpus design. International Journal of Corpus Linguistics 25(1). 89–115. doi:10.1075/ijcl.18010.egb
Gries, Stefan Th. 2008. Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13(4). 403–437. doi:10.1075/ijcl.13.4.02gri
Gries, Stefan Th. 2020. Analyzing dispersion. In Magali Paquot & Stefan Th. Gries (eds.), A practical handbook of corpus linguistics, 99–118. New York: Springer. doi:10.1007/978-3-030-46216-1_5
Gries, Stefan Th. 2021. A new approach to (key) keywords analysis: Using frequency, and now also dispersion. Research in Corpus Linguistics 9(2). 1–33. doi:10.32714/ricl.09.02.02
Gries, Stefan Th. 2022. What do (most of) our dispersion measures measure (most)? Dispersion? Journal of Second Language Studies 5(2). 171–205. doi:10.1075/jsls.21029.gri
Gries, Stefan Th. 2024. Frequency, dispersion, association, and keyness: Revising and tupleizing corpus-linguistic measures. Amsterdam: Benjamins. doi:10.1075/scl.115
Juilland, Alphonse G. & Eugenio Chang-Rodríguez. 1964. Frequency dictionary of Spanish words. The Hague: Mouton de Gruyter. doi:10.1515/9783112415467
Keniston, Hayward. 1920. Common words in Spanish. Hispania 3(2). 85–96. doi:10.2307/331305
Lijffijt, Jefrey & Stefan Th. Gries. 2012. Correction to Stefan Th. Gries’ ‘Dispersions and adjusted frequencies in corpora’. International Journal of Corpus Linguistics 17(1). 147–149. doi:10.1075/ijcl.17.1.08lij
Rosengren, Inger. 1971. The quantitative concept of language and its relation to the structure of frequency dictionaries. Études de linguistique appliquée (Nouvelle Série) 1. 103–127.
See Also
For finer control over the calculation of several dispersion measures:
-
disp_R_tdm()
forRange
-
disp_DP_tdm()
forD_P
-
disp_DA_tdm()
forD_A
-
disp_DKL_tdm()
forD_{KL}
Examples
disp_tdm(
tdm = biber150_spokenBNC2014[1:20,],
row_partsize = "first",
directionality = "conventional",
freq_adjust = FALSE)