find_max_disp {tlda} | R Documentation |
Find the maximally dispersed distribution of an item across corpus parts
Description
This function returns the (hypothetical) distribution of subfrequencies that represents the highest possible level of dispersion for a given item across a particular set of corpus parts. It requires a vector of subfrequencies and a vector of corpus part sizes. This distribution is required for the min-max transformation proposed by Gries (2022: 184-191; 2024: 196-208) to obtain frequency-adjusted dispersion scores.
Usage
find_max_disp(subfreq, partsize, freq_adjust_method = "even")
Arguments
subfreq |
A numeric vector of subfrequencies, i.e. the number of occurrences of the item in each corpus part |
partsize |
A numeric vector specifying the size of the corpus parts |
freq_adjust_method |
Character string indicating which method to use for devising dispersion extremes. See details below. Possible values are |
Details
This function creates a hypothetical distribution of the total number of occurrences of the item (i.e. the sum of its subfrequencies) across corpus parts. To obtain the highest possible level of dispersion, the argument freq_adjust_method
allows the user to choose between two distributional features: pervasiveness (pervasive
) or evenness (even
). For details and explanations, see vignette("frequency-adjustment")
. To obtain the highest possible level of dispersion, the occurrences are either spread as broadly across corpus parts as possible (pervasive
), or they are allocated to corpus parts in proportion to their size (even
). The choice between these methods is particularly relevant if corpus parts differ considerably in size. Since the dispersion of an item that occurs only once in the corpus (hapaxes) cannot be sensibly measured or manipulated, such items are disregarded; the function returns their observed subfrequencies.
Value
An integer vector the same length as partsize
Author(s)
Lukas Soenning
References
Gries, Stefan Th. 2022. What do (most of) our dispersion measures measure (most)? Dispersion? Journal of Second Language Studies 5(2). 171–205. doi:10.1075/jsls.21029.gri
Gries, Stefan Th. 2024. Frequency, dispersion, association, and keyness: Revising and tupleizing corpus-linguistic measures. Amsterdam: Benjamins. doi:10.1075/scl.115
Examples
find_max_disp(
subfreq = c(0,0,1,2,5),
partsize = c(100, 100, 100, 500, 1000),
freq_adjust_method = "pervasive")