disp {tlda}R Documentation

Calculate parts-based dispersion measures

Description

This function calculates a number of parts-based dispersion measures and allows the user to choose the directionality of scaling, i.e. whether higher values denote a more even or a less even distribution. It also offers the option of calculating frequency-adjusted dispersion scores.

Usage

disp(
  subfreq,
  partsize,
  directionality = "conventional",
  freq_adjust = FALSE,
  freq_adjust_method = "even",
  unit_interval = TRUE,
  digits = NULL,
  verbose = TRUE,
  print_score = TRUE,
  suppress_warning = FALSE
)

Arguments

subfreq

A numeric vector of subfrequencies, i.e. the number of occurrences of the item in each corpus part

partsize

A numeric vector specifying the size of the corpus parts

directionality

Character string indicating the directionality of scaling. See details below. Possible values are "conventional" (default) and "gries"

freq_adjust

Logical. Whether dispersion score should be adjusted for frequency (i.e. whether frequency should be 'partialed out'); default is FALSE

freq_adjust_method

Character string indicating which method to use for devising dispersion extremes. See details below. Possible values are "even" (default) and "pervasive"

unit_interval

Logical. Whether frequency-adjusted scores that exceed the limits of the unit interval should be replaced by 0 and 1; default is TRUE

digits

Rounding: Integer value specifying the number of decimal places to retain (default: no rounding)

verbose

Logical. Whether additional information (on directionality, formulas, frequency adjustment) should be printed; default is TRUE

print_score

Logical. Whether the dispersion score should be printed to the console; default is TRUE

suppress_warning

Logical. Whether warning messages should be suppressed; default is FALSE

Details

This function calculates dispersion measures based on two vectors: a set of subfrequencies (number of occurrences of the item in each corpus part) and a matching set of part sizes (the size of the corpus parts, i.e. number of word tokens).

The following measures are computed, listed in chronological order (see details below):

In the formulas given below, the following notation is used:

Note that the formulas cited below differ in their scaling, i.e. whether 1 reflects an even or an uneven distribution. In the current function, this behavior is overridden by the argument directionality. The specific scaling used in the formulas below is therefore irrelevant.

R_{rel} refers to the relative range, i.e. the proportion of corpus parts containing at least one occurrence of the item.

D denotes Juilland's D and is calculated as follows (this formula uses conventional scaling); \bar{R_i} refers to the average over the normalized subfrequencies:

1 - \sqrt{\frac{\sum_{i = 1}^k (R_i - \bar{R_i})^2}{k}} \times \frac{1}{\bar{R_i} \sqrt{k - 1}}

D_2 denotes the index proposed by Carroll (1970); the following formula uses conventional scaling:

\frac{\sum_i^k r_i \log_2{\frac{1}{r_i}}}{\log_2{k}}

S is the dispersion measure proposed by Rosengren (1971); the formula uses conventional scaling:

\frac{(\sum_i^k r_i \sqrt{w_i T_i}}{N}

D_P represents Gries's deviation of proportions; the following formula is the modified version suggested by Egbert et al. (2020: 99); it implements conventional scaling (0 = uneven, 1 = even) and the notation min\{w_i: t_i > 0\} refers to the w_i value among those corpus parts that include at least one occurrence of the item.

1 - \frac{\sum_i^k |t_i - w_i|}{2} \times \frac{1}{1 - min\{w_i: t_i > 0\}}

D_A is a measure introduced into dispersion analysis by Burch et al. (2017). The following formula is the one used by Egbert et al. (2020: 98); it relies on normalized frequencies and therefore works with corpus parts of different size. The formula represents conventional scaling (0 = uneven, 1 = even):

1 - \frac{\sum_{i = 1}^{k-1} \sum_{j = i+1}^{k} |R_i - R_j|}{\frac{k(k-1)}{2}} \times \frac{1}{2\frac{\sum_i^k R_i}{k}}

The current function uses a different version of the same formula, which relies on the proportional r_i values instead of the normalized subfrequencies R_i. This version yields the identical result:

1 - \frac{\sum_{i = 1}^{k-1} \sum_{j = i+1}^{k} |r_i - r_j|}{k-1}

D_{KL} refers to a measure proposed by Gries (2020, 2021); for standardization, it uses the odds-to-probability transformation (Gries 2024: 90) and represents Gries scaling (0 = even, 1 = uneven):

\frac{\sum_i^k t_i \log_2{\frac{t_i}{w_i}}}{1 + \sum_i^k t_i \log_2{\frac{t_i}{w_i}}}

Value

A numeric vector of seven dispersion scores

Author(s)

Lukas Soenning

References

Burch, Brent, Jesse Egbert & Douglas Biber. 2017. Measuring and interpreting lexical dispersion in corpus linguistics. Journal of Research Design and Statistics in Linguistics and Communication Science 3(2). 189–216. doi:10.1558/jrds.33066

Carroll, John B. 1970. An alternative to Juilland’s usage coefficient for lexical frequencies and a proposal for a standard frequency index. Computer Studies in the Humanities and Verbal Behaviour 3(2). 61–65. doi:10.1002/j.2333-8504.1970.tb00778.x

Egbert, Jesse, Brent Burch & Douglas Biber. 2020. Lexical dispersion and corpus design. International Journal of Corpus Linguistics 25(1). 89–115. doi:10.1075/ijcl.18010.egb

Gries, Stefan Th. 2008. Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13(4). 403–437. doi:10.1075/ijcl.13.4.02gri

Gries, Stefan Th. 2020. Analyzing dispersion. In Magali Paquot & Stefan Th. Gries (eds.), A practical handbook of corpus linguistics, 99–118. New York: Springer. doi:10.1007/978-3-030-46216-1_5

Gries, Stefan Th. 2021. A new approach to (key) keywords analysis: Using frequency, and now also dispersion. Research in Corpus Linguistics 9(2). 1–33. doi:10.32714/ricl.09.02.02

Gries, Stefan Th. 2022. What do (most of) our dispersion measures measure (most)? Dispersion? Journal of Second Language Studies 5(2). 171–205. doi:10.1075/jsls.21029.gri

Gries, Stefan Th. 2024. Frequency, dispersion, association, and keyness: Revising and tupleizing corpus-linguistic measures. Amsterdam: Benjamins. doi:10.1075/scl.115

Juilland, Alphonse G. & Eugenio Chang-Rodríguez. 1964. Frequency dictionary of Spanish words. The Hague: Mouton de Gruyter. doi:10.1515/9783112415467

Keniston, Hayward. 1920. Common words in Spanish. Hispania 3(2). 85–96. doi:10.2307/331305

Lijffijt, Jefrey & Stefan Th. Gries. 2012. Correction to Stefan Th. Gries’ ‘Dispersions and adjusted frequencies in corpora’. International Journal of Corpus Linguistics 17(1). 147–149. doi:10.1075/ijcl.17.1.08lij

Rosengren, Inger. 1971. The quantitative concept of language and its relation to the structure of frequency dictionaries. Études de linguistique appliquée (Nouvelle Série) 1. 103–127.

See Also

For finer control over the calculation of several dispersion measures:

Examples

disp_DP(
  subfreq = c(0,0,1,2,5), 
  partsize = rep(1000, 5),
  directionality = "conventional",
  freq_adjust = FALSE)


[Package tlda version 0.1.0 Index]