stri_locate_ith {tinycodet}R Documentation

Locate i^{th} Pattern Occurrence or Text Boundary

Description

The stri_locate_ith() function locates the i^{th} occurrence of a pattern in each string of some character vector.

The stri_locate_ith_boundaries() function locates the i^{th} text boundary (like character, word, line, or sentence boundaries).

Usage

stri_locate_ith(str, i, ..., regex, fixed, coll, charclass)

stri_locate_ith_boundaries(str, i, ..., type = "character")

Arguments

str

a string or character vector.

i

a number, or a numeric vector of the same length as str.
Positive numbers are counting from the left. Negative numbers are counting from the right. I.e.:

  • stri_locate_ith(str, i=1, ...)
    gives the position (range) of the first occurrence of a pattern.

  • stri_locate_ith(str, i=-1, ...)
    gives the position (range) of the last occurrence of a pattern.

  • stri_locate_ith(str, i=2, ...)
    gives the position (range) of the second occurrence of a pattern.

  • stri_locate_ith(str, i=-2, ...)
    gives the position (range) of the second-last occurrence of a pattern.

If abs(i) is larger than the number of instances, the first (if i < 0) or last (if i > 0) instance will be given.
For example: suppose a string has 3 instances of some pattern;
then if i >= 3 the third instance will be located,
and if i <= -3 the first instance will be located.

...

more arguments to be supplied to stri_locate or stri_locate_all_boundaries.
Do not supply the arguments omit_no_match, get_length, or pattern, as they are already specified internally. Supplying these arguments anyway will result in an error.

regex, fixed, coll, charclass

a character vector of search patterns, as in stri_locate.
[REGEX]
[FIXED]
[COLL]
[CHARCLASS]

type

single string; either the break iterator type, one of character, line_break, sentence, word, or a custom set of ICU break iteration rules. Defaults to "character".
[BOUNDARIES]

Details

Special note regarding charclass
The stri_locate_ith() function is based on stri_locate_all. This generally gives results consistent with stri_locate_first or stri_locate_last, but the exception is when charclass pattern is used.
Where the functions stri_locate_first or stri_locate_last give the location of the first or last single character matching the charclass (respectively), stri_locate_all gives the start and end of consecutive characters.
The stri_locate_ith() is in this aspect more in line with stri_locate_all, as it gives the i^{th} set of consecutive characters.

Value

The stri_locate_ith() function returns an integer matrix with two columns, giving the start and end positions of the i^{th} matches, two NAs if no matches are found, and also two NAs if str is NA.

See Also

tinycodet_strings()

Examples


#############################################################################

# practical example with regex & fixed ====

# input character vector:
x <- c(paste0(letters[1:13], collapse=""), paste0(letters[14:26], collapse=""))
print(x)

# report ith (second and second-last) vowel locations:
p <- rep("A|E|I|O|U", 2) # vowels
loc <- stri_locate_ith(x, c(2, -2), regex=p, case_insensitive=TRUE)
print(loc)

# extract ith vowels:
extr <- stringi::stri_sub(x, from=loc)
print(extr)

# replace ith vowels with numbers:
repl <- stringi::stri_replace_all(
extr, fixed = c("a", "e", "i", "o", "u"), replacement = 1:5, vectorize_all = FALSE
)
x <- stringi::stri_sub_replace(x, loc, replacement=repl)
print(x)


#############################################################################

# practical example with boundaries ====

# input character vector:
x <- c("good morning and good night",
"hello ladies and gentlemen")
print(x)

# report ith word locations:
loc <- stri_locate_ith_boundaries(x, c(-3, 3), type = "word")
print(loc)

# extract ith words:
extr <- stringi::stri_sub(x, from=loc)
print(extr)

# transform and replace words:
tf <- chartr(extr, old = "a-zA-Z", new = "A-Za-z")
x <- stringi::stri_sub_replace(x, loc, replacement=tf)
print(x)


#############################################################################

# find pattern ====

extr <- stringi::stri_sub(x, from=loc)
repl <- chartr(extr, old = "a-zA-Z", new = "A-Za-z")
stringi::stri_sub_replace(x, loc, replacement=repl)

# simple pattern ====

x <- rep(paste0(1:10, collapse=""), 10)
print(x)
out <- stri_locate_ith(x, 1:10, regex = as.character(1:10))
cbind(1:10, out)


x <- c(paste0(letters[1:13], collapse=""), paste0(letters[14:26], collapse=""))
print(x)
p <- rep("a|e|i|o|u",2)
out <- stri_locate_ith(x, c(-1, 1), regex=p)
print(out)
substr(x, out[,1], out[,2])


#############################################################################

# ignore case pattern ====


x <- c(paste0(letters[1:13], collapse=""), paste0(letters[14:26], collapse=""))
print(x)
p <- rep("A|E|I|O|U", 2)
out <- stri_locate_ith(x, c(1, -1), regex=p, case_insensitive=TRUE)
substr(x, out[,1], out[,2])


#############################################################################

# multi-character pattern ====

x <- c(paste0(letters[1:13], collapse=""), paste0(letters[14:26], collapse=""))
print(x)
# multi-character pattern:
p <- rep("AB", 2)
out <- stri_locate_ith(x, c(1, -1), regex=p, case_insensitive=TRUE)
print(out)
substr(x, out[,1], out[,2])



#############################################################################

# Replacement transformation using stringi ====

x <- c("hello world", "goodbye world")
loc <- stri_locate_ith(x, c(1, -1), regex="a|e|i|o|u")
extr <- stringi::stri_sub(x, from=loc)
repl <- chartr(extr, old = "a-zA-Z", new = "A-Za-z")
stringi::stri_sub_replace(x, loc, replacement=repl)


#############################################################################

# Boundaries ====

test <- c(
paste0("The\u00a0above-mentioned    features are very useful. ",
      "Spam, spam, eggs, bacon, and spam. 123 456 789"),
      "good morning, good evening, and good night"
      )
loc <- stri_locate_ith_boundaries(test, i = c(1, -1), type = "word")
stringi::stri_sub(test, from=loc)


[Package tinycodet version 0.3.0 Index]