module RMMSeg::Algorithm
An algorithm can segment a piece of text into an array of words. This module is the common operations shared by SimpleAlgorithm
and ComplexAlgorithm
.
Constants
- NONWORD_CHAR_RE
Determine whether a character can be part of a basic latin word.
Public Class Methods
new(text, token=Token)
click to toggle source
Initialize a new instance of Algorithm
, the text
will then be segmented by this instance. token
is the class which will be used to construct the result token.
# File lib/rmmseg/algorithm.rb, line 15 def initialize(text, token=Token) @text = text @chars = text.each_char @index = 0 @byte_index = 0 @token = token end
Public Instance Methods
basic_latin?(char)
click to toggle source
Determine whether a character is a basic latin character.
# File lib/rmmseg/algorithm.rb, line 127 def basic_latin?(char) char.length == 1 end
find_match_words(index)
click to toggle source
Find all words occuring in the dictionary starting from index
. The maximum word length is determined by Config.max_word_length
.
# File lib/rmmseg/algorithm.rb, line 89 def find_match_words(index) for i, w in @match_cache if i == index return w end end dic = Dictionary.instance str = String.new strlen = 0 words = Array.new i = index while i < @chars.length && !basic_latin?(@chars[i]) && strlen < Config.max_word_length str << @chars[i] strlen += 1 if dic.has_word?(str) words << dic.get_word(str) end i += 1 end if words.empty? words << Word.new(@chars[index], Word::TYPES[:unrecognized]) end @match_cache[@match_cache_idx] = [index, words] @match_cache_idx += 1 @match_cache_idx = 0 if @match_cache_idx == MATCH_CACHE_MAX_LENGTH words end
get_basic_latin_word()
click to toggle source
Skip whitespaces and punctuation to extract a basic latin word.
# File lib/rmmseg/algorithm.rb, line 56 def get_basic_latin_word start_pos = nil end_pos = nil i = @index while i < @chars.length && basic_latin?(@chars[i]) && nonword_char?(@chars[i]) i += 1 end start_pos = @byte_index + i - @index while i < @chars.length && basic_latin?(@chars[i]) break if nonword_char?(@chars[i]) i += 1 end end_pos = @byte_index + i - @index while i < @chars.length && basic_latin?(@chars[i]) && nonword_char?(@chars[i]) i += 1 end @byte_index += i - @index @index = i return @token.new(@text[start_pos...end_pos], start_pos, end_pos) end
next_token()
click to toggle source
Get the next Token
recognized.
# File lib/rmmseg/algorithm.rb, line 24 def next_token return nil if @index >= @chars.length if basic_latin?(@chars[@index]) token = get_basic_latin_word else token = get_cjk_word end if token.start == token.end # empty return next_token else return token end end
nonword_char?(char)
click to toggle source
# File lib/rmmseg/algorithm.rb, line 134 def nonword_char?(char) NONWORD_CHAR_RE =~ char end
segment()
click to toggle source
Segment the string in text
into an array of words.
# File lib/rmmseg/algorithm.rb, line 42 def segment words = Array.new token = next_token until token.nil? words << token.text token = next_token end words end