Chapter
1
Introduction
RMMSeg is an implementation of MMSEG Chinese word segmentation algorithm. It is based on two variants of maximum matching algorithms. Two algorithms are available for using:
- simple algorithm that uses only forward maximum matching.
- complex algorithm that uses three-word chunk maximum matching and 3 aditonal rules to solve ambiguities.
For more information about the algorithm, please refer to the following essays:
- http://technology.chtsai.org/mmseg/
- http://pluskid.lifegoo.com/?p=261
RMMSeg can be used as either a stand alone program or an Analyzer of Ferret.
Chapter
2
Setup
2.1 Requirements
2.2 Installation
2.2.1 Using RubyGems
2.2.2 From Subversion
From subversion repository hosted at RubyForge, you can always get the latest source code.
Note 1. The latest code might be unstable
svn checkout http://rmmseg.rubyforge.org/svn/trunk/ rmmseg
Then you can run
rake gem
to build the gem file.
Chapter
3
Usage
3.1 Stand Alone rmmseg
RMMSeg comes with a script rmmseg
. To get the basic usage, just execute it with -h
option:
rmmseg -h
It reads from STDIN and print result to STDOUT. Here is a real example:
$ echo "我们都喜欢用 Ruby" | rmmseg 我们 都 喜欢 用 Ruby
3.2 Analyzer for Ferret
RMMSeg include an analyzer for Ferret. It is simply ready to use. Just require it and pass it to Ferret. Here’s a complete example:
#!/usr/bin/env ruby
require 'rubygems'
require 'rmmseg'
require 'rmmseg/ferret'
analyzer = RMMSeg::Ferret::Analyzer.new { |tokenizer|
Ferret::Analysis::LowerCaseFilter.new(tokenizer)
}
$index = Ferret::Index::Index.new(:analyzer => analyzer)
$index << {
:title => "分词",
:content => "中文分词比较困难,不像英文那样,直接在空格和标点符号的地方断开就可以了。"
}
$index << {
:title => "RMMSeg",
:content => "RMMSeg 我近日做的一个 Ruby 中文分词实现,下一步是和 Ferret 进行集成。"
}
$index << {
:title => "Ruby 1.9",
:content => "Ruby 1.9.0 已经发布了,1.9 的一个重大改进就是对 Unicode 的支持。"
}
$index << {
:title => "Ferret",
:content => <<END
Ferret is a high-performance, full-featured text search engine library
written for Ruby. It is inspired by Apache Lucene Java project. With
the introduction of Ferret, Ruby users now have one of the fastest and
most flexible search libraries available. And it is surprisingly easy
to use.
END
}
def highlight_search(key)
$index.search_each(%Q!content:"#{key}"!) do |id, score|
puts "*** Document \"#{$index[id][:title]}\" found with a score of #{score}"
puts "-"*40
highlights = $index.highlight("content:#{key}", id,
:field => :content,
:pre_tag => "\033[36m",
:post_tag => "\033[m")
puts "#{highlights}"
puts ""
end
end
ARGV.each { |key|
puts "\033[33mSearching for #{key}...\033[m"
puts ""
highlight_search(key)
}
# Local Variables:
# coding: utf-8
# End:
execute it on the following key words:
$ ruby ferret_example.rb Ruby 中文
will generate the following results:
Searching for Ruby...
*** Document "RMMSeg" found with a score of 0.21875
----------------------------------------
RMMSeg 我近日做的一个 Ruby 中文分词实现,下一步是和 Ferret 进行集成。
*** Document "Ruby 1.9" found with a score of 0.21875
----------------------------------------
Ruby 1.9.0 已经发布了,1.9 的一个重大改进就是对 Unicode 的支持。
*** Document "Ferret" found with a score of 0.176776692271233
----------------------------------------
Ferret is a high-performance, full-featured text search engine library
written for Ruby. It is inspired by Apache Lucene Java project. With
the introduction of Ferret, Ruby users now have one of the fastest and
most flexible search libraries available. And it's surprisingly easy
to use.
Searching for 中文...
*** Document "分词" found with a score of 0.281680464744568
----------------------------------------
中文分词比较困难,不像英文那样,直接在空格和标点符号的地方断开就可以了。
*** Document "RMMSeg" found with a score of 0.281680464744568
----------------------------------------
RMMSeg 我近日做的一个 Ruby 中文分词实现,下一步是和 Ferret 进行集成。
And if you run the example in terminal, you’ll see the result highlighted as in Figure 1: Ferret Example Screenshot.
Figure 1. Ferret Example Screenshot

3.3 Customization
RMMSeg can be customized through RMMSeg::Config
. For example, to use your own dictionaries, just set it before starting to do segmentation:
RMMSeg::Config.dictionaries = [["dict1.dic", true], # with frequency info
["dict2.dic", false], # without
["dict3.dic", false]]
RMMSeg::Config.max_word_length = 6
Or to use the simple algorithm for more efficient (and less accurate) segmenting:
RMMSeg::Config.algorithm = :simple
For more information on customization, please refer to the RDoc of RMMSeg::Config.
Chapter
4
Resources
- Project Home: The Project page at RubyForge.
- RDoc of RMMSeg: The auto generated rdoc of RMMSeg.
- A Screencast: Demo of Ferret RMMSeg and acts_as_ferret.
- Implementation Details: My blog post about the implementation details of RMMSeg (Chinese).
- Author’s Email: Contact me if you have any problem.