class LogStash::Filters::Grok

Parse arbitrary text and structure it.

Grok is currently the best way in logstash to parse crappy unstructured log data into something structured and queryable.

This tool is perfect for syslog logs, apache and other webserver logs, mysql logs, and in general, any log format that is generally written for humans and not computer consumption.

Logstash ships with about 120 patterns by default. You can find them here: <github.com/logstash/logstash/tree/v%VERSION%/patterns>. You can add your own trivially. (See the patterns_dir setting)

If you need help building patterns to match your logs, you will find the <grokdebug.herokuapp.com> too quite useful!

#### Grok Basics

Grok works by combining text patterns into something that matches your logs.

The syntax for a grok pattern is `%{SYNTAX:SEMANTIC}`

The `SYNTAX` is the name of the pattern that will match your text. For example, “3.44” will be matched by the NUMBER pattern and “55.3.244.1” will be matched by the IP pattern. The syntax is how you match.

The `SEMANTIC` is the identifier you give to the piece of text being matched. For example, “3.44” could be the duration of an event, so you could call it simply 'duration'. Further, a string “55.3.244.1” might identify the 'client' making a request.

Optionally you can add a data type conversion to your grok pattern. By default all semantics are saved as strings. If you wish to convert a semantic's data type, for example change a string to an integer then suffix it with the target data type. For example `%{NUMBER:num:int}` which converts the 'num' semantic from a string to an integer. Currently the only supported conversions are `int` and `float`.

#### Example

With that idea of a syntax and semantic, we can pull out useful fields from a sample log like this fictional http request log:

55.3.244.1 GET /index.html 15824 0.043

The pattern for this could be:

%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}

A more realistic example, let's read these logs from a file:

input {
  file {
    path => "/var/log/http.log"
  }
}
filter {
  grok {
    match => [ "message", "%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}" ]
  }
}

After the grok filter, the event will have a few extra fields in it:

#### Regular Expressions

Grok sits on top of regular expressions, so any regular expressions are valid in grok as well. The regular expression library is Oniguruma, and you can see the full supported regexp syntax [on the Onigiruma site](www.geocities.jp/kosako3/oniguruma/doc/RE.txt)

#### Custom Patterns

Sometimes logstash doesn't have a pattern you need. For this, you have a few options.

First, you can use the Oniguruma syntax for 'named capture' which will let you match a piece of text and save it as a field:

(?<field_name>the pattern here)

For example, postfix logs have a 'queue id' that is an 10 or 11-character hexadecimal value. I can capture that easily like this:

(?<queue_id>[0-9A-F]{10,11})

Alternately, you can create a custom patterns file.

For example, doing the postfix queue id example as above:

# in ./patterns/postfix 
POSTFIX_QUEUEID [0-9A-F]{10,11}

Then use the `patterns_dir` setting in this plugin to tell logstash where your custom patterns directory is. Here's a full example with a sample log:

Jan  1 06:25:43 mailserver14 postfix/cleanup[21403]: BEF25A72965: message-id=<20130101142543.5828399CCAF@mailserver14.example.com>

filter {
  grok {
    patterns_dir => "./patterns"
    match => [ "message", "%{SYSLOGBASE} %{POSTFIX_QUEUEID:queue_id}: %{GREEDYDATA:syslog_message}" ]
  }
}

The above will match and result in the following fields:

The `timestamp`, `logsource`, `program`, and `pid` fields come from the SYSLOGBASE pattern which itself is defined by other patterns.

Public Class Methods

new(params) click to toggle source
Calls superclass method LogStash::Filters::Base::new
# File lib/logstash/filters/grok.rb, line 224
def initialize(params)
  super(params)
  @match["message"] ||= []
  @match["message"] += @pattern if @pattern # the config 'pattern' value (array)
  # a cache of capture name handler methods.
  @handlers = {}
end

Public Instance Methods

filter(event) click to toggle source
# File lib/logstash/filters/grok.rb, line 282
def filter(event)
  return unless filter?(event)

  matched = false
  done = false

  @logger.debug? and @logger.debug("Running grok filter", :event => event);
  @patterns.each do |field, grok|
    if match(grok, field, event)
      matched = true
      break if @break_on_match
    end
    #break if done
  end # @patterns.each

  if matched
    filter_matched(event)
  else
    # Tag this event if we can't parse it. We can use this later to
    # reparse+reindex logs if we improve the patterns given.
    @tag_on_failure.each do |tag|
      event["tags"] ||= []
      event["tags"] << tag unless event["tags"].include?(tag)
    end
  end

  @logger.debug? and @logger.debug("Event now: ", :event => event)
end
register() click to toggle source
# File lib/logstash/filters/grok.rb, line 233
def register
  require "grok-pure" # rubygem 'jls-grok'

  @patternfiles = []

  # Have @@patterns_path show first. Last-in pattern definitions win; this
  # will let folks redefine built-in patterns at runtime.
  @patterns_dir = @@patterns_path.to_a + @patterns_dir
  @logger.info? and @logger.info("Grok patterns path", :patterns_dir => @patterns_dir)
  @patterns_dir.each do |path|
    # Can't read relative paths from jars, try to normalize away '../'
    while path =~ /file:\/.*\.jar!.*\/\.\.\//
      # replace /foo/bar/../baz => /foo/baz
      path = path.gsub(/[^\/]+\/\.\.\//, "")
      @logger.debug? and @logger.debug("In-jar path to read", :path => path)
    end

    if File.directory?(path)
      path = File.join(path, "*")
    end

    Dir.glob(path).each do |file|
      @logger.info? and @logger.info("Grok loading patterns from file", :path => file)
      @patternfiles << file
    end
  end

  @patterns = Hash.new { |h,k| h[k] = [] }

  @logger.info? and @logger.info("Match data", :match => @match)

  @match.each do |field, patterns|
    patterns = [patterns] if patterns.is_a?(String)

    if !@patterns.include?(field)
      @patterns[field] = Grok::Pile.new
      #@patterns[field].logger = @logger

      add_patterns_from_files(@patternfiles, @patterns[field])
    end
    @logger.info? and @logger.info("Grok compile", :field => field, :patterns => patterns)
    patterns.each do |pattern|
      @logger.debug? and @logger.debug("regexp: #{@type}/#{field}", :pattern => pattern)
      @patterns[field].compile(pattern)
    end
  end # @match.each
end

Private Instance Methods

add_patterns_from_file(path, pile) click to toggle source
# File lib/logstash/filters/grok.rb, line 407
def add_patterns_from_file(path, pile)
  # Check if the file path is a jar, if so, we'll have to read it ourselves
  # since libgrok won't know what to do with it.
  if path =~ /file:\/.*\.jar!.*/
    File.new(path).each do |line|
      next if line =~ /^(?:\s*#|\s*$)/
      # In some cases I have seen 'file.each' yield lines with newlines at
      # the end. I don't know if this is a bug or intentional, but we need
      # to chomp it.
      name, pattern = line.chomp.split(/\s+/, 2)
      @logger.debug? and @logger.debug("Adding pattern from file", :name => name,
                                       :pattern => pattern, :path => path)
      pile.add_pattern(name, pattern)
    end
  else
    pile.add_patterns_from_file(path)
  end
end
add_patterns_from_files(paths, pile) click to toggle source
# File lib/logstash/filters/grok.rb, line 402
def add_patterns_from_files(paths, pile)
  paths.each { |path| add_patterns_from_file(path, pile) }
end
compile_capture_handler(capture) click to toggle source
# File lib/logstash/filters/grok.rb, line 349
def compile_capture_handler(capture)
  # SYNTAX:SEMANTIC:TYPE
  syntax, semantic, coerce = capture.split(":")

  # each_capture do |fullname, value|
  #   capture_handlers[fullname].call(value, event)
  # end

  code = []
  code << "# for capture #{capture}"
  code << "lambda do |value, event|"
  #code << "  p :value => value, :event => event"
  if semantic.nil?
    if @named_captures_only 
      # Abort early if we are only keeping named (semantic) captures
      # and this capture has no semantic name.
      code << "  return"
    else
      field = syntax
    end
  else
    field = semantic
  end
  code << "  return if value.nil? || value.empty?" unless @keep_empty_captures
  if coerce
    case coerce
      when "int"; code << "  value = value.to_i"
      when "float"; code << "  value = value.to_f"
    end
  end

  code << "  # field: #{field}"
  if @overwrite.include?(field)
    code << "  event[field] = value"
  else
    code << "  v = event[field]"
    code << "  if v.nil?"
    code << "    event[field] = value"
    code << "  elsif v.is_a?(Array)"
    code << "    event[field] << value"
    code << "  elsif v.is_a?(String)"
    # Promote to array since we aren't overwriting.
    code << "    event[field] = [v, value]"
    code << "  end"
  end
  code << "  return"
  code << "end"

  #puts code
  return eval(code.join("\n"), binding, "<grok capture #{capture}>")
end
handle(capture, value, event) click to toggle source
# File lib/logstash/filters/grok.rb, line 343
def handle(capture, value, event)
  handler = @handlers[capture] ||= compile_capture_handler(capture)
  return handler.call(value, event)
end
match(grok, field, event) click to toggle source
# File lib/logstash/filters/grok.rb, line 312
def match(grok, field, event)
  input = event[field]
  if input.is_a?(Array)
    success = true
    input.each do |input|
      grok, match = grok.match(input)
      if match
        match.each_capture do |capture, value|
          handle(capture, value, event)
        end
      else
        success = false
      end
    end
    return success
  #elsif input.is_a?(String)
  else
    # Convert anything else to string (number, hash, etc)
    grok, match = grok.match(input.to_s)
    return false if !match

    match.each_capture do |capture, value|
      handle(capture, value, event)
    end
    return true
  end
rescue StandardError => e
  @logger.warn("Grok regexp threw exception", :exception => e.message)
end