class Cacofonix::Normaliser
A standalone class that can be used to normalise ONIX files into a standardised form. If you're accepting ONIX files from a wide range of suppliers, you're guarunteed to get all sorts of dialects.
This will create a new file that:
-
is UTF-8 encoded
-
uses reference tags, not short
-
has no named entities (ndash, etc) other than & < and >
Usage:
Cacofonix::Normaliser.process("oldfile.xml", "newfile.xml")
Dependencies:
At this stage the class depends on several external apps, all commonly available on *nix systems: xsltproc, isutf8, iconv and sed
Public Class Methods
NB: Newfile argument is deprecated.
# File lib/cacofonix/utils/normaliser.rb, line 41 def initialize(oldfile, newfile = nil) raise ArgumentError, "#{oldfile} does not exist" unless File.file?(oldfile) raise "xsltproc app not found" unless app_available?("xsltproc") raise "tr app not found" unless app_available?("tr") @oldfile = oldfile @newfile = newfile @curfile = next_tempfile FileUtils.cp(@oldfile, @curfile) @head = File.open(@oldfile, "r") { |f| f.read(1024) } end
normalise oldfile and save it as newfile. oldfile will be left untouched
# File lib/cacofonix/utils/normaliser.rb, line 34 def process(oldfile, newfile) self.new(oldfile).normalise_to_path(newfile) end
Public Instance Methods
check the specified app is available on the system
# File lib/cacofonix/utils/normaliser.rb, line 87 def app_available?(app) `which #{app}`.strip == "" ? false : true end
generate a temp filename
# File lib/cacofonix/utils/normaliser.rb, line 93 def next_tempfile p = nil Tempfile.open("onix") do |tf| p = tf.path tf.close! end p end
# File lib/cacofonix/utils/normaliser.rb, line 58 def normalise_to_path(newfile) raise ArgumentError, "#{newfile} already exists" if File.file?(newfile) @curfile = normalise_to_tempfile FileUtils.cp(@curfile, newfile) end
Processes oldfile and puts the normalised result in a tempfile, returning the path to that tempfile.
# File lib/cacofonix/utils/normaliser.rb, line 67 def normalise_to_tempfile src = @curfile # remove short tags if @head.include?("ONIXmessage") dest = next_tempfile to_reference_tags(src, dest) src = dest end # remove control chars dest = next_tempfile remove_control_chars(src, dest) dest end
XML files shouldn't contain low ASCII control chars. Strip them.
# File lib/cacofonix/utils/normaliser.rb, line 117 def remove_control_chars(src, dest) inpath = File.expand_path(src) outpath = File.expand_path(dest) `cat #{inpath} | tr -d "\\000-\\010\\013\\014\\016-\\037" > #{outpath}` end
This is deprecated - use normalise_to_path
with a path.
# File lib/cacofonix/utils/normaliser.rb, line 54 def run normalise_to_path(@newfile) end