class PDF::Reader::ObjectHash

This monkey-patches pdf-reader to allow it to read PDFs that have junk characters that appear in the file before the start of the PDF stream. (this is quite commonly an html head block - I suspect a bug in the Adobe or other software used to serve the bills)

The patch has been contributed back to the pdf-reader project (github.com/yob/pdf-reader/pull/54) and has already been merged on master. When it shows up in a release of the pdf-reader gem we can trash this patch.

Public Instance Methods

extract_io_from(input) click to toggle source
# File lib/pdf/reader/patch/object_hash.rb, line 12
def extract_io_from(input)
  if input.respond_to?(:seek) && input.respond_to?(:read)
    input
  elsif File.file?(input.to_s)
    read_with_quirks(input)
  else
    raise ArgumentError, "input must be an IO-like object or a filename"
  end
end

Private Instance Methods

pdf_offset(stream) click to toggle source

Returns the offset of the PDF document in the stream. Checks up to 50 chars into the file, returns nil of no PDF stream detected.

# File lib/pdf/reader/patch/object_hash.rb, line 37
def pdf_offset(stream)
  stream.rewind
  ofs = stream.pos
  until (c = stream.readchar) == '%' || c == 37 || ofs > 50
    ofs += 1
  end
  ofs < 50 ? ofs : nil
end
read_with_quirks(input) click to toggle source

Load file as a StringIO stream, accounting for invalid format where additional characters exist in the file before the %PDF start of file

# File lib/pdf/reader/patch/object_hash.rb, line 24
def read_with_quirks(input)
  stream = File.open(input.to_s, "rb")
  if ofs = pdf_offset(stream)
    stream.seek(ofs)
    StringIO.new(stream.read)
  else
    raise ArgumentError, "invalid file format"
  end
end