module FormatParser::ZIPParser::OfficeFormats

Based on an unscientific sample of 63 documents I could find on my hard drive, all docx/pptx/xlsx files contain, at the minimum, the following files:

[Content_types].xml
_rels/.rels
docProps/core.xml
docPropx/app.xml

Additionally, per file type, they contain the following:

word/document.xml
xl/workbook.xml
ppt/presentation.xml

These are sufficient to say with certainty that a ZIP is in fact an Office document. Also that unscientific sample revealed that I came to dislike MS Office so much as to only have 63 documents on my entire workstation.

We do not perform the actual decoding of the Office documents here, because to read their contents we need to:

which are real threats and require adequate mitigation. For our purposes the token detection of specific filenames should be enough to say with certainty that a document is an Office document, and not just a ZIP.

Constants

OFFICE_MARKER_FILES

Public Instance Methods

office_document?(filenames_set) click to toggle source
# File lib/parsers/zip_parser/office_formats.rb, line 36
def office_document?(filenames_set)
  OFFICE_MARKER_FILES.subset?(filenames_set)
end
office_file_format_and_mime_type_from_entry_set(filenames_set) click to toggle source
# File lib/parsers/zip_parser/office_formats.rb, line 40
def office_file_format_and_mime_type_from_entry_set(filenames_set)
  if filenames_set.include?('word/document.xml')
    [:docx, 'application/vnd.openxmlformats-officedocument.wordprocessingml.document']
  elsif filenames_set.include?('xl/workbook.xml')
    [:xlsx, 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet']
  elsif filenames_set.include?('ppt/presentation.xml')
    [:pptx, 'application/vnd.openxmlformats-officedocument.presentationml.presentation']
  else
    [:unknown, 'application/zip']
  end
end