class RemoteTable
Open Google Docs spreadsheets, local or remote XLSX, XLS, ODS, CSV (comma separated), TSV (tab separated), other delimited, fixed-width files.
Constants
- DEFAULT
- EXTERNAL_ENCODING
- EXTERNAL_ENCODING_ICONV
- GOOGLE_DOCS_SPREADSHEET
- OLD_SETTING_NAMES
- SINGLE_SPACE
- VALID
- VERSION
- WHITESPACE
Attributes
@private A cache of rows, created unless :streaming
is enabled. @return [Array<Hash,Array>]
The CSS selector used to find columns in HTML or XML. @return [String]
The XPath used to find columns in HTML or XML. @return [String]
The compression type. Guessed from URL if not provided. :gz
, :zip
, :bz2
, and :exe
(treated as :zip
) are supported. @return [Symbol]
Use a range of rows in a plaintext file.
@return [Range]
@example Only take rows 21 through 37
RemoteTable.new("http://www.eia.gov/emeu/cbecs/cbecs2003/detailed_tables_2003/2003set10/2003excel/C17.xls", :headers => false, :select => proc { |row| CbecsEnergyIntensity::NAICS_CODE_SYNTHESIZER.call(row) }, :crop => (21..37))
Pick specific columns out of a plaintext file using an argument to the UNIX [cut
utility](en.wikipedia.org/wiki/Cut_%28Unix%29).
@return [String]
@example Pick ALMOST out of ABCDEFGHIJKLMNOPQRSTUVWXYZ
# $ echo ABCDEFGHIJKLMNOPQRSTUVWXYZ | cut -c '1,12,13,15,19,20' # ALMOST RemoteTable.new 'file:///atoz.txt', :cut => '1,12,13,15,19,20'
The delimiter, a.k.a. column separator. Passed to Ruby CSV as :col_sep
. Default is ‘,’. @return [String]
@private How many times this file has been downloaded. RemoteTable
will emit a warning if you download it more than once. @return [Integer]
The original encoding of the source file. Default is UTF-8. @return [String]
An object that responds to rejects?(row) and correct!(row). Applied after creating row_hash
.
-
rejects?(row) - if row should be treated like it doesn’t exist
-
correct!(row) - destructively update a row to fix something
See the Errata library at github.com/seamusabshere/errata for an example implementation.
@return [Hash]
The filename, which can be used to pick a file out of an archive.
@return [String]
@example Specify the filename to get out of a ZIP file
RemoteTable.new 'http://www.fueleconomy.gov/FEG/epadata/08data.zip', :filename => '2008_FE_guide_ALL_rel_dates_-no sales-for DOE-5-1-08.csv'
Form data to POST in the download request. It should probably be in application/x-www-form-urlencoded
. @return [String]
The format of the source file. Can be :xlsx
, :xls
, :delimited
, :ods
, :fixed_width
, :html
, :xml
, :yaml
, :json
. @return [Symbol]
The glob used to pick a file out of an archive.
@return [String]
@example Pick out the only CSV in a ZIP file
RemoteTable.new 'http://www.fueleconomy.gov/FEG/epadata/08data.zip', :glob => '/*.csv'
Headers specified by the user: :first_row
(the default), false
, or a list of headers. @return [:first_row,false,Array<String>]
Whether to keep blank rows. Default is false. @return [true,false]
@private Used internally to access to a downloaded copy of the file. @return [RemoteTable::LocalCopy]
Options passed by the user that may be passed through to the underlying parsing library. @return [Hash]
The packing type. Guessed from URL if not provided. Only :tar
is supported. @return [Symbol]
A proc that decides whether to include a row. Previously passed as :reject
. @return [Proc]
A proc that decides whether to include a row. Previously passed as :select
. @return [Proc]
Quote character for delimited files.
Defaults to double quotes.
@return [String]
The root node of the json document. Specified as a string.
Default: nil; no root node.
@return [String]
The CSS selector used to find rows in HTML or XML. @return [String]
The XPath used to find rows in HTML or XML. @return [String]
The fixed-width schema, given as a multi-dimensional array.
@return [Array<Array{String,Integer,Hash}>]
@example From the tests
RemoteTable.new('http://cloud.github.com/downloads/seamusabshere/remote_table/test2.fixed_width.txt', :format => :fixed_width, :skip => 1, :schema => [[ 'header4', 10, { :type => :string } ], [ 'spacer', 1 ], [ 'header5', 10, { :type => :string } ], [ 'spacer', 12 ], [ 'header6', 10, { :type => :string } ]])
If you somehow already defined a fixed-width schema (so you can re-use it?), specify it here. @return [String,Symbol]
The sheet specified by the user as a number or a string. @return
How many rows to skip at the beginning of the file or table. Default is 0. @return [Integer]
When to trim untitled headers. Set this to 100 to prevent more than 100 untitled headers being created; the rest will be silently discarded.
Note: This is effectively a right trim… the counting starts from the left.
Default: false, don’t try
@return [Integer]
Whether to stream the rows without caching them. Saves memory, but you have to re-download the file every time you enumerate its rows. Defaults to false. @return [true,false]
The URL of the local or remote file.
@example Local
file:///Users/myuser/Desktop/holidays.csv
@example Local using an absolute path
/Users/myuser/Desktop/holidays.csv
@example Remote
http://data.brighterplanet.com/countries.csv
@return [String]
Whether to warn the user on multiple downloads. Defaults to true. @return [true,false]
Public Class Methods
Given a Google Docs spreadsheet URL, make sure it uses CSV output. @return [String]
# File lib/remote_table.rb, line 102 def google_spreadsheet_csv_url(url) uri = ::URI.parse url params = uri.query.split('&') params.delete_if { |param| param.start_with?('output=') } params << 'output=csv' uri.query = params.join('&') uri.to_s end
Guess compression based on URL. Used internally. @return [Symbol,nil]
# File lib/remote_table.rb, line 51 def guess_compression(url) extname = extname(url).downcase case extname when /gz/, /gunzip/ :gz when /zip/ :zip when /bz2/, /bunzip2/ :bz2 when /exe/ :exe end end
Guess file format from the basename. Since a file might be decompressed and/or pulled out of an archive with a glob, this usually can’t be called until a file is downloaded. @return [Symbol,nil]
# File lib/remote_table.rb, line 76 def guess_format(basename) case basename.to_s.downcase.strip when /ods\z/, /open_?office\z/ :ods when /xlsx\z/, /excelx\z/ :xlsx when /xls\z/, /excel\z/ :xls when /csv\z/, /tsv\z/, /delimited\z/ # note that there is no RemoteTable::Csv class - it's normalized to :delimited :delimited when /fixed_?width\z/ :fixed_width when /html?\z/ :html when /xml\z/ :xml when /yaml\z/, /yml\z/ :yaml when /json\z/ :json end end
Guess packing from URL. Used internally. @return [Symbol,nil]
# File lib/remote_table.rb, line 67 def guess_packing(url) basename = basename(url).downcase if basename.include?('.tar') or basename.include?('.tgz') :tar end end
Create a new RemoteTable
, which is an Enumerable.
Options are set at creation using any of the attributes listed… RDoc will say they’re “read-only” because you can’t set/change them after creation.
Does not immediately download/parse… it’s lazy-loading.
@overload initialize(settings)
@param [Hash] settings Settings including +:url+.
@overload initialize(url, settings)
@param [String] url The URL to the local or remote file. @param [Hash] settings Settings.
@example Open an XLSX
RemoteTable.new('http://www.customerreferenceprogram.org/uploads/CRP_RFP_template.xlsx')
@example Open a CSV inside a ZIP file
RemoteTable.new 'http://www.epa.gov/climatechange/emissions/downloads10/2010-Inventory-Annex-Tables.zip', :filename => 'Annex Tables/Annex 3/Table A-93.csv', :skip => 1, :pre_select => proc { |row| row['Vehicle Age'].strip =~ /^\d+$/ }
# File lib/remote_table.rb, line 405 def initialize(*args) @download_count_mutex = ::Mutex.new @extend_bang_mutex = ::Mutex.new @cache = [] @download_count = 0 settings = args.last.is_a?(::Hash) ? args.last.symbolize_keys : {} @url = if args.first.is_a? ::String args.first else grab settings, :url end @format = RemoteTable.guess_format grab(settings, :format) if GOOGLE_DOCS_SPREADSHEET.any? { |regex| regex =~ url } @url = RemoteTable.google_spreadsheet_csv_url url @format = :delimited end @headers = grab settings, :headers if headers.is_a?(::Array) and headers.any?(&:blank?) raise ::ArgumentError, "[remote_table] If you specify headers, none of them can be blank" end @quote_char = grab settings, :quote_char @compression = grab(settings, :compression) || RemoteTable.guess_compression(url) @packing = grab(settings, :packing) || RemoteTable.guess_packing(url) @streaming = grab settings, :streaming @warn_on_multiple_downloads = grab settings, :warn_on_multiple_downloads @delimiter = grab settings, :delimiter @sheet = grab settings, :sheet @keep_blank_rows = grab settings, :keep_blank_rows @form_data = grab settings, :form_data @skip = grab settings, :skip @encoding = grab settings, :encoding @row_xpath = grab settings, :row_xpath @column_xpath = grab settings, :column_xpath @row_css = grab settings, :row_css @column_css = grab settings, :column_css @glob = grab settings, :glob @filename = grab settings, :filename @cut = grab settings, :cut @crop = grab settings, :crop @schema = grab settings, :schema @schema_name = grab settings, :schema_name @pre_select = grab settings, :pre_select @pre_reject = grab settings, :pre_reject @errata = grab settings, :errata @root_node = grab settings, :root_node @parser = grab settings, :parser @stop_after_untitled_headers = grab settings, :stop_after_untitled_headers @other_options = settings @local_copy = LocalCopy.new self extend! end
# File lib/remote_table.rb, line 111 def normalize_whitespace(v) v = v.to_s.dup v.gsub! WHITESPACE, SINGLE_SPACE v.strip! v end
Transpose two columns into a mapping from one to the other.
# File lib/remote_table.rb, line 42 def transpose(url, key_key, value_key, options = {}) new(url, options).inject({}) do |memo, row| memo[row[key_key]] = row[value_key] memo end end
Private Class Methods
# File lib/remote_table.rb, line 120 def basename(url) ::File.basename path(url) end
# File lib/remote_table.rb, line 124 def extname(url) ::File.extname path(url) end
# File lib/remote_table.rb, line 128 def path(url) if url.include?('://') ::URI.parse(url).path else File.expand_path url end end
Public Instance Methods
Get a row by row number. Zero-based.
@return [Hash,Array]
# File lib/remote_table.rb, line 519 def [](row_number) if fully_cached? cache[row_number] else to_a[row_number] end end
Yield each row.
@return [nil]
@yield [Hash,Array] A hash or an array depending on whether the RemoteTable
has named headers (column names).
# File lib/remote_table.rb, line 470 def each if fully_cached? cache.each do |row| yield row end else mark_download! preprocess! memo = _each do |row| parser.call(row).each do |virtual_row| virtual_row.row_hash = ::HashDigest.digest3 row if errata next if errata.rejects? virtual_row errata.correct! virtual_row end next if pre_select and !pre_select.call(virtual_row) next if pre_reject and pre_reject.call(virtual_row) unless streaming cache.push virtual_row end yield virtual_row end end unless streaming fully_cached! end memo end nil end
Clear the row cache in case it helps your GC.
@return [nil]
# File lib/remote_table.rb, line 530 def free @fully_cached = false cache.clear nil end
An object that responds to call(row) and returns an array of one or more rows.
@return [#call]
# File lib/remote_table.rb, line 376 def parser @final_parser ||= (@parser || NullParser.new) end
@return [Array<Hash,Array>] All rows.
# File lib/remote_table.rb, line 505 def to_a if fully_cached? cache.dup else map { |row| row } end end
Private Instance Methods
# File lib/remote_table.rb, line 565 def assume_utf8(str) if str.is_a?(::String) and ::RUBY_VERSION >= '1.9' str.encode! EXTERNAL_ENCODING else str end end
# File lib/remote_table.rb, line 593 def extend! return if @extend_bang @extend_bang_mutex.synchronize do return if @extend_bang @extend_bang = true format_module = if format RemoteTable.const_get format.to_s.camelcase elsif format = RemoteTable.guess_format(local_copy.path) @format = format RemoteTable.const_get format.to_s.camelcase else Delimited end extend format_module after_extend if respond_to?(:after_extend) end end
# File lib/remote_table.rb, line 551 def fully_cached! @fully_cached = true end
# File lib/remote_table.rb, line 555 def fully_cached? !!@fully_cached end
# File lib/remote_table.rb, line 573 def grab(settings, k) user_specified = false memo = nil if (old_names = OLD_SETTING_NAMES[k]) and old_names.any? { |old_k| settings.has_key?(old_k) } user_specified = true memo = old_names.map { |old_k| settings.delete(old_k) }.compact.first end if settings.has_key?(k) user_specified = true memo = settings.delete k end if not user_specified and DEFAULT.has_key?(k) memo = DEFAULT[k] end if memo and (valid = VALID[k]) and not valid.include?(memo.to_sym) raise ::ArgumentError, %{[remote_table] #{k.inspect} => #{memo.inspect} is not a valid setting. Valid settings are #{valid.inspect}.} end memo end
# File lib/remote_table.rb, line 542 def mark_download! @download_count_mutex.synchronize do @download_count += 1 end if warn_on_multiple_downloads and download_count > 1 ::Kernel.warn "[remote_table] #{url} has been downloaded #{download_count} times." end end
# File lib/remote_table.rb, line 538 def preprocess! # noop, overridden sometimes end
# File lib/remote_table.rb, line 559 def transliterate_to_utf8(str) if str.is_a?(::String) ::ActiveSupport::Inflector.transliterate str end end