class FormatParser::RemoteIO

Acts as a wrapper for turning a given URL into an IO object you can read from and seek in. Uses Faraday under the hood to perform fetches, so if you apply Faraday configuration tweaks using `Faraday.default_connection = …` these will take effect for these RemoteIO objects as well

Public Class Methods

new(uri, headers: {}) click to toggle source

@param uri[URI, String] the remote URL to obtain @param headers (optional) the HTTP headers to be used in the HTTP request

# File lib/remote_io.rb, line 28
def initialize(uri, headers: {})
  require 'faraday'
  require 'faraday_middleware/response/follow_redirects'
  @headers = headers
  @uri = uri
  @pos = 0
  @remote_size = false
end

Public Instance Methods

pos() click to toggle source

Emulates IO#pos

# File lib/remote_io.rb, line 44
def pos
  @pos
end
read(n_bytes) click to toggle source

Emulates IO#read, but requires the number of bytes to read The read will be limited to the size of the remote resource relative to the current offset in the IO, so if you are at offset 0 in the IO of size 10, doing a `read(20)` will only return you 10 bytes of result, and not raise any exceptions.

@param n_bytes[Fixnum, nil] how many bytes to read, or `nil` to read all the way to the end @return [String] the read bytes

# File lib/remote_io.rb, line 64
def read(n_bytes)
  http_range = (@pos..(@pos + n_bytes - 1))
  maybe_size, maybe_body = Measurometer.instrument('format_parser.RemoteIO.read') { request_range(http_range) }
  if maybe_size && maybe_body
    @remote_size = maybe_size
    @pos += maybe_body.bytesize
    maybe_body.force_encoding(Encoding::ASCII_8BIT)
  end
end
seek(offset) click to toggle source

Emulates IO#seek

# File lib/remote_io.rb, line 38
def seek(offset)
  @pos = offset
  0 # always return 0
end
size() click to toggle source

Emulates IO#size.

@return [Integer] the size of the remote resource

# File lib/remote_io.rb, line 51
def size
  raise 'Remote size not yet obtained, need to perform at least one read() to retrieve it' unless @remote_size
  @remote_size
end

Protected Instance Methods

request_range(range) click to toggle source

Only used internally when reading the remote file

@param range the HTTP range of data to fetch from remote @return [String] the response body of the ranged request

# File lib/remote_io.rb, line 80
def request_range(range)
  # We use a GET and not a HEAD request followed by a GET because
  # S3 does not allow HEAD requests if you only presigned your URL for GETs, so we
  # combine the first GET of a segment and retrieving the size of the resource
  conn = Faraday.new(headers: @headers) do |faraday|
    faraday.use FaradayMiddleware::FollowRedirects
    # we still need the default adapter, more details: https://blog.thecodewhisperer.com/permalink/losing-time-to-faraday
    faraday.adapter Faraday.default_adapter
  end
  response = conn.get(@uri, nil, range: 'bytes=%d-%d' % [range.begin, range.end])

  case response.status
  when 200
    # S3 returns 200 when you request a Range that is fully satisfied by the entire object,
    # we take that into account here. Also, for very tiny responses (and also for empty responses)
    # the responses are going to be 200 which does not mean we cannot proceed
    # To have a good check for both of these conditions we need to know whether the ranges overlap fully
    response_size = response.body.bytesize
    requested_range_size = range.end - range.begin + 1
    if response_size > requested_range_size
      error_message = [
        "We requested #{requested_range_size} bytes, but the server sent us more",
        "(#{response_size} bytes) - it likely has no `Range:` support.",
        "The error occurred when talking to #{@uri})"
      ]
      raise InvalidRequest.new(response.status, error_message.join("\n"))
    end
    [response_size, response.body]
  when 206
    # Figure out of the server supports content ranges, if it doesn't we have no
    # business working with that server
    range_header = response.headers['Content-Range']
    raise InvalidRequest.new(response.status, "The server replied with 206 status but no Content-Range at #{@uri}") unless range_header

    # "Content-Range: bytes 0-0/307404381" is how the response header is structured
    size = range_header[/\/(\d+)$/, 1].to_i

    # If we request a _larger_ range than what can be satisfied by the server,
    # the response is going to only contain what _can_ be sent and the status is also going
    # to be 206
    return [size, response.body]
  when 416
    # We return `nil` if we tried to read past the end of the IO,
    # which satisfies the Ruby IO convention. The caller should deal with `nil` being the result of a read()
    # S3 will also handily _not_ supply us with the Content-Range of the actual resource, so we
    # cannot hint size with this response - at lease not when working with S3
    return
  when 500..599
    Measurometer.increment_counter('format_parser.RemoteIO.upstream50x_errors', 1)
    raise IntermittentFailure.new(response.status, "Server at #{@uri} replied with a #{response.status} and we might want to retry")
  else
    Measurometer.increment_counter('format_parser.RemoteIO.invalid_request_errors', 1)
    raise InvalidRequest.new(response.status, "Server at #{@uri} replied with a #{response.status} and refused our request")
  end
end