class Crawler::Index
Attributes
base_uri[RW]
Public Class Methods
new(base_uri)
click to toggle source
New Index
to record paths for a given domain
# File lib/crawler/index.rb, line 13 def initialize(base_uri) @base_uri = base_uri clear_stored_results end
Public Instance Methods
consume_document(path, document)
click to toggle source
Ingests a Crawler::Document
, stores all relevant data in redis Updates pages that need to be visited as well as pages that have been visited already
# File lib/crawler/index.rb, line 21 def consume_document(path, document) path = normalize_path path new_links = document.domain_specific_paths.map { |path| normalize_path path } store_path path store_path_visited path store_path_assets path, document.static_assets store_path_links_to path, new_links store_paths_to_visit(new_links - get_paths_visited) remove_path_from_queue path update_paths_linked_to_from_path(document) end
results()
click to toggle source
Returns the data associated with an indexed domain
# File lib/crawler/index.rb, line 42 def results get_domain_data end
Private Instance Methods
update_paths_linked_to_from_path(document)
click to toggle source
Records incoming links for pages Uses the current path as the incoming link Records the current_path as incoming on all links found in the current document
# File lib/crawler/index.rb, line 52 def update_paths_linked_to_from_path(document) document.domain_specific_paths.each do |url| link_uri_path = normalize_path Addressable::URI.parse(url.strip).path document_uri_path = normalize_path document.uri.path next if link_uri_path == document_uri_path store_path link_uri_path store_path_linked_to_from(link_uri_path, [document_uri_path]) end end