Skip to content

Latest commit

 

History

History
68 lines (53 loc) · 3.21 KB

batch_processing.md

File metadata and controls

68 lines (53 loc) · 3.21 KB

Contents


Batch Processing

Processing CSV data in batches (chunks), allows you to parallelize the workload of importing data. This can come in handy when you don't want to slow-down the CSV import of large files.

Setting the option chunk_size sets the max batch size.

Example 1: How SmarterCSV processes CSV-files as chunks, returning arrays of hashes:

Please note how the returned array contains two sub-arrays containing the chunks which were read, each chunk containing 2 hashes. In case the number of rows is not cleanly divisible by :chunk_size, the last chunk contains fewer hashes.

     > pets_by_owner = SmarterCSV.process('/tmp/pets.csv', {:chunk_size => 2, :key_mapping => {:first_name => :first, :last_name => :last}})
       => [ [ {:first=>"Dan", :last=>"McAllister", :dogs=>"2"}, {:first=>"Lucy", :last=>"Laweless", :cats=>"5"} ],
            [ {:first=>"Miles", :last=>"O'Brian", :fish=>"21"}, {:first=>"Nancy", :last=>"Homes", :dogs=>"2", :birds=>"1"} ]
          ]

Example 2: How SmarterCSV processes CSV-files as chunks, and passes arrays of hashes to a given block:

Please note how the given block is passed the data for each chunk as the parameter (array of hashes), and how the process method returns the number of chunks when called with a block

     > total_chunks = SmarterCSV.process('/tmp/pets.csv', {:chunk_size => 2, :key_mapping => {:first_name => :first, :last_name => :last}}) do |chunk|
         chunk.each do |h|   # you can post-process the data from each row to your heart's content, and also create virtual attributes:
           h[:full_name] = [h[:first],h[:last]].join(' ')  # create a virtual attribute
           h.delete(:first) ; h.delete(:last)              # remove two keys
         end
         puts chunk.inspect   # we could at this point pass the chunk to a Resque worker..
       end

       [{:dogs=>"2", :full_name=>"Dan McAllister"}, {:cats=>"5", :full_name=>"Lucy Laweless"}]
       [{:fish=>"21", :full_name=>"Miles O'Brian"}, {:dogs=>"2", :birds=>"1", :full_name=>"Nancy Homes"}]
        => 2

Example 3: Populate a MongoDB Database in Chunks of 100 records with SmarterCSV:

    # using chunks:
    filename = '/tmp/some.csv'
    options = {:chunk_size => 100, :key_mapping => {:unwanted_row => nil, :old_row_name => :new_name}}
    n = SmarterCSV.process(filename, options) do |chunk|
          # we're passing a block in, to process each resulting hash / row (block takes array of hashes)
          # when chunking is enabled, there are up to :chunk_size hashes in each chunk
          MyModel.insert_all( chunk )   # insert up to 100 records at a time
    end

     => returns number of chunks we processed

PREVIOUS: The Basic API | NEXT: Configuration Options