Ruby – Uploading files greater than 5GB to Amazon S3

I was struggling with finding a method to upload files greater than 5GB to S3 using the Amazon aws-s3 gem, then I came across this great post by Gavin.

I’m adding the code here in case the site eventually goes down. All credit to Gavin!

#!/usr/bin/env ruby
#
# Testing multipart uploads into s3 with threads
# Tested with Ruby 1.8 and 1.9
 
# This is proof of concept code, it works, but is not suitable for production, and may even have nasty bugs in the
# threading section
 
# Refs:
# http://docs.amazonwebservices.com/AmazonS3/latest/API/index.html?mpUploadInitiate.html
# http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/index.html?using-query-api.html <-- Query API auth
#
 
require 'rubygems'
require 'fog'
require 'digest/md5'
require 'base64'
require 'fileutils'
 
 
 
# Credentials
key = 'AAAA'
secret = 'BBBB'
bucket = 'some-bucket'
region = 'eu-west-1'
 
# Setup connection
stor = Fog::Storage.new(
  :provider => 'AWS',
  :aws_access_key_id => key,
  :aws_secret_access_key => secret,
  :region => region
)
 
 
# Don't want to get caught out with any time errors
stor.sync_clock
 
 
# Take a test file and split it up, remove the initial / to use the filename and path as the key
#object_to_upload = '/tmp/linux-2.6.38.2.tar.bz2'
object_to_upload = '/tmp/ubuntu-10.04.2-server-amd64.iso'
object_key = object_to_upload[1..-1]
 
# Area to place the split file into
workdir = "/tmp/work/#{File.basename(object_to_upload)}/"
FileUtils.mkdir_p(workdir)
 
# Split the file into chunks, the chunks are 000, 001, etc
#`split -C 10M -a 3 -d #{object_to_upload} #{workdir}`
`split -C 100M -a 3 -d #{object_to_upload} #{workdir}`
 
 
# Map of the file_part => md5
parts = {}
 
# Get the Base64 encoded MD5 of each file
Dir.entries(workdir).each do |file|
  next if file =~ /\.\./
  next if file =~ /\./
 
  md5 = Base64.encode64(Digest::MD5.file("#{workdir}/#{file}").digest).chomp!
 
  full_path = "#{workdir}#{file}"
 
  parts[full_path] = md5
end
 
 
### Now ready to perform the actual upload
 
# Initiate the upload and get the uploadid
multi_part_up = stor.initiate_multipart_upload(bucket, object_key, { 'x-amz-acl' => 'private' } )
upload_id = multi_part_up.body["UploadId"]
 
# Lists for the threads and tags
tags = []
threads = []
 
 
sorted_parts = parts.sort_by do |d|
  d[0].split('/').last.to_i
end
 
sorted_parts.each_with_index do |entry, idx|
  # Part numbers need to start at 1
  part_number = idx + 1
 
  # Reload to stop the connection timing out, useful when uploading large chunks
  stor.reload
 
  # Create a new thread for each part we are wanting to upload.
  threads << Thread.new(entry) do |e|
    print "DEBUG: Starting on File: #{e[0]} with MD5: #{e[1]} - this is part #{part_number} \n"
 
    # Pass fog a file object to upload
    File.open(e[0]) do |file_part|
 
      # The part_number changes each time, as does the file_part, however as they are set outside of the threads being created I *think* they are
      # safe. Really need to dig into the pickaxe threading section some more..
      part_upload = stor.upload_part(bucket, object_key, upload_id, part_number, file_part, { 'Content-MD5' => e[1] } )
 
      # You need to make sure the tags array has the tags in the correct order, else the upload won't complete
      tags[idx] = part_upload.headers["ETag"]
 
      print "#{part_upload.inspect} \n" # This will return when the part has uploaded
    end
  end
end
 
# Make sure all of our threads have finished before we continue
threads.each do |t|
  begin
    t.join
  rescue Exception => e
    puts "Failed: #{e.message}"
  end
end
 
# Might want a stor.reload here...
 
completed_upload = stor.complete_multipart_upload(bucket, object_key, upload_id, tags)

Basically you shouldn’t leverage the aws-s3 multipart_upload as I was trying to do, but rather use fog!

Leave a Reply Cancel reply