Working with gzipped data in S3

Working with gzipped data in S3


Recently I had to deal with a dataset of hundreds of .tar.gz files dumped in an IBM Cloud Object Storage (IBM CleverSafe) accessible via an S3 API. Since I had to process not all but just a subset of files contained in each archive, I wanted to avoid unpacking everything (and actually, I did not have enough storage capacity to do that anyway as the dataset was really massive).

I came up with a couple of handy Python functions, which allow me to inspect individual archives and do in-memory extracts of specific files, and I decided to share them in this post.

First and foremost, to access the S3 storage I use Boto – a Python interface to AWS.

Connecting to the S3 storage is fairly trivial:

import boto
import boto.s3.connection
access_key = "my_access_key"
secret_key = "my_secret_key"
bucket = "bucket_name"
host = "host_name"

conn = boto.connect_s3(
        aws_access_key_id = access_key,
        aws_secret_access_key = secret_key,
        host = host,
        calling_format = boto.s3.connection.OrdinaryCallingFormat(),
        )

# Set up the credentials and endpoint for the IBM Cloud Object Storage connector
hconf = sc._jsc.hadoopConfiguration()
hconf.set("fs.s3d.service.endpoint", "https://" + host)
hconf.set("fs.s3d.service.access.key", access_key)
hconf.set("fs.s3d.service.secret.key", secret_key)

b = conn.get_bucket(bucket)

Once we have the bucket, we can confirm that we can read files from it by printing the file names of say the first 10 files.

for key in islice(b.list(), 10):
    print(key.name)
data/logdump_2017-05-01.tsv.gz
data/logdump_2017-05-02.tsv.gz
data/logdump_2017-05-03.tsv.gz
data/logdump_2017-05-04.tsv.gz
data/logdump_2017-05-05.tsv.gz
data/logdump_2017-05-06.tsv.gz
data/logdump_2017-05-07.tsv.gz
data/logdump_2017-05-08.tsv.gz
data/logdump_2017-05-09.tsv.gz
data/logdump_2017-05-10.tsv.gz
...

Now that we have access to the bucket, we can define a simple helper function that returns a BytesIO object for a file in the bucket. This allows us to manipulate the file in an in-memory buffer.

# Gets a file object from the S3 bucket
def get_file_object(bucket, file_name):
    # Get a file from the bucket
    # and return its contents as BytesIO object
    k = Key(bucket)
    k.key = file_name
    fileobj = BytesIO(k.get_contents_as_string())
    return fileobj

We can then use the BytesIO object to read the archive a show it contents without unpacking it. This will be done a second helper function:

# Lists all files stored in an archive
def list_files_in_archive(fileobj):
    # Open the archive
    tarf = tarfile.open(fileobj=fileobj)
    names = tarf.getnames()
    return names

Combing the two functions allows us to peek inside an archive and see the name of the files contained within.

fileobj = get_file_object(b, "data/logdump_2017-05-01.tsv.gz")
list_files_in_archive(fileobj)
['monitoring.tsv',
 'sources.tsv',
 'events.tsv',
 'messages.tsv',
 'aux_data.tsv',
 ...
 'applied_policies.tsv']

We can easily put together a third function that receives an archive file (as a BytesIO object) and a name of a file contained inside the archive. The function then loads the data from the file by doing all the extraction in memory.

# Reads a lookup file from an archive (fileobj)
def get_from_archive(fileobj, compressed_file):
    # Open the archive
    tarf = tarfile.open(fileobj=fileobj)
    
    # Get the file of interest
    compressed = tarf.extractfile(compressed_file)

    # Parse as TSV and return the results
    data = pd.read_csv(compressed,sep="\t")  
    return data

get_from_archive(fileobj, "events.tsv")

Using combinations of these three functions, we can easily loop over all of the .tar.gz files and load the data from the .tsv files of interest on the fly, without wasting any storage space.