Serializing structured data into Avro using Python

Wed, May 20, 2015

It is impossible to ignore avro at work - it is the data serialization format of choice at work (and rightly so), whether it is to store data into Kafka or into our document database Espresso. Recently, I had the need to read avro data serialized by a Java application, and I looked into how I might use Python to read such data.

The avro python library uses schemas, and can store data in a compact binary format using both deflate and snappy compression. I wanted to test out how compact the serialization format is as compared to say, CSV.

I set up a Python virtual environment using the nifty virtualenv wrapper and installed the python avro library and the snappy library. On my Ubuntu system, the python snappy library is available from the package repository, but I used the Pypi one anyway. I had to install the c development package (libsnappy-dev) as a prerequisite though.

$ mkvirtualenv avro

$ workon avro

(avro) $ pip install avro

# Remember that it is 'python-snappy', and not just 'snappy', which is a
# completely different library
(avro) $ pip install python-snappy

And I was ready for my code. First I decided on a simple data set: a weather data set which was part of the pandas cookbook.

Based on this, I created this simple schema.

{
    "namespace": "net.sandipb.avro.example.weather",
    "type": "record",
    "name": "Reading",
    "doc": "Weather reading at a point in time",
    "fields": [
        {"name": "time", "type": "int", "doc": "Seconds since epoch"},
        {"name": "temp", "type": "float", "doc": "Temperature in Celsius"},
        {"name": "dew_point_temp", "type": "float", "doc": "Dew point temperature in Celsius"},
        {"name": "humidity", "type": "int", "doc": "Relative humidity %"},
        {"name": "wind_speed", "type": "float", "doc": "Wind speed in km/h"},
        {"name": "visibility", "type": "float", "doc": "Visibility in km"},
        {"name": "pressure", "type": "float", "doc": "Atmospheric pressure in kPa"},
        {"name": "weather", "type": "string", "doc": "Weather summary"}
    ]
}

Now to actually process the file, I created this Python script to read the CSV file and to write to three different avro files, each with a different compression format - null (no compression), deflate and snappy.

Here are some relevant parts of the code:

# Load the avro schema to validate and serialize data
schema = avro.schema.parse(open("weather.avsc").read())

# Open using a avro container class, providing the schema to use and the
# compression to use
writer_deflate = DataFileWriter(open("weather_data_deflate.avro", "wb"), DatumWriter(), schema, codec="deflate")

# Write a dict with the data
writer_deflate.append(row)

# close the container
writer_deflate.close()

The result of this was … unexpected. The deflate codec showed more compression than snappy.

$ ls -lSh *.avro *.csv
-rw-rw-r-- 1 sandipb sandipb 492K May 20 01:34 weather_2012.csv
-rw-rw-r-- 1 sandipb sandipb 317K May 20 02:29 weather_data_null.avro
-rw-rw-r-- 1 sandipb sandipb 178K May 20 02:29 weather_data_snappy.avro
-rw-rw-r-- 1 sandipb sandipb 121K May 20 02:29 weather_data_deflate.avro

It is possible that a larger dataset could show better compression for snappy. This is the first time I have used this codec, I need to read up a bit more about it. I will try this experiment with a larger dataset and update this post in the future.

#avro #apache #snappy

python