Data Transformation with Python, Docker, and Kubernetes

by Andrew Backes and Allen Fung

We recently created an application to transform data files for a given day, concatenate them, and upload them to an FTP server. Specifically, here’s what the application does.

  1. Downloads the individual GZ files from S3 for a given day.
  2. Unzips them
  3. Transforms the JSON into pipe delimited format
  4. Concatenates all of the data for an entire day into one file
  5. Compresses that file
  6. Uploads it via FTP

There are a couple interesting things about this application. First, while we have traditionally used shell scripts for these types of applications, we chose Python this time. This language makes the application more maintainable, since it easily supports unit testing. Python also supports a more functional style of programming, which we are starting to use more often throughout the organization.

Another interesting thing about the application is that it runs in a Docker container. An important benefit of using Docker is that our development environment is exactly the same as our production environment. As a result, we didn’t run into issues such as wrong Python versions or missing Python modules during deployment. This helped us complete the initial deployment to production in a few minutes compared to the days that it would have taken in our legacy environment.

To deploy the application into a cluster of servers, we used Kubernetes. Specifically, we used the “kubectl create” command. The input to this command is a YAML file, which contains the docker image, CPU resources, and memory resources to use. While this application is designed to run in a single container, Kubernetes makes it very easy to scale to multiple containers in seconds. Here’s an example of how to do this.

$ kubectl scale rc my-application --replicas=5

Since the replicas options is set to five, it means that we are scaling to five containers. Note that we’ve also built tooling on top of Kubernetes to help manage multiple clusters. With this tooling, we specify the location to deploy to and the system automatically finds the correct environment variables for the specified location.

As you’ve seen, Python, Docker, and Kubernetes helps us write cleaner code and iterate faster, which leads to better products. In the future, we hope to transition more legacy applications to this environment. Stay tuned for blog posts about this.

If you're interested in solving problems like this we would love to have you join our team!