Monitoring Large Scale Data Pipelines

ShareThis is growing rapidly!  In the last 4 years that I’ve worked here, we have grown from a one room company to a many room company.  We have also started getting all the typical fancy start-up perks: free food, fancy office, company outings, etc.

Fun things happen when we create value, but there’s a cost for engineering in terms of complexity.  We now generate terabytes of data per day in real time.  We also hope to ingest even more data in the next year or two; doubling or tripling our data flow.  It’s gratifying when I talk to colleagues to tell them what we’re doing and they respond with “wow”.

The problem is not ingestion though – tools like Kafka, Cassandra, Aerospike, BigQuery, etc, make scaling up an easier problem than it was 4 years ago.  The difficulty comes when someone asks “how do you know the data in there is right?” or “how do you know all the data is there?”

“It just is.”  –  Not good enough.

To help us answer questions like this we have doubled down on some pretty gratuitous monitoring through all of data pipelines.  Thankfully, our code is built upon a common framework so that when we add monitoring to one library it is dynamically added to all of the applications that use that library.  Pretty cool.

Perhaps a small breakdown of the tech will help those who are trying to do this in the future:

Graphite: Graphite is a “a Django-based web application that renders graphs and dashboards.”  It is built on top of two other projects, Carbon and Whisper (both projects are part of Graphite).  Carbon acts as an aggregation and cache layer which makes the UI responsive.  Whisper is a “fixed-size database, similar in design and purpose to RRD (round-robin-database). It provides fast, reliable storage of numeric data over time. Whisper allows for higher resolution (seconds per point) of recent data to degrade into lower resolutions for long-term retention of historical data.”

Codahale: Codahale’s library actually has connectors for a variety of different services (like Ganglia).  For our purposes we use the Java Graphite connectors and we’re good to go.  The metrics objects can be created dynamically and will allow you to log basic stats, histograms, and timers.  We’ve set the library to log on a per server per minute basis so that the calls are batched and traffic stays low.

Seyren:  We use a modified version of this docker project internally with our components bundled inside.  We point it at graphite and slack using system environment variables.  Once in the UI, we create checks with Graphite metrics (often using some of the well documented functionality) to alert to our Slack channel.

Hope this helps!  Happy hacking :-)