Monthly Archives: November 2016

Easy Time Series Analysis With NoSQL, Python, Pandas and Jupyter

I was really honored to speak at All Things Open (allthingsopen.org) this year. All Things Open is an absolutely amazing gathering of over 2400 open source practitioners in Raleigh, NC (which just happens to be blessed with amazing barbecue and a really nice conference center). This year’s conference was packed with high quality presentations, great attendees, and some awesome social events including the conference ending soiree at the Boxcar Bar and Arcade (theboxcarbar.com/raleigh/). Of course I was also happily surprised at the large number of people who turned out for my presentation: “Easy Time Series Analysis With NoSQL, Python, Pandas & Jupyter”. As it turns out I had a packed room for my talk about putting together a cheap (free!) and cheerful set of tools to do time series analysis.

Up until just recently, doing time series analysis at scale was expensive and almost exclusively the domain of large enterprises. What made time series a hard/expensive problem to tackle? Until the advent of NoSQL databases, scaling up to meet increasing velocity and volumes of data generally meant scaling hardware vertically by adding CPUs, memory, or additional hard drives. When combined with database licensing models that charged per processor core the cost of scaling was simply out of reach for most.

Fortunately the open source community is democratising large scale data analysis rapidly and I am lucky enough to work at Basho which is making contributions in this space. In my talk I introduced the audience to Basho’s open source time series database Riak TS (http://docs.basho.com/riak/ts/) and demonstrated how to use it in conjunction with three other open source tools (Python, Pandas, and Jupyter) to build a completely open source time series analysis platform in next to no time at all.

I think that Riak TS is a particularly exciting addition to the open source world of databases for a couple of reasons. To start, you would be hard pressed to find a time series database that can scale from one to over one hundred nodes on commodity hardware with so little effort in the ops department. Riak TS automatically handles the distribution of data around your cluster of nodes, replicates your data three times to ensure high availability, and has a host of other automated features that are designed specifically to maximize uptime while making it easy to grow your cluster to meet your scaling needs.

Developing applications on top of Riak TS is just as easy (whether you work with Java, Python, Ruby, GO, Node.js, PHP, .Net, or Erlang) as installing and running the database. One of the coolest features for developers is Riak TS’s use of ANSI compliant SQL. While SQL may not be the coolest, latest thing in the world of big data it certainly makes Riak TS accessible to a wide range of developers and, maybe even more importantly, business/data analysts.

My talk started off with an introduction to Riak TS, a key-value database optimized to store and retrieve time series data while being able to scale to meet truly massive data sets. During the “academic” portion of the talk I covered the architecture of Riak TS, its feature set, and some of the unique things that set it apart from other time series databases currently available. I also covered some example Riak TS use cases and how that the use case affects the way that you go about modeling data.

In the “practical” portion of my talk we covered the of getting started with Riak TS:

  • Installation – where to get Riak TS, how to install it, and how to scale it up as the size of your data problem grows;
  • How to get started interacting with Riak TS using the built in riak-shell and Python using the Riak Python Client;
  • How to create a new table in Riak TS and verify that it was created;
  • And how to query Riak TS using both the riak-shell and Python;

During the practical portion of the walk through we also loaded over three hundred and fifty-thousand records from the Bay Area Bike Share open data set (http://www.bayareabikeshare.com/open-data) to demonstrate how fast Riak TS is at both reading and writing data.

Having mastered the basics of using Riak TS we moved on to the “advanced” portion of talk where we introduced the Python Data Analysis Library and Jupyter (these two open source tools should be staples of any Python programmers chest of data analysis tools). After a brief introduction to Pandas and Jupyter we ran through some data analysis examples where we demonstrated the kind of insight we can gain using the tools and the Bay Area Bike Share data we loaded earlier on. We also covered how to use Python within Jupyter to:

  • Query Riak TS;
  • Convert a Riak TS resultset into a Pandas DataFrame;
  • Demonstrate some of the built in data analysis features of Pandas;
  • And finally we used the matplotlib library to demonstrate how to create data visualizations.

if you are feeling particularly motivated to start analyzing time series data you can grab all of my example code (which is open source of course) from the following repository on GitHub: https://github.com/cvitter/ATO2016.

Note: An early version of this blog post appeared on opensource.com before All Things Open: https://opensource.com/life/16/9/time-series-analysis-riak-ts.