Category Archives: Programming

Easy Time Series Analysis With NoSQL, Python, Pandas and Jupyter

I was really honored to speak at All Things Open (allthingsopen.org) this year. All Things Open is an absolutely amazing gathering of over 2400 open source practitioners in Raleigh, NC (which just happens to be blessed with amazing barbecue and a really nice conference center). This year’s conference was packed with high quality presentations, great attendees, and some awesome social events including the conference ending soiree at the Boxcar Bar and Arcade (theboxcarbar.com/raleigh/). Of course I was also happily surprised at the large number of people who turned out for my presentation: “Easy Time Series Analysis With NoSQL, Python, Pandas & Jupyter”. As it turns out I had a packed room for my talk about putting together a cheap (free!) and cheerful set of tools to do time series analysis.

Up until just recently, doing time series analysis at scale was expensive and almost exclusively the domain of large enterprises. What made time series a hard/expensive problem to tackle? Until the advent of NoSQL databases, scaling up to meet increasing velocity and volumes of data generally meant scaling hardware vertically by adding CPUs, memory, or additional hard drives. When combined with database licensing models that charged per processor core the cost of scaling was simply out of reach for most.

Fortunately the open source community is democratising large scale data analysis rapidly and I am lucky enough to work at Basho which is making contributions in this space. In my talk I introduced the audience to Basho’s open source time series database Riak TS (http://docs.basho.com/riak/ts/) and demonstrated how to use it in conjunction with three other open source tools (Python, Pandas, and Jupyter) to build a completely open source time series analysis platform in next to no time at all.

I think that Riak TS is a particularly exciting addition to the open source world of databases for a couple of reasons. To start, you would be hard pressed to find a time series database that can scale from one to over one hundred nodes on commodity hardware with so little effort in the ops department. Riak TS automatically handles the distribution of data around your cluster of nodes, replicates your data three times to ensure high availability, and has a host of other automated features that are designed specifically to maximize uptime while making it easy to grow your cluster to meet your scaling needs.

Developing applications on top of Riak TS is just as easy (whether you work with Java, Python, Ruby, GO, Node.js, PHP, .Net, or Erlang) as installing and running the database. One of the coolest features for developers is Riak TS’s use of ANSI compliant SQL. While SQL may not be the coolest, latest thing in the world of big data it certainly makes Riak TS accessible to a wide range of developers and, maybe even more importantly, business/data analysts.

My talk started off with an introduction to Riak TS, a key-value database optimized to store and retrieve time series data while being able to scale to meet truly massive data sets. During the “academic” portion of the talk I covered the architecture of Riak TS, its feature set, and some of the unique things that set it apart from other time series databases currently available. I also covered some example Riak TS use cases and how that the use case affects the way that you go about modeling data.

In the “practical” portion of my talk we covered the of getting started with Riak TS:

  • Installation – where to get Riak TS, how to install it, and how to scale it up as the size of your data problem grows;
  • How to get started interacting with Riak TS using the built in riak-shell and Python using the Riak Python Client;
  • How to create a new table in Riak TS and verify that it was created;
  • And how to query Riak TS using both the riak-shell and Python;

During the practical portion of the walk through we also loaded over three hundred and fifty-thousand records from the Bay Area Bike Share open data set (http://www.bayareabikeshare.com/open-data) to demonstrate how fast Riak TS is at both reading and writing data.

Having mastered the basics of using Riak TS we moved on to the “advanced” portion of talk where we introduced the Python Data Analysis Library and Jupyter (these two open source tools should be staples of any Python programmers chest of data analysis tools). After a brief introduction to Pandas and Jupyter we ran through some data analysis examples where we demonstrated the kind of insight we can gain using the tools and the Bay Area Bike Share data we loaded earlier on. We also covered how to use Python within Jupyter to:

  • Query Riak TS;
  • Convert a Riak TS resultset into a Pandas DataFrame;
  • Demonstrate some of the built in data analysis features of Pandas;
  • And finally we used the matplotlib library to demonstrate how to create data visualizations.

if you are feeling particularly motivated to start analyzing time series data you can grab all of my example code (which is open source of course) from the following repository on GitHub: https://github.com/cvitter/ATO2016.

Note: An early version of this blog post appeared on opensource.com before All Things Open: https://opensource.com/life/16/9/time-series-analysis-riak-ts.

NoSQL Riak TS Gets JDBC Driver Inspired by SQL

When Basho’s engineering team released Riak TS 1.0 back in December 2015 one of the features that I found most exciting was its use of standard SQL. I know that there aren’t a lot of people who get excited by SQL in this era of NoSQL databases but SQL isn’t dead just yet. In the 30+ years that SQL has been in use, it has had the opportunity to find itself integrated into the vast majority of databases and reporting tools used by enterprises. Essentially SQL has become the lingua franca of data analysis and by making SQL the query language of Riak TS, Basho made the database accessible to a wider range of potential users.

As cool as that is, as a developer, I realized that the use of SQL also made it possible to build a JDBC (Java Database Connectivity) driver for Riak TS. If you aren’t already familiar with the JDBC API, it provides Java applications standardized methods to connect to, query, and update data in any database (almost exclusively relational databases) that provides a JDBC driver. As an official part of the Java language since 1997, JDBC has been widely adopted by developers. If you use a reporting tool like those available from Cognos, Microstrategy, Business Objects, or Jaspersoft, than you can connect to any data source that provides a JDBC driver.

Once I realized how important a JDBC driver would be for Riak TS, I was compelled to write one. When I started down the path of writing a JDBC driver for Riak TS my goal was simply to use it as a learning opportunity, I wasn’t really convinced that I would have the time or ability to produce something that would be generally useful. As I started working on the driver the learning exercise became a viable project and so now I’ve decided to open source the project and share my work with the community:

https://github.com/basho-labs/Riak-TS-JDBC-Driver

There are two main reasons why you would want to use the JDBC Driver:

  1. You are a Java application developer familiar with the JDBC API and want to integrate Riak TS into an application;
  2. You use reporting tools like BusinessObjects, Cognos, or Jaspersoft that allow you to connect to databases using JDBC drivers.

If you have one of the proceeding uses for a JDBC driver for Riak TS check out the ReadMe at https://github.com/basho-labs/Riak-TS-JDBC-Driver/tree/master/riakts.jdbc.driver for full details on the driver’s capabilities and how to get started using it. And of course if you do use the driver please leave feedback, submit issues, or submit pull requests.

Presentation: Visualizing MongoDB Objects in Concept and Practice

This afternoon I gave a presentation titled Visualizing MongoDB Objects in Concept and Practice at MongoDB Washington DC. The slides for the presentation can be found in PDF format here:
Visualizing MongoDB Objects in Concept and Practice Slides

And the example code is available online via my GitHub repository here:
https://github.com/cvitter/ikanow.mongodc2013.presentation

MongoDC 2013

I am pretty excited to announce that I will be speaking at this year’s MongoDC March 11th. This will be my third year attending MongoDC and my 2nd year as a speaker. My presentation this time is tentatively titled Visualizing MongoDB Objects in Concept and Practice and will cover Open Source tools and technologies for visualizing data using JSON and JavaScript. Hopefully it will be both entertaining and educational.

More information on MongoDC 2013 can be found here: http://www.10gen.com/events/mongodc-2013

 

De/serializing MongoDB IDs and Dates with GSON

I recently ran into a need to serialize and deserialize MongoDB Object ID’s and dates due to the manner in which the application I am working on is  using Google’s GSON library to convert data retrieved from MongoDB into POJOs.

If you rely on the built in type adapters the come with the GSON library for serialization the library will convert Object IDs from their JSON representation of {“$oid” : “4c2209f9f3924d31102bd84a”} into a plain old string (i.e. “4c2209f9f3924d31102bd84a”) when what you probably want is to serialize the value as a BSON ObjectId. The GSON library also does a poor job of serializing MongoDB’s “yyyy-MM-dd’T’HH:mm:sss’Z'” date format. Fortunately this behavior can be over ridden through the use of custom serializers and deserializers. Unfortunately I could not find any good examples of how to write custom serialization code for MongoDB online so I spent a good deal of time figuring it out through trial and error (and some help from my boss).

Below is a sample of how to serialize and deserialize the ObjectId:

@Override
public JsonElement serialize(ObjectId id, Type typeOfT,
   JsonSerializationContext context)
{
   JsonObject jo = new JsonObject();
   jo.addProperty("$oid", id.toStringMongod());
   return jo;
}
@Override
public ObjectId deserialize(JsonElement json, Type typeOfT,
   JsonDeserializationContext context) throws JsonParseException
{
   try {return new ObjectId(json.getAsJsonObject()
       .get("$oid").getAsString()); }
   catch (Exception e) { return null; }
}

Note: The full source of the GsonTypeAdapter class can be found here:  GsonTypeAdapter.txt. Please note that this code handles both MongoDB ObjectIDs and dates but it has not been optimized yet. Use it at your own risk and feel free to leave comments/critiques attached to this post.

Note 2: I wrote this code as part of my day job at IKANOW where we are doing some very cool things in the knowledge discovery and analysis space.

MetroMinder Update

MetroMinder is a Web based application that I started as part of my initial efforts to learn PHP.  The current iteration of MetroMinder has the following features:

  • Lists all Metro stations by line based on user select (Blue, Green, Orange, Red, Yellow);
  • Provides real time arrival estimates for trains at user selectable stations;
  • Provides a link to Google maps based location information for each station;
  • Lists Metro Rail Service alerts.

The application gets its data from the Metro Transparent Data Sets API (using the RESTful services) and the Metrorail Service Disruptions RSS feed.

Today I finished making some tweaks to the layout to optimize the display on my Motorola Droid (the target platform is smart phones like the Droid). The one area of the design that I am not happy about is the graphic I created from the Stations function (see screenshot below).

MetroMinder Screenshot

If you have the time to check out MetroMinder I would appreciate any feedback you might have or offers of assistance in the form of a better station graphic.