Category Archives: Open Source

Open Source 101 – Building Community in Raleigh, NC

On Saturday February 4th, I was fortunate enough to get the chance to spend my day attending and speaking at Open Source 101 (http://opensource101.com/) in Raleigh, NC. Open Source 101 is a one day conference for developers who are new to Open Source and is run by the awesome team behind All Things Open (https://allthingsopen.org/).  The focus of the conference is to help attendees learn how to participate and benefit from Open Source software. Looking around the room during the opening keynote talks, it was hard not to be impressed by the over 500 attendees who had chosen to spend their entire Saturday at this inaugural event. I was honored to be included as a speaker alongside such a highly respected group of speakers.

My presentation, cheekily entitled Create your first open source project the “right way”, todo_list
was designed to provoke conversation around the minimal things that developers can (and should) do when open sourcing their code. Although I have spoken many times about different hard skills in software development, this is the first time I have tackled the softer skills related to building community around Open Source software and I relished (and was honestly a bit terrified of) the challenge.

Having worked in and around Open Source for the last seven years, I have developed some thoughts on what it takes to create a good Open Source project (beyond simply writing solid code). For the presentation I boiled these thoughts down to the following eight ideas or recommendations:

  • Pick the right license for your project
  • Pick the right host for your project (Github, Bitbucket, Codeplex, etc.) that has the tools you need to build a community around your project
  • Write good, solid documentation that explains clearly the two most important things about your code: what it is and how to use it
  • Keep living documentation of your code through issues whether it be bug reports, ideas for enhancements, questions, or requests for help
  • Write unit tests for your code (striving to maximize coverage) and keep the unit tests current as you write new code
  • “Release early and release often” – even if all you are releasing is a script, package your code and documentation up, put a version number on it and release it
  • If you want people to use your code, build community around your code, or just learn something from feedback, then ask for help
  • And of course, be gracious in accepting help when it is offered

Of course, it takes more than simply following these eight guidelines to ensure a project is a “success”, however neglecting too many of them nearly guarantees failure.

When I first got word from the conference organizers that my presentation had been accepted, I immediately set to work on filling in the blanks of the abstract I had submitted. Not long into a first draft of an outline, I started thinking about the slides and how I would convey my story visually. And then, I had a brilliant idea (one I freely admit to stealing) — I would hand draw my slides! Over the past several years I have seen a handful of presenters who used hand drawn slides and each time, I appreciated their effort and remembered it more than the standard “corporate” slide deck.

“How hard could it be to hand draw twenty slides?” I thought. It turns out that it was actually pretty hard. The biggest struggle I experienced was simply not trying to make each slide a “work of art.”

In the end, I gave my presentation to a friendly and engaged crowd. There were good bugsquestions about the material and some really nice complements on the slides. One woman even told me that my “bug” was now the background image on her phone. Hopefully everyone left with some new ideas on how to make their future Open Source efforts more fruitful.

If you are looking for Open Source conferences to attend or speak at, I can’t recommend highly enough All Things Open and Open Source 101. Todd Lewis (https://twitter.com/toddlew) and his team do an amazing job at putting together events that have an embarrassing richness of content in a welcoming and inclusive environment.

Note: My presentation Create your first open source project the “right way” is Open Source material released under the Apache 2.0 License. You can find the full presentation on Github at: https://github.com/cvitter/Open-Source-101-Doing-It-Right.

 

Please reach out via twitter with any feedback.
Craig Vitter
@craigvitter

Advertisements

Easy Time Series Analysis With NoSQL, Python, Pandas and Jupyter

I was really honored to speak at All Things Open (allthingsopen.org) this year. All Things Open is an absolutely amazing gathering of over 2400 open source practitioners in Raleigh, NC (which just happens to be blessed with amazing barbecue and a really nice conference center). This year’s conference was packed with high quality presentations, great attendees, and some awesome social events including the conference ending soiree at the Boxcar Bar and Arcade (theboxcarbar.com/raleigh/). Of course I was also happily surprised at the large number of people who turned out for my presentation: “Easy Time Series Analysis With NoSQL, Python, Pandas & Jupyter”. As it turns out I had a packed room for my talk about putting together a cheap (free!) and cheerful set of tools to do time series analysis.

Up until just recently, doing time series analysis at scale was expensive and almost exclusively the domain of large enterprises. What made time series a hard/expensive problem to tackle? Until the advent of NoSQL databases, scaling up to meet increasing velocity and volumes of data generally meant scaling hardware vertically by adding CPUs, memory, or additional hard drives. When combined with database licensing models that charged per processor core the cost of scaling was simply out of reach for most.

Fortunately the open source community is democratising large scale data analysis rapidly and I am lucky enough to work at Basho which is making contributions in this space. In my talk I introduced the audience to Basho’s open source time series database Riak TS (http://docs.basho.com/riak/ts/) and demonstrated how to use it in conjunction with three other open source tools (Python, Pandas, and Jupyter) to build a completely open source time series analysis platform in next to no time at all.

I think that Riak TS is a particularly exciting addition to the open source world of databases for a couple of reasons. To start, you would be hard pressed to find a time series database that can scale from one to over one hundred nodes on commodity hardware with so little effort in the ops department. Riak TS automatically handles the distribution of data around your cluster of nodes, replicates your data three times to ensure high availability, and has a host of other automated features that are designed specifically to maximize uptime while making it easy to grow your cluster to meet your scaling needs.

Developing applications on top of Riak TS is just as easy (whether you work with Java, Python, Ruby, GO, Node.js, PHP, .Net, or Erlang) as installing and running the database. One of the coolest features for developers is Riak TS’s use of ANSI compliant SQL. While SQL may not be the coolest, latest thing in the world of big data it certainly makes Riak TS accessible to a wide range of developers and, maybe even more importantly, business/data analysts.

My talk started off with an introduction to Riak TS, a key-value database optimized to store and retrieve time series data while being able to scale to meet truly massive data sets. During the “academic” portion of the talk I covered the architecture of Riak TS, its feature set, and some of the unique things that set it apart from other time series databases currently available. I also covered some example Riak TS use cases and how that the use case affects the way that you go about modeling data.

In the “practical” portion of my talk we covered the of getting started with Riak TS:

  • Installation – where to get Riak TS, how to install it, and how to scale it up as the size of your data problem grows;
  • How to get started interacting with Riak TS using the built in riak-shell and Python using the Riak Python Client;
  • How to create a new table in Riak TS and verify that it was created;
  • And how to query Riak TS using both the riak-shell and Python;

During the practical portion of the walk through we also loaded over three hundred and fifty-thousand records from the Bay Area Bike Share open data set (http://www.bayareabikeshare.com/open-data) to demonstrate how fast Riak TS is at both reading and writing data.

Having mastered the basics of using Riak TS we moved on to the “advanced” portion of talk where we introduced the Python Data Analysis Library and Jupyter (these two open source tools should be staples of any Python programmers chest of data analysis tools). After a brief introduction to Pandas and Jupyter we ran through some data analysis examples where we demonstrated the kind of insight we can gain using the tools and the Bay Area Bike Share data we loaded earlier on. We also covered how to use Python within Jupyter to:

  • Query Riak TS;
  • Convert a Riak TS resultset into a Pandas DataFrame;
  • Demonstrate some of the built in data analysis features of Pandas;
  • And finally we used the matplotlib library to demonstrate how to create data visualizations.

if you are feeling particularly motivated to start analyzing time series data you can grab all of my example code (which is open source of course) from the following repository on GitHub: https://github.com/cvitter/ATO2016.

Note: An early version of this blog post appeared on opensource.com before All Things Open: https://opensource.com/life/16/9/time-series-analysis-riak-ts.

NoSQL Riak TS Gets JDBC Driver Inspired by SQL

When Basho’s engineering team released Riak TS 1.0 back in December 2015 one of the features that I found most exciting was its use of standard SQL. I know that there aren’t a lot of people who get excited by SQL in this era of NoSQL databases but SQL isn’t dead just yet. In the 30+ years that SQL has been in use, it has had the opportunity to find itself integrated into the vast majority of databases and reporting tools used by enterprises. Essentially SQL has become the lingua franca of data analysis and by making SQL the query language of Riak TS, Basho made the database accessible to a wider range of potential users.

As cool as that is, as a developer, I realized that the use of SQL also made it possible to build a JDBC (Java Database Connectivity) driver for Riak TS. If you aren’t already familiar with the JDBC API, it provides Java applications standardized methods to connect to, query, and update data in any database (almost exclusively relational databases) that provides a JDBC driver. As an official part of the Java language since 1997, JDBC has been widely adopted by developers. If you use a reporting tool like those available from Cognos, Microstrategy, Business Objects, or Jaspersoft, than you can connect to any data source that provides a JDBC driver.

Once I realized how important a JDBC driver would be for Riak TS, I was compelled to write one. When I started down the path of writing a JDBC driver for Riak TS my goal was simply to use it as a learning opportunity, I wasn’t really convinced that I would have the time or ability to produce something that would be generally useful. As I started working on the driver the learning exercise became a viable project and so now I’ve decided to open source the project and share my work with the community:

https://github.com/basho-labs/Riak-TS-JDBC-Driver

There are two main reasons why you would want to use the JDBC Driver:

  1. You are a Java application developer familiar with the JDBC API and want to integrate Riak TS into an application;
  2. You use reporting tools like BusinessObjects, Cognos, or Jaspersoft that allow you to connect to databases using JDBC drivers.

If you have one of the proceeding uses for a JDBC driver for Riak TS check out the ReadMe at https://github.com/basho-labs/Riak-TS-JDBC-Driver/tree/master/riakts.jdbc.driver for full details on the driver’s capabilities and how to get started using it. And of course if you do use the driver please leave feedback, submit issues, or submit pull requests.