Histograms and High Level Languages at StrangeLoop

September 14, 2016

This year’s StrangeLoop conference is less than a week away and I’m psyched. This meeting with an odd name lies at the intersection of an odd blend of topics, including distributed systems, languages, and data science. It would be a natural place for me to talk about PFA, which covers all three, but instead I decided to talk about something new: a language of histogram aggregation called Histo·grammar.

Histo·grammar arose from trying to fit together two conflicting philosophies of how to aggregate data. Histograms are the bread and butter of my first field of study, high energy physics, and high energy physics software views histograms as objects to be filled, like lists in LISP or dictionaries in Python. Non-physics analysis software views histograms as the statistical abstractions they technically are, an approximation of a dataset’s distribution. Physics code is infinitely scalable because histograms can forever accumulate data in-place, but it is cumbersome to use in a functional paradigm like Apache Spark. Non-physics histogram APIs are restrictive in how they let you add or access the aggregated data. The key to getting the best of both is to keep the idea of a histogram as a container, but make it a functional container that knows how to fill itself.

To non-physicists, my focus on histograms might seem narrow: after all, isn’t a histogram just one type of chart? According to the statistician’s definition, yes, but the ways physicists have used (abused?) histogram-filling software over the past forty years has led to much, much more. Histo·grammar makes this generality explicit by splitting the histogram into its constituent atoms— composable primitives of data-aggregation that can be used to build a statistician’s histogram and many other aggregate structures.

As datasets get larger in all fields, having a way to summarize them with complex aggregations will be increasingly important. I’ll show how the same declarative language can slice and dice data in HDFS, can be JIT-compiled for blazing speed, and can even be parallelized across vector devices like GPUs.

Around the time I was developing PFA, someone asked me if it was a big transition from particle physics to data science. I said no, because particle physics is the most industrial field in academia and data science is the most academic field in industry. Conferences like StrangeLoop prove this point, in that philosophical musings on some esoteric language can be followed by the next big software stack. If you’ll be there, I’m the guy with the long, scraggly beard (non-unique identifier?) and would love to hear your latest great idea.

A link to an overview of my talk can be found here.

Written by Jim Pivarski

Tagged: data science, Jim Pivarski, PFA