Histograms and High Level Languages at StrangeLoop

This year’s StrangeLoop conference is less than a week away and I’m psyched. This meeting with an odd name lies at the intersection of an odd blend of topics, including distributed systems, languages, and data science. It would be a natural place for me to talk about PFA, which covers all three, but instead I decided to talk about something new: a language of histogram aggregation called Histo·grammar.

Histo·grammar arose from trying to fit together two conflicting philosophies of how to aggregate data. Histograms are the bread and butter of my first field of study, high energy physics, and high energy physics software views histograms as objects to be filled, like lists in LISP or dictionaries in Python. Non-physics analysis software views histograms as the statistical abstractions they technically are, an approximation of a dataset’s distribution. Physics code is infinitely scalable because histograms can forever accumulate data in-place, but it is cumbersome to use in a functional paradigm like Apache Spark. Non-physics histogram APIs are restrictive in how they let you add or access the aggregated data. The key to getting the best of both is to keep the idea of a histogram as a container, but make it a functional container that knows how to fill itself.

To non-physicists, my focus on histograms might seem narrow: after all, isn’t a histogram just one type of chart? According to the statistician’s definition, yes, but the ways physicists have used (abused?) histogram-filling software over the past forty years has led to much, much more. Histo·grammar makes this generality explicit by splitting the histogram into its constituent atoms— composable primitives of data-aggregation that can be used to build a statistician’s histogram and many other aggregate structures.

As datasets get larger in all fields, having a way to summarize them with complex aggregations will be increasingly important. I’ll show how the same declarative language can slice and dice data in HDFS, can be JIT-compiled for blazing speed, and can even be parallelized across vector devices like GPUs.

Around the time I was developing PFA, someone asked me if it was a big transition from particle physics to data science. I said no, because particle physics is the most industrial field in academia and data science is the most academic field in industry. Conferences like StrangeLoop prove this point, in that philosophical musings on some esoteric language can be followed by the next big software stack. If you’ll be there, I’m the guy with the long, scraggly beard (non-unique identifier?) and would love to hear your latest great idea.

A link to an overview of my talk can be found here.

Written by Jim Pivarski

7/20 Meet-Up: Model Deployment with Bob Grossman

Bob Grossman Model Deployment

Last Wednesday, Open Data Group had the opportunity to co-host a data science meet-up with DataScope, which manages the Data Science Chicago Meet-Up. We thoroughly enjoyed the experience and appreciate all the folks who came out for discussion and pizza. Bob Grossman, Open Data’s founder and Chief Data Scientist, introduced the concept of AnalyticOps (read CTO Stu Bailey’s posts on the same topic here) and the emerging core competency of deploying models. Bob was joined by Robert Nendorf from Allstate, who shared his views on a similar topic: DevOps for Data Science.

AnalyticOps is an organizational function that fills the gap often found between modelers and developers. Often times, these two functions use different specialized languages and services that lead to significant efforts and delays when moving models from the modeling environment to the deployment environment. AnalyticOps acts as a catalyst to deploy and then monitor and update models.

At the heart of AnalyticOps is an Analytic Engine. An Analytic Engine is a component that is integrated once into products or enterprise IT and then runs new and updated analytic models that are deployed to it in an operational workflows. Although integrated into systems only once, they allow applications to quickly update models as fast as it can read a model file. 

Ultimately, AnalyticOps is implemented within an organization to create and maintain a culture where building, validating, deploying, and running analytic models happen in a rapid, repeatable, and reliable system.

To learn more about AnalyticOps and how it assists the growing role of deploying models within a business, feel free to look through Bob’s slide deck from the night. Of course, if you have any questions contact us and we’d be more than happy to discuss the topic with you. Contact us at info@opendatagroup.com. 

Find Bob Grossman's slides here.

Genomic Data Commons led by Robert Grossman Launches

(Photo credit: Robert Kozloff) 

(Photo credit: Robert Kozloff) 

The GDC, a platform aimed at allowing unprecedented access to cancer research Data, launched June 6th as part of Vice President Joe Biden's Cancer Moonshot Initiative. 

Vice President Joe Biden visited and toured the GDC operations center at the University of Chicago on June 6th in preparation for the public launch of the Genomic Data Commons. The GDC is a platform aimed at allowing researchers unprecedented access to cancer research data for analysis and sharing. It is part of Biden's work with the National Institute for Cancer Research's Cancer Moonshot Initiative, which is working to accelerate cancer research while making therapies and early detection more readily available.  

The GDC principal researcher is our founder, Bob Grossman,  who is also the Director of University of Chicago's Center for Data Intensive Science.  Bob has worked to help centralize data from previous National Cancer Institute programs into one data set under GDC.  The GDC data set, which is over 4 PetaBytes in size, includes clinical and genomic information from thousands of patients and is continuing to grow as more patient information is catalogued. The integration of clinical and genomic data allows for researchers to view items such as cancer screening and imaging, information on the molecular profiles of tumors, and treatment response in one location. All of this will create a direct line for analysis of cancer research, detection, and treatment response and eventually enable better treatment and detection for patients. 

Open Data Group is excited to offer a congratulations to our founder and chief data scientist Robert Grossman for his ongoing dedication and work to help create a comprehensive network of knowledge for the prevention and treatment of cancer!