ASA Datafest

ODG is happy to announce our sponsorship and participation in two ASA Datafest events this spring. ODG will work with the Ohio State University in Columbus, OH and Loyola University in Chicago, IL to help students get the most from their weekends.  Our relationship with Datafest started last year, where we both mentored and judged the 2016 event at Loyola University.  It was a blast, and we were happy to see students really digging into the problem and finding unique solutions.

This year we will again be at Loyola Univ, judging and mentoring, and providing support at OSU as well.  We are happy to have a few members of our technical staff again join the students and help them learn the art of data science and analytics.  Best of luck to all the participants.

Analytic Deployment Stacks and Frameworks (Part 1): Motivation

This is part 1 in a multi-part series discussing an approach to effective deployment of analytic models at scale.  

It’s 2017. Your organization has been collecting valuable data for several years.  The organization you work for is somewhere on the spectrum of analytic maturity from “we just hired our first data scientist” to “we are in the credit scoring business and have been developing critical analytics for decades”.

No matter where the organization is on the analytic maturity journey, now is the time to ensure you have an analytic deployment technology stack and organizational competency that is future proofed for growth, and anticipates an increased reliance on analytics and data science to drive top line organizational success.  Let’s start by clarifying our topic and dispelling some myths.

Truth:  Analytic Deployment is not Big Data.

Big Data is both the general idea that organizations should collect and store as much unique and valuable data as possible and the reality that storing this data results in data piles that cannot fit in either a single machine or even in the memory of lots and lots of machines.  When the amount of data that an organization is storing is so large that lots of disks attached to lots of computers are needed to store the data, querying such data set for analysis is generally considered a Big Data problem.  These data sets require techniques such as map-reduce and technology stacks like Apache Hadoop to help process them.  These technology stacks cleverly mix distributed computing with distributed data storage to allow for certain kinds of numerically intensive distributed queries to be run on very large (in fact Big) data sets. Clearly no one is better at Big Data than Google. 

It should be noted that most organization have data sets that might appear to be Big Data piles when the data is first obtained, but a few years later the data for that use-case didn’t grow at an astounding rate. Furthermore, the exponential price drop in storage, memory, and CPU power might mean that the use-case is no longer a Big Data problem for the organization except that the data is locked in a legacy data warehouse or database.  In other words, if your meaningful data sets are not growing exponentially every couple of years, it is possible they are not really in the class of Big Data problems anymore.  Genomic research and consumer click data for behemoth online retailers are good examples of Big Data today.  Lots of examples of important analytic data sets in insurance, manufacturing, and healthcare are not growing faster than available memory or storage while becoming increasingly enriched and ripe for value added analytics.

So what? What’s all the blabbing about Big Data have to do with our topic? I wanted to highlight that Big Data, the technologies around Big Data, the question of whether a particular business problem requires Big Data, is almost totally independent from the general issues and considerations surrounding Analytic Deployment.  Analytic Deployment is not Big Data.

Truth:  Analytic Deployment is what happens after you design and build models on historical data.

Once your organization has targeted business use-cases, models must be designed and trained on the available historical data that your organization has captured (Big Data or not).  Example use-cases include predicting actual vs predicted paid loss for insurance claims; product demand patterns such as lost, leaking, stable, growing over various time periods in retail; predicting likelihood to give to charity for the non-for-profit world; or predicting probability of default for loans, 

The mathematical “design” of analytics to target such use-cases is primarily a data science endeavor.  Questions such as what “factors” to use, the applicability of analytic techniques such as building GBMs or random forests, how to balance bias and variance, and more must be considered by the data scientists/analytic designers. Many large and successful organizations, which are not Google but rather are in the insurance, credit scoring, retail, and customer engagement industries have been building models like these for decades.

There are a number of software tools to help data science/analytic teams create valuable analytics on historical data once the approach as been designed. Such tools include but are not limited to proprietary and mature products such as SAS, SPSS, and Matlab; open source tools such as R and a variety of open source data science packages for Python such as a numpy, pandas, and sci-kit learn.  There are also “big data centric” approaches like Spark and ml.lib and of course newer software companies such as H20 and Nutonian.  The list of potential tools and techniques to use for analytic design seems to go on and on.

However, no matter what tools the “analytic design and creation team” uses to design, train, and test analytic models on historic data, the model is not actually useful until it is “deployed” into a business process.  Analytic Deployment is what happens after you design and build models on historical data.

Truth:  Analytic Deployment frameworks must be agnostic to analytic design tools and data engineering and storage solutions.

Hopefully it is obvious that a “model” or an “analytic” which was designed and trained to solve a specific business problem should be, in some sense, an asset separate from the design/training tool such used to build it, and yet again separate from the method of storing data used to train or score that model. It would be ideal to have an abstraction for models that give them some independence from and portability between various modeling tools and the myriad of data management solutions.

To achieve this abstraction, we need to think of models as independent from languages, systems, and platforms. Organizations should demand that truth for future scalability.  Models are primarily mathematical entities that have been created by the examination of data, specifically your organization’s proprietary data.  Often they are complex mathematical functions that have been custom designed to fit the data owned by the organization.  When deployed, a credit card fraud detection model for example, analytics are expected to provide valuable insights when new data is captured,  like assessing ifa given pending or recent transaction is likely to be fraudulent. Since they are designed and trained on your organization’s proprietary data, these models embody proprietary and often competitive differentiating value.  These models are critical assets.

When they accurately enhance (or enable) a business process like improving customer retention, these models are extremely valuable.  In the case of the credit scoring and insurance industries their accuracy and applicability have a dramatic effect on an organization’s top line performance. Therefore, the faster a model can be deployed into a business process the more an organization can improved the Time to Value on the sunk cost expense of designing and training the model itself. 

Furthermore, given their importance, it stands to reason that once models are designed, trained, and tested on historical data it is critical that they can be deployed, updated, maintained, managed through a lifecycle independent from the tools that created it, the specific data management solutions which housed the historical data the time of creation or update, and even the data scientists who designed and built the models.  

To achieve rapid deployment and model-system-modeler independence, analytic deployment frameworks must be agnostic to analytic design tools and data engineering and storage solutions. Reaching this desired agnosticism without adding complexity and dependence, or reducing flexibility or expressiveness, the right abstractions must be adopted. 

By Stu Bailey

Analytic Deployment Stacks and Frameworks (Part 2): Models Abstraction

This is part 2 in a series discussing an approach to effective deployment of analytic models at scale.  You can find part 1 here.  

Our first abstraction intended to aide the coordination of analytic designers and analytic deployment is the model.  As an abstract entity, a model has four main components.

  • input (factors)
  • output (scores and other data)
  • state (including initial state, usually trained or fitted to historical data)
  • the math (some times what is called the “scoring function”)

The math can be written or automatically generated in any language: Python, R, Java, etc.  The math is math, which by definition is language neutral.  Generally the structure of the math, for example GBMs or Random forests are templated in language and the specific structures and numeric values are filled in by a “fitting tool” like R or Spark or even by hand by the data scientist.  The math is like a black box, it takes the input (sometimes called factors), current state and generates the output (sometimes called y-hat or score or prediction or outcome or insight) and potentially some meta data.  This output data may then be passed to another stage in an analytics chain or used to monitor and report on the process etc.

We call the application of the input “scoring” but that’s just a technical jargon for taking a given input and a “current state” and applying the “action” of the math to produce an “output” for that particular input and state combination. 

So let’s look two implementations of a trained neural net model of that fit this abstraction. One is in articulated in Python and one in R.  First let’s look at the Python Model:

# input: array-double
# output: double

# A Neural Net model
# y = f(W_2 f( W_1 x + b1) + b2) where 
# x = input vector, y = output vector
# f is the activator function, and W_1, W_2 are the weights
# and b1, b2 are the bias vectors
# In this example, the neural net computes XOR of the inputs

import numpy as np
import math

# The Initial state

def begin():
    global W_1, W_2, b_1, b_2, f
    f = np.vectorize(activator)
    W_1 = [[-6.0, -8.0], [-25.0, -30.0]]
    b_1 = [4.0, 50.0]
    W_2 = [[-12.0, 30.0]]
    b_2 = -25.0

# The math with datum as type “input” and output at “yield”
def action(datum):
    x = np.array(datum)
    y = f(np.dot(W_2, f(np.dot(W_1, x) + b_1 ) ) + b_2)
    yield y[0] # the dot product returns a numpy array with one element

# Supporting functions
# Here we use a sigmoid logistic activator function
# but you can define your own
def activator(x):
    return 1 / (1 + math.exp(-x))

And here is the R model:

# input: array-double
# output: double

# A Neural Net model
# y = f(W_2 f( W_1 x + b1) + b2) where 
# x = input vector, y = output vector
# f is the activator function, and W_1, W_2 are the weights
# and b1, b2 are the bias vectors
# In this example, the neural net computes XOR of the inputs

# The Initial state

begin <- function(){
  W_1 <<- matrix(c(-6, -8, -25, -30), nc = 2, byrow = T)
  W_2 <<- matrix(c(-12, 30), nc = 1)
  b_1 <<- c(4, 50)
  b_2 <<- -25
}


# Supporting functions

activator <- function(x){
  y <- 1/(1+exp(-x))
  return(y)
}

# The math with datum as type “input” and output at “emit”

action <- function(datum){
  x <- matrix(datum, nc = 1)
  y <- activator(t(W_2) %*% activator((W_1 %*% x) + b_1) + b_2)
  emit(y[[1]])
}

Input “schema” or “type system” can be described with a language neutral system like Avro Schema:

{"type" : "array", "items" : "double"}

The output typing can be similarly described:

{"type" : "double"}

In fact, requiring a typed input / output set is essential to achieve scalable data science competency at scale.  You can also notice the regularity of structure and code points:

begin for the initial state

action(datum) where the math is defined over datum which is defined with the “input schema” describing an array of type double.  And some supporting functions for code organization, in this case the activator function of the neural net. 

It turns out most analytic models can be easily conformed to this structure.  Moreover, doing so allows them to be executed by an agnostic engine which abstracts some of the messy DevOps and Data Engineering details away from this fairly clean math.  We’ll deal with the abstraction “streams” in our next blog.

By Stu Bailey