This is part 1 in a multi-part series discussing an approach to effective deployment of analytic models at scale.
It’s 2017. Your organization has been collecting valuable data for several years. The organization you work for is somewhere on the spectrum of analytic maturity from “we just hired our first data scientist” to “we are in the credit scoring business and have been developing critical analytics for decades”.
No matter where the organization is on the analytic maturity journey, now is the time to ensure you have an analytic deployment technology stack and organizational competency that is future proofed for growth, and anticipates an increased reliance on analytics and data science to drive top line organizational success. Let’s start by clarifying our topic and dispelling some myths.
Truth: Analytic Deployment is not Big Data.
Big Data is both the general idea that organizations should collect and store as much unique and valuable data as possible and the reality that storing this data results in data piles that cannot fit in either a single machine or even in the memory of lots and lots of machines. When the amount of data that an organization is storing is so large that lots of disks attached to lots of computers are needed to store the data, querying such data set for analysis is generally considered a Big Data problem. These data sets require techniques such as map-reduce and technology stacks like Apache Hadoop to help process them. These technology stacks cleverly mix distributed computing with distributed data storage to allow for certain kinds of numerically intensive distributed queries to be run on very large (in fact Big) data sets. Clearly no one is better at Big Data than Google.
It should be noted that most organization have data sets that might appear to be Big Data piles when the data is first obtained, but a few years later the data for that use-case didn’t grow at an astounding rate. Furthermore, the exponential price drop in storage, memory, and CPU power might mean that the use-case is no longer a Big Data problem for the organization except that the data is locked in a legacy data warehouse or database. In other words, if your meaningful data sets are not growing exponentially every couple of years, it is possible they are not really in the class of Big Data problems anymore. Genomic research and consumer click data for behemoth online retailers are good examples of Big Data today. Lots of examples of important analytic data sets in insurance, manufacturing, and healthcare are not growing faster than available memory or storage while becoming increasingly enriched and ripe for value added analytics.
So what? What’s all the blabbing about Big Data have to do with our topic? I wanted to highlight that Big Data, the technologies around Big Data, the question of whether a particular business problem requires Big Data, is almost totally independent from the general issues and considerations surrounding Analytic Deployment. Analytic Deployment is not Big Data.
Truth: Analytic Deployment is what happens after you design and build models on historical data.
Once your organization has targeted business use-cases, models must be designed and trained on the available historical data that your organization has captured (Big Data or not). Example use-cases include predicting actual vs predicted paid loss for insurance claims; product demand patterns such as lost, leaking, stable, growing over various time periods in retail; predicting likelihood to give to charity for the non-for-profit world; or predicting probability of default for loans,
The mathematical “design” of analytics to target such use-cases is primarily a data science endeavor. Questions such as what “factors” to use, the applicability of analytic techniques such as building GBMs or random forests, how to balance bias and variance, and more must be considered by the data scientists/analytic designers. Many large and successful organizations, which are not Google but rather are in the insurance, credit scoring, retail, and customer engagement industries have been building models like these for decades.
There are a number of software tools to help data science/analytic teams create valuable analytics on historical data once the approach as been designed. Such tools include but are not limited to proprietary and mature products such as SAS, SPSS, and Matlab; open source tools such as R and a variety of open source data science packages for Python such as a numpy, pandas, and sci-kit learn. There are also “big data centric” approaches like Spark and ml.lib and of course newer software companies such as H20 and Nutonian. The list of potential tools and techniques to use for analytic design seems to go on and on.
However, no matter what tools the “analytic design and creation team” uses to design, train, and test analytic models on historic data, the model is not actually useful until it is “deployed” into a business process. Analytic Deployment is what happens after you design and build models on historical data.
Truth: Analytic Deployment frameworks must be agnostic to analytic design tools and data engineering and storage solutions.
Hopefully it is obvious that a “model” or an “analytic” which was designed and trained to solve a specific business problem should be, in some sense, an asset separate from the design/training tool such used to build it, and yet again separate from the method of storing data used to train or score that model. It would be ideal to have an abstraction for models that give them some independence from and portability between various modeling tools and the myriad of data management solutions.
To achieve this abstraction, we need to think of models as independent from languages, systems, and platforms. Organizations should demand that truth for future scalability. Models are primarily mathematical entities that have been created by the examination of data, specifically your organization’s proprietary data. Often they are complex mathematical functions that have been custom designed to fit the data owned by the organization. When deployed, a credit card fraud detection model for example, analytics are expected to provide valuable insights when new data is captured, like assessing ifa given pending or recent transaction is likely to be fraudulent. Since they are designed and trained on your organization’s proprietary data, these models embody proprietary and often competitive differentiating value. These models are critical assets.
When they accurately enhance (or enable) a business process like improving customer retention, these models are extremely valuable. In the case of the credit scoring and insurance industries their accuracy and applicability have a dramatic effect on an organization’s top line performance. Therefore, the faster a model can be deployed into a business process the more an organization can improved the Time to Value on the sunk cost expense of designing and training the model itself.
Furthermore, given their importance, it stands to reason that once models are designed, trained, and tested on historical data it is critical that they can be deployed, updated, maintained, managed through a lifecycle independent from the tools that created it, the specific data management solutions which housed the historical data the time of creation or update, and even the data scientists who designed and built the models.
To achieve rapid deployment and model-system-modeler independence, analytic deployment frameworks must be agnostic to analytic design tools and data engineering and storage solutions. Reaching this desired agnosticism without adding complexity and dependence, or reducing flexibility or expressiveness, the right abstractions must be adopted.
By Stu Bailey