Modern businesses leverage analytics to gain insights in a multitude of areas, from evaluating business performance to predicting future behaviors. In many industries, these insights are quantified numerically as “scores,” and the process of applying an analytic model to transform a collection of data into scores is called “scoring.”
Often, the more data that’s available to train the model, the more accurate and valuable the model will be. In today’s world, businesses have more data than ever before. While more data typically leads to more valuable models, an increasing quantity of data can strain many traditional end-to-end model scoring processes. In this blog post, we’ll look at how to overcome this problem by building scalable, big data-ready model deployment and scoring architectures.
Before we begin to discuss how to properly achieve scalable, big data-ready model deployment and scoring architectures, we need to understand exactly what scoring is. The best example to have in mind in terms of what scoring is, is a credit score. The model takes in a particular individual's demographic information and credit history, and then produces a number which represents how safe they are to lend money too. In different industries, scores can represent different things, but the process is more complicated than “multiplying numbers and sending them off to be scored.”
Now that we understand what “scoring” is, let’s go through a few of the challenges that a company often faces when scoring with big data. One of the biggest obstacles in working with big data begins within a model’s training phase. If the dataset a model is utilizing for the training process is too large to fit into memory, then the modeler has to get creative about how to train the model. For example, the modeler can train the model on subsets of the data and merge the results. Depending on the type of model, this may not even be possible.
Once the model has been trained and is ready for deployment, the next big data challenge is encountered. The modeler has to ask themself, “What if the model needs to score a large quantity of data?”
The best-case scenario at this stage is that the model can be deployed in a “streaming” fashion. That is, mathematically, the model satisfies a couple of properties:
- The model produces scores “one record at a time”, which means that the model can be run on arbitrarily small subsets of the input data and the resulting scores are still valid, and
- Ideally, the scores produced for each individual record are independent of the scores of other records in the set, or the order the inputs are received in.
Both of these properties hold for many common types of models, such as: gradient boosting machines, logistic and linear regression models, (non-recurrent) neural nets, etc. If a model satisfies both of these properties, then the burden of scaling the model across large input datasets falls squarely on the infrastructure and deployment engine, rather than the model itself. These models can then be run concurrently to handle high data throughputs. Through this process, models do not have a significant memory footprint because only small subsets of the input data are ever loaded into memory.At Open Data Group, we understand the importance of scoring big datasets in a way that is both fast and scalable, which is why we created FastScore. FastScore supports running models in high-throughput enterprise streaming data platforms (such as Kafka), and provides push-button concurrency and scaling. The FastScore scoring engine also supports advanced state sharing and state management functionality, which allows even typically single-threaded R models to be easily run concurrently and in a streaming manner. FastScore is packaged as a Docker container, the engine (and your models) can be deployed anywhere: on your physical hardware, in the cloud, or any combination thereof, and scaled with demand. To learn more about how to use FastScore to overcome your big data analytic challenges, click here!