Event Based Data Mining Process Models
Event Based Data Mining Process Models
Robert L. Grossman
Open Data Partners
September, 2004
Introduction
Beginning about 1996, there has been a broad consensus within
the data mining community about the essential steps in a data mining
process model. For example, the CRISP-DM Process Model (CPM) is a
description of this consensus approach.
During the past three to four years, several essential modifications
to this model have emerged:
- The essential difference between transactional or event data,
such as purchases, credit card transactions, or insurance claims, and
aggregated or summary data, such as a customer master file, account
level information, or member data, respectively, has been highlighted.
The former are usually called events or transactions, while the
latter are called summary or feature vectors.
- The importance of a) data transformations and data
aggregations and b) the role of derived attributes to build features
has been highlighted, instead of simply being a step in data
preparation.
- The role of data mining and statistical standards and the use
of standards based deployment environments have become an important
requirements for many applications.
- The importance of data quality has been highlighted.
- The practical importance of a data mining specific datamart has
emerged.
Event Based Data Mining Process Model
The term Event-Based Process Model (EBPM) is sometimes used to
describe a data mining process model incorporating the components
above. Variants of the EBPM are being standardized by the
vendor-supported Data Mining Group. Open Data Partners has been very
active in the development of standards for event-based data mining
process models and the deployment of systems using these process
models.
Essential Steps in the Event-Based Data Mining Process Model
In this section, we list some of the main steps in an event-based
data mining process model.
Step 1. Problem Identification and Project Design.
Deliverable: statement of problem, including metric for measuring success.
- define metric for evaluating model or rules
- determine initial reporting requirements
- identify data set for project
- get data dictionary and small sample of data for sanity check
- initial selection of data mining approach & algorithms
- initial selection of deployment architecture
- select environment for data mart
Step 2. Build Data Mining Data Mart
Deliverable: data mart populated with event data. Feature vectors,
created below in the process, and scores are also usually stored in
the data mart.
- identify extracts
- load data mart with raw data
- clean data sufficiently to begin data exploration in the
next step
Step 3. Data Exploration and Data Quality Assessment
Deliverable: short report containing statistical overview of
data.
- exploratory analysis of data
- identify important sub-populations for different models
- data quality assessment
Step 4. Preparing Feature Vectors for Modeling
Deliverable: DXML or other description of process used to define
feature vector from one or more attributes in data tables.
- initial transformations of attributes
- create derived attributes or features
- do quality assurance for derived attributes
- data mart at this point contains data and feature tables
Step 5. Prepare Data Sets for Modeling
Deliverable: short report describing how learning and validation
data sets will be prepared.
- define sub-populations to model
- define data sets for modeling and validation
- create modeling and validation data sets (flat files) by dumping data from data mart
Step 6. Statistical Modeling
Deliverable: PMML file and/or rule sets describing statistical or
data mining model.
- separate feature vectors into appropriate segments for separate modeling
- create baseline model
- iteratively refine features
- create Mark 1 model and use as champion
- rank order features based upon Mark 1 model
Step 7. Model Validation
Deliverable: evaluate quality of model (output: lift of model over
random using agreed upon metric).
- create lift table for model or other agreed upon report using validation set
- obtain agreement on how model will be evaluated
- create Mark 2 model and compare to champion
Step 8. Deployment
Deliverable: PMML file, rule set, or other agreed upon deployment
mechanism for operational system.
- design operational deployment environment
- set up deployment test environment
Step 9. Reporting, Monitoring, and Quality Assessment
Deliverable: Regular, periodic report on quality of model and its
effectiveness.
- study effectiveness and quality of model as it is deployed and used
- model impact in terms of relevant business metric(s)
- understand decay of model over time