Question: What is data mining?
Data mining is the semi-automatic extraction of patterns, changes,
associations, anomalies, and other statistically significant
structures from large data sets.
Question: Why is data mining important?
There is more and more digital data being collected, processed,
managed and archived every day. Algorithms, software tools, and systems
to mine it are critical to a wide variety of problems in business,
science, national defense, engineering, and health care.
Question: What are some commerical success stories in data
mining?
Data mining has been applied successfully in a number of
different fields, including:
a) for detecting credit card fraud by HNC, which was recently
acquired by FICO;
b) in credit card acquisition and risk management by American
Express;
and c) for product recommendations by Amazon.
Question: What are the historical roots of data mining?
From a business perspective, data mining's roots are in direct
marketing and financial services which have used statistical modeling
for at least the past two decades. From a technical perspective, data
mining is beginning to emerge as a separate discipline with roots in a)
statistics, b) machine learning, c) databases, and d) high performance
computing.
Q. What are some of the different techniques
used in data mining?
There are several different types of data mining, including:
- Predictive models. These types of models predict how
likely an event is. Usually, the higher a score, the more
likely the event is. For example, how likely a credit card
transaction is to be fraudulent, or how likely an airline
passenger is to be a terrorist, or how likely a company
is to go bankrupt.
- Summary models. These models summarize data. For
example, a cluster model can be used to divide credit
card transactions or airline passengers into different
groups depending upon their characteristics.
- Network models. These types of models uncover certain
structures in data represented by nodes and links.
For example, a credit card fraud ring may surreptitiously
collect credit card numbers at a pawn shop and then use
them for online computer purchases. Here the nodes
are consumers and merchants and the links
are credit card transactions. Similarly a network model
for a terrorist cell might use nodes representing individuals
and links representing meetings.
- Association models. Sometimes certain events occur
frequently together. For example, purchases of certain items,
such as beer and pretzels, or a sequence of events associated
with component failure. Association models are used to find and
characterize these co-occurrences.
Q. What are the major steps in data mining?
- Data cleaning. The first and most challenging step is
to clean and to prepare the data for data mining and statistical
modeling. This is usually the most challenging step.
- Data mart. The next step is to create a data mart
containing the cleaned and prepared data.
- Derived attributes. It is rare for a model to built
using only the attributes present in the cleaned data; rather,
additional attributes called derived attributes are usually defined.
As a single example, a stock on the S&P 500 has a price and an
earnings associated with it, but the ratio of the price divided by the
earnings is more important for many applications than either single
attribute considered by itself. The construction of the derived and
data attributes from the raw data is sometimes called shaping
the data. Standards, such as the Data Extraction and Transformation
Markup Language (DXML), are beginning to emerge for defining the
common data shaping operations needed in data mining.
- Modeling. Once the data is prepared and data mart is
created, one or more statistical or data mining models are
built. Today, statistical and data mining models can be described in
an application and platform independent XML interchange format called
the Predictive Model Markup Language or PMML.
- Post-processing. It is common to normalize the outputs
of data mining models and to apply business rules to the inputs and
the outputs of the models. This is to ensure that the scores and
other outputs of the models are consistent with the over all business
processes the models are supporting.
- Deployment. Once a statistical or data mining model has
been produced by the steps above, the next phase begins of deploying
the model in operational systems. Deployment usually consists of
three different activities. First, data is scored using the
statistical or data mining model produced on a periodic basis, either
daily, weekly or monthly, or perhaps on a real time, or event driven
basis. Second, these scores are deployed into operational systems and
also used as the basis for various reports. Third, on a periodic
basis, say monthly, a new model is built and compared to the existing
model. If required, the old model is replaced by the new model.
Q. What are the differences between predictive models,
business rules, and score cards?
Predictive models use historical data to predict future
events, for example the likelihood that a credit card transaction
is fraudulent or that an airline passenger is likely to commit a
terrorist act. Business rules ensure that business processes
follow agreed upon procedures. For example, business procedures
may dictate that a predictive model can use only the first three
digits of a zip code not all five digits. Score cards check
certain conditions, and for example, and if these conditions are
met, points are added to an overall score. For example, a score
card for a credit card fraud model, might add 28 points if
a $1 transaction occurs at a gas station. The higher the score,
the more likely the credit card transaction is fraudulent. The
best practice is to use both rules and scores. Rules ensure that
business processes are being followed and predictive models
ensure that historical data is being used most effectively.
Score cards are typically used for very basic systems which
use just a few simple rules or for historical reasons. For
example, the credit scoring reason has used score cards for many
years - these score cards though use statistical models to
determine the conditions and corresponding scores.
Q. What determines the accuracy of predictive models?
The accuracy of a predictive model is influenced most strongly
by the quality of the data and the freshness of the model.
Without good data, it is simply wishful thinking to expect a
good model. Without updating the model frequently, the model's
performance will decay over time.
Accuracy is measured in two basic ways. Models have false
positive rates and false negative rates. For example, consider a
model predicting credit card fraud. A false positive means that
the model predicted fraud when no fraud was present. A false
negative means that the model predicted that the transaction was
ok when in fact it was fraudulent. In practice, false positive
and false negative rates can be relatively high. The role of a
good model is to improve a business process by a significant
degree not to make flawless predictions. Only journalists and
pundits make flawless predictions.
Best practice uses separate, specialized software applications for
building models (the model producer) and for scoring models (the model
consumer). The Predictive Model Markup Language or PMML is the
industry standard for describing a model in XML so that it can
be moved easily between a model producer and a model consumer.
Good accuracy require fresh models on fresh data, which means
updating the model consumer as frequently as the data demands.
Q. What are the major types of predictive models?
Although there are quite a large number of different types of
predictive models, the majority of applications use one of the
following types of models.
- Linear models. For many years, especially before
the advent of personal computers, these were the most common
types of models due to their simplicity. They divide data into
two different cells using a line in two dimensions and a plane
in higher dimensions. Quadratic models are similar but
use a curve instead of a line to divide the data.
- Logistic models. Logistic models are used when
the predicted variable is zero or one, for example predicting
that a credit card transaction is fraudulent or not. Logistic
models assume that one of the internal components of the model
is linear. Computing the weights that characterize
a logistic model is difficult by hand, but simple with a computer.
- Neural Networks. Neural networks are a type of nonlinear
model broadly motivated ("inspired by" is the phrase Hollywood uses)
by neurons in brains.
- Trees. Trees are a type of nonlinear model which uses
a series of lines or planes to divide the data into different
cells. Trees consist of a sequence of if...then.. rules.
Because of this, it is easier to interpret trees than other types
of nonlinear models such as neural networks.
- Hybrid Models. It is common to combine one or more
of the four models above to produce a more powerful model.
Q. What is the difference between a linear and nonlinear
model?
Models can be thought of as a function, which takes inputs,
performs a computation, and produces an output. The output is
often a score, say from 1 to 1000, or a label, such such as high,
medium, or low. A very simple type of model, called a linear
model, uses the n input features to split the space of features
into two parts. This is done using an (n-1)-dimensional plane.
For example, 2 features can be separated with a line, 3 features
with a plane, etc. Most data is not so simple. Any model which
is not linear is called a nonlinear model. Logistic models, tree
based models and neural networks are common examples of nonlinear
models.
Q. What are the some of the differences between the
various types of predictive models?
First, there is no one best model. Different data requires different
types of models. The accuracy of a model depends more on the quality of
the data, how well it is prepared, and how fresh the model is than on
the type of model used. On the other hand, there are some important
differences between different types of models. Nonlinear models are
generally more accurate than linear models. Linear models were more
common in the past because they were easier to compute. Today this is
no longer relevant given the proliferation of computers and good quality
statistical and data mining software. Neural networks were very popular
in the 80's and early 90's because they were quite successful for
several different types of applications and because they had a cool
name. Today, they are being replaced by tree-based methods, which are
generally considered easier to build, easier to interpret, and more
scalable.
Q. I hear the phrase "empirically derived and statistically valid"
applied to models. What does that mean?
Decisions based upon models derived from data are usually
expected to be empirically derived and statistically sound. That
is, first, they must be derived from the data itself, and not the
biases of the person building the model. Second, they must be
based upon generally acceptable statistical procedures. For
example, the arbitrary exclusion of data can result in models
that are biased in some fashion.
Q. What are some of the major components in a data mining
system?
Assume that the function of the data mining system is to assign
scores to various profiles. For example, profiles may be maintained
about companies and the scores used to indicate the likelihood that the
company will go bankrupt. Alternatively, the profiles may be maintained
for customer accounts and the scores indiciate the likelihood that the
account is being used fradulently. A typical data mining system
processes raw transactional data, consisting of what are called events,
to produce the profiles. To continue the examples above, the events may
consist of survey data about the companies, or purchases by the
customer.
First, a data mart is used to store the event and profile
data which is used to build
the predictive models. For large data sets, the data mart must be
designed for efficient statistics on columns rather than simple counting
and summaries like a conventional data warehouse, or safe updating of
rows, like a conventional database.
Second, a data mining system takes data
from the data mart and applies statistical or data mining
algorithms to produce a model. More precisely, the data mining system
takes a learning set of profiles and produces a statistical model.
Third, an operational data store or operational database is used to
store profiles. A profile is a statistical summary of the entity being
model and typically contains dozens to hundreds of features. A
relational database is generally used for the operational data
store.
Fourth, the scoring software takes a model
produced by the data mining system, and a profile from the operational
data store and produce one or more scores. These scores can either
be used to produce reports or deployed into operation systems.
Fifth, the reports generated are generally made available through a
reporting system.
For smaller applications, a database can be used for the data mart
and operational data store, and the reports can be produced in HTML and
made available through a web server.
Q. Who is the author of this FAQ?
This FAQ is maintained by Robert L. Grossman.
Copyright Robert L. Grossman, 1999-2005,
revised January 2, 2005.