Question: What is data mining?
Data mining is the semi-automatic discovery of patterns, changes,
associations, anomalies, and other statistically significant
structures from large data sets.
Question: Why is data mining important?
There is more and more digital data being collected, processed,
managed and archived every day. Algorithms, software tools, and systems
to mine it are critical to a wide variety of problems in business,
science, national defense, engineering, and health care.
Question: What are some commerical success stories in data
mining?
Data mining has been applied successfully in a number of
different fields, including:
a) for detecting credit card fraud by HNC, which is now part
of FICO;
b) in credit card acquisition and risk management by American
Express;
c) for product recommendations by Amazon;
d) and for improving search and for placing online ads by Google.
Question: What are the historical roots of data mining?
From a business perspective, data mining's roots are in direct
marketing and financial services which have used statistical modeling
for at least the past two decades. From a technical perspective, data
mining has emerged as a separate discipline from various fields,
including a) statistics, b) machine learning, c) databases, and d)
high performance computing.
Q. What are some of the different techniques
used in data mining?
There are several different types of data mining, including:
- Predictive models. These types of models predict how
likely an event is. Usually, the higher a score, the more
likely the event is. For example, how likely a credit card
transaction is to be fraudulent, or how likely an airline
passenger is to be a terrorist, or how likely a company
is to go bankrupt.
- Summary models. These models summarize data. For
example, a cluster model can be used to divide credit
card transactions or airline passengers into different
groups depending upon their characteristics.
- Network models. These types of models uncover certain
structures in data represented by nodes and links. As an example, in
a network model describing Facebook friends, nodes might be
individuals and directed edges with weights might represent the
likelihood that one friend will contact another friend in the next 72
hours. As another example, a credit card fraud ring may
surreptitiously collect credit card numbers at a pawn shop and then
use them for online computer purchases. Here the nodes are credit
card accounts and merchants and the links are credit card
transactions.
- Association models. Sometimes certain events occur frequently
together. For example, purchases of certain items, such as beer and
pretzels, or a sequence of events associated with the failure of a
component in a device. Association models are used to find and
to characterize these co-occurrences.
Q. What are the major steps in data mining?
- Data cleaning. The first and most challenging step is
to clean and to prepare the data for data mining and statistical
modeling. This is usually the most challenging step.
- Data mart. The next step is to create a data mart
containing the cleaned and prepared data.
- Derived attributes. It is rare for a model to built
using only the attributes present in the cleaned data; rather,
additional attributes called derived attributes are usually defined.
As a single example, a stock on the S&P 500 has a price and an
earnings associated with it, but the ratio of the price divided by the
earnings is more important for many applications than either single
attribute considered by itself. The construction of the derived and
data attributes from the raw data is sometimes called shaping
the data.
- Modeling. Once the data is prepared and data mart is
created, one or more statistical or data mining models are built.
- Post-processing. It is common to normalize the outputs
of data mining models and to apply business rules to the inputs and
the outputs of the models. This is to ensure that the scores and
other outputs of the models are consistent with the over all business
processes the models are supporting.
- Deployment. Once a statistical or data mining model has
been produced by the steps above, the next phase begins of deploying
the model in operational systems. Deployment usually consists of
three different activities. First, data is scored using the
statistical or data mining model produced on a periodic basis, either
daily, weekly or monthly, or perhaps on a real time, or event driven
basis. Second, these scores are deployed into operational systems and
also used as the basis for various reports. Third, on a periodic
basis, say monthly, a new model is built and compared to the existing
model. If required, the old model is replaced by the new model.
Q. Are there standards for data mining?
Yes, there are several standards used in data mining. The most
widely used standard is the Predictive Model Markup Language or PMML
that is developed by a consortium of vendors called the Data Mining
Group. PMML can be used for describing statistical and data mining
models as well as many of the transformations required to prepare and
shape data attributes to create the inputs to the models. PMML can be
be used for importing models, exporting models, and serializing models
for passing models between applications.
Q. What are the differences between predictive models,
business rules, and score cards?
Predictive models use historical data to predict future
events, for example the likelihood that a credit card transaction
is fraudulent or that an airline passenger is likely to commit a
terrorist act. Business rules ensure that business processes
follow agreed upon procedures. For example, business procedures
may dictate that a predictive model can use only the first three
digits of a zip code not all five digits. Score cards check
certain conditions, and for example, and if these conditions are
met, points are added to an overall score. For example, a score
card for a credit card fraud model, might add 28 points if
a $1 transaction occurs at a gas station. The higher the score,
the more likely the credit card transaction is fraudulent. The
best practice is to use both rules and scores. Rules ensure that
business processes are being followed and predictive models
ensure that historical data is being used most effectively.
Score cards are typically used for very basic systems which
use just a few simple rules or for historical reasons. For
example, the credit scoring reason has used score cards for many
years - these score cards though use statistical models to
determine the conditions and corresponding scores.
Q. What determines the accuracy of predictive models?
The accuracy of a predictive model is influenced most strongly
by the quality of the data and the freshness of the model.
Without good data, it is simply wishful thinking to expect a
good model. Without updating the model frequently, the model's
performance will decay over time.
Accuracy is measured in two basic ways. Models have false
positive rates and false negative rates. For example, consider a
model predicting credit card fraud. A false positive means that
the model predicted fraud when no fraud was present. A false
negative means that the model predicted that the transaction was
ok when in fact it was fraudulent. In practice, false positive
and false negative rates can be relatively high. The role of a
good model is to improve a business process by a significant
degree not to make flawless predictions. Only journalists and
pundits make flawless predictions.
Best practice uses separate, specialized software applications for
building models (the model producer) and for scoring models (the model
consumer). The Predictive Model Markup Language or PMML is the
industry standard for describing a model in XML so that it can
be moved easily between a model producer and a model consumer.
Good accuracy require fresh models on fresh data, which means
updating the model consumer as frequently as the data demands.
Q. What are the major types of predictive models?
Although there are quite a large number of different types of
predictive models, the majority of applications use one of the
following types of models.
- Linear models. For many years, especially before
the advent of personal computers, these were the most common
types of models due to their simplicity. They divide data into
two different cells using a line in two dimensions and a plane
in higher dimensions. Quadratic models are similar but
use a curve instead of a line to divide the data.
- Logistic models. Logistic models are used when
the predicted variable is zero or one, for example predicting
that a credit card transaction is fraudulent or not. Logistic
models assume that one of the internal components of the model
is linear. Computing the weights that characterize
a logistic model is difficult by hand, but simple with a computer.
- Trees. Trees are a type of nonlinear model that uses
a series of lines or planes to divide the data into different
cells. Trees consist of a sequence of if...then.. rules.
Because of this, it is easier to interpret trees than other types
of nonlinear models such as neural networks.
- Neural Networks. Neural networks are a type of nonlinear
model broadly motivated ("inspired by" is the phrase Hollywood uses)
by neurons in brains.
- Support Vector Machines. Support vector machines use what are
called kernel functions to separate data into two classes. Using
kernel functions, a nonlinear classifier can be found by computing a
hyperplane in a higher dimensional linear space that separates the two
classes. The higher dimensional linear space is a transformation of
the original space.
- Hybrid Models. It is common to combine one or more
of the four models above to produce a more powerful model.
Q. What is the difference between a linear and nonlinear
model?
Models can be thought of as a function, which takes inputs,
performs a computation, and produces an output. The output is
often a score, say from 1 to 1000, or a label, such such as high,
medium, or low. A very simple type of model, called a linear
model, uses the n input features to split the space of features
into two parts. This is done using an (n-1)-dimensional plane.
For example, 2 features can be separated with a line, 3 features
with a plane, etc. Most data is not so simple. Any model which
is not linear is called a nonlinear model. Logistic models, tree
based models and neural networks are common examples of nonlinear
models.
Q. What are the some of the differences between the
various types of predictive models?
First, there is no one best model. Different data requires
different types of models. The accuracy of a model depends more on
the quality of the data, how well it is prepared, and how fresh the
model is than on the type of model used. On the other hand, there are
some important differences between different types of models.
Nonlinear models are generally more accurate than linear models.
Linear models were more common in the past because they were easier to
compute. Today this is no longer relevant given the proliferation of
computers and good quality statistical and data mining software.
Neural networks were very popular in the 80's and early 90's because
they were quite successful for several different types of applications
and because they had a cool name. Today, a variety of other methods
are also commonly used, including tree-based methods and support
vector machines. For example, tree-based methods are generally
considered easier to build, easier to interpret, and more scalable
than neural networks.
Q. I hear the phrase "empirically derived and statistically
valid" applied to models. What does that mean?
Decisions based upon models derived from data are usually
expected to be empirically derived and statistically sound. That
is, first, they must be derived from the data itself, and not the
biases of the person building the model. Second, they must be
based upon generally acceptable statistical procedures. For
example, the arbitrary exclusion of data can result in models
that are biased in some fashion.
Q. What are some of the major components in a data mining
system?
Assume that the function of the data mining system is to assign
scores to various profiles. For example, profiles may be maintained
about companies and the scores used to indicate the likelihood that the
company will go bankrupt. Alternatively, the profiles may be maintained
for customer accounts and the scores indiciate the likelihood that the
account is being used fradulently. A typical data mining system
processes raw transactional data, consisting of what are called events,
to produce the profiles. To continue the examples above, the events may
consist of survey data about the companies, or purchases by the
customer.
First, a data mart is used to store the event and profile
data which is used to build
the predictive models. For large data sets, the data mart must be
designed for efficient statistics on columns rather than simple counting
and summaries like a conventional data warehouse, or safe updating of
rows, like a conventional database.
Second, a data mining system takes data
from the data mart and applies statistical or data mining
algorithms to produce a model. More precisely, the data mining system
takes a learning set of profiles and produces a statistical model.
Third, an operational data store or operational database is used to
store profiles. A profile is a statistical summary of the entity being
model and typically contains dozens to hundreds of features. A
relational database is generally used for the operational data
store.
Fourth, the scoring software takes a model
produced by the data mining system, and a profile from the operational
data store and produce one or more scores. These scores can either
be used to produce reports or deployed into operation systems.
Fifth, the reports generated are generally made available through a
reporting system.
For smaller applications, a database can be used for the data mart
and operational data store, and the reports can be produced in HTML and
made available through a web server.
Q. Who is the author of this FAQ?
This FAQ is maintained by Robert L. Grossman.
Copyright Robert L. Grossman, 1999-2008,
revised December 31, 2008.