A Data Mining Glossary
Robert Grossman
Open Data Partners
April 20, 2004
Aggregation. The process of combining one or more data
vectors to create a feature vector. The attributes of a feature vector
with a key k are derived in part from all the data vectors with
foreign key k. The aggregation process may be described by the Data
Transformation and Extraction Markup Language (DXML), through
transformation libraries, or related methods. As an simple example,
the total dollar volume of the
the credit card transactions in a one hour period is
obtained by aggregating transactions for this period.
Cluster (statistics). In many circumstances there is a
natural way to define the distance between two feature vectors in a
learning set. In this case, feature vectors may be grouped into
clusters with the property that every point in a cluster is closer to
the other points in the cluster then it is to points in other
clusters. There are many algorithms to group feature vectors into
clusters; each algorithm has advantages and disadvantages.
Cluster (demographics). A group of individuals sharing some
characteristics in common. For example, a cluster can be defined by
including those individuals with an income range between
$40,000-$60,000, who own a home, live in a city, are married with
children, and have attended college.
Cluster (hardware). A cluster is a collection of
workstations with software which allows the workstations to function
as a single computer. Clusters of workstations can provide the same
processing power as supercomputers at a fraction of the cost. The
process of distributing tasks over the cluster is called load
sharing.
Data or Event Vector . Data that is used to build a predictive model.
Each data vector has a key and a foreign key. The foreign key is the
key of the associated feature vector.
Data Mining. The process of taking a learning set and
applying an algorithm to obtain one or more statistical models. More
generally, the semi-automatic extraction of patterns, changes,
associations, anomalies, and other statistically significant
structures from large data sets. Even more generally, the analysis of
data to improve decisions.
Derived Attributes. Attributes in a feature vector that are
derived from one or more attributes from data or event vectors or
collections of them. Derived attributes may be described by the Data
Transformation and Extraction Markup Language (DXML), through
transformation libraries, or related methods.
DXML. An XML language for describing how attributes
are normalized, transformed and aggregated to produce feature
vectors. Currently DXML is part of PMML, but there are proposals
for separating DXML and PMML.
Event or Data Vector. An event is a data record that is
used to create features vectors. Data records are transformed,
normalized, and aggregated in order to create features
vectors. Examples of events are credit card transactions
or insurance claims. Both of these can be aggregated to produce
feature vectors associated with accounts or members, respectively.
Feature Vector. The input to a predictive
model. A vector of attributes. Each feature vector has a key or ID.
Key. A unique id for data vector or profile
vector.
Learning Set. A data set is usually divided
into two subsets: a learning set and a validation
set. The learning set is used to build the model.
The validation set is used to measure the efficacy of the
model.
Predictive Models or Models. Predictive models
are used for scoring. Their input is a feature vector
and the output is a score. Tree-based classifiers,
logistic regression and neural networks are examples
of predictive models. Predictive models are described
in the Predictive Model Markup Language (PMML).
We usually do not distinguish between a predictive model
and a more general model such as a rule based model and
refer to both as predictive models or simply model. Both
can be expressed in PMML.
Predictive Model Markup Language (PMML). PMML is an
XML language developed by the Data Mining Group used for
describing statistical models. PMML Models can be used by any
PMML-compliant product. Translators exist which translate some
proprietary formats into PMML. Currently, PMML is also used to
describe transformations and aggregations of attributes to create
derived attributes. There are proposals to create a separate
markup language called DXML to described these types of
transformations.
Shaping. In our context, it can be thought of as the
process of combining one or more data vectors to create a feature
vector. The attributes of a feature vector with a key k are
derived in part from all the data vectors with foreign key k.
Shaping may be described by the Data Transformation and
Extraction Markup Language (DXML), through transformation
libraries, or related methods.
Supermodel. Any fashion model whose hourly wage is more
than 10x that of a software guru. Predictive models and fashion
models are fundamentally different and it is important not to confuse
them, which is becoming more difficult as fashion models become
spokespersons for technology companies.
Scoring and Scoring Engine. The process which takes a
profile vector and produces a score using a predictive model.
The semantics of scoring is defined by a PMML file. A scoring
engine is an application used for scoring.
Updating. The process of updating feature vectors
using data from one or more data vectors. The rules for
updating are described by a DXML file and any associated
files. The term also refers to the process of creating
new feature vectors from one or more data vectors.
Validation Set. A data set is usually divided into
two subsets: a learning set and a validation set. The learning
set, which is also called the training set, is used to build the
model. The validation set is used to measure the efficacy of the
model.
Copyright Robert L. Grossman, 1998-2004,
revised April 20, 2004 .