Data Mining Glossary
     

Home

Open Data Services

Client Success Stories

News

White Papers

About the Principals

About Open Data

Jobs at Open Data

Contacting Open Data

 



A Data Mining Glossary

Robert Grossman
Open Data Partners

April 20, 2004


Aggregation. The process of combining one or more data vectors to create a feature vector. The attributes of a feature vector with a key k are derived in part from all the data vectors with foreign key k. The aggregation process may be described by the Data Transformation and Extraction Markup Language (DXML), through transformation libraries, or related methods. As an simple example, the total dollar volume of the the credit card transactions in a one hour period is obtained by aggregating transactions for this period.

Cluster (statistics). In many circumstances there is a natural way to define the distance between two feature vectors in a learning set. In this case, feature vectors may be grouped into clusters with the property that every point in a cluster is closer to the other points in the cluster then it is to points in other clusters. There are many algorithms to group feature vectors into clusters; each algorithm has advantages and disadvantages.

Cluster (demographics). A group of individuals sharing some characteristics in common. For example, a cluster can be defined by including those individuals with an income range between $40,000-$60,000, who own a home, live in a city, are married with children, and have attended college.

Cluster (hardware). A cluster is a collection of workstations with software which allows the workstations to function as a single computer. Clusters of workstations can provide the same processing power as supercomputers at a fraction of the cost. The process of distributing tasks over the cluster is called load sharing.

Data or Event Vector . Data that is used to build a predictive model. Each data vector has a key and a foreign key. The foreign key is the key of the associated feature vector.

Data Mining. The process of taking a learning set and applying an algorithm to obtain one or more statistical models. More generally, the semi-automatic extraction of patterns, changes, associations, anomalies, and other statistically significant structures from large data sets. Even more generally, the analysis of data to improve decisions.

Derived Attributes. Attributes in a feature vector that are derived from one or more attributes from data or event vectors or collections of them. Derived attributes may be described by the Data Transformation and Extraction Markup Language (DXML), through transformation libraries, or related methods.

DXML. An XML language for describing how attributes are normalized, transformed and aggregated to produce feature vectors. Currently DXML is part of PMML, but there are proposals for separating DXML and PMML.

Event or Data Vector. An event is a data record that is used to create features vectors. Data records are transformed, normalized, and aggregated in order to create features vectors. Examples of events are credit card transactions or insurance claims. Both of these can be aggregated to produce feature vectors associated with accounts or members, respectively.

Feature Vector. The input to a predictive model. A vector of attributes. Each feature vector has a key or ID.

Key. A unique id for data vector or profile vector.

Learning Set. A data set is usually divided into two subsets: a learning set and a validation set. The learning set is used to build the model. The validation set is used to measure the efficacy of the model.

Predictive Models or Models. Predictive models are used for scoring. Their input is a feature vector and the output is a score. Tree-based classifiers, logistic regression and neural networks are examples of predictive models. Predictive models are described in the Predictive Model Markup Language (PMML). We usually do not distinguish between a predictive model and a more general model such as a rule based model and refer to both as predictive models or simply model. Both can be expressed in PMML.

Predictive Model Markup Language (PMML). PMML is an XML language developed by the Data Mining Group used for describing statistical models. PMML Models can be used by any PMML-compliant product. Translators exist which translate some proprietary formats into PMML. Currently, PMML is also used to describe transformations and aggregations of attributes to create derived attributes. There are proposals to create a separate markup language called DXML to described these types of transformations.

Shaping. In our context, it can be thought of as the process of combining one or more data vectors to create a feature vector. The attributes of a feature vector with a key k are derived in part from all the data vectors with foreign key k. Shaping may be described by the Data Transformation and Extraction Markup Language (DXML), through transformation libraries, or related methods.

Supermodel. Any fashion model whose hourly wage is more than 10x that of a software guru. Predictive models and fashion models are fundamentally different and it is important not to confuse them, which is becoming more difficult as fashion models become spokespersons for technology companies.

Scoring and Scoring Engine. The process which takes a profile vector and produces a score using a predictive model. The semantics of scoring is defined by a PMML file. A scoring engine is an application used for scoring.

Updating. The process of updating feature vectors using data from one or more data vectors. The rules for updating are described by a DXML file and any associated files. The term also refers to the process of creating new feature vectors from one or more data vectors.

Validation Set. A data set is usually divided into two subsets: a learning set and a validation set. The learning set, which is also called the training set, is used to build the model. The validation set is used to measure the efficacy of the model.


Copyright Robert L. Grossman, 1998-2004, revised April 20, 2004 .