Standards, Services and Platforms for Data Mining

Standards, Services and Platforms for Data Mining:
An Introduction

Robert Grossman
Open Data Group

June, 2005



This is based in part on Robert L. Grossman, Standards, Services and Platforms for Data Mining: A Quick Introduction, Proceedings of DM-SSP 2003. Although a bit outdated now, it may still be useful as an introduction to the subject.


1. Introduction

Today, most data mining takes places in one of two ways. In the first way, a client server or 3-tier based data mining application accesses and analyzes local data. In the second way, data mining is embedded in another application, either explicitly or implicitly. For example, today data mining is embedded into the databases marketed and sold by IBM, Microsoft and Oracle. Data mining is also commonly embedded into a variety of applications, for example in CRM applications and financial risk applications.

During the past several years, web services have matured to the point that it is now becoming practical to create distributed data mining infrastructures and platforms based upon web services. In this paper, we briefly survey this area.

There is the potential for web services to change in a fundamental way the infrastructure used to analyze data. Consider the following: today, many people find quicker to locate a preprint by using Google then to search for it on their own local disk. On the other hand, almost all data analysis is done using local data. As bandwidth becomes a commodity, accessing remote data and remote services will become easier, and one day it may be as easy to work with remote data as it is to work with local data.

This paper is a preliminary version of a paper by the same name. Sections 5, 6 and 7 are based in part on [Grossman:03].

2. Background

It is convenient to think of data mining systems which were developed during the past decade as comprising three generations: 1) client-server systems; 2) component and agent based systems; and 3) systems based upon web services.

Today, most data mining takes place using first generation data mining systems. The data is local data and the architecture is either a client server architecture or an 3-tier based architecture. With these systems, a client front end is used to access a server (possibly on the same machine) hosting the data mining application. With a client server model, the server also manages the data; with an 3-tier model, the data is accessed from another host using ODBC, JDBC, or related protocol. The most common commercial systems of this type include SAS(tm), SPSS(tm), and SPlus(tm). There is also an open source version of SPlus called R.

The next generation of data mining systems which were developed were component based. The components could be local, relying on Microsofts COM or DCOM platforms for example, or global, relying on Suns J2EE platform for example. Angoss [Angoss] is an example of the former and Kennsington [Inforsense] is an example of the latter.

More or less at the same time, various experimental agent based data mining systems were developed. The basic assumption in these systems is that the data is distributed and agents are used to either move the data, move the models produced by a local data mining system, or move the results of a local data mining computation. Today, very few agent based systems are used in practice. There is probably because no agent-based infrastructure, over which agent based data mining systems were built, was ever widely adopted. Examples of agent based distributed data mining systems include JAM [Stoflo:97], Papyrus [Grossman:99], and BODHI [Kargupta:97].

Somewhat later, the next generation of service based data mining systems began to emerge. These generally are built using W3C's standardization of web services. Examples include DataSpace [Grossman:02a] and data mining systems developed by IBM, Microsoft and SAS which employ the XML for Analysis standard [XMLA].

More general service based infrastructures, such as grids or data grids [Foster:99], are also used for data mining, especially when large computational resources are required. A data grid uses Globus or an equivalent infrastructure to provide a security infrastructure and resource management infrastructure so that distributed computing resources can be used. In addition, Globus provides a high performance data transport mechanism called GridFTP. Recently, the Grid community has begun an effort called the Open Grid Service Architecture or OGSA which provides a web service based access to some grid services [OGSA]. The term knowledge grid is sometimes used for data mining services deployed using grids or data grid services.

Although grids have been used for some data mining applications, there use has been limited since the critical path for many data mining applications is not the lack computational resources but rather the time and effort required to deploy data mining into operational systems and the time and effort required to prepare data for data mining. We address these issues in the next section.

3. From Data Mining Systems to Data Mining Middleware

Producing Models. At the core, a data mining system produces one or more statistical or data mining models. Today these models are generally described using the XML markup language called the Predictive Model Markup Language or PMML [DMG]. These models may be descriptive (e.g. a cluster model or association rules) or predictive (e.g. a regression model or neural network). From this point of view a data mining system takes a data set as input (the learning set) and produce a PMML model as the output. Sometimes these types of applications are called PMML producers.

Producing Scores. Initially, the majority of data mining systems were designed as stand alone applications and were generally difficult to integrate with operational systems. In an operational system, the role of data mining is often simply to take a data record and produce a score. For this reason, specialized scoring engines began to be developed for this purpose. Today there are several data mining scoring engines which take a PMML model as input and then score one or more records. For example, a scoring engine would take a data record as an input and produce as output the result of the applying the model to the data record. Scoring engines have taken a while to mature. The first work in this area began in 1997. Today scoring engines are produced by a variety of vendors including IBM, SPSS, SAS, and Magnify. Sometimes these types of applications are called PMML consumers.

Preparing Data. More recently, there has been an effort, most notably among members of the Data Mining Group, to standarize the data transformations, aggregations, normalizations and other functions required to prepare data for data mining. Although this work is still immature, PMML version 2.0 already includes many common data mining transformations, but perhaps not yet in a format which makes them easy to integrate into data mining systems [DMG]. The typical input for a system for preparing data consists of one or more tables of data records. The output consists of a table containing the data records produced by the transformations. The transformations may be described by a PMML file, for example. Today, data preparation is generally done using a database or a data mining system, such as SAS or R.

Accessing Data. It is probably still the case that most data that is mined and analyzed is made available in ascii files whose fields are delimited by commas, tabs, vertical bars or some other special character and whose records are delimited by carriage returns. On the other hand, more and more business data is being stored in relational databases which data mining applications can access the data through ODBC or JDBC. More recently, web service based protocols such as the DataSpace Transfer Protocol or DSTP have provided a simple mechanism for accessing remote and distributed data and direct support for some of the more common operations for accessing data, such as the ability to retrieve metadata, to select rows and columns, and to sample data [Grossman:02a].

We use the term data mining middleware to refer to systems or services which are used to produce models, produce scores, prepare data, or access data. Standards have at least two important roles in data mining middleware [Grossman:02b]:

  1. Standards are used to specify the inputs, outputs and interfaces to the various data mining middleware services described above. For example, PMML is used to specify the output of a service producing models. As another example, the Web Service Description Language (WSDL) [W3C:WS] can be used to describe a data mining web service.
  2. Standards are used to specify the APIs to other languages and systems. There are standard data mining APIs for Java and SQL for example. Using the appropriate API, an application can build a classification tree using data in a SQL database.

4. Using XML to Define Inputs, Outputs and Interfaces

The Predictive Model Markup Language (PMML) is being developed by the Data Mining Group, a vendor led consortium which currently includes over a dozen vendors including statistical and data mining software [DMG]. PMML can be used to specify the inputs and outputs of data mining consumers and producers. PMML can also be used to specify the transformations used to prepare data for data mining.

PMML conists of the following components:

  1. Data Dictionary. The data dictionary defines the fields which are the inputs to models and specifies the type and value range for each field.
  2. Mining Schema. Each model contains one mining schema which lists the fields used in the model. These fields are a subset of the fields in the Data Dictionary. The mining schema contains information that is specific to a certain model, while the data dictionary contains data definitions which do not vary with the model. For example, the Mining Schema specifies the usage type of an attribute, which may be active (an input of the model), predicted (an output of the model), or supplementary (holding descriptive information and ignored by the model).
  3. Transformation Dictionary. The Transformation Dictionary defines derived fields. Derived fields may be defined by normalization, which maps continuous or discrete values to numbers; by discretization, which maps continuous values to discrete values; by value mapping, which maps discrete values to discrete values; or by aggregation, which summarizes or collects groups of values, for example by computing averages.
  4. Model Statistics. The Model Statistics component contains basic univariate statistics about the model, such as the minimum, maximum, mean, standard deviation, median, etc. of numerical attributes.
  5. Model Parameters. PMML also specifies the actual parameters defining the statistical and data mining models per se. Models in PMML Verson 2.1 include regression models, clusters models, trees, neural networks, bayesian models, association rules, and sequence models.

5. Standards for Data Mining APIs

Developing standards based data mining applications and services is facilitated by defining standard API's to common languages such as Java, SQL, and Microsoft's OLD DB.

The data mining extensions in SQL are part of the SQL Multimedia and Applications Packages Standard or SQL/MM. The particular specification, called SQL/MM Part 6: Data Mining, specifies a SQL interface to data mining packages.

The Java Specification Request 73 (JSR-73), known as Java Data Mining (JDM), defines a pure Java (tm) API to support data mining operations. These operations include model building, scoring data using models, as well as the creation, storage, access and maintenance of data and metadata supporting data mining results [JSR-73]. It also includes selected data transformations and provides a framework so that new mining algorithms can be introduced.

Microsoft's OLE DB for Data Mining (OLE DB for DM) defines a data mining API to Microsoft's OLE DB environment [Microsoft:OLEDB]. OLE DB for DM doesn't introduce any new OLE DB interfaces, but rather uses a SQL-like query language and a specialized data structure called a rowset so that data mining consumers can communicate with data mining producers using OLE DB. In 2002, OLE DB for DM was subsumed by Microsoft's Analysis Services for SQL Server 2000 [Microsoft:XMA]. Microsoft's Analysis Services provide APIs to Microsoft's SQL Server 2000 services which support data transformations, data mining and OLAP operations.

6. Data Mining as a Web Services

The W3C has led the development of standards for web services. Although there are several variants, a web service can be defined using the Web Service Description Language (WSDL), and XML data can be transported using the Simple Object Access Protocol or SOAP. Finally, web services can be discovered using the Universal Description, Discovery and Integration or UDDI service. For details see [W3C:WS].

Work in this area includes:

7. Conclusions and Future Directions

Given the growing amount of web accessible data, the declining cost of bandwidth, and the maturing of web services, it will become more and more common to analyze and to mine data with data mining services.

One of the obstacles to the wider adoption of data mining has been the difficulty preparing data for data mining. The incorporation of data mining as an embedded database application and the emergence of web services to prepare and transform data may facilitate the broader use of data mining.

References

[Angoss] Angoss, KnowledgeStudio, retrieved from www.angoss.com on August 5, 2003.

[Cannataro:2002] Mario Cannataro, Domenico Talia, and Paolo Trunfio, The Knowledge Grid: Towards an Architecture for Knowledge Discovery on the Grid, to appear.

[DMG] Predictive Model Markup Language (PMML), Data Mining Group, retrieved from http://www.dmg.org on August 5, 2003.

[JSR-73] Java Specification Request 73. Retrieved from http://jcp.org/jsr/detail/073.jsp on March 8, 2002.

[Foster:99] I. Foster and C. Kesselman, The Grid: Blueprint for a New Computing Infrastructure, Morgan Kaufmann, San Francisco, California, 1999.

[Grossman:99] Robert L. Grossman, Stuart Bailey, A. Ramu, Balinder Malhi, Harinath Sivakumar, Andrei Turinsky, Papyrus: A System for Data Mining over Local and Wide Area Clusters and Super-Clusters, Proceedings of SC 1999, 1999.

[Grossman:02a] Robert Grossman, and Marco Mazzucco, DataSpace - A Web Infrastructure for the Exploratory Analysis and Mining of Data, IEEE Computing in Science and Engineering, July/August, 2002, pages 44-51.

[Grossman:02b] Robert Grossman, Mark Hornick, and Gregor Meyer, Data Mining Standards Initiatives, Communications of the ACM, Volume 45-8, 2002, pages 59-61.

[Grossman:03] Robert Grossman, Mark Hornick, and Gregor Meyer, Emerging Standards and Interfaces in Data Mining, Handbook of Data Mining, Nong Ye, editor, Kluwer Academic Publishers.

[Kargupta:97] H. Kargupta, I. Hamzaoglu and B. Stafford, Scalable, Distributed Data Mining Using an Agent Based Architecture, KDD97, pages 211-214.

[Microsoft:OLEDB]. Microsoft OLE DB for Data Mining Specification 1.0 Retrieved from www.microsoft.com/data/oledb/default.htm on March 8, 2002.

[Microsoft:SQL]. Microsoft SQL Server 2000 Analysis Services. Retrieved from www.microsoft.com/SQL/techinfo/bi/analysis.asp on March 8, 2002.

[OGSA] The Globus Project, Towards Globus Toolkit 3.0: Open Grid Services Architecture, retrieved from www.globus.org/ogsa/, on January 10, 2003.

[Stolfo:97] S. Stolfo, A. L. Prodromidis and P. K. Chan, JAM: Java Agents for Meta-Learning over Distributed Databases, KDD97.

[W3C:SW] World Wide Web Consortium (W3C), Semantic Web, retrieved from www.w3c.org/2001/sw on March 8, 2002.

[W3C:WS] World Wide Web Consortium (W3C), Web Services, retreived from http://www.w3.org/2002/ws/ on August 5, 2003.

[XMLA] XML for Analysis Consortium, XML for Analysis, retrieved from http://www.xmla.org.

For More Information

For more information, please contact Open Data Partners www.opendatagroup.com.

About the Author

Robert Grossman is the Managing Partner of Open Data Partners, which provides consulting services, outsourced data services, and litigation support services related to data. He is also the Director of the Laboratory for Advanced Computing at the University of Illinos at Chicago, which develops internet-based technologies. He has written over 100 papers and edited four books in data mining, business intelligence, direct marketing, e-business, high performance computing, and related areas. He has a Ph.D. from Princeton and a A.B. from Harvard.

Copyright

Copyright 2003-2005 Robert L. Grossman. All rights reserved.