Data mining (the analysis step of the "Knowledge Discovery in Databases" process, or KDD),
an interdisciplinary subfield of
computer science,is the computational process of discovering patterns in large
data sets involving methods at the intersection of
artificial intelligence,
machine learning,
statistics, and
database systems.
The overall goal of the data mining process is to extract information
from a data set and transform it into an understandable structure for
further use.
Aside from the raw analysis step, it involves database and
data management aspects,
data preprocessing,
model and
inference considerations, interestingness metrics,
complexity considerations, post-processing of discovered structures,
visualization, and
online updating.
The term is a
buzzword, and is frequently misused to mean any form of large-scale data or information processing (
collection,
extraction,
warehousing,
analysis, and statistics) but is also generalized to any kind of
computer decision support system, including
artificial intelligence,
machine learning, and
business intelligence. In the proper use of the word, the key term is
discovery[citation needed],
commonly defined as "detecting something new". Even the popular book
"Data mining: Practical machine learning tools and techniques with Java"
(which covers mostly
machine learning
material) was originally to be named just "Practical machine learning",
and the term "data mining" was only added for marketing reasons.
Often the more general terms "(large scale)
data analysis", or "
analytics" – or when referring to actual methods,
artificial intelligence and
machine learning – are more appropriate.
The actual data mining task is the automatic or semi-automatic
analysis of large quantities of data to extract previously unknown
interesting patterns such as groups of data records (
cluster analysis), unusual records (
anomaly detection) and dependencies (
association rule mining). This usually involves using database techniques such as
spatial indices.
These patterns can then be seen as a kind of summary of the input data,
and may be used in further analysis or, for example, in
machine learning and
predictive analytics.
For example, the data mining step might identify multiple groups in the
data, which can then be used to obtain more accurate prediction results
by a
decision support system.
Neither the data collection, data preparation, nor result
interpretation and reporting are part of the data mining step, but do
belong to the overall KDD process as additional steps.
The related terms
data dredging,
data fishing, and
data snooping
refer to the use of data mining methods to sample parts of a larger
population data set that are (or may be) too small for reliable
statistical inferences to be made about the validity of any patterns
discovered. These methods can, however, be used in creating new
hypotheses to test against the larger data populations.
Data mining uses information from past data to analyze the outcome of
a particular problem or situation that may arise. Data mining works to
analyze data stored in data warehouses that are used to store that data
that is being analyzed. That particular data may come from all parts of
business, from the production to the management. Managers also use data
mining to decide upon marketing strategies for their product. They can
use data to compare and contrast among competitors. Data mining
interprets its data into real time analysis that can be used to increase
sales, promote new product, or delete product that is not value-added
to the company.
Process
The
Knowledge Discovery in Databases (KDD) process is commonly defined with the stages:
- (1) Selection
- (2) Pre-processing
- (3) Transformation
- (4) Data Mining
- (5) Interpretation/Evaluation.
It exists, however, in many variations on this theme, such as the
Cross Industry Standard Process for Data Mining (CRISP-DM) which defines six phases:
- (1) Business Understanding
- (2) Data Understanding
- (3) Data Preparation
- (4) Modeling
- (5) Evaluation
- (6) Deployment
or a simplified process such as (1) pre-processing, (2) data mining, and (3) results validation.
Polls conducted in 2002, 2004, and 2007 show that the CRISP-DM methodology is the leading methodology used by data miners. The only other data mining standard named in these polls was
SEMMA.
However, 3-4 times as many people reported using CRISP-DM. Several
teams of researchers have published reviews of data mining process
models,
and Azevedo and Santos conducted a comparison of CRISP-DM and SEMMA in 2008.
Pre-processing
Before data mining algorithms can be used, a target data set must be
assembled. As data mining can only uncover patterns actually present in
the data, the target data set must be large enough to contain these
patterns while remaining concise enough to be mined within an acceptable
time limit. A common source for data is a
data mart or
data warehouse. Pre-processing is essential to analyze the
multivariate data sets before data mining. The target set is then cleaned.
Data cleaning removes the observations containing
noise and those with
missing data.
Data mining
Data mining involves six common classes of tasks:
- Anomaly detection
(Outlier/change/deviation detection) – The identification of unusual
data records, that might be interesting or data errors that require
further investigation.
- Association rule learning
(Dependency modeling) – Searches for relationships between variables.
For example a supermarket might gather data on customer purchasing
habits. Using association rule learning, the supermarket can determine
which products are frequently bought together and use this information
for marketing purposes. This is sometimes referred to as market basket
analysis.
- Clustering
– is the task of discovering groups and structures in the data that are
in some way or another "similar", without using known structures in the
data.
- Classification
– is the task of generalizing known structure to apply to new data. For
example, an e-mail program might attempt to classify an e-mail as
"legitimate" or as "spam".
- Regression – Attempts to find a function which models the data with the least error.
- Summarization – providing a more compact representation of the data set, including visualization and report generation.
- Sequential pattern mining
– Sequential pattern mining finds sets of data items that occur
together frequently in some sequences. Sequential pattern mining, which
extracts frequent subsequences from a sequence database, has attracted a
great deal of interest during the recent data mining research because
it is the basis of many applications, such as: web user analysis, stock
trend prediction, DNA sequence analysis, finding language or linguistic
patterns from natural language texts, and using the history of symptoms
to predict certain kind of disease.
Results validation
The final step of knowledge discovery from data is to verify that the
patterns produced by the data mining algorithms occur in the wider data
set. Not all patterns found by the data mining algorithms are
necessarily valid. It is common for the data mining algorithms to find
patterns in the training set which are not present in the general data
set. This is called
overfitting. To overcome this, the evaluation uses a
test set
of data on which the data mining algorithm was not trained. The learned
patterns are applied to this test set and the resulting output is
compared to the desired output. For example, a data mining algorithm
trying to distinguish "spam" from "legitimate" emails would be trained
on a
training set of sample e-mails. Once trained, the learned patterns would be applied to the test set of e-mails on which it had
not
been trained. The accuracy of the patterns can then be measured from
how many e-mails they correctly classify. A number of statistical
methods may be used to evaluate the algorithm, such as
ROC curves.
If the learned patterns do not meet the desired standards, then it is
necessary to re-evaluate and change the pre-processing and data mining
steps. If the learned patterns do meet the desired standards, then the
final step is to interpret the learned patterns and turn them into
knowledge.
--------------------------------------------------------------------------------------------------
Notable uses
Business
Data mining is the analysis of historical business activities, stored
as static data in data warehouse databases, to reveal hidden patterns
and trends. Data mining software uses advanced pattern recognition
algorithms to sift through large amounts of data to assist in
discovering previously unknown strategic business information. Examples
of what businesses use data mining for include performing market
analysis to identify new product bundles, finding the root cause of
manufacturing problems, to prevent customer attrition and acquire new
customers, cross-sell to existing customers, and profile customers with
more accuracy.
In today’s world raw data is being collected by companies at an
exploding rate. For example, Walmart processes over 20 million
point-of-sale transactions every day. This information is stored in a
centralized database, but would be useless without some type of data
mining software to analysis it. If Walmart analyzed their point-of-sale
data with data mining techniques they would be able to determine sales
trends, develop marketing campaigns, and more accurately predict
customer loyalty.
Every time we use our credit card, a store loyalty card, or fill out a
warranty card data is being collected about our purchasing behavior.
Many people find the amount of information stored about us from
companies, such as Google, Facebook, and Amazon, disturbing and are
concerned about privacy. Although there is the potential for our
personal data to be used in harmful, or unwanted, ways it is also being
used to make our lives better. For example, Ford and Audi hope to one
day collect information about customer driving patterns so they can
recommend safer routes and warn drivers about dangerous road conditions.
Data mining in
customer relationship management applications can contribute significantly to the bottom line.
[citation needed]
Rather than randomly contacting a prospect or customer through a call
center or sending mail, a company can concentrate its efforts on
prospects that are predicted to have a high likelihood of responding to
an offer. More sophisticated methods may be used to optimize resources
across campaigns so that one may predict to which channel and to which
offer an individual is most likely to respond (across all potential
offers). Additionally, sophisticated applications could be used to
automate mailing. Once the results from data mining (potential
prospect/customer and channel/offer) are determined, this "sophisticated
application" can either automatically send an e-mail or a regular mail.
Finally, in cases where many people will take an action without an
offer, "
uplift modeling"
can be used to determine which people have the greatest increase in
response if given an offer. Uplift modeling thereby enables marketers to
focus mailings and offers on persuadable people, and not to send offers
to people who will buy the product without an offer.
Data clustering can also be used to automatically discover the segments or groups within a customer data set.
Businesses employing data mining may see a return on investment, but
also they recognize that the number of predictive models can quickly
become very large. Rather than using one model to predict how many
customers will
churn,
a business could build a separate model for each region and customer
type. Then, instead of sending an offer to all people that are likely to
churn, it may only want to send offers to loyal customers. Finally, the
business may want to determine which customers are going to be
profitable over a certain window in time, and only send the offers to
those that are likely to be profitable. In order to maintain this
quantity of models, they need to manage model versions and move on to
automated data mining.
Data mining can also be helpful to human resources (HR) departments
in identifying the characteristics of their most successful employees.
Information obtained – such as universities attended by highly
successful employees – can help HR focus recruiting efforts accordingly.
Additionally, Strategic Enterprise Management applications help a
company translate corporate-level goals, such as profit and margin share
targets, into operational decisions, such as production plans and
workforce levels.
Another example of data mining, often called the
market basket analysis,
relates to its use in retail sales. If a clothing store records the
purchases of customers, a data mining system could identify those
customers who favor silk shirts over cotton ones. Although some
explanations of relationships may be difficult, taking advantage of it
is easier. The example deals with
association rules within transaction-based data. Not all data are transaction based and logical, or inexact
rules may also be present within a
database.
Market basket analysis has also been used to identify the purchase patterns of the
Alpha Consumer.
Alpha Consumers are people that play a key role in connecting with the
concept behind a product, then adopting that product, and finally
validating it for the rest of society. Analyzing the data collected on
this type of user has allowed companies to predict future buying trends
and forecast supply demands.
Data mining is a highly effective tool in the catalog marketing industry.
Catalogers have a rich database of history of their customer
transactions for millions of customers dating back a number of years.
Data mining tools can identify patterns among customers and help
identify the most likely customers to respond to upcoming mailing
campaigns.
Data mining for business applications is a component that needs to be
integrated into a complex modeling and decision making process.
Reactive business intelligence (RBI) advocates a "holistic" approach that integrates data mining,
modeling, and
interactive visualization into an end-to-end discovery and continuous innovation process powered by human and automated learning.
In the area of
decision making, the
RBI
approach has been used to mine knowledge that is progressively acquired
from the decision maker, and then self-tune the decision method
accordingly.
An example of data mining related to an integrated-circuit (IC)
production line is described in the paper "Mining IC Test Data to
Optimize VLSI Testing."
In this paper, the application of data mining and decision analysis to
the problem of die-level functional testing is described. Experiments
mentioned demonstrate the ability to apply a system of mining historical
die-test data to create a probabilistic model of patterns of die
failure. These patterns are then utilized to decide, in real time, which
die to test next and when to stop testing. This system has been shown,
based on experiments with historical test data, to have the potential to
improve profits on mature IC products.
Science and engineering
In recent years, data mining has been used widely in the areas of science and engineering, such as
bioinformatics,
genetics,
medicine,
education and
electrical power engineering.
In the study of human genetics,
sequence mining helps address the important goal of understanding the mapping relationship between the inter-individual variations in human
DNA
sequence and the variability in disease susceptibility. In simple
terms, it aims to find out how the changes in an individual's DNA
sequence affects the risks of developing common diseases such as
cancer,
which is of great importance to improving methods of diagnosing,
preventing, and treating these diseases. The data mining method that is
used to perform this task is known as
multifactor dimensionality reduction.
In the area of electrical power engineering, data mining methods have been widely used for
condition monitoring
of high voltage electrical equipment. The purpose of condition
monitoring is to obtain valuable information on, for example, the status
of the
insulation (or other important safety-related parameters).
Data clustering techniques – such as the
self-organizing map
(SOM), have been applied to vibration monitoring and analysis of
transformer on-load tap-changers (OLTCS). Using vibration monitoring, it
can be observed that each tap change operation generates a signal that
contains information about the condition of the tap changer contacts and
the drive mechanisms. Obviously, different tap positions will generate
different signals. However, there was considerable variability amongst
normal condition signals for exactly the same tap position. SOM has been
applied to detect abnormal conditions and to hypothesize about the
nature of the abnormalities.
Data mining methods have also been applied to
dissolved gas analysis (DGA) in
power transformers.
DGA, as a diagnostics for power transformers, has been available for
many years. Methods such as SOM has been applied to analyze generated
data and to determine trends which are not obvious to the standard DGA
ratio methods (such as Duval Triangle).
Another example of data mining in science and engineering is found in
educational research, where data mining has been used to study the
factors leading students to choose to engage in behaviors which reduce
their learning,
and to understand factors influencing university student retention.
A similar example of social application of data mining is its use in
expertise finding systems,
whereby descriptors of human expertise are extracted, normalized, and
classified so as to facilitate the finding of experts, particularly in
scientific and technical fields. In this way, data mining can facilitate
institutional memory.
Other examples of application of data mining methods are
biomedical data facilitated by domain
ontologies,
mining clinical trial data,
and
traffic analysis using SOM.
In adverse drug reaction surveillance, the
Uppsala Monitoring Centre
has, since 1998, used data mining methods to routinely screen for
reporting patterns indicative of emerging drug safety issues in the WHO
global database of 4.6 million suspected
adverse drug reaction incidents.
Recently, similar methodology has been developed to mine large collections of
electronic health records for temporal patterns associating drug prescriptions to medical diagnoses.
Data mining has been applied
software artifacts within the realm of
software engineering:
Mining Software Repositories.