Data Mining: Concepts and Techniques, Third Edition (The Morgan Kaufmann Series in Data Management Systems)
Data mining (also known as knowledge discovery and data mining (KDD), knowledge discovery in data, and knowledge discovery in databases), refers to sorting through data to identify patterns and relationships in large relational databases in order to extract and utilize data. Data mining allows data to be extracted and transformed, stored in a data base, accessed, and analyzed by software programs. Additionally, the data retrieved can be presented in various forms, such as in graphs or tables. The types of relationships that are commonly found in data mining include classes, clusters, associations, and sequential patterns.
Data mining focuses on producing a solution that generates useful forecasting through a four-phase process:
- problem identification
- exploration of the data
- pattern discovery
- knowledge deployment, or application of knowledge to new data to forecast or generate predictions.
Problem identification
- Initial phase of data mining
- the problem must be defined, and everyone involve must understand the objectives and requirements of the data mining process they are initiating
Exploration of the data
- begins with exploring and preparing the data for the data mining process
- might include data access, cleansing, sampling, and transformation
- a technique that can be used for data reduction is clustering.
- Clustering groups of statistical units into clusters (classes) in order to reduce the overall number of statistical units.
- A cluster is comprised of elements that are similar to each other and dissimilar to other clusters, so clustering is essential a method of grouping.
- To determine why the groups are different, then a different data reduction technique, factor analysis, must be used.
- a technique that can be used for data reduction is clustering.
- the goal is to identify the relevant or important variables and determine their nature
Pattern discovery
- model building/ pattern identification
- a complex phase of data mining
- different models are applied to the same data to choose the best model for the data set being analyzed
- model chosen should be identify the patterns in the data that will support the best predictions.
- model must be tested, evaluated, and interpreted
- this phase ends with a highly predictive, consistent patterns-identifying model
Knowledge deployment/ Application of knowledge
- takes the pattern and model identified in the pattern discovery phase and applies them to new data to test whether they can achieve the desired outcome.