Data mining

Доклад - Компьютеры, программирование

Другие доклады по предмету Компьютеры, программирование

mining

Students KA-81YanaYaroslav

2011

Table of Contents

Abstract

Introduction

. What is Data Mining

2. Developmental History of Data Mining and Knowledge Discovery

. Theoretical Principles

. Technological Elements of Data Mining

. Steps in Knowledge Discovery

.1 Step 1: Task Discovery

.2 Step 2: Data Discovery

.3 Step 3: Data Transformation

.4 Step 4: Data Reduction

.5 Step 5: Discovering Patterns (aka Data Mining)

.6 Step 6: Result Interpretation and Visualization

.7 Step 7: Putting the Knowledge to Use

. Data Mining Methods

.1 Classification

.2 Regression

.3 Clustering

.4 Summarization

.5 Change and Deviation Detection

. Related Disciplines: Information Retrieval and Text Mining

.1 Information Retrieval (IR)

.2 IR Contributions to Data Mining

.3 Data Mining Contributions to IR

. Text Mining

Abstract

mining or knowledge discovery refers to the process of finding interesting information in large repositories of data. The term data mining also refers to the step in the knowledge discovery process in which special algorithms are employed in hopes of identifying interesting patterns in the data. These interesting patterns are then analyzed yielding knowledge. The desired outcome of data mining activities is to discover knowledge that is not explicit in the data, and to put that knowledge to use.involved in digital libraries are already benefiting from data mining techniques as they explore ways to automatically classify information and explore new approaches for subject clustering (MetaCombine Project). As the field grows, new applications for libraries are likely to evolve and it will be important for library administrators to have a basic understanding of the technology.wide variety of data mining techniques are also employed by industry and government. Many of these activities pose threats to personal privacy. As professionals ethically bound to ensure that individual privacy is safe-guarded, data mining activities should be monitored and kept on every librarians radar.paper is written for information professionals who would like a better understanding of knowledge discovery and data mining techniques. It explains the historical development of this new discipline, explains specific data mining methods, and concludes that future development should focus on developing tools and techniques that yield useful knowledge without invading individual privacy. 2

Introduction

Data mining is an ambiguous term that has been used to refer to the process of finding interesting information in large repositories of data. More precisely, the term refers to the application of special algorithms in a process built upon sound principles from numerous disciplines including statistics, artificial intelligence, machine learning, database science, and information retrieval (Han & Kamber, 2001).mining algorithms are utilized in the process of pursuits variously called data mining, knowledge mining, data driven discovery, and deductive learning (Dunham, 2003). Data mining techniques can be performed on a wide variety of data types including databases, text, spatial data, temporal data, images, and other complex data (Frawley, Piatetsky-Shapiro, & Matheus, 1991; Hearst, 1999; Roddick & Spiliopoulou, 1999; Zaane, O.R., Han, J., Li, Z., & Hou, J, 1998).areas of specialty have a name such as KDD (knowledge discovery in databases), text mining and Web mining. Most of these specialties utilize the same basic toolset and follow the same basic process and (hopefully) yield the same product - useful knowledge that was not explicitly part of the original data set (Benot, 2002; Han & Kamber, 2001,Fayyed, Piatetsky-Shapiro, & Smyth, 1996). 3

1. What is Data Mining

data knowledge information mining

Data mining refers to the process of finding interesting patterns in data that are not explicitly part of the data (Witten & Frank, 2005, p. xxiii). The interesting patterns can be used to tell us something new and to make predictions. The process of data mining is composed of several steps including selecting data to analyze, preparing the data, applying the data mining algorithms, and then interpreting and evaluating the results. Sometimes the term data mining refers to the step in which the data mining algorithms are applied. This has created a fair amount of confusion in the literature. But more often the term is used to refer the entire process of finding and using interesting patterns in data (Benot, 2002).application of data mining techniques was first applied to databases. A better term for this process is KDD (Knowledge Discovery in Databases). Benot (2002) offers this definition of KDD (which he refers to as data mining):mining (DM) is a multistaged process of extracting previously unanticipated knowledge from large databases, and applying the results to decision making. Data mining tools detect patterns from the data and infer associations and rules from them. The extracted information may then be applied to prediction or classification models by identifying relations within the data records or between databases. Those patterns and rules can then guide decision making and forecast the effects of those decisions.

Today, data mining usually refers to the process broadly described by Benot (2002) but without the restriction to databases. It is a multidisciplinary field drawing work from areas including database technology, artificial intelligence, machine learning, neural networks, statistics, pattern recognition, knowledge-based systems, knowledge acquisition, information retrieval, high-performance computing and data visualization. (Han & Kamber, 2001, p. xix).mining techniques can be applied to a wide variety of data repositories including databases, data warehouses, spatial data, multimedia data, Internet or Web-based data and complex objects. A more appropriate term for describing the entire process would be knowledge discovery, but unfortunately the term data mining is what has caught on (Andrssoy & Paralic, 1999).

2. Developmental History of Data Mining and Knowledge Discovery

The building blocks of todays data mining techniques date back to the 1950s when the work of mathematicians, logicians, and computer scientists combined to create artificial intelligence (AI) and machine learning (Buchanan, 2006.).the 1960s, AI and statistics practitioners developed new algorithms such as regression analysis, maximum likelihood estimates, neural networks, bias reduction, and linear models of classification (Dunham, 2003, p. 13). The term data mining was coined during this decade, but the term was pejoratively used to describe the practice of wading through data and finding patterns that had no statistical significance (Fayyad, et al., 1996, p. 40). 5in the 1960s, the field of information retrieval (IR) made its contribution in the form of clustering techniques and wordsity measures. At the time these techniques were applied to text documents, but they would later be utilized when mining data in databases and other large, distributed data sets (Dunham, 2003, p. 13). Database systems focus on query and transaction processing of structured data, whereas information retrieval is concerned with the organization and retrieval of information from a large number of text-based documents (Han & Kamber, 2001, p. 428). By the end of the 1960s, information retrieval and database systems were developing in parallel.1971, Gerard Salton published his groundbreaking work on the SMART Information Retrieval System. This represented a new approach to information retrieval which utilized the algebra-based vector space model (VSM). VSM models would prove to be a key ingredient in the data mining toolkit (Dunham, 2003, p. 13).the 1970s, 1980s, and 1990s, the confluence of disciplines (AI, IR, statistics, and database systems) plus the availability of fast microcomputers opened up a world of possibilities for retrieving and analyzing data. During this time new programming languages were developed and new computing techniques were developed including genetic algorithms, EM algorithms, K-Means clustering, and decision tree algorithms (Dunham, 2003, p. 13).the start of the 1990s, the term Knowledge Discovery in Databases (KDD) had been coined and the first KDD workshop held (Fayyad, Piatetsky-Shapiro, & Smyth, 1996, p. 40). The huge volume of data available created the need for new techniques for handling massive quantities of information, much of which was located in huge databases.1990s saw the development of database warehouses, a term used to describe a large database (composed of a single schema), created from the consolidation of operational and transactional database data. Along with the development of data warehouses came online analytical processing (OLAP), decision support systems, data scrubbing/staging (transformation), and association rule algorithms (Dunham, 2003, p. 13, 35-39; Han & Kamber, 2001, p. 3).the 1990s, data mining changed from being an interesting new technology to becoming part of standard business practice. This occurred because the cost of computer disk storage went down, pr