Jelena MAMČENKO DATA MINING TECHNOLOGIES FOR DISTRIBUTED SERVERS’ EFFICIENCY Summary of Doctoral Dissertation Technological Sciences, Informatics Engineering (07T) 1546-M Vilnius 2008 VILNIUS GEDIMINAS TECHNICAL UNIVERSITY Jelena MAMČENKO DATA MINING TECHNOLOGIES FOR DISTRIBUTED SERVERS’ EFFICIENCY Summary of Doctoral Dissertation Technological Sciences, Informatics Engineering (07T) Vilnius 2008 Doctoral dissertation was prepared at Vilnius Gediminas Technical University in 2002–2008. The dissertation is defended as an external work. Scientific Consultant Prof Dr Habil Genadijus KULVIETIS (Vilnius Gediminas Technical University, Technological Sciences, Informatics Engineering – 07T). The dissertation is being defended at the Council of Scientific Field of Informatics Engineering at Vilnius Gediminas Technical University: Chairman Prof Dr Habil Petras Gailutis ADOMĖNAS (Vilnius Gediminas Technical University, Technological Sciences, Informatics Engineering – 07T).
Jelena MAMČENKO DATA MINING TECHNOLOGIES FOR DISTRIBUTED SERVERS’ EFFICIENCY Summary of Doctoral Dissertation Technological Sciences, Informatics Engineering (07T)
Vilnius 2008
1546-M
VILNIUS GEDIMINAS TECHNICAL UNIVERSITYJelena MAMČENKO DATA MINING TECHNOLOGIES FOR DISTRIBUTED SERVERS’ EFFICIENCY Summary of Doctoral Dissertation Technological Sciences, Informatics Engineering (07T)
VILNIAUS GEDIMINO TECHNIKOS UNIVERSITETAS Jelena MAMČENKODUOMENŲ GAVYBOS TECHNOLOGIJŲ TAIKYMAS IŠSKIRSTYTŲ SERVERIŲ DARBUI GERINTI Daktaro disertacijos santrauka Technologijos mokslai, informatikos inžinerija (07T)
General Characteristic of the DissertationTopicality of the problem. The huge amount of data collected in the recent years became a big problem all over the world. The average quantity of data stored in organizations amounts to 3–10 terabytes. Data has to be analyzed, in order to extract meaningful and useful information. Traditional analytical tools can only be used for superficial statistical analysis. They cannot provide complete information on sales, client needs or consumer behavior in certain areas. Comprehensive data analysis can only be performed using modern data mining technology. When analyzing data, problems may arise due to the structuring, collection and transformation of various types of data.Collection of large data quantities has become possible after introduction of new computer technologies, namely the invention of significantly smaller and more powerful processors as well as the capabilities of parallel processing. The interaction between these two factors and unprecedented growth of the World Wide Web has encouraged many business organizations and academic institutions pay attention to the data mining technologies.However, academic institutions worldwide as well as in Lithuania have not progressed a lot in respect of data analysis in their possession. Literature and the internet abound with the examples of various companies that have successfully utilized data mining methods. On the other hand, it is difficult enough to determine usability or advantage a university would enjoy by applying data mining methods for teaching or other purposes.Data mining technology in Lithuania is still in the infancy stage. Despite the fact that it is widely used worldwide and particularly in the United States of America and West European countries, Lithuanian commercial entities and organizations are not familiar the opportunities provided by the technology. The main reason is high cost of the technology. According to global practice these technologies pay off only in six months. To achieve valuable and meaningful result human intervention is necessary. This means that interpretation of the results is to be performed by an expert; otherwise the use of this technology would be ineffective. The success depends mostly on the results of interpretation and their application in business processes and decision making. This dissertation concentrates on the analysis of data of distributed log files, web data collection, data structuring and distributed servers.In contrast to the business world, data mining technology is well known and analyzed by scientists.
5
Well known in Lithuania and established authorities abroad, the active scientists G. Dzemyda, A. Žilinskas, Š. Raudys and others have greatly influenced the field of data mining, the algorithms used and their improvements. Aim and tasks of the work.The main goal of this work is to carry on the research of data mining technologies and their applications, possibilities for their use, compare currently used software and integrate the data of a document database into the system and analysis of data mining. To achieve the formulated goals, following problems have to be resolved:1.literature analysis and to evaluate data miningTo perform research methods in use taking into account a size, variety and structure of data. 2.To analyze the development of data mining technologies. 3.To design and implement data mining system for data collection and analysis. 4.To analyze data for telecommunication solving an actual fraudulent activities. 5.To perform the analysis of the distributed servers effectiveness using created new transformation method and data mining techniques. Scientific novelty.Designed model is realized in study environment. Created method transforms data in real-time mode from document databases (document based model) into data warehouse using agent technology.Methodology of research includes document database model, agent technology and data mining methods. Practical value.The collected log files of distributed servers are used for the Master degree course “Data Mining Technologies” FMITM03035 at the Information Technologies Department. The results achieved have been used for server reconfiguration and data replication. The work completed has been approved and appraised in the NorFa (Nordisk Forskerutdanningsakademi) International WIM (Wireless Information Network) project during 2002–2005.Defended propositions 1.New methodology for transformation of document databases, which will be dedicated to the application of data mining technologies.
2.Data mining analysis based assessment of E-learning system‘s users and their behavior, results of which will be used in the enhancement of distributed servers‘ functioning. The scope of the scientific work.The work consists of the general description, five chapters, general conclusions, the list of references (141 items), the list of publications and appendix. The total volume of the dissertation – 127 pages, 47 figures and 29 tables. First covers an analysis of existing scientific publications related chapter with problems of the thesis.There are presented general methods used in data mining technologies, it benefits and limitations and comparison of data mining and statistical analysis. In the end of the chapterthe conclusions and final tasks of the thesis are concretized. In thesecond chapterthe historical view, source and components of data mining technologies’ are investigated.The technology of data warehouse as an important part of data mining is reviewed. Additionally this chapter illustrates the comparison of data mining tools which vendors are market leaders in data analysis.Third is concentrated to fraud detection in telecoms using data chapter mining techniques. The problem is solved and some solutions to avoid such situation are proposed. Fourthchapter covers investigation of application and implementation of data mining to document based model database management system, which is described. Data transformation and integration from heterogeneous sources into data warehouse using agent’s technology are presented in the middle of the chapter.Also the model of laboratory environment are realized in study frame was created. In thefifthmining application for distributed servers which data chapter are based on document databases distributed system is represented. There are clusterization, association and classification techniques which helped to discover some interesting patterns and solved main aim and tasks of the work.Inthegeneral conclusionsthe main theoretical and practical results of the work and their significance are formulated.1 Mathematical Methods in Data Mining Data mining technology is widespread in the world, while in Lithuania it is best known in the scientific circles, not in business and other fields.
7
The work of the Lithuanian scientists J. Mockus (MII), G. Dzemyda (MII), A. Žilinskas (MII), Š. Raudys (MII), E. Zavadskas (VGTU), S. Turskien6 (ŠU), H. Pranevičius (KTU) reflect on the interest in these technologies and their application not only in science but also in everyday life. The Lithuanian scientists noted are active in the field of data analysis solve optimization problems (J. Mockus) and the problems of visualizing multidimensional data (G. Dzemyda, A. Žilinskas). Much attention is also paid to the methods used for data mining: neural networks, genetic and other algorithms.Association rulesThe process of finding association rules is also known as market basket analysis. As an example of an association rule we can think of the case of a super market. An association rule might, then, be in the following form: “If a customer buys shampoo, he buys a conditioner as well”. In other word, it is in the form X Y, where X and Y are items or set of items from the super market’s database. In the case of Vilnius Gediminas Technical University, an example of an association rule might be that if a student wishes to do business studies, then with a probability of 90% chooses VGTU. The most used algorithms areApriori,PartitionandÉclatalgorithms.ClusteringClustering can be defined as the task of grouping together similar items in a dataset. As an example, we can consider a bank’s database. This database is huge in size and it would be much better for the executive to have a picture of the customers being divided into smaller groups; for instance, those who pay their balance on time and those who do not. The problem, however, is that we can not determine the characteristics we would like each group to have; the clustering is done first and then we examine the clusters to see what their members have in common. Another example, coming again from university domain, is that once we perform the clustering we might end up observing a cluster being full of students coming from a certain postcode; it might be interesting to investigate it further. The most commonly method used is the k-means method. The method is based on iteration in which data points are assigned to clusters according to their distance from each cluster till there are no differences in clusters members after two consecutive iterations.Decision TreesThis method can be applied for solution of classification tasks only. This limits applicability of the decision trees method in many fields. For example, in financial applications the most common problem is the task of predicting values of some numerical variable. As a result of applying this method to a training set, a hierarchical structure of classifying rules of the type "IF...THEN..." is created. This structure has a form of a tree. In order to decide to which class an object or a situation should be assigned one has to answer questions located at the tree nodes, starting from the root.
8
These questions are of the form "Is the value of variable A greater than x ?". If the answer is yes, one follows the right branch of the tree to a node of the next level, if the answer is no - the left branch. Then a question in this node should be answered, and so on. Following this procedure one eventually comes to one of the final nodes (called leaves), where he/she finds a conclusion to which class the considered object should be assigned. The most used algorithms are CART, C4.5 and CHAID. Neural NetworksThis is a large class of diverse systems whose architecture to some extent imitates structure of live neural tissue built from separate neurons. One of the most widespread architectures, multilayered perceptron with back propagation of errors, emulates the work of neurons incorporated in a hierarchical network, where the input of each neuron of the next layer (narrower) is connected with the outputs of all neurons of the previous (wider) layer. Analyzed data are treated as neuron excitation parameters and are fed to inputs of the first layer. These excitations of lower layer neurons are propagated to the next layer neurons, being amplified or weakened according to weights (numerical coefficients) ascribed to corresponding intraneural connections. As a final result of this process, the single neuron, comprising the topmost neuron layer, acquires some value (excitation strength), which is considered to be a prediction - the reaction of the whole network to the processed data. In order to make meaningful predictions a neural network first has to be trained on data describing previous situations for which both, input parameters and correct reactions to them are known. Training consists of selecting weights ascribed to intraneural connections that provide the maximal closeness of reactions produced by the network to the known correct reactions. Genetic algorithmsThe idea behind genetic algorithms is more or less same as behind neural networks; we want to mimic the way nature works. In nature we find a number of different combinations of genes; they “interact” with each other in three ways – crossover, mutation and selection – and the fittest survive to the next generation. This is the idea behind evolutionary development by which organization survive by adapting to their environment. In computing, of course, the word fittest has a different meaning and it usually refers to an optimization function. Regression analysisRegression analysis is the mathematical process by which we are trying to find the line of “best fit” between two variables X – usually denotes the independent variable, i.e. its values are not depending on Y and Y- which usually denotes the dependent variable whose values we want to predict; for those only a set of data is available.
9
We might use it to predict the server load with respect to the files that we are going to transfer over that server. Summarizing this part it can be claimed that the scientific research to perform data mining technology analysis and evaluation, create and apply automotized tool to store data from documental into relational database, create a data warehouse to collect and store data, create a data mining system and integrate it into e-learning system and perform the analysis of distributed servers functioning. ‘ 2 The Development of Data Mining Technologies The huge data storage led rise data mining technology. Data mining is a part of Knowledge discovery in databases. This is a semi automatic process of pattern recognition and/or finding the relation inside very large databases. In various literature sources we can find a lot of data mining definitions and these technology synonyms such as Knowledge Extraction, Data/Pattern Analysis, Data Archaeology or Data Dredging. The roots of data mining technology are in machine Learning, artificial intelligence algorithms and statistics (Fig 1). Due to powerful computers, networks and database systems coming now is possible to handle vast data storage and analyze it.
Fig 1.A historical view of data miningData mining has existed for less than 15 years and its origins can be traced to the early developments in artificial intelligence in the 1950‘s. During this period, developments in pattern recognition and rule based reasoning were providing the fundamental building blocks on which data mining was to be based.