137
pages
English
Documents
2009
Le téléchargement nécessite un accès à la bibliothèque YouScribe Tout savoir sur nos offres
137
pages
English
Documents
2009
Le téléchargement nécessite un accès à la bibliothèque YouScribe Tout savoir sur nos offres
Publié par
Publié le
01 janvier 2009
Nombre de lectures
15
Langue
English
Poids de l'ouvrage
1 Mo
Publié par
Publié le
01 janvier 2009
Nombre de lectures
15
Langue
English
Poids de l'ouvrage
1 Mo
VILNIUS GEDIMINAS TECHNICAL UNIVERSITY
INSTITUTE OF MATHEMATICS AND INFORMATICS
ENHANCEMENTS OF PRE-
PROCESSING, ANALYSIS AND
PRESENTATION TECHNIQUES IN
WEB LOG MINING
DOCTORAL DISSERTATION
TECHNOLOGICAL SCIENCES,
INFORMATICS ENGINEERING (07T)
Vilnius 2009
$$D5L73?$Q%?U,G.L?Doctoral dissertation was prepared at the Institute of Mathematics and Informatics
in 2003–2009.
Scientific Supervisor
Prof. habil. dr. (Institute of Mathematics and Informatics,
Technological Sciences, Informatics Engineering – 07T).
http://leidykla.vgtu.lt
VGTU leidyklos TECHNIKA 1620-M
ISBN 978-9955-28-429-1
© , ., 2009
© Vilniaus Gedimino technikos universitetas, 2009
WWLDVN??'UODREDDD3$D6JL??\OQNNP?VURQUV?5W8D<UHRVILNIAUS GEDIMINO TECHNIKOS UNIVERSITETAS
MATEMATIKOS IR INFORMATIKOS INSTITUTAS
NAUDOTOJUI TOBULINIMAS
DAKTARO DISERTACIJA
TECHNOLOGIJOS MOKSLAI,
INFORMATIKOS (07T)
Vilnius 2009
$2L0$,.1,$(?7.$?3(??Q7$57,/,8,=,(,5,$5$,U6*??=$,?/2$/11$7,1?270,,??1258-5$.35%632D%L<G9?$Disertacija rengta 2003–2009 metais Matematikos ir informatikos institute.
Mokslinis vadovas
prof. habil. dr. ,
technologijos mokslai, informatikos – 07T).
WVL8L?QVVWULPW0X<W5NQR?QLMWDDHPDUR6I'Q$LDULQLVHRLNDLUW?DDVAbstract
As Internet is becoming an important part of our life, more attention is paid
to the information quality and how it is displayed to the user. The research area
of this work is web data analysis and methods how to process this data. This
knowledge can be extracted by gathering web servers’ data – log files, where all
users’ navigational patters about browsing are recorded.
The research object of the dissertation is web log data mining process.
General topics that are related with this object: web log data preparation
methods, data mining algorithms for prediction and classification tasks, web text
mining. The key target of the thesis is to develop methods how to improve
knowledge discovery steps mining web log data that would reveal new
opportunities to the data analyst.
While performing web log analysis, it was discovered that insufficient
interest has been paid to web log data cleaning process. By reducing the number
of redundant records data mining process becomes much more effective and
faster. Therefore a new original cleaning framework was introduced which leaves
records that only corresponds to the real user clicks.
People tend to understand technical information more if it is similar to a
human language. Therefore it is advantageous to use decision trees for mining
web log data, as they generate web usage patterns in the form of rules which are
understandable to humans. However, it was discovered that users browsing
history length is different, therefore specific data preparation needed in order to
compose fixed length data vectors required by the algorithm. Methods what data
preparations steps necessary to carry out are provided and later classification and
prediction tasks were applied to generate web usage models which then could
contribute to the web site refinement.
Finally, it was shown that specific part of the text can be a valuable source of
information. This part of the text is extracted from the hyperlink text. Method
was suggested and steps provided how to use hyperlink text together with other
features. Experiments demonstrated more accurate results defining user
behaviour by using text as additional feature. In addition hyperlink text can be
used in results presentation step as it represents the actual text that users see
when clicking hyperlinks.
The main results of this dissertation were presented in 5 scientific
publications: two articles in periodical scientific publications from the Master
Journal List of Institute for Scientific Information (Thomson ISI Web of science),
one in the referred journal by IOS Press, 2 scientific papers were presented and
published in the international referred conferences.Santrauka
pateikta.
bei
. Tam reikalingos
aunamos , kuriuose fiksuojama
informacija
gavyba, o su
dalykai:
ana prognozavimo ir klasifikavimo
u daviniams . Pagrindinis disertacijos tikslas –
,
ir metodologijas.
Darbo tyrim
I –
. Parodyta, kad s ,
buvo sukurtas naujas
metodas, nka .
Tyrimo metu nustatyta, kad
– suformavus fiksuoto ilgio vektorius,
tikslinga
poreikius.
Pa s teksto , panaudojimas. Parodyta, kad prie
, galima pasiekti tikslesnius
rezultatus. mo etapo patobulinimas,
kuomet pateikiami
suprantamesne forma.
5 moksliniuose leidiniuose: paskelbti 3
straipsniai: du – formacijos
Thomson ISI Web of Science , vienas –
recenzuojamajame IOS Press leidinyje, du –
konferencijose.
DLPMUPRKIDQLLNSRL?DDNDPGLWD\WNL?D?XVLLHUE?RP
SEP\LXVSUUR
?VMWVLRNURWNVVR?P?L?NNLLHLWLDRSGVVLRRMLLQFWDXPXUORDIWQLR??VVDDPXDQL?SULSQHVULS3UPDDVWWSLHWVWYLHUW?DOLUQHL??VQQDUXXGNRPW?RPML?PWHOONJUH]V?LPRW]IRDUDPDVWWVWLDULLDDQPWOQ?QLQDLRDMWNLLQWNUOOLLRUQ?PUHDD?UXLVLHUULWXRHEUXMOLLDQ?WDLHSDDPUUXDRS?XLP?RRHLDLQQDROVL\]]?VRLNQLVUULH?]?XLOLWDWWR?HLVQSWQHDUS?UVH?WDDYRLPXRLEHWWHDRSI?Y\SVQHYP?DGLV
LQOVVHWGDLG?V?LDYHLRLPLLDQLRUD?WVNLODHLLGG?MDQQDXXLMWDOVWV?LLQQXLGDQWLLQNVOXLMRRGNXMRLPWHXQS?S?DQWDWOLR]F?VDJWDSOXLGP?\LEH?VNQD?QVRLRDJL?JNXLQQVWDNVN\DG?LXQWWDH]ULQRHUWRL'QLLV?RQGRXRVPHQQD?QRH?XLWQ\LXDNWDL?QDNROYLWRVL?UUODD?X?NL?YDD?UDLNQLYP?XXLPDERXUYRRUVVNNLMU?WXDPVVQ?HQSYDNLDLQRNJDGPDLVQ?GX?QPHVU\QVY\\XJQL?EVL?HPXR?QLVSLHWLQUDXL?XWPJDL?WLLQDXLVRLQHUUJHRLDNL?RPXLGQYJQ?DV?VUUDV?L?$DNLL?HNH?XEUGVXLRQPUH\QR?PUDUQWDLOOLL]??DVWESOUQRLF?HQVDDLVNLWDWPXSWDUHNIGHONEWD\YWHWVNQ?LQVXRR7?RUGS??OHRUGHVY?N\VDLLNXLWHHQDU?HRWDQW,QLRMRD?L\LUFDMV'WLUV
HLLXDNWXHUE?XS?UOLQW\DRL?N?L?X?V??SLVQDLL?USGDMWVHLLNHLVPLD
VLIDUWDLLWLPVWQLLWQLNLUHXQRRVRL?XPVUYWDVU\W?RDW?RRMO?QWPLDUD?GUWXMWXOVHU?RSJROQD]RVD??]XLO?DDQ\DVW?QDHLPORWXVGQU\?R]UL]OODQ?DSXYGLRGWYRHUEF?LQ\DUE?D\PXRSOLXVWRDUDLGMMDVW\NUWDHVDNWLUQWLXQUJG?UL]OOJDLD?WUMWLROGQ?WOLLDWPOMLWNWXDVXVVSLH7FULPILHQX?WWGLXSREPLHXQW?LSHDSUXDRL?LLWPQORWVDUH?LWVDNSHDW??RULOONQQXLWDWLDQSLQ??UQVLD\PHLQWLDONXDEHR?W0N?OSQDVWLHLHRUPVLR?ORQXWWWDWLND\UWQLL?LNQLWL?LLRLOGRQHHJQLDVXLGRWDXVL?SJUYDENGWRLHN?RSMDH?VPSQUUHJQLGMLWPG?DPDHQG?DLN?OWDODJSRDUWLQWVPXXVDNPONDHVDLM?Contents
INTRODUCTION .............................................................................................. 1
Problem under Investigation ........................................................................... 1
Topicality of the Research Work..................................................................... 1
Research Object............................................................................................... 3
The Aim of the Work ...................................................................................... 3
Tasks of the Work ........................................................................................... 4
Applied Methods ............................................................................................. 4
Scientific Novelty and its Importance ............................................................. 4
Practical Value of the Work Results ............................................................... 5
Statements Presented for Defence................................................................... 5
Approval of the Work Results......................................................................... 6
The Scope of the Scientific Work ................................................................... 6
1. WEB DATA MINING ANALYSIS ............................................................. 3
1.1. Knowledge Discovery from Huge Databases.......................................... 3
1.2. Knowledge Discovery from the Web ...................................................... 5
1.3. Web Mining Taxonomy .......................................................................... 6
1.3.1. Web Structure Mining ...................................................................... 7
1.3.2. Web Usage Mining........................................................................... 8
1.3.3. Web Content Mining ........................................................................ 9
1.4. Web Data Collection Sources................................................................ 101.4.1. Marketing Data............................................................................... 11
1.4.2. Web Server Data............................................................................. 11
1.4.3. Topological Information................................................................. 11
1.4.4. Unstructured Web Data .................................................................. 11
1.4.5. Semi-Structured Web Data............................................................. 12
1.4.6. Structured Data............................................................................... 13
1.5. KDD Steps Using Web Log Data.......................................................... 14
1.6. Web Log Data Collection...................................................................... 15
1.7. Web Log Data Pre-Processing Steps..................................................... 18
1.7.1. Feature Selection ............................................................................ 19
1.7.2. Data Cleaning ................................................................................. 19
1.7.3. Unique User Identification ...