*********************************************************************
**************************** Special Date *****************************
*********************************************************************
Seminar
Department of Systems Engineering and Engineering Management
The
---------------------------------------------------------------------------------------------
Title: Graph-Theoretic Methods for Web Document
Categorization
Speaker: Dr. Mark Last
Department of Information Systems Engineering
Date : December 15th, 2006 (Friday)
Time : 4:30p.m. - 5:30p.m.
Venue : Room 513
CUHK
---------------------------------------------------------------------------------------------
Abstract:
Most web document categorization methods are based on the
vector-space model of information retrieval. However, this popular
method of document representation does not capture important
structural information, such as the order and the proximity of
word occurrence or the location of a word within a document.
It also makes no use of the mark-up information that can easily
be extracted from the web document HTML tags.
In this talk, we will present the recently developed graph-based
web document representation models, which preserve this structural
information. The new models are shown to outperform the traditional
vector representation using the k-Nearest Neighbor (k-NN)
classification algorithm. Since the eager (model-based) classifiers
cannot work with the graph-based representation directly, we have
also developed a hybrid approach to web document representation,
built upon both graph and vector space models, thus preserving the
benefits and overcoming the limitations of each. The hybrid methods
have been compared to vector-based models using the C4.5 decision-tree
and the probabilistic Naive Bayes classifiers on several benchmark
web document collections. The results demonstrate that the hybrid
methods outperform, in most cases, existing approaches in terms of
classification accuracy, and in addition, achieve a significant
reduction in the classification time. If time permits, we will also
discuss a cyber intelligence application of the proposed methodology.
-------------------------------------------------------------------------------------------
Biography:
Mark Last received his M.Sc. (1990) and Ph.D. (2000) degrees in
Industrial Engineering from
currently a Senior Lecturer at the Department of Information Systems
Engineering,
that, he was a Visiting Research Scholar at the National Institute
for Systems Test and Productivity,
Professor at the Department of Computer Science and Engineering,
in Industrial Engineering and Computing (1994-1998), and the Head
of Production Control Department at
AVX
Mark Last has published over 100 papers and chapters in journals,
books, and conferences. He is a co-author of the books
"Knowledge Discovery and Data Mining ¨C The Info-Fuzzy Network
(IFN) Methodology" (Kluwer 2000) and "Graph-Theoretic Techniques
for Web Content Mining" (World Scientific, 2005) and a co-editor
of five books including "Fighting Terror in Cyberspace"
(World Scientific, 2005). His current research interests include
data mining, software assurance, and cyber intelligence.
Mark Last is an Associate Editor of IEEE Transactions on Systems,
Man, and Cybernetics - Part C (since February 2004) and a Senior
Member of the IEEE (since September 2006).
*********************** ALL ARE WELCOME ************************
Host : Professor Yang, Christopher Chuen Chi
Tel : (852) 2609-8239
Email : yang@se.cuhk.edu.hk
Enquiries: Peixiang Zhao or Jeffrey Xu Yu,
Department of Systems Engineering and Engineering Management
CUHK
Website: http://www.se.cuhk.edu.hk/~seg5810
Email: seg5810@se.cuhk.edu.hk
******************************************************************