*********************************************************************

****************************  Special Date *****************************

*********************************************************************

 

                                                           Seminar

             Department of Systems Engineering and Engineering Management

                                 The Chinese University of Hong Kong

 

---------------------------------------------------------------------------------------------

 

Title:  Graph-Theoretic Methods for Web Document Categorization

 

Speaker:  Dr. Mark Last

                Department of Information Systems Engineering

                Ben-Gurion University of the Negev

                Beer-Sheva, Israel

Date     :   December 15th, 2006 (Friday)

Time    :   4:30p.m. - 5:30p.m.

Venue  :   Room 513

                MMW Engineering Building(Engineering Building Complex Phase 2)

                CUHK

 

---------------------------------------------------------------------------------------------

 

Abstract:

 

Most web document categorization methods are based on the

vector-space model of information retrieval. However, this popular

method of document representation does not capture important

structural information, such as the order and the proximity of

word occurrence or the location of a word within a document.

It also makes no use of the mark-up information that can easily

be extracted from the web document HTML tags.

 

In this talk, we will present the recently developed graph-based

web document representation models, which preserve this structural

information. The new models are shown to outperform the traditional

vector representation using the k-Nearest Neighbor (k-NN)

classification algorithm. Since the eager (model-based) classifiers

cannot work with the graph-based representation directly, we have

also developed a hybrid approach to web document representation,

built upon both graph and vector space models, thus preserving the

benefits and overcoming the limitations of each. The hybrid methods

have been compared to vector-based models using the C4.5 decision-tree

and the probabilistic Naive Bayes classifiers on several benchmark

web document collections.  The results demonstrate that the hybrid

methods outperform, in most cases, existing approaches in terms of

classification accuracy, and in addition, achieve a significant

reduction in the classification time.  If time permits, we will also

discuss a cyber intelligence application of the proposed methodology.

 

-------------------------------------------------------------------------------------------

 

Biography:

 

Mark Last received his M.Sc. (1990) and Ph.D. (2000) degrees in

Industrial Engineering from Tel Aviv University, Israel.  He is

currently a Senior Lecturer at the Department of Information Systems

Engineering, Ben-Gurion University of the Negev, Israel. Prior to

that, he was a Visiting Research Scholar at the National Institute

for Systems Test and Productivity, University of South Florida,

USA (Summer 2001, Summer 2002, Summer 2003), Visiting Assistant

Professor at the Department of Computer Science and Engineering,

University of South Florida, USA (1999 C 2001), a Senior Consultant

in Industrial Engineering and Computing (1994-1998), and the Head

of Production Control Department at AVX Israel (1989-1994).

 

Mark Last has published over 100 papers and chapters in journals,

books, and conferences. He is a co-author of the books

"Knowledge Discovery and Data Mining C The Info-Fuzzy Network

(IFN) Methodology" (Kluwer 2000) and "Graph-Theoretic Techniques

for Web Content Mining" (World Scientific, 2005) and a co-editor

of five books including "Fighting Terror in Cyberspace"

(World Scientific, 2005). His current research interests include

data mining, software assurance, and cyber intelligence.

Mark Last is an Associate Editor of IEEE Transactions on Systems,

Man, and Cybernetics - Part C (since February 2004) and a Senior

Member of the IEEE (since September 2006).

 

***********************  ALL ARE WELCOME  ************************

 

Host     : Professor  Yang, Christopher Chuen Chi

Tel      : (852) 2609-8239

Email    : yang@se.cuhk.edu.hk

 

Enquiries: Peixiang Zhao or Jeffrey Xu Yu,

                 Department of Systems Engineering and Engineering Management

                 CUHK

Website:   http://www.se.cuhk.edu.hk/~seg5810

Email:       seg5810@se.cuhk.edu.hk

 

******************************************************************