Mohamed Quafafou Professor of Computer Science : (Networked) Data Mining, and Web Intelligence
[Aix-Marseille University] Research Lab [LSIS], Engineering School [ESIL]
Picture of John Lafferty

Web Data/Information: Extraction, Fusion and Mining

Key words : Version Space, Boolean Function, Linguistic Negation, Feature Reduction, Inductive Logic Programming, Vagueness and Uncertainty, Rough Sets.

>> Web Information Extraction

With the growth of the Web, many on-line sources such as on-line address books, real estate sites, ecommerce sites, etc. have appeared. However these data sources are destined to be accessed and viewed by human users. While being content rich, these pages are in a presentational format thus making it difficult for an automated machine access. However, giving such machine access opens the door to many applications such as allowing intelligent agents to make use of Web sources, allowing to include Web sources in data mediation systems, etc. In order to give such an access two major problems need to be resolved. First, it is necessary to be able to extract the information contained in the result documents and put this information into a machine understandable format. Second, the machine must know how to access the source, i.e. how to build queries the source will understand, where to post the queries, how to navigate through the result pages, etc. To resolve the first problem we propose a method in which the user specifies the information he wishes to extract by giving example instances of this information. The contexts of the occurrences of these instances are searched for in the result pages and generalized allowing to extract unseen instances. Compared the methods of the literature this method allows to precisely extract the desired information without having to fully manually label example pages. We also propose a solution to the second problem : allowing the machine to access a source. By the study of multiple on-line sources we put front a set of recurrent operators whose parameter settings and combination allows to access a source. We propose a language WETDL in order to describe the operators and their combination. We also give different algorithms allowing to exectute such a description thus allowing to realize an extraction task.

>> Web Data Fusion

>> MashUp

>> Profiling and Recommendation

Internet User Behavior (pdf "TSMC2006.pdf") : [pdf]: With an ever-increasing emphasis on human activity (idea exchange, shopping, gaming, etc.) being mediated through the data network, the understanding of Internet users’ behavior has become a rising challenge. Research dealing with the analysis and modeling of Internet user behavior can be roughly split in to two main approaches. The first is based on sociocognitive observation of users’ practices in a standardized context. The second approach focuses on the analysis of productions and the traces of users’ activity. This paper relates to the latter approach and presents a comparative analysis of Internet navigation traces (URLs versus keywords) to characterize individual or group-of-users’ behavior when accessing the Web. The proposed models are based on the study of accesses redundancy seen as global static parameters and from the angle of time evolution. We also study the use of these models, in particular, to categorize a population of users in communities of interests. This study enables us to draw some conclusions on the compared performances of the two kinds of trace exploitation, as raw information, as well as the self-similar properties of the models.

Recommendation System Based on Categorical Clusters (pdf "KES2003.pdf") : [pdf]: We propose in this paper a recommendation system based on a new method of clusters discovery which allows a user to be present in several clusters in order to capture his different centres of interest. Our system takes advantage of content-based and collaborative recommendation approaches. The system is evaluated by using proxy server logs, and encouraging results were obtained.
Designed by Polo Chau