This work looks at approximation and emergence in data mining.
We propose differents approximation opérators based on a generalized rough set theory and develop algorithms to approximate
Version space,
graph dependencies in text mining,
Data cubes,
boolean functions,
linguistic negation, and
formal concepts.
I'm currently dealing with the approximation of minimale transversals of hypergraphs and Borders of frequent and/or emergent patterns.
Experts play a crucial role in supervised learning as they annotate training datasets which are used next as an input for the learning process. The basic assumption behind this paradigm is that experts estimate correctly the true labels. Unfortunately, finding the ground truth remains a hard problem as the average expert of a specific domain, has many zones of ignorance. Furthermore, experts annotated systematically all the observed data even if they hesitate or ignore how to annotate the instance at hand. This problem became crucial in the context of learning from multiple annotators. Therefore, the key to successful supervised learning, especially in domains based on dynamic knowledge, is how to effectively exploit ignorance of experts to improve the learning process.
With the growth of the Web, many on-line sources such as on-line address books, real estate sites, e- commerce sites, etc. have appeared. However these data sources are destined to be accessed and viewed by human users. While being content rich, these pages are in a presentational format thus making it difficult for an automated machine access. However, giving such machine access opens the door to many applications such as allowing intelligent agents to make use of Web sources, allowing to include Web sources in data mediation systems, etc. In order to give such an access two major problems need to be resolved. First, it is necessary to be able to extract the information contained in the result documents and put this information into a machine understandable format. Second, the machine must know how to access the source, i.e. how to build queries the source will understand, where to post the queries, how to navigate through the result pages, etc.
This allows us to build peer-to-peer semantic structured communities called cSON (Communauty Semantic Overlay Network).
This raises many questions concerning the explanation of communities and their operating to improve performances (response time, number of messages, precision and recall). To build communities, we study two different alternatives: (1) Semantic Mediation: the building of communities is based on semantic links between super-peers and the confidence that they have between them and (2) Clustering: a clustering algorithm, based on the analysis of queries processed by the super-peers, is the base of community building. Then, we propose two methods to calculate the characterizations of communities in the two research fields: (1) Data mining: we try to characterize each community using knowledge extracted from applications processed by his super-peers of the same community CK (Community Knowledge) and (2) Hypergraphs: Unlike the previous method, our goal now is to characterize the communities collectively. We formalize this problem as the research of the MCS (minimal covering shortcuts) which are shortcuts between the super-peers, minimum shortcuts covering all communities. Then, we develop two methods of queries routing CK-rooting and MCS-rooting respectively using community knowledge and MCS to identify the super-peers may process a given query.