# Approximation and Emergence in Data Mining

Key words : Version Space, Boolean Function, Linguistic Negation, Feature Reduction, Inductive Logic Programming, Vagueness and Uncertainty, Rough Sets.

* >> Approximation of (presumed) formal concepts*

**Algebraic Operators for Querying Pattern Bases [pdf]**: Theobjectivesofthisresearchworkwhichisintimatelyrelatedtopat- tern discovery and management are threefold: (i) handle the problem of pattern manipulation by defining operations on patterns, (ii) study the problem of enrich- ing and updating a pattern set (e.g., concepts, rules) when changes occur in the user’s needs and the input data (e.g., object/attribute insertion or elimination, tax- onomy utilization), and (iii) approximate a “presumed” concept using a related pattern space so that patterns can augment data with knowledge. To conduct our work, we use formal concept analysis (FCA) as a framework for pattern discov- ery and management and we take a joint database-FCA perspective by defining operators similar in spirit to relational algebra operators, investigating approxi- mation in concept lattices and exploiting existing work related to operations on contexts and lattices to formalize such operators.

* >> Version Space Approximation*

**Concept Learning with Approximation: Rough Version Spaces [pdf]**: The concept learning problem is a general framework forlearningconceptconsistentwithavailabledata.VersionSpaces theoryandmethodsarebuildinthisframework.However,itis notdesignatedtohandlenoisy(possiblyinconsistent)data.Inthis paper,weuseroughsettheorytoimprovethisframework.Firstly,we introducearoughconsistency.Secondly,wedefineanapproximative conceptlearningproblem.Thirdly,wepresentaRoughVersionSpace theory and related methods to address the approximative concept learningproblem.Usingadidacticexample,weputthesemethodsinto use.Anoverviewofpossibleextensionofthisworkconcludesthisarticle.

**Text Mining (Graph Structure Discovery and Visualization) [pdf]**: we define Rough Version Spaces (RVS), by building approximation operators, thus leading to a general framework for symbolic machine learning. Introducing approximative consistency induces use of concepts almost consistant with data. The practical effect is to enlarge concept interpretation while learning, and to handle inconsistant data using data clusters. Then, this new learning method is applied to word neighbourhood study in text corpus. This symbolic approach and RVS robustness make it possible to use data and prior linguistic knowledge (a WordNet graph) directly, without statistical reformulation. This allows us to extract and abstract a target word context in form a concept subgraph.

**Web Mining [pdf]**: This paper presents a challenging project which aims to ex- tend the current features of search and browsing engines. Different meth- ods are integrated to meet the following requirements : (1) Integration of incremental and focused dynamic crawling with meta-search; (2) Free the user from sifting through the long list of documents returned by the search engines; (3) Extract comprehensive patterns and useful knowledge from the documents; (4) Visual-based support to browse dynamic doc- ument collections. Finally, a new paradigm is proposed combining the mining and the visualization methods used for search and exploration.

* >> Boolean function approximation*

**Query Generation [pdf]**: In this article, we introduce a new method for query gen- eration. This method uses only a logical approach and does not need a statistical process or a natural language process- ing. The main interest of this new method is the abstraction. In a second part, we discuss a method for learning a text classifier and query generation for this classifier. The two problems are resolved in a complementary approach using our query generation method and SVM as text classifiers. We use this approach for studying words polysemy. Our method generates queries in order to retrieve documents about a specific sense of the word and in the same time learning the associated text classifier. Our method have good results.

# >> Linguistic Negation Approximation

**Positive interpretations [pdf]**: Linguistic negation processing is a challenging problem studied by a large number of researchers from different communities, i.e. logic, linguistics, etc. We are interested in finding the positive interpretations of a negative sentence represented as ”x is not A”. In this paper, we do not focus on the single set of trans- lations but on two approximation sets. The first one called pessimistic corresponds to the positive translations of the negative sentence that we can consider as sure. The second one called optimistic contains all the sentences that can be viewed as possible translations of the negative sentence. These approximation sets are computed according to the rough sets framework and based on a neighbourhood relation defined on the space of properties. Finally, we apply an original strategy of choice upon the two approximation sets which allows us to select the suitable translations of the initial negative sentence. It appears that we obtain results in good accordance with the ones linguistically expected. 2002

# >> learning flexible concepts from vague and uncertain data [pdf]

Inductive learning systems in a logical framework are prone to difficulties when dealing with huge amount of information. In particular, the learning cost is greatly increased, and it becomes difficult to find descriptions of concepts in a reasonable time. n this work, we develop a learning approach based on Rough Set Theory (RST), and more especially on its basic notion of concept approximation. In accordance with RST, a learning process is splitted into three steps, namely (1) partitioning of knowledge, (2) approximation of the target concept, and finally (3) induction of a logical description of this concept. The second step of approximation reduces the volume of the learning data, by computing well-chosen portions of the background knowledge which represent approximations of the concept to learn. Then, only one of these portions is used during the induction of the description, which allows for reducing the learning cost. In the first part of this paper, we report how RSTs basic notions namely indiscernibility, as well as lower and upper approximations of a concept have been adapted in order to cope with a logical framework. Some empirical results obtained with a concrete implementation of the approach, i.e., the EAGLE system, are given. These results show the relevance of the approach, in terms of learning cost gain, on a learning problem related to the document understanding.# >> Feature reduction

**Instance Selection [pdf]**: This paper introduces a new method for instances selection. The conceptual framework and the basic notions used by this method are those of an extended rough set theory, called α-rough set theory. In this context we formalize a notion of conflicting data, which is at the basis of a conflict normalization method used for instances selection. Extensive experiments are performed to show the efficiency and the accuracy of models built from the reduced datasets. The selection methodology and its results are discussed.

**Feature Selection [pdf]**: In this paper, we address the problem of feature subset se- lection using rough set theory. We propose a scalable algorithm to find a set of reducts based on discernibilityfunction, which is an alternative so- lution for the exhaustive approach. Our study shows that our algorithm improves the classical one from three points of view: computation time, reducts size and the accuracy of induced model. The problem of features subset selection can be defined as the selec- tion of a relevant subset of features which allows a leanaing algorithm to induce small high-accuracy concepts. To achieve this goal, two main approaches have been developed, the first one is algorithm-independent (filter approach) which considers only the data, when the second approach takes into account both the data and a given learning algorithm (wrapper approach). Recent work were developed to study the interest of rough set theory and more particularly its notions of reducts and core to deal with the problem of features subset selection. Different methods were proposed to select features using both the core and the reduct concepts, whereas other researches show that useful features subset does not necessarily con- tain all features in cores. In this paper, we underline the fact that rough set theory is concerned with deterministic analysis of attribute dependen- cies which are at basis of the two notions of reduct and core. We extend the notion of dependency which allows us to find both deterministic and non-deterministic dependencies. A new notion of strong reducts is then introduced and leads to the definition of strong feature subsets (SFS). The interest of SFS is illustrated by the improvement of the accuracy of our learning system, called Alpha, on real-world datasets.

* >> Closed and Emerging Patterns*

**Closed Emerging Patterns based Clustering : [pdf]**: This paper presents a data mining effort to discover emerging overlapping clusters with respect to the stage of fibrosis from a medical database collected at the Chiba University Hospital, Japan. Such clusters, described by examinations on patients, may lead to relevant factors for estimating the stage of fibrosis. We used and combined two recent methods: a soft-clustering approach (ECCLAT) and a method to soundly and completely mine patterns under a constraint specified by the user (MUSIC) in order to characterize classes. This new system produces a set of emerging overlapping clusters of patients, based on closed emerging patterns, which summarizes the data set by taking into account the different classes. Results point out the role of some medical examinations.

**Closed Patterns based Clustering : [pdf]**: In this paper we present a new approach for the discovery of meaningful clusters from large categorical data (which is an usual situation, e.g., web data analysis). Our method called ECCLAT (for Extraction of Clusters from Concepts LATtice) extracts a subset of concepts from the frequent closed patterns lattice, using an evaluation measure. ECCLAT is generic because it allows to build approximate clustering and discover meaningful clusters with slight overlapping. The approach is illustrated on a classical data set and on web data analysis.

**Data Visualization : [pdf]**: Minimal Hypergraph transversals based Meta-Clustering for Data Visualization (pdf "ISMIS2006.pdf") : Information visualization is gaining importance in data mining and transactional data has long been an important target for data miners. We propose a novel approach for visualizing transactional data using multiple clustering results for knowledge discovery. This scheme necessitates us to relate different clustering results in a comprehensive manner. Thus we have invented a method for attributing colors to clusters of different clustering results based on minimal transversals. The effectiveness of our method VisuMClust has been confirmed with experiments using artificial and real-world data sets.