Hamid Mirisaee - Mining Representative Items and Itemsets with Binary Matrix Factorization and Instance Selection

12:00

Wednesday

Sep

2015

Thesis defence

Organized by:

Hamid Mirisaee

Speaker:

Hamid Mirisaee

The PhD defence session will be held at the Touring room in CE4 (allée de Palestine, 38610 Gières)

Jury :

* Pascal Poncelet (Reviewer) - Prof. Univ. de Montpellier

* Céline Robardet (Reviewer) - Prof. INSA de Lyon

* Arno Siebes (Examiner) - Prof. Universiteit Utrecht

* Christel Vrain (Examiner) - Prof. Univ. d'Orléans

* Eric Gaussier (Director) - Prof. Univ de Grenoble

* Alexandre Termier (Co-Director) - Prof. Univ. de Rennes

This thesis focuses on mining representative items and itemsets using Binary Matrix Factorization (BMF) and instance selection. To accomplish this task, we first consider the BMF problem by studying the literature on matrix decomposition techniques and the state-of-the-art algorithms. Then, we establish a connection between BMF problem and Unconstrained Binary Quadratic Programming (UBQP) problem in order to use UBQP's algorithms and heuristics, available in the literature, to improve BMF solutions. Next, we propose a new, efficient heuristic which flips 1 bit at the time in order to improve the solutions of BMF. Using the established link between BMF and UBQP, we compare the proposed technique, called 1-opt-BMF with that of UBQP, called 1opt-UBQP as well as with the standard approach, called 1-opt-Standard. We then show, theoretically and experimentally, the efficiency of 1-opt-BMF on a wide range of publicly available datasets. Then, we explore addressing the problem of finding representative itemsets via BMF. To do that, we first consider the theoretical relation between the frequent itemset mining problem and BMF; once established, we propose a new technique called Decomposition Itemset Miner (DIM). We then design a set of experiments to show the efficiency of DIM and the quality of its results.

Finally, we consider the problem of finding representative objects (instances) in big, high-dimensional datasets. These objects helps us to find the instances which provide a global, top-view of the data and which are very important in the data analysis process. We first study the available methods for finding representative objects and discuss the pros and cons of each. We then formally define the Instance Selection Problem (ISP), provide three variants of that and examine their complexities before providing their solutions. In the experimental section, we show that although the ISP solutions can outperform other methods in some cases, in general it should be considered as a complementary technique in the context of finding representative objects.