Hesam Amoualian - Scaling Latent Topic-Class Models to Big Data Collections and Streams

Organized by: 
Hesam Amoualian
Hesam Amoualian


Jury :

  • Marie-Francine Moens, professeur à l'Université de KU Leuven , rapporteur
  • Julien Velcin, maitre de conferences (HDR) à l'Université de Lyon 2, rapporteur
  • Wei Lu, professeur assistant à l'Université de Tech et de Design de Singapour, examinateur
  • Eric Gaussier, professeur à l'Université Grenoble Alpes, directeur de thèse
  • Massih-Reza Amini, Professeur à l'Université Grenoble Alpes, co-directeur de thèse
  • Marianne Clausel, professeur associe à l'Université Grenoble Alpes, co-directrice de thèse

This thesis focuses on scaling latent topic models for big data collections, especially when document streams. Although the main goal of probabilistic modeling is to find word topics, an equally interesting objective is to examine topic evolutions and transitions. To accomplish this task, we proposed three new models for modeling topic and word-topic dependencies between consecutive documents in document streams. The first model is a direct extension of Latent Dirichlet Allocation model (LDA) and makes use of a Dirichlet distribution to balance the influence of the LDA prior parameters with respect to topic and word-topic distributions of the previous document. The second extension makes use of copulas, which constitute a generic tool to model dependencies between random variables. Lastly, the third model is a non-parametric extension of the second one through the integration of copulas in the stick-breaking construction of Hierarchical Dirichlet Processes (HDP). Our experiments, conducted on five standard collections that have been used in several studies on topic modeling, show that our proposals outperform previous ones, as dynamic topic models, temporal LDA and the Evolving Hierarchical Processes, both in terms of perplexity and for tracking similar topics in document streams. Compared to previous proposals, our models have extra flexibility and can adapt to situations where there are no dependencies between the documents.
On the other hand, the "Exchangeability" assumption in topic models like LDA often results in inferring inconsistent topics for the words of text spans like noun-phrases, which are usually expected to be topically coherent. Following this issue, we proposed copulaLDA (copLDA), that extends LDA by integrating part of the text structure to the model and relaxes the conditional independence assumption between the word-specific latent topics given the per-document topic distributions. To this end, we assume that the words of text spans like noun-phrases are topically bound and we model this dependence with copulas. We demonstrated empirically the effectiveness of copLDA on both intrinsic and extrinsic evaluation tasks on several publicly available corpora. To complete the previous model (copLDA), we presented an LDA-based model that generates topically coherent segments within documents by jointly segmenting documents and assigning topics to their words. The coherence between topics is ensured through a copula, binding the topics associated to the words of a segment. In addition, this model relies on both document and segment specific topic distributions so as to capture fine-grained differences in topic assignments. We showed that the proposed model naturally encompasses other state-of-the-art LDA-based models designed for similar tasks.