Jérémy Ferrero - Cross-Lingual Semantic Textual Similarity : towards Automatic Cross-Language Plagiarism Detection

Organized by: 
Jeremy Ferrero
Jeremy Ferrero


Membres du jury :

  • Mme Isabelle Tellier, professeure, Université Paris 3 - Sorbonne Nouvelle, examinatrice
  • M. Emmanuel Morin, professeur, Université de Nantes, rapporteur
  • M. Juan-Manuel Torres-Moreno, maître de conférences, HDR, Université d'Avignon et des Pays de Vaucluse, École Polytechnique de Montréal - DGIGL, rapporteur
  • M. Frédéric Agnès, ingénieur R&D, Compilatio, examinateur
  • M. Laurent Besacier, professeur, Université Grenoble Alpes, directeur de thèse
  • M. Didier Schwab, maître de conférences, Université Grenoble Alpes, co-directeur de thèse

The massive amount of documents through the Internet (e.g. web pages, data warehouses and digital or transcribed texts) makes easier the recycling of ideas. Unfortunately, this phenomenon is accompanied by an increase of plagiarism cases. Indeed, claim ownership of content, without the consent of its author and without crediting its source, and present it as new and original, is considered as plagiarism. In addition, the expansion of the Internet, which facilitates access to documents throughout the world (written in foreign languages) as well as increasingly efficient (and freely available) machine translation tools, contribute to spread a new kind of plagiarism: cross-language plagiarism. Cross-language plagiarism means plagiarism by translation, i.e. a text has been plagiarized while being translated (manually or automatically) from its original language into the language of the document in which the plagiarist wishes to include it. While prevention of plagiarism is an active field of research and development, it covers mostly monolingual comparison techniques. This thesis is a joint work between an academic laboratory (LIG) and Compilatio (a software publishing company of solutions for plagiarism detection), and proposes cross lingual semantic textual similarity measures, which is an important sub-task of cross-language plagiarism detection.
After defining the plagiarism and the different concepts discussed during this thesis, we present a state-of-the-art of the different cross-language plagiarism detection approaches. We also present the preexisting corpora for cross-language plagiarism detection and show their limits. Then we describe how we have gathered and built a new dataset, which does not contain most of the limits encountered by the preexisting corpora. Using this new dataset, we conduct a rigorous evaluation of several state-of-the-art methods and discover that they behave differently according to certain characteristics of the texts on which they operate. We next present new methods for measuring cross-lingual semantic textual similarities based on word embeddings. We also propose a notion of morphosyntactic and frequency weighting of words, which can be used both within a vector and within a bag-of-words, and we show that its introduction in the new methods increases their respective performance. Then we test different fusion systems (mostly based on linear regression). Our experiments show that we obtain better results than the state-of-the-art in all the sub-corpora studied. We conclude by presenting and discussing the results of these methods obtained during our participation to the cross-lingual Semantic Textual Similarity (STS) task of SemEval-2017, where we ranked 1st on the sub-task that best corresponds to Compilatio’s use-case scenario.