Ritesh Shah - SUFT-1, un système pour aider à comprendre les tweets spontanés multilingues et à commutation de code en langues étrangères : expérimentation et évaluation sur les tweets indiens et japonais

12:00

Friday

Oct

2017

Thesis defence

Place:

Seminar room 2 - Ground floor - IMAG Building

Organized by:

Ritesh Shah

Speaker:

Ritesh Shah

Teams:

GETALP

Jury :

Georges Antoniadis - Professeur - Université Grenoble-Alpes - Président
Patrick Paroubek - Ingénieur de Recherche - LIMSI-CNRS - Rapporteur
Mathieu Lafourcade - Maître de Conférences - Université Montpellier 2 - Rapporteur
Violaine Prince - Professeur - Université Montpellier 2 - Examinateur
Clément Levallois - Maître de Conférences - EM-Lyon - Examinateur
Christian Boitet - Professeur Emérite - Université Grenoble-Alpes - Directeur de thèse
Pushpak Bhattacharyya - Professeur - IIT Bombay and IIT Patna - Co-directeur de thèse
Mathieu Mangeot - Maître de Conférences - Université Savoie Mont Blanc - Co-encadrant de thèse

As Twitter evolves into a ubiquitous information dissemination tool, understanding tweets in foreign languages becomes an important and difficult problem. Because of the inherent code-mixed1, disfluent and noisy nature of tweets, state-of-the-art Machine Translation (MT) is not a viable option (Farzindar & Inkpen, 2015). Indeed, at least for Hindi and Japanese, we observe that the percentage of "understandable" tweets falls from 80% for natives to below 30% for target (English or French) readers using GOOGLE TRANSLATE or YANDEX. Our starting hypothesis is that it should be possible to build generic tools, which would enable foreigners to make sense of at least 70% of “native tweets”, using a versatile “active reading” (AR) interface, while simultaneously determining the percentage of understandable tweets under which such a system would be deemed useless by intended users.
We have thus specified a generic "SUFT" (System for helping Understand Foreign Tweets), and implemented SUFT-1, an interactive multi-layout system based on AR, and easily configurable by adding dictionaries, morphological modules, and MT plugins. It is capable of accessing multiple dictionaries for each source language and provides an evaluation interface. For evaluations, we introduce a task-related measure inducing a negligible cost, and a methodology aimed at enabling a « continuous evaluation on open data », as opposed to classical measures based on test sets related to closed learning sets. We propose to combine understandability ratio and understandability decision time as a two-pronged quality measure, one subjective and the other objective, and experimentally ascertain that a dictionary-based active reading presentation can indeed help understand tweets better than available MT systems.
In addition to gathering various lexical resources, we constructed a large resource of "word-forms" appearing in Indian tweets with their morphological analyses (163221 Hindi word-forms from 68788 lemmas and 72312 Marathi word-forms from 6026 lemmas) for creating a multilingual morphological analyzer specialized to tweets, which can handle code-mixed tweets, compute unified features, and present a tweet with an attached AR graph from which foreign readers can intuitively extract a plausible meaning, if any.

— हिन्दी—

ट्विटर के रूप में एक सर्वव्यापी सूचना प्रसार उपकरण के विकसित होते ही विदेशी भाषाओं में ट्वीट्स को समझने की समस्या एक महत्वपूर्ण और कठिन चुनौती बनके सामने आती है। ट्वीट्स में निहित कोड-मिक्सिंग, विसंगत वाक्य रचना एवं सामान्यतः अशुद्ध लेखन की वजह से, अत्याधुनिक मशीन ट्रांसलेशन (एमटी) एक व्यवहार्य विकल्प नहीं है (फारज़ींदर एंड इंकपेन, 2015)। वास्तव में, कम से कम हिंदी और जापानी के लिए, हम देखते हैं कि गूगल ट्रांस्लेट या यैंडेक का उपयोग करते हुए, "समझने योग्य" ट्वीट्स का प्रतिशत किसी मूल निवासी के लिए 80% से गिरकर किसी अंग्रेज़ी या फ्रेंच वाचक के लिए 30% हो जाता है। हमारी प्रारंभिक अवधारणा यह है कि एक बहुमुखी "एक्टिव रीडिंग" (एआर) इंटरफ़ेस का उपयोग करते हुए विदेशियों को कम से कम 70% "देशी ट्वीट्स" का अर्थ समझने में सक्षम कर सके ऐसे एक व्यापक उपकरण बनाने की निश्चित रूप से संभावना है। साथ ही साथ हम ये भी सुनिश्चित करते है कि कम से कम कितने प्रतिशत ट्वीट्स न समझ आने पर ये उपकरण प्रयोक्ताओं द्वारा बेकार माना जाएगा।
इस अवधारणा के आधार पर हमने एक व्यापक "एसयूएफटी" (विदेशी ट्वीट्स को समझने में मदद करनेवाला सिस्टम) निर्दिष्ट किया, एवं तत्पश्चात "एक्टिव रीडिंग" पर आधारित एक इंटरैक्टिव मल्टी-लेआउट सिस्टम(उपकरण) SUFT-1 का कार्यान्वन किया। इस उपकरण का प्रारूप आसानी से शब्दकोश, रूपिकी या शब्द साधन मॉड्यूल और मशीनी अनुवाद के प्लगइन्स जोड़कर बदला जा सकता है। यह प्रत्येक भाषा के लिए एकाधिक शब्दकोशों का उपयोग करने एवं एक मूल्यांकन इंटरफ़ेस प्रदान करने में सक्षम है। मूल्यांकन के लिए, हम एक कार्य-संबंधित माप और एक कार्यप्रणाली का प्रस्ताव रखते हैं जो नगण्य लागत से "ओपन डाटा पर निरंतर मूल्यांकन" करने में सक्षम है एवं उन शास्त्रीय उपायों से अलग है जो "क्लोस्ड लर्निंग सेट्स" पर आधारित हैं।
हम 'अंडरस्टँडेबिलिटी रेशियो' एवं 'अंडरस्टँडेबिलिटी डिसीज़न टाइम' को व्यक्तिपरक और वस्तुपरक माप की दृश्टी से दो-तरफा गुणवत्ता वाले एक माप के रूप में जोड़ते हैं। साथ ही साथ प्रयोगात्मक रूप से यह पता लगाते है कि क्या एक शब्दकोश-आधारित सक्रिय रीडिंग प्रस्तुति वास्तव में उपलब्ध एमटी सिस्टमों की अपेक्षा ट्वीट्स को बेहतर समझने में सहायक हो सकती है। विभिन्न शब्दार्थिक संसाधनों को इकट्ठा करने के अलावा, हमने भारतीय ट्वीट्स में निहीत "वर्ड फाॅर्म्स" का उनके रूपात्मक विश्लेषण के साथ एक बड़ा संसाधन निर्मित किया है जिसमें (68788 लेम्माज़ से 163221 हिंदी वर्ड फाॅर्म्स और 6026 लेम्माज़ से 72312 मराठी वर्ड फाॅर्म्स ) हैं। यह एक बहुभाषी रूपात्मक विश्लेषक बनाने के लिए है, जो कि कोड-मिश्रित ट्वीट्स को संभाल सकता है, एकीकृत वैशिष्ट्यों की गणना कर सकता है और एक्टिव रीडिंग ग्राफ के साथ एक ट्वीट प्रस्तुत कर सकता है जिससे विदेशी पाठक सहजता से संभाव्य अर्थ निकाल सकें।