Lorraine Goeuriot website

Research Interests

My research activities are centered around the processing of medical information. During my PhD, I studied the compilation of multilingual medical resources to build terminologies and lexical resources. That involved linguistic processing, as well as data mining techniques such as machine learning. In Nanyang Technological University, my research focused on opinion mining in the medical domain, which involved compiling resources, corpus analysis and linguistic processing. I am now working on Khresmoi project, which aims at creating a multilingual and multimedia platform for accessing biomedical information. DCU's role in this project mainly involves leading evaluation of the system (both empirical and user-centered), multilingual information access support, designing collaborative functionalities and designing results summarisation.

My research interests include:

Medical Informatics
Natural Language Processing
Data Mining
Information retrieval
Sentiment Analysis
Terminology
Corpus Linguistics
Multilingual Information Processing
Evaluation

Research activities

1. Postdoc in Dublin City University

Khresmoi system aims at building systems for multilingual multimodal search and access for biomedical information sources. Our research group is involved in three workpackages related to biomedical text mining and search, user interface and search system, multilingual resources and information delivery; and leading the workpackage in charge of the evaluation of the project and the system developed. My work for all of these workpackages consists in conducting research work, development, as well as managing the progress, organise meetings and teleconferences, and writing deliverables and research papers. During my .first year in the project, we have published two papers in refereed workshops: one on the development of collaborative functionalities for medical information systems (LREC workshop) and one on the development of a large-scale user-centred evaluation within Khresmoi (CLEF eHealth workshop).

2. Postdoc in Nanyang Technological University

Social media are commonly used to express opinions about interesting subjects. Our objective is to develop an eff.ective method for sentiment analysis and summarization of social media content, especially in health and medical .fields. As target domains, we focus on drugs. We aim to build a web-based system to provide a summarized view of public opinions. A sentence-based system has been built to achieve semantic annotation of the sentences, based on medical thesaurus semantic types (e.g. Chemical and drugs, Symptom), and then predict sentiments toward various aspects (e.g. side eff.ects, cost) of a drug using machine learning and linguistic approaches. This project has led to two publications in international refereed conference and one journal paper.

3. PhD

Characterization and compilation of specialized comparable corpora

Supervised by Béatrice Daille and Emmanuel Morin

Comparable corpora are sets of texts written in different languages that are not translations of each other but that share common characteristics. Their main advantage is to be fully representative of linguistics and cultural specificities of their respective language. The Web could theoretically be considered as a comparable corpora source. However, the quality of corpora and of their extracted resources depends on the preliminary definition of corpora and on the carefulness of their compilation (i.e. the definition of common features in comparable corpora). In this thesis, we focus on the compilation of specialized comparable corpora in French and Japanese which documents are extracted from the Web. We propose a definition of these corpora and a set of common features: a specialized domain, a topic and a type of discourse (science or popular science). Our goal is to create a tool to assist comparable corpora compilation. First, we present automatic recognition of common features. Topics can be easily identified with keywords used in Web searches. On the contrary, the detection of the type of discourse needs a wide stylistic analysis. This task is performed over a learning corpus, which leads to the creation of a bilingual typology based on three levels of analysis: structural, modal and lexical. Second, we use this typology to learn a classification model with SVMlight and C4.5. This classification model is tested over an evaluation corpus. Our test results indicate that more than 70% of the documents are well classified. Finally, the classifier is integrated into a comparable corpora compilation assistant tool developed on UIMA system.

Student supervision

2016-2019: Seydou Doumbia (PhD) - Information retrieval for Mali
2017: Nayanika Dogra (Master 2) - Cultural Microblog retrieval
2017: Mujtaba Asif (Master 1) - Opinion mining in tweets
2017: Ghadeer Mohannad (Master 1) - Tool for IR evaluation
2016: Julie Budaher (Master 2) - Health information retrieval
2016: Sanjay Kamath (Master 2) - Contextual recommendation of touristic activities