Mining and Enriching Multilingual Scientific Text Collections: Challenges and Opportunities

January 25, 2019 at 12:00 am by Dr. Horacio Saggion

Place: Large Lecture Room

Abstract:

Scientists worldwide are confronted with an exponential growth in the number of scientific documents being made available, in this scenario of scientific information overload, natural language processing has a key role to play.

Over the past few years we have seen a number of tools for the analysis of the structure of scientific documents (e.g. transforming PDF to XML), methods for extracting keywords, or classifying sentences into argumentative categories being developed. However, deep analysis of scientific documents such as: finding key claims, assessing the argumentative quality and strength of the research, or summarizing the key contributions of a piece of work are less common. Besides, most research in scientific text processing is being carried out for the English language, neglecting both the share of scientific information available in other languages and the fact that scientific publications are many times bilingual.

In this talk, I will present work carried out in our laboratory towards the development of a system for “deep” analysis and annotation of scientific text collection. Originally for the English language, it has now been adapted to Spanish. After a brief overview of the system and its main components, I will present our recent work on the development of a bi-lingual (Spanish and English) fully annotated text resource in the field of natural language processing that we have created with our system together with a faceted-search and visualization system to explore the created resource.

The talk will be preceded by an overview of the research activities and projects developed at the Natural Language Processing Group (TALN) from Universitat Pompeu Fabra.

Short Bio:

Horacio Saggion is an Associate Professor at the Department of Information and Communication Technologies, Universitat Pompeu Fabra (UPF), Barcelona. He is the head of the Large Scale Text Understanding Systems Lab, associated to the Natural Language Processing group (TALN) where he works on automatic text summarization, text simplification, information extraction, sentiment analysis and related topics. Horacio obtained his PhD in Computer Science from Universite de Montreal, Canada in 2000. He obtained his BSc in Computer Science from Universidad de Buenos Aires in Argentina, and his MSc in Computer Science from UNICAMP in Brazil. He was the Principal Investigator for UPF in the EU projects Dr Inventor and Able-to-Include and is currently principal investigator of the national project TUNER and the Maria de Maeztu project Mining the Knowledge of Scientific Publications. Horacio has published over 150 works in leading scientific journals, conferences, and books in the field of human language technology. He organized four international workshops in the areas of text summarization and information extraction and was scientific Co-chair of STIL 2009 and scientific Chair of SEPLN 2014. He is a regular programme committee member for international conferences such as ACL, EACL, COLING, EMNLP, IJCNLP, IJCAI and is an active reviewer for international journals in computer science, information processing, and human language technology. Horacio has given courses, tutorials, and invited talks at a number of international events including LREC, ESSLLI, IJCNLP, NLDB, and RuSSIR.

Watch the video seminar

Pictures: