Since its origin, Syllabs has been working on terminology and keyword extraction from unstructured data in several languages (French, English, Spanish, Italian, German, Portuguese, Polish and Chinese). The aim of automatic terminology extraction is to produce a list of terms out of a corpus. Terms are domain-specific words corresponding to a specialized domain.
In the European-funded TTC project (Terminology, Translation and Comparable Corpora), Syllabs develops tools for monolingual terminology extraction in 7 languages using two different approaches: a rule-based as well as an unsupervised approach. One of the goals of the project is to compare both methods and compare knowledge-rich and knowledge-poor strategies. To evaluate both methods and compare them to other tools available from TTC partners, TTC participants have compiled reference term lists (RTL) in two specialized domains: wind energy and mobile technologies. The usefulness of these lists is two-folded: on the one hand they are used in the project to assess results and compare different tools and approaches, on the other hand they are delivered to the NLP and terminology community so that they can benefit of TTC’s output and re-use them for other research purposes. They will soon be for download on the webpage of the Université de Nantes.
The compilation of terminologies is not a trivial task. The work performed to create those reference term lists is presented in the paper “Reference Lists for the Evaluation of Term Extraction Tools” that will be presented at the TKE conference in Madrid by the Université de Nantes, Thursday the 21st of June. Syllabs and the Universität of Stuttgart (IMS) are co-authors of the paper.
In this paper, we discuss the practical and methodological issues of creating reference term lists (RTLs) for the evaluation of term candidate extraction from comparable corpora, in the domains of wind energy and mobile technology. These reference term lists are intended to serve as a “gold standard” for the qualitative and quantitative evaluation of automatic term extraction tools. We present the preliminary results of the evaluation of the monolingual term extraction. Using the manually collected RTLs, we evaluate monolingual term candidate lists which are automatically extracted from the Spanish texts in the domain of wind energy. If you would like to know more about our work you can ask the paper here.