In the European-funded TTC project (Terminology, Translation Tools and Comparable Corpora), Syllabs has carried out research on knowledge-poor and knowledge-rich approaches for terminology extraction. Automatic terminology extraction is the task of suggesting term candidates found in a corpus. Term candidates are domain-specific words corresponding to a specialized domain, in this case, to the domain found in the corpus. Are tools rank the the candidate terms using the specificity score, this is, the frequency of occurrence in the specific corpus with respect to the frequency of occurrence in a large general corpus.
One of the aims of the project is to compare results depending on the amount of linguistic knowledge used. During this 3-year project, Syllabs has developed tools for monolingual terminology extraction in the seven languages of the project (German, French, English, Spanish, Latvian, Russian and Chinese) using two different approaches: a rule-based as well as a probabilistic approach. The University of Nantes has also developed a multilingual framework for monolingual terminology extraction.
In the paper “Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Extraction”” that will be presented at the Cicling 2013 Conference in Samos (Greece) by the Université de Nantes, on Thursday the 28th of March, we discuss the results of both approaches and compare results based on several scenarios. First we present the two terminology extraction tools to compare a knowledge-poor and a knowledge-rich approach. Both tools process single word terms (SWT) and multi-word terms (MWT), and are designed to handle multilingualism. We run an evaluation on 6 languages and 2 different domains using crawled comparable corpora and hand-crafted reference term lists (RTL). We discuss the 3 main results achieved for terminology extraction. The first two evaluation scenarios concern the knowledge-rich framework. Scenario 1 compares performances for each of the languages depending on the ranking that is applied: specificity score vs. the number of occurrences. Scenario 2 examines the relevancy of the term variant identification to increase the precision ranking for any of the languages. Scenario 3 compares both tools and demonstrates that a probabilistic term extraction approach, developed with minimal effort, achieves satisfactory results when compared to a rule-based method.
More informations :
- Schedule: http://www.cicling.org/2013/Poster-session.htm
- Poster : http://fr.slideshare.net/hblanca/cicling-2013poster