SenDiS

   
 

Sectoral Operational Programme “Increase of Economic Competitiveness”

„Investments for your future”

<<General Word Sense Disambiguation System applied to Romanian and English Languages / SenDiS>>

 „Project co-financed by the European Regional Development Fund”

The content of this material does not necessarily represent the official position of the European Union or of the Romanian Government    

The R&D Department, S.C. SOFTWIN SRL continues its research in the domain of Natural Language Processing (NLP) by implementing the project called << General Word Sense Disambiguation System applied to Romanian and English Languages / SenDiS >>, co-financed by the European Regional Development Fund, Sectoral Operational Programme “Increase of Economic Competitiveness” Priority Axis 2 – Research, Technological Development and Innovation for Competitiveness, Operation 2.1.2:  “Complex research projects fostering the participation of high-level international experts”.

The goal of the SenDiS project is the conception, design and implementation of a general disambiguation system (usable for any natural language), following the creation of an API with a high degree of applicability and with real chances of being successfully exploited in commercial applications. For this goal, the entire process of research and development will be covered: developing new tackling methodologies, improvement of the existing techniques, design, experimental developing (prototype) and testing/validation. Because of the vast experience of the host enterprise, SOFTWIN, being a marker leader in Romania in the field of IT solutions, the results will be applied directly into the economy, answering a diversified market request. The application of this system for the Romanian and English languages, through the creation of disambiguation knowledge bases, will prove the functionality and will set the working parameters of the proposed algorithms. The exploitation of this component on the market by the host enterprise, SOFTWIN, will be made through the integration in a set of Natural Language Processing applications, from which an important role will have the automated translation system. By achieving the goal of the SenDiS project and then by integrating this module with other NLP components, developed by the research team of the host enterprise, with the purpose of obtaining high grade quality products, we foresee the premises of new quality and performance standards in the industry of applications which offer computational linguistic services, translation and even information search and retrieval.
 

Through the specific objectives made in the 3 years of research and development, the project is appropriate to the thematic area no. 6 - “Communication and Information Technology” from the operation 2.1.2:

  • definition of a group of methods and techniques for disambiguation of natual languages, based mainly on the exploitation and processing of the information inferred from a lexicon.
  • developing instruments that allow creating, maintaining and using linguistic resources for disambiguation, independent of the language of the linguistic knowledge.
  • developing modules (API) that allow calling and executing disambiguation routines in various applications, involved in natural language processing.
  • developing complex disambiguation knowledge bases for the Romanian and English languages.
  • demonstration of the viability of the developed disambiguation system, by applying it on the experimental model for the two languages.

The SenDiS system is based on the GRAALAN system (see S. Diaconescu: Creation of linguistic resources with the help of a specialized language, in Workshop on Linguistic resources and tools for Romanian language processing, Iasi, 2006), and on a series of specific aspects of disambiguation.

  1. The GRAALAN system gives to the SenDiS system a theoretical base (GDG – "Generative Dependency Grammars", DT – "Dependency Tree", and AVT – "Attribute Value Tree"), and a practical one, by the means of linguistic knowledge databases that are already created.
  2. Specific aspects of disambiguation regard a modality of approach with a general character that involves:
    • the lexicon structure as a web of meanings;
    • ways of representing the lexicon as a network, taking into account the volume and/or the fast exploitation needs (about 150,000 - 250,000 meanings, 2,000,000 - 3,000,000 relationships between meanings);
    • ways of ordering the lexicon network to achieve an optimum after several criteria: the number of levels in the ordered web, the number of primitives (meanings that do not accept into their definitions other meanings, that are accepted without definition), the number of universals (meanings that do not contribute to other meanings);
    • setting the definition sets for each meaning (the meanings sets that contribute directly or indirectly to the definition of the current meaning);
    • setting the competence sets for each meaning (the meanings set at whose definition the current meaning contributes directly or indirectly);
    • obtaining the disambiguation by making use of all the above information (with operations on the definition of competence sets).

These problems require elaborate studies regarding the properties and optimization, so that the proposed algorithms to be implemented in an efficient way into the instruments that will be used to create and process the knowledge required by the disambiguation and the disambiguation applications. 


System architecture

The SenDis system contains three subsystems described in the Description of the project's components chapter:

  • The subsystem for creating the DLKB (Disambiguation Linguistic Knowledge Base), which receives as input definitions from a lexicon with definitions for each word and produces as output, with the assistance of linguist specialists, an unordered lexicon web with certain relationships between the present meanings.
  • The lexicon web processing subsystem. This subsystem receives as input an unordered lexicon web (an almost complete graph) and produces as output an ordered lexicon web that respects the entry parameters regarding the number and the types of permitted links.
  • The actual disambiguation system. This is an API that can be integrated in a disambiguation application that when receiving as input an arbitrary text and an ordered lexicon web, will generate as output a disambiguated text, with each word having associated exactly one meaning in the meaning dictionary.

The operating principle of these subsystems is represented in the following scheme:


The project will be divided into 6 stages of research and development, at the end of which status reports and adjustments will be made, if needed, to the next stages’ planning. Each of these stages is well determined by inputs from the previous stage and activities related outputs, used in the next stages. The first stage will have as its input the project proposal and the business plan and will output the project plan, a detailed study of the disambiguation methods used in existing applications and the preliminary specification of the disambiguation system. In stages 2 and 3 the used algorithms will be specified and the experimental models for the following components will be implemented: tool for creating disambiguation knowledge bases, methods for ordering the meanings network and algorithms for the disambiguation application. In stages 4 and 5 we will describe, using the tool from stage 2, the experimental models, the minimal disambiguation bases for the Romanian and English languages and we’ll be assessing and adjusting the network optimization and disambiguation algorithms from stage 3, according to the results from the two knowledge bases. Stage 6 will use the optimized algorithms from stages 4 and 5 and will integrate them in an API module, which will then be integrated in a prototype disambiguation system, applied on a complete disambiguation base of the Romanian language, which will represent the final objective of the research and development phase of the SenDiS project and the practical verification of the project.

The project implies both industrial research activities and technological development activities and its duration is of 36 months; the total budget is estimated to 500,000 EUR. The non-reimbursable financial assistance obtained in 2010, when the financing contract was signed, was of 935,413 Lei, meaning 776,392.79 Lei from FEDR and 159,020.21 Lei from the National Budget.

For detailed information about other programs co-financed by the European Union, please visit www.fonduri-ue.ro.

 

 

Acest site foloseşte cookie-uri. Folosim cookie-uri pentru analiza şi îmbunătăţirea site-ului, personalizarea vizitei, marketing şi reclamă. Prin navigarea pe acest site, vă exprimaţi acordul asupra folosirii cookie-urilor în aceste scopuri. Citiți mai mult.