Analysing Information from Large Documentary Bases - The ILC Project

by Yannick Toussaint and Jean Royaute

The ILC Project (Infométrie, Langage et Connaissance) is a collaboration between the DIALOGUE Team of the INRIA-Lorraine & CRIN-CNRS Laboratory in Nancy and the Infometry Research Program of the INIST-CNRS Laboratory. It aims at partly modelling and structuring the knowledge written in large documentary bases. This modelling will facilitate information analysis. The project is part of the ILIAD project, supported by the French National Cognitive Science Program (GIS 'Science de la Cognition').

The tools and methods currently being developed in the ILC project should enable a human operator to collect the information content of a text without reading it sequentially. The information
analysis is the step following the information retrieval process and is based on methods particular to informetrics, using statistical techniques of data analysis. They are combined with approaches used in large corpora linguistics for identifying term structures and locating them and the relations between them in the texts.. Techniques from artificial intelligence are called upon in order to collect and organise the knowledge that emerges from these linguistic and statistic processes.

We assume that the major part of the information in technical texts is located in noun phrases. Therefore, our strategy for analysing information relies upon performing robust and partial linguistic processing based on term and noun phrase identification. Combining statistic and linguistic methods, we search the texts for the conceptual links that exist between terms in the domain. We pay special attention to the identification of a set of linguistic connectors and to certain domain-specific predicative structures.

We divided the project into two phases. The first, which is now near completion, consists in building an automatic process for the recognition of terms in texts from a thesaurus, and of the classification of these texts following criteria of term co-occurrence.

The second phase is aimed at identifying structures in the texts, predicatives or not, which could reveal a conceptual link between two terms. This should lead to the construction of a knowledge base with the terms and their conceptual relations whose main structure is the initial thesaurus.

Searching terms in corpora
and classifying them

The first phase of the project combines three different stages : Conclusion and future work

In order to integrate these three stages, we had to develop robust linguistic tools such as a lemmatiser for French. Identical tools were developed for English and the results of the experimentations on the same domain are very close to those for French.

The second phase of the project is being started next month and we will then focus on three points : See also:

