A conventional search oftern delivers a large number of hits. This seemingly extensive list of search results often clouds the fact that a considerable number of additional hits could be relevant. This is because the (morphologic) variants of the search terms are often not found by conventional searches.
This is where the linguistic knowledge offered by EXTRAKT comes into play. EXTRAKT automatically finds the other variants of the given search term and adds them to the search. These include the different forms of a word (inflected forms of words - for example plural forms), but also different ways of spelling (British and American English, Swiss German spelling, German spelling reforms etc) and the use of umlauts and accents etc.
The search terms are "Vertrag" and "international". If the search is done without the linguistic component, both terms are searched unaltered and the variant forms of "international" and "Vertrag" are not considered, thus:
and for: Vertrag
This obviously leads to inaccurate results, which cannot satisfactorily be improved with truncation. With EXTRAKT different variant forms are automatically recognized and searched. Likewise German composite words are separated into their component parts, so that here too significantly better searching is possible. Simultaneous searches for the old and new German spelling styles can also be carried out. EXTRAKT improves searching with its multilingual component, which translates the search term into various languages.
The standard languages in FILERO are German, US-English and English. So in our example translations and their variant forms are searched:
- accord + accords
- contrat + contrats
- convention + conventions
- traité + traités
- international + internationaux
- agreement + agreements
- convention + conventions
- treaty + treaties
It is obvious that the ability to search synonyms broadens the scope of results yet further.
In addition to German, English and French, other languages available are Italian, Spanish and Latin.
Additional languages (for example Dutch) are envisaged in the short term.
The foundations of EXTRAKT are extremely large (unabridged) dictionaries in several different langauges as well as bilingual dictionaries. The size of these dictionaries range from a few thousand entries (for example the dictionary for the new German spellings with ca. 6,000 entries) to nearly a million entries (the German composite dictionary contains around 940,000 entries). The bilingual dictionaries contain between 20,000 and 130,000 entries. These dictionaries are heavily compressed through a special process and can therefore be stored on the RAM of a computer. This means that analysis and translation are especially fast.
Alongside these dictionaries other computer-linguistic processes are used - for example the recognition of multiple word terms (e.g. the French "pomme de terre"). Special dictionaries, synonym dicitionaries, thesauri and "private" (customised) dictionaries can easily be added to the system.
EXTRAKT is based on the linguistic server EXTRAKT and is available for WINDOWS NT, WINDOWS 2000, XP and LINUX. It is already in use in several different applications - for example the internet search machine Scoutmaster.
EXTRAKT was developed in part with supported from the European Union as part of the ESPRIT and LIBRARIES programmes.
The basic principles of linguistic analysis for the German which is used in EXTRAKT are described in: Stegentritt, Erwin (Ed.): German Analysis, Morpho-Syntax within the free-text retrieval project EMIR. (Sprachwissenschaft - Computerlinguistik. Linguistics - Computational Linguistics vol. 15). Saarbrücken 1993.
You can find further information about EXTRAKT at www.textec.de.