Iñaki Arantzabal
Why processing Basque automatically is so difficult
Basque is a highly agglutinative language, which makes automatic processing of Basque a complicated business, because to get good results it is necessary for such processes to "know" the language's structure.
Suppose we want to search for the word euskara ('Basque language') in a group of Basque texts. In the text shown in the illustration below, "euskara" occurs four times, yet an ordinary search system such as those used in many languages would not find them. The reason is that the word euskara does not appear in precisely that form in the text but in various declined forms, whereas the ordinary system doesn't know anything about Basque declension endings.
Now suppose we are not familiar with one of the words occurring in this same text and wish to look it up in an automatic dictionary. The word we want to look up is egokituta, which means 'adapted'. This word is not given as such in an ordinary dictionary, but only in the base form egokitu 'to adapt'. Once again, to be able to process the word it is necessary to know that the word's headform is egokitu.

Figure 1
So unlike many other languges, in the case of Basque it is practically a requirement that language technology be used in order to get decent results from automatic processes.
Basque's small corpus
A corpus is a set of texts, usually in an electronic format. According to Beñat Oihartzabal, director of the research section of Euskaltzaindia (the official Academy of the Basque Language), "a corpus is a collection of language data that is used either for the description and analysis of a language or as a data source usable and accessible to electronic resources."
For the development of many applications it is essential to dispose of a large corpus of examples. Either owing to the use of statistical technology, as a means of training up a system or for a variety of other reasons, corpora are highly useful tools for improving the efficiency of automatic processes. In the case of Basque, suitable collections of documents in digitalised format are lacking in many areas of human knowledge.
The lack of language normalisation
As in many other areas, deficiencies in the normalisation of the Basque language creates various additional complications for the application of new technology.
The absence of cognate languages
Basque is a language isolate, with no known relatives. Thus it is not possible to implement directly resources developed for other languages. Because of the wide gap between Basque and other languages, we cannot, for instance, apply the materials developed for French or Spanish. As a matter of fact, work done on Hungarian is far more useful to us as a basis for developing a processing system for Basque than anything in Basque's closer neighbours.
The work of the University of the Basque Country's IXA group (http://ixa.si.ehu.es/Ixa) over recent years has made important contributions towards progress overcoming these obstacles. Most of the developments that will be referred to in this article have made use of the important resources that have come out of IXA's groundbreaking work.
What resources exist, and what remains to be done?
Some of the tools and applications that will be mentioned in this article have already been developed and are available for downloading from the websites of the Basque Government and the Basque Summer University (UEU):
How can new technology help us to write, work and live in Basque? I will try to answer this question by means of the examples on the following pages. Some of those mentioned represent materials that have already been developed, while other projects are still waiting for funding. Still others already exist in neighbouring languages. But all these tools are attainable goals through a concerted effort, and would be of much interest for the Basque language.
Terminology standardisation
- A national terminology bank called Euskalterm is available to the public on the Internet (www.euskara.euskadi.net/euskalterm). Euskalterm collects together most of the work that has been done on terminology in the Basque Country, and is updated every three months. Another resource available is Euskaltzaindia's Unified Dictionary (Hiztegi Batua).
- At the individual level, spell checkers can contribute much to the language standardisation effort. The only Basque spell checker that has been developed so far is XUXEN (http://www.euskara.euskadi.net/euskara_soft). The next step along these lines, which has already commenced with support from the Basque Government, is the development of a grammar checker that will work within Word.

Figure 2
- But neither Euskaltzaindia nor EuskalTerm can meet all the demands of Basque organisations for terminological assistance. Institutions and companies have the option of employing centralised correctors. Such tools make it possible to create a unified database for use throughout an organisation. A language office will be in charge of entering new terms into the bank to ensure terminological coherence. When a word is spelt wrong or employed incorrectly by any employee or member of the organisation, the spell checker will propose an alternative. Moreover, tools that can automatically catalogue the terminology used within an organisation's documents have been developed for Basque. This will greatly facilitate the task of loading the terms used by a given organisation into its spell checker. The following figure illustrates how the system could be used to correct the use of Dinamarka to Danimarka ("Denmark"):

Figure 3
Systems for resolving doubts over language issues
"Environment" may be translated into Basque as either ingurugiro or inguramen, so which is the right word to use in a given context? There exist several Internet services whose function it is to resolve such uncertainties as this, in which users may check the answers given to questions by other users, and also formulate their own questions:

Figure 4
- Euskaltzaindia's JAGONET (http://www.euskaltzaindia.org/jagonet/) is a free consultancy service that aims to promote good Basque usage. A user who cannot find the answer to a question already on the site may write to the Academy of the Basque Language to ask for advice. However, JAGONET is not the place to ask about terminology or etymologies, to request theoretical explanations, or to obtain translations.
- IVAP, the Basque institute of public administration, offers its own service called Duda-Muda (http://www.ivap.euskadi.net/r61-2347/eu/contenidos/informacion/dudamuda/eu_3803/dudamuda_e.html), a free service for citizens and public entities in the Basque Autonomous Community. Its function is to resolve doubts about correct Basque usage in the area of administrative or legal terminology. It is not a terminological database, however, but a service offered by a team of researchers.
- Elhuyar offers another such service for technical Basque (http://www.zientzia.net/galdera_bidali.asp)
Many institutions and companies have Basque language officers one of whose jobs it is to resolve uncertainties about language use, and special software has been developed to facilitate their work by channeling communication and storing information. A well-known example is the University of the Basque Country's Ehulku (http://www.ehu.es/ehulku/) service which is oriented to the needs of university staff and has the goal of supporting good Basque usage in the university. The organisers emphasise that Ehulku is neither a language school nor a forum for the discussion of language research issues, but rather a service to help find solutions to language problems arising in the university's provision of information and services in Basque.
Other entities using such software applications for the resolution of insititutional Basque language issues include the Basque Government, the Elhuyar Foundation, and the provincial government of Bizkaia.
Dictionaries in new formats
Gone are the days when dictionaries had to be printed on paper. Today it is quite usual to consult dictionaries on the Net. Thanks to Basque government support, the main dictionaries in use are now available in electronic format on the Internet. These are updated twice or three times a year, so the information they contain is more up-to-date than that provided in the latest print versions.
On-line Basque-Spanish and Basque-French dictionaries can be used as tools aiding web browsing in Basque. Unfortunately, many Basque speakers choose to surf the Internet in Spanish or French even when on websites available in Basque because they feel they may miss out on information if they don't understand a word here or there. Too help solve this problem, new technology is able to provide quick dictionary help whenever a user activates a difficult word.
Suppose a user is looking for information on a bank's website in Basque and comes across the "strange" word onuren. By clicking on this word and touching a key (F12, for instance), the dictionary function is called up. This system first determines the headword, then looks this up in its internal dictionary, and finally displays possible translation equivalents in a window on the screen. The user then carries on browsing. This avoids the need for the user to switch over to a different language version of the site. This may seem like a very small step towards language normalisation, but we should not forget that the more people use the Internet in Basque, the more content will be made available in Basque. It is a known fact that most of the organisations that place Basque language matter on their websites regularly check on the number of users who read the Basque pages.

Figure 5
The same kind of system may be used to look up words in other languages when web-surfing in English, French or Spanish, for example. This option would permit us to look up words we don't kown in these languages and see their Basque equivalents.
Such dictionaries can be incorporated into Word so that we can look up words while either writing or reading in the word-processor: the translation of any word we need is only a mouse click away. Software of this kind now available includes Elhuyar's Basque-Spanish and Basque-French dictionaries and UZEI's dictionary of synonyms.
Dictionaries in the work place: There exist other tools for the use by staff in organisations and businesses that allow a user to look up words in dictionaries while in a wod processor, on a website or simply on the computer's desktop. Many such dictionary plug-ins are now available, and Babylon (www.babylon.com) already offers over thirteen languages.
A dictionary in the hand: Hand-held computers are becoming popular these days. They may contain a diary, word-processor, Internet access, GPS, e-maila and many other resources.

Figure 6
New technology can provide dictionaries for hand-held computers that can translate between Spanish, English, French and Basque. This is a new area we will need to work on.

Figure 7
Corpus systems
Often it is not enough just to look up a word in a dictionary. What we need to know is how to use the word correctly. For this, it helps to be able to see the word used in a sentence, in context. Very often, seeing the whole sentence can provide us with more semantic information than a mere dictionary definition. Today it is possible to see examples of a word or expression as used in works of twentieth-century literature, translated textbooks for occupational training, or newspaper and magazine archives.
The twentieth-century corpus (http://www.euskaracorpusa.net/) is a 4,658,036-word corpus of twentieth-century texts. This can be consulted as a repository of ordinary language usage. Note that it is not necessarily intended as a guide to correct usage.
The exemplary corpus of contemporary prose (http://www.ehu.es/euskara-orria/euskara/ereduzkoa/), on the other hand, aims to serve as a tool for writers who, even at the university level, often face doubts about the correct forms of words, the best expressions or the right syntax. This corpus, provided by the Basque language service of the University of the Basque Country, collects together contemporary texts by the very best present-day writers of Basque, and comes with a powerful, user-friendly search engine to enhance the resource's usefulness. It is hoped that this corpus will be used to resolve writers' doubts through comparison with examples of the best existing usage.
Lanbide Ekimena (http://www.jakinbai.com/) will offer a corpus of translated textbooks for occupational training covering a wide range of subjects. The organisation LANEKI and the Lanbide Ekimena project were started by the associations HETEL and IKASLAN with the long-term goal of translating textbooks and providing them to users in electronic format via the Internet.
The Elhuyar Foundation's Science and Technology Corpus is a specialised corpus that brings together texts written and published in Basque between 1990 and 2002 in fields of science and technologiy. It includes both material originally written in Basque and Basque translations from other languages.
The publisher Susa maintains a literature website called Armiarma (http://www.armiarma.com/) which contains a large number of works.
In other areas, significant collections of documents in Basque are still lacking.
Information searches
An incredible amount of information is available today. While the advantages of this are undeniable, there is an increasing need to combine criteria of quantity and quality in the processes of accessing this information. Without quality tools for accessing information, the usefulness of the infomation may be substantially reduced no matter how much of it there is. This problem makes itself felt when we want to find some specific information on the Internet but what we are looking for is swamped by a mass of other information that has nothing, or very little, to do with what we need.
The following illustration shows what happens when we do a search on the portal www.euskadi.net for the Basque word beka ("grant" or "scholarship")

Figure 8
This example illustrates the importance of employing language technology in information searches in Basque.
Software localisation
It is quite important that the office applications we use on a daily basis (word-processor, e-mail programme, operating system etc.) should be in our own language. This is not easily achieved, however, because of the frequency of version updates in such programmes and the large investment required to be constantly writing Basque language translations of these. But there are some applications that can be used in Basque, most of which may be found on the two websites mentioned earlier..
Digitalisation of information
Optical Character Recognition (OCR): When we scan a text from a printed page, we merely obtain an image of that text. This cannot be manipulated in a word-processor, modified or processed as text. For that we need an OCR application, which takes the scanned image of the text as input and outputs a text composed of characters that the computer can "understand". Thus OCR means recognition of the characters on a written or printed page. It is as if the OCR took the "photograph" of each character and analysed this to recognise it and in this way convert the image of the text into a text document in an ordinary computer-readable character code (such as ASCII).
OCR systems are used daily in a number of areas. For example, libraries and document archives employ OCR to convert and store the content of paper documents in a digital format. Every day millions of periodicals and items of correspondence are sorted using OCR to increase the speed of postal distribution. OCR for use with the Basque language has been developed, so it is possible to scan books and papers in Basque and convert them to digital text format. As in any other language, the results of OCR still need to be corrected manually, since this method cannot ensure 100% accuracy.
Translation
Every day large amounts of information are translated from Spanish into Basque and vice-versa. Such translation can be made easier by the tools of CAT (Computer Aided Translation). What these systems do is "remember" what people have previously translated, so that when a similar text appears again they can propose a translation to the translator. This is a way of increasing the efficiency with which work already done by a translator is exploited. Such an application stores the translator's work in a data base called translation memory. Over the past fifteen years applications for the management of translation memory have become important tools in the field of translation. One of their advantages is that through them it is possible to work faster, and another is that they help to ensure the quality of translations. Translation memory is particularly valuable where many documents pertaining to a specific specialised field (such as administration or law) need to be translated, and documents of a type in which very similar sentences or phrases are often repeated.

Figure 9
The advantages of using CAT tools:
-
Speed of translation.
-
Improved quality of tranlations.
-
Maintaining consistency: glossaries may be incorporated and the application made to tell us what terms are defined in the glossary.
-
Recycling of previous work by the translator or work group.
-
Faster integration of new staff members, since parts of the organisation's information and translation guidelines are stored in the translation memory.
Mechanical translation
The newspaper El Periódico de Catalunya comes out every day in Spanish and Catalan editions. Journalists write the paper in Spanish first and then it is translated into Catalan by computer. After a team of editors has revised the computer output, the newspaper is ready for publication in Catalan at the same time as the Spanish language edition. Mechanical translation is also used by many people surfing the Internet. Suppose I wish to take a look at a German website but I don't understand German: by means of mechanical translation I will be able to read the same site in English or Spanish. The translation will not be perfect, but will nevertheless provide basic access to the information I am interested in. These are the two main uses of mechanical translation: to assist translators in their work and to provide access to information.
These days there are many mechanical translation systems available. Systems that translate between English, French, German and Spanish have been around for a long time. The Catalans have several systems and the Galicians have one too, but there is still no such system available for Basque. The first system of mechanical translation from Spanish to Basque will soon be made public. Called OpenTrad (www.opentrad.com), it has drawn heavily on the Catalans' experience in this field as well as work by the IXA group here in the Basque Country. Thus the first step has been taken, although much work remains to be done. The process of gradually improving the results this system can produce will take place in the form of specific projects geared to clients' needs and well-defined goals. The objective of creating a system that translates in the opposite direction, from Basque to Spanish, remains on the agenda, as does that of developing a system that can translate between English and Basque.
Meanwhile, it is possible today to browse the Gipuzkoa provincial government's official website in Catalan if we want to, even though the information has not been provided in this language by the owners of the site, thanks to the wonders of mechanical translation.

Figure 10
Virtual communities
Thanks to new technology we can distribute information to as many users as we wish or communicate with many people all at once. Internet and new technology allow us to cross geographical, social and political barriers at will. Although some people in Euskal Herria are already taking advantage of these possibilities, there remains much to be done.
It turns out that translators themselves have been pioneers in this respect. The professional electronic mailing list Itzulist began to function several years ago under the aegis of the Basque translators', editors' and interpreters' association EIZIE (www.eizie.org). Any user who is signed on can consult other users on any question and read other users' opinions. This has resulted in the creation of a true virtual community and an effective and user-friendly working tool.
There are other such resources for science workers, such as the www.zientzia.net portal which contains much information in Basque on scientific matters, or www.BasqueResearch.com which provides information on dissertations published in the Basque Country.
Not only translators and scientists, but also computer scientists, pharmacists, doctors, lawyers and so on all would benefit from the support of such communities to help maintain the quality of their work in Basque.
Other projects such as Sustatu (www.sustatu.org) and Erabili (www.erabili.com) also represent important efforts supporting the spread of information and participation in the Basque speaking community.
Today practically anyone can easily create a space on the Internet to publish information and receive other people's input, so it is not difficult to create a virtual community. A good example is the rise of the blogging phenomenon. A blog is a sort of diary or bulletin maintained by an individual or group in the form of a website consisting of a series of pages ordered from most recent to oldest and updated as and when it suits the author to do so. For further information in Basque see http://www.uztarria.com/blogak/, http://www.berria.info/blogak/index.php and http://www.eibar.org/blogak/.
We should also mention some especially useful Internet resources such as language learning resources (BaiBai-Didaktiker or HABE's e-learning system for studying Basque), the recently developed technical Basque engine www.euskadi.net/aditu and the electronic editions of various Basque daily newspapers and periodicals.
Thus there are numerous resources available in different fields, and it is worth making an effort to become acquainted with them and use them. In many other areas tasks remain pending, and both the administration and private companies need to increase their efforts if they don't want to get left behind.
Incorporating language technology into company processes may be seen as a medium-term goal. We need to work on developing computer tools that support use of the Basque language at work through tools that are used daily at work and in the areas of language required.
As in the case of company computerisation, there must also be support for the incorporation of such resources within the framework of language normalisation, promoting options for developing a single basic instrument that brings together all the tools and assistance needed to be able to work in Basque.
In the past, language training and human resources have claimed the attention of language normalisation movements. In the future, once processes have attained a certain level, incorporation of appropriate technology should also be promoted.
For institutions, businesses and individuals the tasks to concentrate on are becoming acquainted with existing resources, using them, spreading their use, updating them and creating new resources. It is essential that year by year investment by the administration, companies and ordinary users in new technology should increase.
|