NLA - Language issues
Identification of Language
Predefined Text Banks
Spelling
Hyphenation
Keywords
Object Names
The natural language of textual data must be identified in an NLA. The specifications must include the language and the country. The following tasks will use the national language attributes:
- Hyphenation
- Spellchecking
- Defining standard text (such as names of months or days)
- Comparison of strings
- Sorting and searching (including a default method for each language. The default must not change automatically when the language is switched)
Certain documents or files may contain words from several languages. Consider, for example, a German document containing a long citation in French. Hence a mechanism for switching between languages must be part of an NLA.
Identification of Language
The identification of the language must be universal. Text files are not only processed by special applications like formatters or text processing systems, but also with very general purpose instruments like program editors. Hence the human reader of such a file must be able to recognize the specification. This could be achieved by a notation using only «printable characters» (see syntactic character set) as shown in the following example from a formatter input specification [Daube-3]:
.assign LANGUAGE, SWGERMAN
to specify the German language as used in Switzerland. The identification must include both the language itself and the country.
Attributes of a language are:
- language (e.g. German)
- country (e.g. Switzerland redefines language to Swiss-German)
- sorting scheme (default)
It is, for example, insufficient to have text elements (keywords, prompts, messages) available in a particular language (for example German), because very often the terminology and usage varies between countries (e.g. Switzerland, Austria and Germany). For the end-user this is an important issue.
Predefined Text Banks
Predefined text like the names of months or days should not be a system constant, since the spelling may vary between countries. The «text bank» must be switched, if the language and/or country is switched. The implementation of these elements also should be sufficiently flexible to allow installations to add definitions like the name of the installation or a copyright phrase:
German: Oerlikon-Bührle Rechenzentrum AG English: Oerlikon-Bührle Computing Centre Ltd. French: Centre de Calcul Oerlikon-Bührle SA
These text examples show the need to identify both the language and the code used (French, but not code page of France). Flexibility is needed for variations due to company terminology.
It is essential that these texts use the «richest possible form» (upper and lower case, accents). Only at presentation time text may be converted to a less legible format (possibly all upper case with no accents).
Spelling
The architecture must allow for general use of basic text operations like checking the spelling of words. That is, the spelling function must be available anywhere - not just in special applications like formatters or text processing systems. Even «common» applications should be able to check spelling.
Spell checking can be performed by dictionary handling and an algorithm for each language to serve conjugation and declensions. Since usage may differ between countries (like German in Germany and Switzerland), the architecture must handle language differences between countries that use the same language and should not assume a fixed dictionary for a certain language.
Dictionary Considerations
However a dictionary should be stored in the richest possible way for a given language (use accented lower case and upper case letters in French, use umlauts and sharp-s in German, and so on). The algorithm used should nevertheless be able to handle country dependencies or character set limitations at processing time.
Hyphenation
The architecture must allow for general use of basic text operations. That is, they should be part of other applications beside formatters or text processing systems. Even «common» applications should be able to hyphenate text (presentation services!).
Hyphenation depends on
- an algorithm
- a dictionary
- a combination of both
Use of both dictionaries and algorithms must be independent of a code page or coding scheme used by the dictionary and the text itself.
Since usage may differ between countries (like German in Germany and Switzerland), the architecture also must provide country-dependent dictionaries rather than assuming a fixed dictionary for a certain language (see dictionary considerations under «spelling»).
Keywords
Keywords used in end-user applications are a special form of predefined text. The selection of the names of keywords should be as flexible as possible. They must be insensitive to the case of the letter. They must allow use of the full set of national alphabetic characters.
Also, keywords should depend on language and country.
End-user products and new products in general must use reserved words (keywords) in the National language of the user. These keywords must be translatable from country to country when porting source command files; hence they must be stored in a standard way internally (e.g. enumerated) or on a file independent of the presentation natural language.
Since keyword names very often are an issue of specialized jargon, they have to be selected carefully. Abbreviations should be avoided.
Keywords for a certain function in a certain language must be
used within all applications following the SAA conventions. Thus
in English it is not acceptable to have QUIT, TERMINATE,
RETURN
or END
for the same function in different
products. Hence the requirement is for a consistent nomenclature
or terminology.
Object Names
Examples of objects as used in this paper are:
- files
- program modules
- data tables
Any operating system conforming to the NLA should not restrict object names to the syntactic character set, but allow the use of all alphabetic characters necessary for the spelling of names of the user's language. Names of objects also must be allowed in upper- and lower case. Restrictions for use of special characters (like brackets, braces, hyphen, slash) in object names should be minimized.
Since names in other languages are often much longer then English, file names, keywords and other items must be able to have names longer than 8 characters. A good architecture would provide for variable length fields in every case. The length of a name must be as long as required and the maximum allowed length must be the same for all environments.
This is a requirement of great importance for SAA, where several Operating Systems are involved. For example, conversion of abbreviated file names must be avoided for transportation of files between Operating Systems (Round trip coherence is required for file names).