NLA - Encoding of character data
Data encoding is an issue of system services and in particular of the operating system and its file handling.
Many of the issues presented in this paper are vital to presentation services, which is a fundamental building block in the System Application Architecture (SAA). These services are useless unless the end-user can communicate correctly in his own language.
The architecture should not introduce new coding schemes where existing code standards can be used. Parameters used to specify language and country should use ISO alphabetic standard naming conventions as a minimum (the compact form being specified as an option). For example the code page attribute of a file should be readable using any editor, and should not be available only to specialized system services.
However, specialized applications should hide this information. For example, a text processing system should not show an instruction for switching the code page; instead it should perform the code page transformation to the characters automatically. A special function of an application (reveal) may show a file in its native form.
A text item needs a set of attributes to be processed correctly:
- code page
- the data itself (see Data Type Key)
However, within a particular installation most of these can be treated as defaults, and so need not be present for every single text string or data base item.
An NLA must make more characters available than are specified in the current «country extended» code pages (see CECP). needs For example, in Central Europe also characters belonging to languages like Turkish, Greek or Slavic Languages are needed. Since switching code page will impose additional problems (like keeping track of states) the NLA must aim at a multi-byte-code for storage and processing. Reduction to a smaller character set may only be achieved at presentation time.
There is also need to specify symbols other than those used in spoken languages. These character set requirements are not limited to specific applications like text formatters. The more the workstation approach becomes more common, the more a consistent way of specifying special symbols in a file will become necessary.
For data base applications, the value of both the character set and code page must be available as attributes of data fields.
All national characters must be supported. Also diacritical symbols themselves are needed, in particular for documentation purposes (for example to state that an ë is an e with the accent ¨ ).
Accents on their own are also useful in the character set ROM of some devices to allow the creation of all accented characters with limited coding space (for example according to ISO 6937).
Since many of the above mentioned coding schemes specify control characters, their meaning also must be transformable by the functions of the NLA. The goal must be, that a «round trip» of text through various operating systems must not change the meaning.
There are several possible ways of handling this issue. Since the best method of transformation depends on the particular situation, all of the following methods must be available:
- Transparency (no conversion at all)
- Single byte standard one-for-one translation (where the control character receives a different code, but retains its meaning). This will imply a 1-byte by 1-byte translation of control functions.
- n-byte control function (where the control function is changed to do the same function on a different device)
- Special user-defined protocol change (to avoid loss of information)
- Eliminate control (for special or old applications)
- Error sequence or text (indicate impossible function at final presentation)
|Note:||Only methods 1 and 2 can be guaranteed to have «round trip» integrity|
Although an implementation of an NLA may primarily use only one particular coding scheme, provisions must be made to store and process data that is coded differently. Hence the coding scheme is an attribute of textual data.
For code page conversion (translation) services in networking tasks, these facilities are essential. These facilities also are very important for the exchange of data with other computer systems. The requirement is for conversion services at least for the following coding schemes:
- ISO 646-1983
- ISO 7-bit coded character set for information interchange: None of the accents are present in the International Reference Version.
- ISO 6937-2
- Coded character sets for text communication; Latin alphabetic and non-alphabetic graphic characters: using non-spacing diacritics
- ISO 8859
- A family of 8-bit single byte coded graphic character sets [ISO 8859-x].
Code pages which combine arbitrary groups of symbols (characters, special symbols - such as PC code pages 437 and 850) must be avoided in future developments. Different code pages should be developed for different purposes (or different character sets). For compatibility with existing code pages, code page switch is required.
The coding scheme chosen for the implementation of the proposed NLA must fulfill the following requirements:
- All symbols from the above mentioned coded character sets must be included.
- A large character set including all characters based on Latin alphabet and the Greek and Cyrillic alphabets must be included. Although this paper does not speak for other scripts their requirements must be satisfied also (for example Arabic, Hebrew, ideographic scripts).
- A rich set of mathematical and other symbols must be defined. This set must be considered open.
- There must be space for user-defined symbols.
- There must be space for future extensions.
Both ISO 10646 and Unicode satisfy these requirements. However, there is a strong feeling in the STWG that ISO 10646 introduces new problems while proposing to solve old ones. The structure of this coding scheme relies to much on obsolete restrictions in communication channels of the early 60's. Also the possibility to define subsets and «compressed code» with imbedded controls again will introduce ambiguity.
On the other hand there is strong agreement in the STWG on the [Unicode] approach. This scheme is much more compelling than ISO 10646, although it is not supported by an international standards organisation. Notwithstanding the political implications in this issue, from the technical point of view we believe that Unicode is superior to ISO 10646 draft 2.
In the long term we request to follow ISO 10646, but pragmatically in the short term Unicode can not be ignored.
For this section, it is assumed that the period of transition (from the current environment to a fully implemented NLA) will take a long time, and we will have to live with single byte code pages which only can hold 256 code points, for the duration of this period. Although a universal multi-byte coding scheme is on the horizon, NLA and its implementation cannot wait for it, since NLA is needed now.
Additional Code Pages
If the NLA sticks to the 1-byte equals 1-character relationship, a multitude of code pages (beyond the 11 Country Extended Code Pages (CECP) on which IBM is standardizing) are required to specify all the characters for languages based on the Latin alphabet even without considering graphics such as mathematical symbols.
If this direction is chosen, then these additional code pages must be defined as soon as possible, because otherwise many installations in Europe will have to make their own assumptions based on existing standards (ISO 8859, Latin Alphabets number 1 to 5, and so on).
User defined Characters and Code Points
The architecture must allow for installation specific character sets and coding. This is necessary, because in the past many installations have created their «special variants» of EBCDIC to solve specific problems. Both Unicode and ISO 10646 provide such capabilities.
Where industry-specific standards exist they should be supported.
Code Page Switching
To use characters from more than one code page (even without the additional characters from non-western languages), a mechanism for switching between single-byte code pages must be in an NLA.
Switching the code page in the middle of a file is potentially dangerous, because processing may start anywhere in the file. Hence it must become easy to split files into portions of different code pages. These mechanisms must become common to all input functions.
Updating files may require different equipment or software because portions of the file use different coding schemes. The switching also may be required because of the unavailability of certain characters in the code page of the major portion of the file (for example writing a Czech name correctly within English text).
The difficulties in maintaining character integrity can only be avoided with multi-byte coding.
Existing Code Pages
Since it is impractical to convert most customer's data to a «universal» code page (such as 500 or 850) the NLA must support the current environment with country specific and country extended code pages (CECP).
To manage the tasks mentioned in this section, the following functions must be provided by an NLA:
- translate text data from one code page to another
- translate coding schemes: This is of particular interest for translations between ISO 6937 (with flying accents) to a single byte coding scheme (and back).
- query device capabilities
- query and set code page of file or data base
- query and set character set of file or data base
Code Page Translation
This is the only function that does not assume the current coding scheme and code page. Parameters of this function are:
- input coding
- The code page in which the input text data is coded.
- output coding
- The code page to which the input text data is to be translated.
- Input: String to be converted to another coding scheme and code page.
- Output: Converted string.
Classes of Symbols
For parsing, pattern matching and other tasks properties of symbols in a coded character set must be known. So for every code page relevant information must be specified for the following classes:
- upper case
- lower case
- decimal digit
- hexadecimal digit
- accented variant of base character x
- special symbol
Only default classes can be assigned to code pages. Different applications call for different classifications. Hence this scheme
must be extensible to allow for specification of regular expressions. These definitions are necessary for example to support the various
query functions like
isalpha in the C programming language. See [POSIX-2] chapter 2, table
Parameters of this function are:
- lower case to upper case
- upper case to lower case
- Input: String to be converted to specific case
- Output: Converted string
|Note:||It turns out that this function can only be achieved correctly, if the text data is stored in the «richest
possible way» (see Data Type Key). This function should only be used in presentation services, not for
stored data (for example it should be used for searching and comparisons). Also characters which do not have the opposite case
character available in the code must be handled correctly (for example ÿ -> Y -> ÿ)
As an example consider the German word Straße in its upper case form STRASSE. The automatic reversal is possible here, but not for other words like Masse (mass), because there also is another word Maße (measures).
Query Device Capabilities
An NLA must specify which device capabilities must be queryable to support the various functions specified. An application must know about the character set and code page of an output device to achieve the necessary translations or fall back procedures.
To deal with input correctly (code page, character set), these device «constants» must be known. Even in one installation, these may be different, because secretaries prefer national keyboards whereas programmers may prefer specialized keyboards.
When input is «spooled» the sensed attributes are those of the spool-file. For output via spool «rerouting» to inappropriate devices must be avoided. For example, output created for an IPDS printer must not be forwarded to a line-printer.
An NLA must provide options for output when a specified character can not be produced. This task should not be left to an installation or a programmer, because, especially in these areas, «standards» are relevant to reduce ambiguities in the interpretation of the output. Options include:
- replace with an «error graphic» (check character)
- replace with unaccented character
- substitute with graphic from alternative set (e.g. present ß instead of ß)
- replace umlauts with ae, oe, ue in German
As an example if no umlauts can be produced it is very common in German to represent the umlauts ä ö and ü with two characters ae oe and ue. However, that technology is not valid necessarily for all languages and characters. For example, what is an adequate representation of an â?
Translation of ISO 6937 Text
Since in the ISO 6937 character set, all graphics based on the Latin alphabet can be written, this is an unambiguous coding for most left-to-right written languages. Hence it is vital to have a code translation from this set to the single byte code pages and vice versa.
PTT communication services make great use of this character set. Therefore many applications will need such a translation.