Encoding script-specific writing rules (short)

Encoding script-specific writing rules based on the Unicode character set

(short version)

Malek Boualem,  Mark Leisher,  Bill Ogden

Computing Research Laboratory (CRL), New Mexico State University,
Box 30001, Dept 3CRL, Las Cruces, NM 88003, USA

E-mail: {malek,mleisher,ogden}@crl.nmsu.edu



The World Wide Web is now the primary means for information interchange that is mainly represented in textual format. However programs that create and view these texts generally do not adequately support texts using non-Latin scripts, particularly right-to-left scripts. Unicode as a universal character set solves encoding problems of multilingual texts. It provides abstract character codes but does not offer methods for rendering text on screen or paper. An abstract character such « ARABIC LETTER BEH » which has the U+0628 code value can have different visual representations (called shapes or glyphs) on screen or paper, depending on context. Different scripts contained in Unicode can have different rules for rendering glyphs, composite characters, ligatures, and other script-specific features. In this paper we present a general approach to encoding script-specific rendering rules based on the Unicode character set and using finite state transducers. The proposed formalism for character classification and writing rules is modular and easy to read and to modify by users. The associated program is written in JAVA, which makes it portable to many environments. This approach will be demonstrated with writing rules for some languages that use the Arabic script and a short example that renders certain Hindi words.