pkg://howto-sgml-6.1-1.noarch.rpm:3689529/
usr/
doc/
HOWTO/
other-formats/
sgml/Cyrillic-HOWTO.sgml
info downloads
<!-- $Header: /home/abel/cvs/doc/Cyrillic-HOWTO/Cyrillic-HOWTO.sgml,v 1.55 1998/01/22 21:59:39 abel Exp $ -->
<!doctype linuxdoc system>
<article>
<title>The Linux Cyrillic HOWTO
<author> Alexander L. Belikoff, (<tt/abel@bfr.co.il/), Berger
Financial Research Ltd.
<date>v4.0, 23 January 1998
<abstract>
This document describes how to set up your Linux box to typeset, view
and print the documents in the Russian language.
</abstract>
<toc>
<sect>Administrativia
<p>
<sect1>Introduction
<p>
This document covers the things you need to successfully work with
information containing cyrillic text (mostly Russian) under
Linux. Although this document assumes your using Linux as an operating
system, most of information presented is equally applicable to many
other Unix flavors. I shall try to keep the distinction as visible as
possible.
There are a number of popular Linux distributions. As an example
system I describe the RedHat 4.1 Linux (Vanderbildt) - the one I am
personally using. Nevertheless, I shall try to highlight the
differences, if they exist, in other popular distributions, such as
Debian GNU/Linux and Slackware Linux.
Since such setup directly modifies and extends the Operating System,
you should understand, what you are doing. Even though I tried to keep
things as easy as possible, having some experience with a given piece
of software is an advantage. I am not going to describe what the X
Window System is or how to typeset the documents with TeX and LaTeX,
or how to install printer in Linux. Those issues are covered in other
documents.
For the same reason, in most cases I describe a system-wide setup, by
default requiring <em/root/ privileges. Still, if there is a
possibility for user-level setup, I'll try to mention it.
<bf/NOTE:/ The X Window System, TeX and other Linux components are complex
systems with a sofisticated configuration. If you do something wrong,
you can not only fail with Russian setup, but to break the component
as well, if not the entire system. This is not to scare you off, but
merely to make you understand the seriousness of the process and be
careful. Preliminary backup of the config files is <bf/highly/
recommended. Having a guru around is also advantageous.
<sect1>Availability and feedback
<p>
This document is available at <htmlurl
url="http://sunsite.unc.edu/LDP" name="sunsite.unc.edu"> or
<htmlurl url="ftp://tsx-11.mit.edu/pub/linux" name="tsx-11.mit.edu">
as a part of the <em/Linux Document Project/. Also, it may be
available at various FTP sites containing Linux. Moreover, it may be
included as a part of Linux distribution.
If you have any suggestions or corrections regarding this document,
please, don't hesitate to contact me as <htmlurl
url="mailto:abel@bfr.co.il" name="abel@bfr.co.il">. Any new
and useful information about Cyrillic support in various Unices is
<em/highly appreciated/. Remember, it will help the others.
<sect1>Acknowledgments and copyrights
<p>
Many people helped me (and not only me) with valuable information and
suggestions. Even more people contributed software to the public
community. I am sorry if I forgot to mention somebody.
So, here they go:
<itemize>
<item>Bas V. de Bakker
<item>David Daves
<item>Serge Vakulenko
<item>Sergei O. Naoumov
<item>Winfried Truemper
<item>Ilya K. Orehov
<item>Michael Van Canneyt
<item>Alex Bogdanov
<item>...and the countless helpful people from the
<htmlurl url="news:relcom.fido.ru.unix" name="relcom.fido.ru.unix">
and <htmlurl url="news:relcom.fido.ru.linux" name="relcom.fido.ru.linux">
Usenet newsgroups.
</itemize>
This document is Copyright (C) 1995,1997 by Alexander L. Belikoff. It
may be used and distributed under the usual Linux HOWTO terms
described below.
The following is a Linux HOWTO copyright notice:
<quote>
<it>Unless otherwise stated, Linux HOWTO documents are copyrighted by their
respective authors. Linux HOWTO documents may be reproduced and distributed
in whole or in part, in any medium physical or electronic, as long as
this copyright notice is retained on all copies. Commercial redistribution
is allowed and encouraged; however, the author would like to be notified of
any such distributions.</it>
</quote>
<quote>
<it>All translations, derivative works, or aggregate works incorporating
any Linux HOWTO documents must be covered under this copyright notice.
That is, you may not produce a derivative work from a HOWTO and impose
additional restrictions on its distribution. Exceptions to these rules
may be granted under certain conditions; please contact the Linux HOWTO
coordinator at the address given below.</it>
</quote>
<quote>
<it>In short, we wish to promote dissemination of this information through as
many channels as possible. However, we do wish to retain copyright on the
HOWTO documents, and would like to be notified of any plans to redistribute
the HOWTOs.</it>
</quote>
If you have questions, please contact Tim Bynum, the Linux HOWTO
coordinator, at <htmlurl url="mailto:linux-howto@sunsite.unc.edu"
name="linux-howto@sunsite.unc.edu">. You may finger this address for phone
number and additional contact information.
Unix is a technology trademark of the X/Open Ltd.; MS-DOS, Windows,
Windows 95, and Windows NT are trademarks of the Microsoft Corp.; The
X Window System is a trademark of The X Consortium Inc. Other
trademarks belong to the appropriate holders.
<sect>Theoretical background
<p>
<sect1>Characters and codesets
<p>
In order to understand and print characters of various languages, the
system and software should be able to distinguish them from other
characters. That is, each unique character must have a unique
representation inside the operating system, or the particular software
package. Such collection of all unique characters, that the system is
able to represent at once, is called a <em/codeset/.
At the time of the most operating system's creation, nobody cared
about software being multilingual. Therefore, the most popular codeset
was (and actually is) an <em/ASCII/ (American Standard Code for
Information Interchange).
The <em/standard ASCII/ (aka 7-bit ASCII) comprises 128 unique
codes. Some of them ASCII defines as real printable characters, and
some are so-called <em/control characters/, which had special meanings
in the old communication protocols. Each element of the set is
identified by an integer <em/character code/ (0-127). The subset of
printable characters represents those found on the typewriter's
keyboard with some minor additions. Each character occupies 7 least
significant bits of a byte, whereas the most significant one was used
for control purposes (say, transmission control in old communication
packages).
The 7-bit ASCII concept was extended by 8-bit ASCII (aka <em/extended
ASCII/). In this codeset, the characters' codes' range is 0-255. The
lower half (0-127) is pure ASCII, whereas the upper one contains 127
more characters. Since this codeset is backward compatible with the
ASCII (character still occupies 8 bit, the codes correspond the old
ASCII), this codeset gained wide popularity.
The 8-bit ASCII doesn't define the contents of the upper half of the
codeset. Therefore the ISO organization took the responsibility of
defining a family of standards known as <em/ISO 8859-X/ family. It is
a collection of 8-bit codesets, where the lower half of each codeset
(characters with codes 0-127) matches the ASCII and the upper parts
define characters for various languages. For example, the following
codesets are defined:
<itemize>
<item><tt/8859-1/ - Europe, Latin America (also known as <em/Latin 1/)
<item><tt/8859-2/ - Eastern Europe
<item><tt/8859-5/ - Cyrillic
<item><tt/8859-8/ - Hebrew
</itemize>
In Latin 1, the upper half of the table defines
various characters which are not part of the English alphabet, but are
present in various european languages (german umlauts, french accentes
etc).
Another popular extended ASCII implementation is so-called <em/IBM
codepage/ (named after some computer company, that developed this
codeset for it's infamous personal computers). This one contains
pseudo-graphic characters in the upper half.
Software, that doesn't make any assumptions about the 8-th bit of the
ASCII data is called <em/8-bit clean/. Some older programs, designed
with 7-bit ASCII in mind are not 8-bit clean and may work incorrectly
with your extended ASCII data. Most of packages, however, are able to
deal with the extended ASCII by default, or require some very basic
setup. <bf/NOTE:/ before posting the question <em>"I did all setup
right, but I cannot enter/view Cyrillic characters!"</em>, please
consult the section <ref id="shells"> for the notes on the
program, you are using.
For information about making your software 8-bit clean, see section
<ref id="locale-programming">.
Since on most systems character occupies 8 bits, there is no way to
extend ASCII more and more. The way to implement new symbols in
ASCII-based codesets is creation of other extended ASCII
implementations. This is the way, the Cyrillic ASCII set is
implemented.
We already mentioned <em/ISO 8859-5/ standard as the one defining the
Cyrillic codeset. But as it often happens to the standards, this one
was developed without taking into account the real practices in the
former USSR. Therefore, one thing that standard really achieved was
another degree of confusion. I wouldn't say that <em/ISO 8859-5/ is
widely used anywhere.
Other standards for Cyrillic include the so-called <em/Alt/
codeset and <em/Microsoft CP1251/ codepage. The former one was
developed by (who?) for MS-DOS quite a while ago. Back then, there was
not very buzz yet about internetworking, so the intention was to make
it as compatible as possible with the IBM standard. Therefore the Alt
codeset is effectively the same IBM codepage, where all specific
European characters in the upper half were replaced with the Cyrillic
ones, leaving the pseudographic ones. Therefore, it didn't screw the
text windowing facilities and provided Cyrillic characters as well.
The <em/Alt/ standard is still alive and extremely popular in MS-DOS.
<em/Microsoft CP1251 codepage/ is just an attempt of Microsoft to come
up with the new standard for Cyrillic codeset in Windows. As far as I
know, it is not compatible with anything else (not very surprizing,
huh?)
And finally there is <em/KOI8-R/. This one is also quite old, but it
was designed wisely and nowadays the design points of it look really
useful.
Again, it is compatible with ASCII, and the Cyrillic characters are
located in the upper half. But the main design point of <em/KOI8-R/ is
that the Cyrillic characters' positions must correspond to the English
characters with the same phonetics. Namely, if we set the eighth bit
of the English character <tt/'a'/, we'll get the Cyrillic <tt/'a'/.
This means that, given the Cyrillic text written in KOI8-R, we can
strip the eighth bit of each character <em/and we still get a readable
text, although written with English characters!/ This is very
important now, since there are many mailers on the Internet, that just
strip the eighth bit silently, being sure that every single soul on
the face of the Earth speaks English.
Not surprisingly, <em/KOI8-R/ quickly became a de-facto standard for
Cyrillic on the Internet. <htmlurl url="http://www.nagual.ru/~ache"
name="Andrew A. Chernov"> did a tremendous amount of work to make a
standard in this area. He is an author of <htmlurl
url="file://ds.internic.net/rfc/rfc1489.txt" name="RFC 1489">
(<em/"Registration of a Cyrillic Character Set"/).
These two standards differ only in positions of the cyrillic
characters in the table (that is in cyrillic character codes).
The principal difference is that the Alt codeset is used by MS-DOS
users only, whereas KOI8-R is used in Unix, as well as in MS-DOS
(though in the latter KOI8-R is much less popular). Since we are doing
the right thing (namely working in the Unix operating system), we
shall focuse mostly on KOI8-R.
As for the ISO standard, it is more popular in Europe and the US as a
standard for Cyrillic. The leader in Russia is definitely KOI8-R.
There are other standards, which are different from ASCII and much
more flexible. <em/Unicode/ is most known. However, they are not
implemented as good as the basic ones in Unix in general and Linux in
particular. Therefore, I am not describing them here.
<sect>Preparing your environment
<p>
Before we start customizing various parts of the system functionality,
we have to set up a couple basic things. Most of tools described below
assume that there are Cyrillic fonts available and a user is able to
input Cyrillic characters. To make it true we have to configure the
environment to provide both fonts and input facility for Cyrillic.
There are effectively two interface models supported by Linux. One is
the text mode, and the other one is the graphic mode, provided by the
X Window System. Both require different setup, which will be described
below.
<sect1>Text mode setup
<p>
Generally, the text mode setup is the easiest way to show and input
Cyrillic characters. There is one significant complication, however:
the text mode fonts and keyboard layout manipulations depend on
terminal driver implementation. Therefore, there is no portable way to
achieve the goal across different systems.
Right now, I describe the way to deal with the Linux console
driver. Thus, if you have another system, don't expect it to work for
you. Instead, consult your terminal driver manual. Nevertheless, send
me any information you find, so I'll be able to include it in further
versions of this document.
<sect2>Linux Console<label id="linux-console">
<p>
The Linux console driver is quite a flexible piece of software. It is
capable of changing fonts as well as keyboard layouts. To achieve it,
you'll need the <htmlurl
url="http://sunsite.unc.edu/pub/Linux/system/Keyboards/" name="kbd">
package. Both RedHat and Slackware install kbd as part of a system.
The kbd package contains keyboard control utilities as well as a big
collection of fonts and keyboard layouts.
Cyrillic setup with <bf/kbd/ usually involves two things:
<enum>
<item>Screen font setup. This is performed by the
<tt/setfont/ program. The fonts files are located in
<tt>/usr/lib/kbd/consolefonts</tt>.
<bf/NOTE:/ Never run the <tt/setfont/ program under X because it will hang
your system. This is because it works with low-level video card calls
which X doesn't like.
<item>Load the appropriate keyboard layout with the <tt/loadkeys/
program.
</enum>
NOTE: In RedHat 3.0.3, <tt>/usr/bin/loadkeys</tt> has too restrictive
access permissions, namely 700 (<tt/rwx------/). There are no reasons
for that, since everyone may compile his own copy and execute it (the
appropriate system calls are not root-only). Thus, just ask your
sysadmin to set more reasonable permissions for it (for example, 755).
The following is an excerpt from my <tt/cyrload/ script, which sets
up the Cyrillic mode for Linux console:
<verb>
if [ notset.$DISPLAY != notset. ]; then
echo "`basename $0`: cannot run under X"
exit
fi
loadkeys /usr/lib/kbd/keytables/ru.map
setfont /usr/lib/kbd/consolefonts/Cyr_a8x16
mapscrn /usr/lib/kbd/consoletrans/koi2alt
echo -ne "\033(K" # the magic sequence
echo "Use the right Ctrl key to switch the mode..."
</verb>
Let me explain it a bit. You load the appropriate keyboard
mapping. Then you load a font corresponding to the <em/Alt/
codeset. Then, in order to be able to display text in <em/KOI8-R/
correctly, you load a <it/screen translation table/. What it does is a
translation of <em/some/ characters from the upper half of the codeset
to the <em/Alt/ encoding. The word 'some' is crucial here - not all
characters get translated, therefore some of them, like IBM
pseudographic characters get unmodified to the screen and display
correctly, since they are compatible with the <em/Alt/ codeset, as
opposed to <em/KOI8-R/. To ensure this, run <bf/mc/ and pretend you
are back to MS-DOS 3.3...
Finally, the magic sequence is important but I have no idea what on
the Earth it does. I stole/borrowed/learned it from German HOWTO back
in 1994, when it was like the only national language oriented
HOWTO. <em/If you have any idea about this magic sequence, please tell
me/.
Finally, for those purists, who don't wont to give the <em/Alt/
codeset a chance, I'm attaching yet another version of the script
above, using native <em/KOI8-R/ fonts.
<verb>
if [ notset.$DISPLAY != notset. ]; then
echo "`basename $0`: cannot run under X"
exit
fi
loadkeys /usr/lib/kbd/keytables/ru.map
setfont /usr/lib/kbd/consolefonts/koi-8x16
echo "Use the right Ctrl key to switch the mode..."
</verb>
However, don't expect nice borders in your text mode-based windowing
applications.
Now you probably want to test it. Do the appropriate bash or tcsh
setup, rerun it, then press the right <tt/Control/ key and make sure
you are getting the cyrillic characters right. The '<tt/q/' key must
produce russian "<tt/short i/" character, '<tt/w/' generates
"<tt/ts/", etc.
If you've screwed something up, the very best thing to do is to reset
to the original (that is, US) settings. Execute the following
commands:
<verb>
loadkeys /usr/lib/kbd/keytables/defkeymap.map
setfont /usr/lib/kbd/consolefonts/default8x16
</verb>
<bf/NOTE:/ unfortunately enough, the console driver is not able to
preserve it's state (at least easily enough), while running the X
Window System. Therefore, after you leave the X (or switch from it to
a console), you have to reload the console russian font.
<sect2>FreeBSD Console
<p>
I am not using FreeBSD so I couldn't test the following information.
All data in this section should be treated as just pointers to begin
with. <htmlurl url="http://www.freebsd.org" name="The FreeBSD project
homepage"> may have some information on the subject. Another good
source is the <htmlurl url="news:relcom.fido.ru.unix"
name="relcom.fido.ru.unix"> newsgroup. Also, check the resources
listed in section <ref id="resources">.
Anyway, this is what <htmlurl url="mailto:elias@artx.ru" name="Ilya
K. Orehov"> suggests to do in order to make FreeBSD console speak
Russian:
<enum>
<item>In <tt>/etc/sysconfig</tt> add:
<verb>
keymap=ru.koi8-r
keyrate=fast
# NOTE: '^[' below is a single control character
keychange="61 ^[[K"
cursor=destructive
scrnmap=koi8-r2cp866
font8x16=cp866b-8x16
font8x14=cp866-8x14
font8x8=cp866-8x8
</verb>
<item>In <tt>/etc/csh.login</tt>:
<verb>
setenv ENABLE_STARTUP_LOCALE
setenv LANG ru_SU.KOI8-R
setenv LESSCHARSET latin1
</verb>
<item>Make analogous changes in <tt>/etc/profile</tt>
</enum>
<sect1>The X Window System
<p>
Like the console mode, the X environment also requires some
setup. This involves setting up the input mode and the X fonts. Both
are being discussed below.
<sect2>The X fonts.<label id="xfonts">
<p>
First of all, you have to obtain the fonts having the
Cyrillic glyphs at the appropriate positions.
If you are using the most recent X (or XFree86) distribution, chances
are, that you already have such fonts. In the late 1995, the X Window
System incorporated a set of Cyrillic fonts, created by <htmlurl
url="http://www.cronyx.ru" name="Cronyx">. Ask your system
administrator, or, if <em/you/ are the one, check your system, namely:
<enum>
<item>Run '<tt/xlsfonts | grep koi8/'. If there are fonts listed, your
X server is already aware about the fonts.
<item>Otherwise, run
<verb>
find -name crox\*.pcf\*
</verb>
to find the location of the Cyrillic fonts in the system. You'll have
to <tt/enable/ those fonts to the X server, as I explain below.
</enum>
If you haven't found such fonts installed, you'll have to do it
yourself.
There is some ambiguity with the fonts. XFree86 docs claim that the
russian fonts collection included in the distribution is developed by
Cronyx. Nevertheless, you may find another set of Cronyx Cyrillic
fonts on the net (eg. on <htmlurl
url="ftp://ftp.kiae.su/cyrillic/x11/fonts/xrus-2.1.1-src.tgz"
name="ftp.kiae.su">), known as the <bf/xrus/ package (don't confuse it
with the <tt/xrus/ program, which is used to setup a Cyrillic keyboard
layout. Hopefully, tha letter one was renamed to <bf/xruskb/
recently). <bf/Xrus/ has fewer fonts than the collection in Xfree86
(38 vs 68), but the latter one didn't go along with my <ref
id="netscape" name="Netscape"> setup - it gave me some really huge
font in the menubar. The <bf/xrus/ package doesn't have this problem.
I would suggest you to download and try both of them. Pick up the one
which you'll like more. Also, I'm going to creat RPM packages soon for
both collections and download them to <htmlurl
url="ftp://ftp.redhat.com/pub/contrib/i386/" name="ftp.redhat.com">.
There are also older stuff, for example the <bf/vakufonts/ package,
created by <htmlurl url="mailto:vak@cronyx.ru" name="Serge Vakulenko">,
which was the base for the one in the X distribution. There are also a
number of others. The important point is that the fonts' names in the
old collection were not strictly conforming to the standard. The
latter is fine in general, but sometimes it may cause various weird
errors. For example, I had a bad experience with Maple V for Linux,
which crashed mysteriously with the <bf/vakufonts/ package, but ran
smoothly with the "standard" ones.
So, let's start with the fonts:
<enum>
<item>Download the appropriate fonts collection. The package for
XFree86 may be found at any FTP site, containing the X distribution,
for example, directly from the <htmlurl url="http://www.xfree86.org"
name="XFree86 FTP site">. The <bf/xrus/ package may be found on
<htmlurl url="ftp://ftp.kiae.su/cyrillic/x11/fonts/xrus-2.1.1-src.tgz"
name="ftp.kiae.su">
<item>Now when you have the fonts, you create some directory for
them. It is generally a bad idea to put new fonts to the already
existing font directory. So, place them, to, say,
<tt>/usr/lib/X11/fonts/cyrillic</tt> for a system-wide setup, or just
create a private directory for personal use.
<item>If the new fonts are in BDF format (<tt/*.bdf/ files), you have to
compile them. For each font do:
<verb>
bdftopcf -o <font>.pcf <font>.bdf
</verb>
If your server supports compressed fonts, do it, using the
<em/compress/ program:
<verb>
compress *.pcf
</verb>
Also, if you do want to put the new fonts to an already existing font
directory. you have to concatenate the old and the new files named
<tt/fonts.alias/ in the case both of them exist.
<item>Each font directory in the X must contain a list of fonts in it. This
list is stored in the file <tt/fonts.dir/. You don't have to create this
list manually. Instead, do:
<verb>
cd <new font directory>
mkfontdir .
</verb>
<item>Now you have to make this font directory known to the X
server. Here, you have a number of options:
<itemize>
<item>System-wide setup for XFree86. If you are running this version of
X, then append the new directory to the list of directories in the
file <tt/XF86Config/. To find the location of this file, see output of
<tt/startx/. Also, see <bf>XF86Config(4/5)</bf> for details.
<item>System-wide setup through <tt/xinit/. Add the new directory to
the <tt/xinit/ startup file. See <bf/xinit(1x)/ and the next option
for details.
<item>Personal setup. You have a special start-up file for the X -
<tt>~/.xinitrc</tt> (or <tt>~/.Xclients</tt>, or <tt>~/.xsession</tt>
for the RedHat users). Add the following commands to it:
</itemize>
<verb>
xset +fp <new font directory>
xset fp rehash
</verb>
It is important to note that '<tt/+fp/' means that the new fonts will
be added to the head of the font path list. That is, if an application
requests say a <tt/fixed/ font, it'll be given the one with Cyrillic
characters, which is definitely what we are trying to achieve.
There are problems, though. The <tt/fixed/ font in the cyrillic fonts
distribution doesn't have it's bold and italic counterparts. My font
of choice is <tt/6x13/, so, since it also lacks bold and italic
typefaces, I cannot use Emacs/XEmacs faces in their full
glory. Hopefully somebody will ultimately create those fonts and the
situation will change.
<item>Now restart your X. If you have done everything right, the tests
in the beginning of the section will be successful. Also, play with
<bf/xfontsel(1x)/ to make sure you are able to select the cyrillic fonts.
</enum>
In order to make the X clients use the Cyrillic fonts, you have to set
up the appropriate X resources. For example, I make the russian font
the default one in my <tt>~/.Xdefaults</tt>:
<verb>
*font: 6x13
</verb>
Since my cyrillic fonts are first in the font path (see output of
'<tt/xset q/'), the font above is taken from the "cyrillic" directory.
This just a simple case. If you want to set the appropriate part of
the X client to a cyrillic font, you have to figure out the name of
the resource (eg. using <bf/editres(1x)/) and to specify it either in
the resource database, or in the command line. Here go some examples:
<verb>
$ xterm -font '-cronyx-*-bold-*-*-*-19-*-*-*-*-*-*-*'
</verb>
...will run xterm with some ugly font; and
<verb>
$ xfontsel -xrm '*quitButton.font: -*-times-*-*-*-*-13-*-*-*-*-*-koi8-*'
</verb>
...will set a Cyrillic Times font for the <bf/Quit/ button in
<tt/xfontsel/.
<sect2>The input translation
<p>
In the newest X releases (X11R61 and higher) there are two "standard"
input methods: the original one, working through the <bf/xmodmap/
utility, and the new one called <em/Xkb/ (X KeyBoard). The very first
thing you have to do is <bf/to disable the Xkb method!/ Don't get
charmed by it's ability to set up a "russian keyboard". It looks like
this method is using the Cyrillic keysyms defined in
<tt/keysymdef.h/. This file defines keysyms for many languages. The
only problem is that those definitions have nothing to do with the
extended ASCII codeset - the one most programs are only able to
operate with! I hardly know any programs being able to grok the
<tt/keysymdef.h/ keysyms, different from 8-bit ASCII. However our goal
is to get the KOI8-R support to work.
To disable the <tt/Xkb/ support, browse through the <tt/Keyboard/
section of your <tt/XF86Config/ file and comment all lines starting
with <em/Xkb/ (case doesn't matter). Instead, put the following line:
<verb>
XkbDisable
</verb>
The <tt/xmodmap/ program.allows customization of codes emitted by
various characters and their combinations. It sets the things up based
on the file containing the translation table.
In the previous versions of this document I used to describe the
<tt/xmodmap/-based setup in a great detail. This proved to be almost
useless. The <tt/Xmodmap/-based input translation method is well known
as being it is non-portable, inflexible, and incomplete. Your
configuration may work with one XFree version and fail with a
different one. Even worse, sometimes things differ accross different
servers in the same distribution.
I strongly suggest you not to play with this <tt/xmodmap/, at least
for now. Apart from headache and disappointment you'll gain nothing.
Instead, I recommend installing the <htmlurl
url="ftp://ftp.relcom.ru/pub/x11/cyrillic/" name="xruskb"> package,
which allows you to configure most of the input translation parameters
without having to know about <tt/xmodmap/. Again, the RedHat Linux
users are free to download and install an <htmlurl
url="ftp://ftp.redhat.com/pub/contrib/i386/xruskb-1.5.1-1.i386.rpm"
name="RPM"> package.
<sect1>First steps - Cyrillic in shells<label id="shells">
<p>
<sect1>bash
<p>
Three variables should be set on order to make <tt/bash/ understand the
8-bit characters. The best place is <tt>~/.inputrc</tt>
file. The following should be set:
<verb>
set meta-flag on
set convert-meta off
set output-meta on
</verb>
<sect1>csh/tcsh<label id="csh">
<p>
The following should be set in <tt/.cshrc/:
<verb>
setenv LC_CTYPE iso_8859_5
stty pass8
</verb>
If you don't have the POSIX <tt/stty/ (impossible for Linux), then
replace the last call to the following:
<verb>
stty -istrip cs8
</verb>
<sect1>ksh
<p>
As for the public domain <tt/ksh/ implementation - <tt/pdksh 5.1.3/,
you can input 8 bit characters only in <tt/vi/ input mode. Use:
<verb>
set -o vi
</verb>
<sect1>less
<p>
So far, <tt/less/ doesn't support the KOI8-R character set, but the
following environment variable will do the job:
<verb>
LESSCHARSET=latin1
</verb>
<sect1>mc (The Midnight Commander)
<p>
To display Cyrillic text correctly, select the <em/full 8 bits/ item
in the <bf>Options/Display</bf> menu.
If your problem is the ugly windows' borders, consult the <ref
id="linux-console"> section.
As an off-topic, if you want to make <bf/mc/ use color in an
<tt/Xterm/ window, set the variable <tt/COLORTERM/:
<verb>
COLORTERM= ; export COLORTERM
</verb>
<sect1>rlogin
<p>
Make sure that the shell on the destination site is properly set
up. Then, if your <tt/rlogin/ doesn't work by default, use '<tt/rlogin
-8/'.
<sect1>zsh
<p>
Use the same way as with <tt/csh/ (see section <ref id="csh"
name="csh">). The startup files in this case are <tt/.zshrc/ or
<tt>/etc/zshrc</tt>.
<sect>Editing text
<p>
In this section I'll describe how to customize various text editors to
work with Cyrillic text. This doesn't cover the <em/word processors/,
which will be described later (see section <ref id="word-processing">).
<sect1>Emacs and XEmacs<label id="emacs">
<p>
There are two version of the Emacs editor - <bf/GNU Emacs/ and
<bf/XEmacs/. While they provide more or less same functionality, some
implementation details are significantly different. Cyrillic setup
requires some low-level (in Emacs Lisp sense) tweaking, and it differs
a bit for those two versions.
<bf/NOTE:/ Apart from the setup described here, there is an
alternative way to configure both versions of emacs - use <bf/MULE/
(MULtilanguage Emacs support). The latter way is fairly complicated
and (to the best of my knowledge) rarely used, so I don't discuss it
here.
The minimal cyrillic support in <bf/GNU emacs/ (you don't have to do
it for the <bf/XEmacs/) is done by adding the following calls to one's
<tt/.emacs/ (provided that the Cyrillic character set support is
installed for console or X respectively):
<verb>
(standard-display-european t)
(set-input-mode (car (current-input-mode))
(nth 1 (current-input-mode))
0)
</verb>
This allows the user to view and input documents in Russian.
However, it isn't enough. Emacs doesn't know yet, that Cyrililic
characters may constitute a word, let alon the upper/lower case
conversion rules. In order to teach Emacs doing that, you have to
modify the syntax and case tables of emacs:
<verb>
(require 'case-table)
(let* ((ruc "\341\342\367\347\344\345\263\366\372\351\352\353\354\355\356\357\360\362\363\364\365\346\350\343\376\373\375\370\371\377\374\340\361")
(rlc "\301\302\327\307\304\305\243\326\332\311\312\313\314\315\316\317\320\322\323\324\325\306\310\303\336\333\335\330\331\337\334\300\321")
(i 0)
(len (length ruc)))
(while (< i len)
(modify-syntax-entry (elt ruc i) "w ")
(modify-syntax-entry (elt rlc i) "w ")
(set-case-syntax-pair (elt ruc i) (elt rlc i) (standard-case-table))
(setq i (+ i 1))))
</verb>
For this purpose I created a <tt/rusup.el/ file which does this, as
well as a couple handy functions. You have to load it in your
<tt>~/.emacs</tt>.
Finally, the <url url="http://www.math.uga.edu/~valery/russian.el"
name="russian.el"> package by Valery Alexeev
(<tt/valery@math.uga.edu/) allows the user to switch between cyrillic
and regular input mode and to translate the contents of a buffer from
one Cyrillic coding standard to another (which is especially useful
while reading the texts imported from MS-DOS or Windows).
<sect1>Using vi
<p>
The <bf/vi/ editor (at least it's clone <bf/vim/, available in most
Linux distributions) is aware of 8-bit characters. It will allow you
to enter cyrillic characters and will be able to recognize the word
boundaries correctly. I don't know about the upper-/lower-case
conversion rules, since I don't use <bf/vi/ much. <em/If you know
something about it, please inform me/.
<sect1>Editing text with joe
<p>
<bf/Joe/ requires a special <tt/-asis/ option to recognize 8-bit
characters. You may either specify this option at the command line, or
to put it in <tt>~/.joerc</tt> file (for personal use, or in
<tt>/usr/lib/joerc</tt> for system-wide setup.
If your program doesn't understand <tt/-asis/ option, you have to
upgrade to the newer version.
However, <bf/joe/ doesn't seem to understand the cyrillic words'
boundaries correctly. I assume, that it applies both to the case
conversion rules.
<sect1>Spell-checking Russian
<p>
The program I use to spell-check text is the <bf/GNU ispell/. It is
very flexible and extensible, so it is possible to use it to
spell-check text in languages, other than English, by adding new
<em/spell dictionaries/.
Constantine Knizhnik has created a very good Russian dictionary for
<bf/ispell/. You may find it at his <htmlurl
url="http://www.ispras.ru/~knizhnik" name="homepage">. The
distribution includes a handy incremental spelling script for
<bf/emacs/.
Ideally, if you already have an <bf/ispell/ properly installed, you
have to just step into the newly-created directory and generate the
dictionary, using the commands provided in the <tt/Makefile/. However,
chances are quite high, that you'll see a lot of complaints about the
<bf/ispell/'s unawareness of the 8-bit data. This is because in most
distributions, <bf/ispell/ is compiled without 8-bit data support. In
this case, you cannot avoid recompiling the <bf/ispell/ package.
Again, RedHat users will be delighted to know that I've rebuilt the
<bf/ispell/ package with both Russian and German dictionaries. As
usual, you may grab it from the <htmlurl
url="ftp://ftp.redhat.com/pub/contrib/i386/ispell-3.1.20-6.i386.rpm"
name="RedHat FTP site">.
Once you have everything installed, you may invoke Russian
spell-check, by supplying <tt/'-d russian'/ option to <bf/ispell/.
Now, if you use <bf/Emacs/, you may want to add a menu item for a
russian dictionary. I sent a proposed menu entry to the <tt/ispell.el/
maintainer and he kindly agreed to include it in the the next public
release of the file. Meanwhile, you may do it by adding the following
code in your <tt>~/.emacs</tt> (or in
<tt>/usr/share/emacs/site-lisp/site-start.el</tt> for a system-wide
setup):
<verb>
(setq ispell-dictionary-alist
(append ispell-dictionary-alist
'(("russian"
"[\341\342\367\347\344\345\263\366\372\351\352\353\354\355\356\357\360\362\363\364\365\346\350\343\376\373\375\370\371\377\374\340\361\301\302\327\307\304\305\243\326\332\311\312\313\314\315\316\317\320\322\323\324\325\306\310\303\336\333\335\330\331\337\334\300\321]"
"[^\341\342\367\347\344\345\263\366\372\351\352\353\354\355\356\357\360\362\363\364\365\346\350\343\376\373\375\370\371\377\374\340\361\301\302\327\307\304\305\243\326\332\311\312\313\314\315\316\317\320\322\323\324\325\306\310\303\336\333\335\330\331\337\334\300\321]"
"[']" t ("-C" "-d" "russian") "~latin1"))))
(define-key-after ispell-menu-map [ispell-select-russian]
'("Select Russian (KOI-8)" . (lambda ()
(interactive)
(ispell-change-dictionary "russian")))
'british)
</verb>
Unfortunately, it won't work for the <bf/XEmacs/. I'll try to solve
this problem later.
<sect>Using Cyrillic with mail and news
<p>
Setting up your mail and news software to recognize Cyrillic text is
not very difficult, although you have to possess some knowledge of
principles, mail and news work by.
Internet electronic mail software generally consists of two parts:
<bf/MUA/ (Mail User Agent) and <bf/MTA/ (Mail Transfer Agent). MUA is
the program you use to read, compose, and send mail. However, MUA
doesn't transfer mail messages by itself. Instead, it calls the MTA,
which is reponsible to send message using an appropriate protocol to
the appropriate direction. For example, your MUA may be <bf/Pine/ and
MTA - <bf/qmail/.
Until quite recently, both MTA and MUA weren't 8-bit clean by
default. Therefore, whenever you sent your message from say America to
Russia, you were never sure, that some intermediate MTA won't strip
the 8th bit from each character of your message. Therefore, a set of
protocols was developed, which allowed encoding various kinds of data
using only printable characters from 7-bit ASCII. This family of
protocols is called <bf/MIME/ (MultimedIa Mail Encoding).
Since MIME is usually pre-configured to reasonable defaults, we won't
describe it here. We will talk more about MIME when we provide a
backward compatibility with other Cyrillic encodings (section <ref
id="mime">).
Meanwhile, we start MUA setup, because it is usually up to an
end-user. Then, we will describe the basic priciples of the MTA
configuration for Cyrillic.
<sect1>Setting up Mail User Agents
<p>
<sect2>Emacs-based mail readers
<p>
Basically, you don't need any special setup for Emacs-based readers,
geivedn, that you've already configured the emacs itself (see section
<ref id="emacs">).
<sect2>pine
<p>
Set the following directive in <tt>~/.pinerc</tt> for personal
configuration, or in <tt>/usr/lib/pine.conf</tt> for a global one:
<verb>
character-set=ISO-8859-5
</verb>
<sect1>Configuring your MTA
<p>
There are a number of MTAs available now. These include <bf/sendmail/,
<bf/qmail/, <bf/smail/, <bf/exim/, and others.
<sect2>sendmail
<p>
So far, <bf/sendmail/ is much more popular than other MTAs, because
it's long history and widespread use. Personally, I hate this program
- it is a perfect example of a completely moronic design and even it's
"improvements" with the passion of time show, that this approach is
not going to cease. Any system administrator shudders, when he hears
the ominous "<tt/sendmail.cf/" name...
As of now, <bf/sendmail/ doesn't strip the 8th bit anymore. However,
it may <em/encode/ the 8-bit data using a special <em/base64/
encoding. Although most MUAs are supposed to recognize it and decode
it back to a regular data, you may want to start with sending raw
8-bit text to make sure everything works.
As of version 8, <bf/sendmail/ handles 8-bit data correctly by
default. If it doesn't do it for you, check the <tt/EightBitMode/
option and option <tt/7/ given to mailers in your
<tt>/etc/sendmail.cf</tt>. See <em/"Sendmail. Operation and
Installation Guide"/ for details.
<sect2>Other MTAs
<p>
I don't know much about other MTAs. If you know something, which may
be important for Cyrillic setup, please inform me.
<sect>Browsing the Cyrillic Web
<p>
Unlike e-mail and news, there is no definitive standard for Cyrillic
encoding for the Web. This is primarily because Microsoft offers Web
authoring tools, which only allow <em/cp1251/ codeset for Cyrillic,
completely ignoring the fact that any other standards may already
exist.
The setup described here is very basic. It will allow you to view
pages in the <em/KOI8-R/ codeset. If the situation improves, I'll add
more information.
<sect1>lynx
<p>
As of version 2.6, you may select the appropriate encoding for the
<tt/display Character set/ option.
<sect1>Netscape navigator<label id="netscape">
<p>
Make sure you are using <tt/Netscape/ version higher than 3. If your
<tt/Netscape/ is older, download a new one from <htmlurl
url="http://www.netscape.com" name="www.netscape.com">.
<sect2>Basic setup
<p>
To be able to see Cyrillic text in most parts of the HTML document, do
the following:
<itemize>
<item>In menu <bf>Options/Document Encoding</bf> select
<bf/Cyrillic(KOI-8)/.
<item>In menu <bf>Options/General Preferences/Fonts</bf> select
<bf/Cyrillic (KOI-8)/ encoding, <bf/Times(Cronyx)/ as a proportional
font and <bf/Courier(Cronyx)/ as a fixed one.
<item>save options.
</itemize>
<bf/NOTE:/ This setup will work with most parts of the
document. However, you won't be able to display Cyrillic text in the
window header, menus and some controls. Attempts to fix it follows.
<sect2>Cyrillic text in frames and input areas
<p>
To fix this, it is usually enough to:
<enum>
<item>Copy the Netscape properties database (usually <tt/Netscape.ad/)
to <tt>~/Netscape</tt>.
<item>In the latter file, set the following property:
<verb>
*documentFonts.charset*iso8859-1: koi8-r
</verb>
</enum>
This will force all frame and input elements to use the fonts with
<em/koi8-r/ encoding instead of the default ones, therefore you have
to make sure you have installed such fonts (see section <ref
id="xfonts">).
The bad news about the trick above is that if you load a document
which is supposed to be displayed in <tt/iso-8859-1/ fonts, it will be
displayed using the <tt/koi8/ fonts instead. Sometimes such documents
will look worse.
<sect2>Advanced setup
<p>
Andrew A. Chernov is the one, who knows more than others about KOI-8
in general and netscape in particular. Visit his excellent <htmlurl
url="http://www.nagual.ru/~ache/koi8.html" name="KOI-8 page"> and
download a patch for Netscape resource file, making Netscape speak
Russian as much as it is able to.
<sect>Cyrillic wordprocessing<label id="word-processors">
<p>
<sect1>TeX-based environments<label id="tex">
<p>
In this section I'll describe several ways to make TeX and LaTeX
typeset Cyrillic texts. There are several ways, which differ in setup
sophistication and usage convenience. For example, one possibility is
to start without any preliminary setup and use the <em/Washington
AMSTeX Cyrillic fonts/. On the other hand, you may install a LaTeX
package, providing a very high degree of Cyrillic setup. I have an
experience with two such packages. One is the <tt/cmcyralt/ package by
Vadim V. Zhytnikov (<tt/vvzhy@phy.ncu.edu.tw/) and Alexander Harin
(<tt/harin@lourie.und.ac.za/), and the other one is the <tt/LH/
package by the <em/CyrTUG/ group with styles and hyphenation for
LaTeX2e by Sergei O. Naoumov (<tt/serge@astro.unc.edu/). I'll describe
both.
Note, that there are two versions of LaTeX available - 2.09 is the old
one, while 2e is a new pre-3.0 release. If you are using LaTeX 2.09,
then switch quickly to the 2e. The latter retains compatibility with
the old one, but has much more features. Hopefully, version 3 will be
released soon. I describe a LaTeX 2e setup.
Also, both of these packages require the Cyrillic text to be typeset
using the <em/Alt/ codeset, not <em/KOI8-R/! This is caused by
historical reasons, since the creators of these packages used to work
with <tt/EmTeX/ - the MS-DOG version of TeX (they didn't know about
Linux yet :-). Switching to the <em/KOI8-R/ requires some effort and is
being expected to be done soon. So far, use some utility to convert
your russian text from <em/KOI8-R/ to <em/Alt/. See section <ref
id="user-tools">.
<sect2>Using the Washington Cyrillic
<p>
This package was created for the American Mathematic Society to
provide documents with Russian references. Therefore, the authors were
not very careful and the fonts look quite clumsy. This package is
usually referred to as a <tt/"really bad cyrillic package for TeX"/.
Nevertheless, we'll discuss it, because it is very easy to use and
doesn't require any setup - this collection is supplied with most of
TeX distributions.
Of course, you won't be able to use such luxury as automatic
hyphenation, but anyway...
1. Prepend your document with the following directives:
<verb>
\input cyracc.def
\font\tencyr=wncyr10
\def\cyr{\tencyr\cyracc}
</verb>
2. Now to type a cyrillic letter, you enter
<verb>
\cyr
</verb>
and use a corresponding latin letter or a TeX command. Thus, the lower
case of the Russian alphabet is expressed by the following codes:
<verb>
a b v g d e \"e zh z i {\u i} k l m n o p r s t u f kh c ch sh shch
{\cprime} y {\cdprime} \`e yu ya
</verb>
It is extremely inconvenient to convert your Russian texts to such
encoding, but you can automate the process. The translit program
(section <ref id="user-tools">) supports a TeX output option.
<sect2> KOI-8 package for teTeX
<p>
There is some new <htmlurl
url="ftp://xray.sai.msu.su/pub/outgoing/teTeX-rus/" name="teTeX-rus
package">. It is reported to support KOI-8 character set and have all
basic stuff required for TeX and LaTeX. I personally haven't tried it
yes, although I heard about it's successfull usage.
<bf/NOTE:/ This package requires you to reconfigure and rebuild some
parts of your <bf/teTeX/ package (for example the precompiled LaTeX
macros). <bf>Unless you know what you are doing, you shouldn't try it
without necessary care. Otherwise, you may be better off by borrowing
the precompiled parts fron somebody on the net</bf>
<sect2>Using the cmcyralt package for LaTeX
<p>
The <tt/cmcyralt/ package can be found on any CTAN (Comprehensive TeX
Archive Network) site like <tt/ftp.dante.de/. You should obtain two
pieces: the fonts collection from <tt>fonts/cmcyralt</tt> and the
styles and hyphenation rules from
<tt>macros/latex/contrib/others/cmcyralt</tt>.
<bf/Note:/ Make sure you have the <tt/Sauter/ package installed, since
<tt/cmcyralt/ requires some fonts from it. You can get this package
from CTAN site as well.
Now you should do the following:
<enum>
<item>Put the new fonts to the TeX fonts tree. On my system (Slackware
2.2) I created a <tt/cmcyralt/ directory in the
<tt>/usr/lib/texmf/fonts/cm/</tt>. Create the <tt/src/, <tt/tfm/, and
<tt/vf/ subdirectories in it. Put there <tt/.mf/, <tt/.tfm/, and
<tt/vf/ files respectively.
<item>Put the font driver files (<tt/*.fd/) from the styles archive to the
appropriate place (in my case it was
<tt>/usr/lib/texmf/tex/latex/fd</tt>).
<item>Put the style files (<tt/*.sty/) to the appropriate LaTeX styles
directory (in my case <tt>/usr/lib/texmf/tex/latex/sty</tt>).
</enum>
Now the hyphenation setup. This requires to remake the LaTeX base
file.
<enum
<item>The file <tt/hyphen.cfg/ contains the directives for both
English and Russian hyphenation. Extract the one for Russian and place
it to the LaTeX hyphenation config file <tt/lthyphen.ltx/. In my case,
that file was in <tt>/usr/lib/texmf/tex/latex/latex-base</tt>.
<item>Put the <tt/rhyphen.tex/ to the same directory. It is needed for
making the new base file. Later, you can remove it.
<item>Do '<tt/make/' in that directory. Don't for get to make a link
from <tt/Makefile/ to <tt/Makefile.unx/. During the make process check
the output. There should be a message:
<verb>
Loading hyphenation patterns for Russian.
</verb>
If everything goes OK, you will get the new <tt/latex.fmt/ in that
directory. Put it to the appropriate place, where the previous one was
(like <tt>/usr/lib/texmf/ini/</tt>). <bf/Don't forget to save the
previous one!/.
</enum>
This is it. The installation is complete. Try processing the examples
found in the styles archive. If you are to create the PostScript files
without any problems, then everything is OK. Now, to use Cyrillic in
LaTeX, prepend your document with the following directive:
<verb>
\usepackage{cmcyralt}
</verb>
For more details, see the <tt/README/ file in the <tt/cmcyralt/ styles
archive.
<bf/Note:/ if you do have problems with the examples, provided you
have installed the things right, then probably your TeX system hasn't
been installed correctly. For example, during my first try, every
attempt to create the <tt/.pk/ files for the russian fonts failed
(<tt/MakeTeXPK/ stage). A substantial investigation discovered some
implicit conflict between the <it/localfont/ and <it/ljfour/
<tt/METAFONT/ configurations. It used to work before, but kept
crashing after the <tt/cmcyralt/ installation. Contact your local TeX
guru - TeX is very (sometimes too much) complicated to reconfigure it
without any prior knowledge.
<sect2>Using the CyrTUG package
<p>
You can obtain the CyrTUG package from the <htmlurl
url="ftp://sunsite.unc.edu/pub/academic/russian-studies/Software"
name="SunSite archive">. Get the files <tt/CyrTUGfonts.tar.gz/,
<tt/CyrTUGmacro.tar.gz/, and <tt/hyphen.tar.Z/.
The process of installation doesn't differ from too much the previous
one.
<!--
<sect1>The ApplixWare suite
<p>
As far as I know, <bf/ApplixWare/ allows
-->
<sect1>The StarOffice suite
<p>
Youri Kovalenko (<htmlurl url="http://www.inp.nsk.su/~kovalenko">) has
compiled a concise summary on StarOffice russification. It is located
at <htmlurl
url="ftp://sky.inp.nsk.su/archives_src/linux/StarOffice/russification.txt">.
I never had a chance to try it, so I cannot say anything about it's
correctness.
Another source of information on the subject is compiled by Eugene
Demidov (<htmlurl url="mailto:jack@gpi.ru">) and is located at
<htmlurl url="ftp://ftp.kapella.gpi.ru/pub/cyrillic/psfonts/README">.
<sect>Printing and PostScript
<p>
<sect1>Text to PostScript conversion
<p>
Sometimes you have just a plain ASCII KOI8-R text and you want to print
it just to get it on the paper. One of the easiest ways to achieve
that is to use special programs converting text to PostScript.
There are a number of programs doing such conversion. I personally
prefer <htmlurl url="http://www-inf.enst.fr/~demaille/a2ps.html"
name="a2ps">. Originally developed as a simple text-to-PostScript
converter it became a big and highly configurable program with many
options and allows you to manage various page layouts, syntax
highlighting etc. Another tool (now available as a part of the
<em/GNU/ project) is <htmlurl url="ftp://prep.ai.mit.edu/pub/gnu"
name="enscript">.
<sect2>An a2ps converter
<p>
A text to PostScript converter has been around for a while and is one
of the most versatile printing tools. The author proved to be very
open to suggestions, so since the release 4.9.8 <bf/a2ps/ supports
Cyrillic right off-the-shelf. All you need is a PostScript printer.
The command I use is:
<verb>
a2ps -X koi8r --print-anyway