Fwd: Note: the ISO 8859 to UTF-8 migration has started

Rene Scholz mrz at informatik.uni-jena.de
Fre Dez 6 20:38:29 CET 2002


Path: news.uni-jena.de!news-lei1.dfn.de!news-mue1.dfn.de!newsfeed.stueberl.de!news-spur1.maxwell.syr.edu!news.maxwell.syr.edu!news.alt.net
 !cola.stump.algebra.com!robomod!not-for-mail
From: n02W49+mgk25 at cl.cam.ac.uk (Markus Kuhn)
Newsgroups: comp.os.linux.announce
Subject: Note: the ISO 8859 to UTF-8 migration has started
Date: Thu, 5 Dec 2002 17:23:14 CST
Organization: University of Cambridge, England
Lines: 119
Approved: COLA Approval Key <cola-approval-key at stump.algebra.com>
Message-ID: <asj7ke$7je$1 at pegasus.csx.cam.ac.uk>
Return-Path: <cola at ak47.algebra.com>
X-COLA-Policy: http://stump.algebra.com/~cola
X-COLA-Info-1: Send submissions to      cola at stump.algebra.com
X-COLA-Info-2: Send complaints to       cola-admin at stump.algebra.com
X-Comment: moderators do not necessarily agree or disagree with this article.
X-Robomod: STUMP, ichudov at algebra.com (Igor Chudov)
X-Newsreader: xrn 9.02
X-Auth: PGPMoose V1.1 PGP comp.os.linux.announce iQBVAwUAPe/f4iFvAtx2nXvNAQFVCAH+NlrYCLYLTr0HVDSKen8qAceJNsBayIIH
  qLMRQHT3dEKuiM7WlVh5Or4P1EOef3WKDt52awzu/gsJRvCV0AWevw== =fszX
X-Spam-Status: No, hits=2.2 required=10.0 tests=FROM_ENDS_IN_NUMS,FROM_HAS_MIXED_NUMS,MISSING_HEADERS,      
  NOSPAM_INC,SIGNATURE_LONG_SPARSE,SPAM_PHRASE_01_02 version=2.43
X-Spam-Level: **
Xref: news.uni-jena.de comp.os.linux.announce:2769


UTF-8 -- The new common character set for GNU/Linux
---------------------------------------------------

Red Hat 8.0 is the first major Linux distributor that configures
its default installation to use the UTF-8 character encoding
(ISO 10646, Unicode). UTF-8 replaces now the various
quite restricted old ISO 8859 8-bit character sets that
we have used for the past 10 years. Other Linux distributors
are expected to follow shortly.

UTF-8 is the encoding variant of the Unicode character set
that was specifically designed to replace ASCII at all levels
in a backwards compatible way on Unix-style operating systems.
It was first introduced by the fathers of Unix around a
decade ago on AT&T's Plan9 operating system and has since then
become a formal international and Internet standard and
gradually found its way into the Unix and GNU/Linux world.

The use of different 8-bit charsets has caused in the past
enormous configuration and data exchange problems and the only
viable longterm solution for these problems remains to
move eventually to one single character encoding all
over the planet under Linux for filenames, plaintext files,
email and terminal I/O streams. UTF-8 is the only widely
recognized encoding that fulfills the requirements of
all major language communities under Linux and is
destined to become this common global character encoding.

UTF-8 is a multi-byte encoding. It needs to be treated by
some application programs (in particular all editors and any
program that interacts directly with fonts) slightly
differently from single-byte encodings such as ISO 8859.

The main difference between UTF-8 and single-byte encodings
is that for any text string

  - the number of bytes occupied
  - the number of characters contained
  - the number of terminal cursor positions consumed

are not identical any more, and any function that made this
assumption needs to be slightly updated.

Over the past three years, a handfull of enthusiasts have worked
intensively to upgrade the most widely used Linux tools to work
correctly with the UTF-8 encoding. As of late summer 2002,
UTF-8 upgrades have been completed or are in an advanced state
for example for glibc, gcc, Xlib, xterm, X11 fonts, bash,
textutils, emacs, vim, perl, python, tcl/tk, etc. UTF-8 support
has now reached a level, at which its first use in a production
environment can be recommended to a larger community of
experienced Linux users.

Of course, UTF-8 support is far from perfect and quite a
number of glitches can still be expected, especially with
less well maintained packages (just as it was with ISO 8859
10 years ago!). Red Hat had the courage to force this
issue somewhat by initiating a large scale test of
the available UTF-8 support, by setting the default locale
to UTF-8 for users in most countries. Please support them
by testing your own software with UTF-8 and address
related problems and user feedback quickly.

If you are maintaining a Linux package that might be affected
by the move to UTF-8, then please upgrade to a very recent
Linux distribution (e.g., Red Hat 8.0 or SuSE 8.1), pick
a locale such as LC_CTYPE=en_GB.UTF-8 and for xterm one
of the ISO10646-1 fonts, and test the behaviour of non-ASCII
characters.

If you are maintaining a national-language support HOWTO
document, please start to consider UTF-8 as an alternative
encoding for representing the character repertoire of your
respective script and language under Linux.

If you are a journalist, note that the move to UTF-8 is one of
the major forthcoming breakthroughs in the evolution of the
GNU/Linux platform and its progress deserves a fair amount of
public attention.

For more information on UTF-8 under Linux, please have a look at

  http://www.cl.cam.ac.uk/~mgk25/unicode.html

and for expert advice join the linux-utf8 mailing list by
sending a message to

  linux-utf8-request at nl.linux.org

with the subject "subscribe".

A 3-minute demo of UTF-8 under Linux is on

  http://www.cl.cam.ac.uk/~mgk25/ucs/quick-intro.txt

If you are completely unfamiliar with UTF-8, then please read
one of

  - man utf-8
  - http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
  - http://www.ietf.org/rfc/rfc2279.txt

for a description of the encoding, and refer to the
Unicode Standard (Addison-Wesley, 2000) for all the
gory details.

Markus

-- 
Markus Kuhn, Computer Lab, Univ of Cambridge, GB
http://www.cl.cam.ac.uk/~mgk25/ | __oo_O..O_oo__

##########################################################################
# Send submissions for comp.os.linux.announce to: cola at stump.algebra.com #
# PLEASE remember a short description of the software and the LOCATION.  #
# This group is archived at http://stump.algebra.com/~cola/              #
##########################################################################



-- 
tlug Mailingliste
Archiv: http://www.tlug.de/archiv/
http://schwarz.thueday.de/mailman/listinfo/tlug_allgemein