CURRENT_MEETING_REPORT_

Reported by Borka Jerman-Blazic/Jozef Stefan Institute

Minutes of the UCS Character Set BOF (UCS)


Introduction

A brief introductory tutorial was given by Borka Jerman-Blazic.  She
described some of the problems which appear on the network due to the
lack of support for the national character sets used for inputting,
outputting, processing and displaying the text written in languages used
all over the world.  She stressed the need for proper maintenance of the
character integrity over the network.  The requirement for processing
and interchanging different character sets correctly is especially
relevant for some Internet services dealing with names of persons or
organizations.


Presentation of the Problems

Peter Svanberg gave a short overview of the level of support for
non-ASCII character sets in different Internet protocols.  Some of the
protocols were identified as hostile to 8-bit characters.  Among them
are:  DNS, SMTP, FTP, NNTP, WAIS, MIME Text/Enhanced, NFS, AFS, Whois,
URN, Gopher, etc.  The more recently developed protocols such as MIME
part 1 and part 2 as well as some currently on-going projects such as
Whois++, as mentioned by Simon Spero, support 16-bit coding and the
repertoires provided by such coding.  He also mentioned, that several
IETF groups developing new protocols/services consider the importance of
the proper support of the character sets to be a problem.  The level of
support for extended character sets in some protocols used on the
Internet is included in the Annex below.

The next speaker was Masataka Ohta.  He presented his view regarding the
idea that the International Universal Coding system be recommended for
use over the Internet.  He identifyed five properties which are required
to be present in the recommended coding system:


  1. Identity for encoding and decoding, which he understands as unique
     mapping between particular graphic character and its code (bit
     combination);

  2. Causality, understood as independence of a processed coded
     character from the other incoming characters in the data stream;

  3. Finite state recognition, state dependence of the code required for
     presentation/display of multi-octet coded data;

                                   1





  4. Finite resynchronizability, which means that the state of
     automation can be determined uniquely by reading a fixed, finite
     number of octets; and

  5. Equality, requirement that a character coded with a different
     coding system can always be recognized as the same character.


Masataka looked for the required properties in ISO 10646 and found out
that full ISO 10646 (UCS4) satisfies none of the required properties.
He also pointed out that ISO 10646 level 1 satisfies all of the
required properties for the European languages.

He proposed an extension to the existing UCS code system consisting of
five additional bits which will enable the deficiency of the UCS coding
system to be overcome.  The discussion showed that the proposed solution
is not in the general stream of the development of the standard
character set codes and their applications in the computing systems.
One of the possible solutions to the problems identified by Masataka
could be the use of the whole model of UCS, i.e., the four envisaged
octets which define the cell and row position for a character in the
Multilingual Basic Plane of ISO 10646 additional planes and groups.
There was a proposal that the required five additional bits be coded as
a private plane in the UCS scheme.  John Klensin noted that such an
approach could clash with the reassignment of such a plane in the
standardization process of ISO JTC1/SC2.  In the discussion the problem
of the handling of bidirectional text was also identified.  Masataka
said that one of the five additional bits in his scheme is intended to
be used for indication of bidirectional text.

Harald Alvestrand pointed out that what is happening now is a sort of
transition period between 8-bit coding and 16-bit coding provided with
UCS. Another parallel stream for support of different national character
sets is ``character switching'' which is enabled by use of the code
extension technique of ISO 2022.  It was obvious that this scheme is not
of practical use for the Internet except for special cases, i.e, the
Japanese e-mail solution.


Conclusions

The attendees then discussed possible work items which will result if
the IESG approves the formation of a working group.  The chair
identified several documents which deal with character set problems such
as:  RFC 1345, ``Character Mnemonics & Character Sets,'' the
Internet-Draft, ``X.400 use of extended character sets,'' and the
Internet-Draft, ``Characters and character sets for various languages.''
John Klensin pointed out that special precautions have to be taken in
the recommendation of UTF-2 as a data interchange method over the
Internet in connection with the possible assignments of additional
coding planes by JTC1/SC2.  He also recommended the use of a mailing
list already working within IETF, ietf-charsets@innosoft.com.  The
mailing list of the RARE working group on character sets could be added

                                   2





to that mailing list.  Other items were discussed and proposed by the
BOF attendees.  It was decided that the IESG will be asked to consider
the possibility of setting up a working group to produce the following:


   o A document defining how UCS can be used in a uniform way in
     Internet protocols, especially taking into consideration the UTF-2
     encoding of UCS. The document will provide guidance to other
     protocols which have to deal with these items over the Internet.

   o A document identifying the languages and the characters required
     for coding text written in a particular natural language (a sort of
     guideline for services dealing with multilinguality such as NIR
     service based on the usage of plain text).

   o A document defining a tool for coded character set conversion to be
     provided within some services such as e-mail user agent including
     fall-back representation of incoming characters that are outside
     the supported character repertoire of the receiver.

   o A proposal for extending the mandatory issues which have to be
     covered in the RFC standardization process to include character set
     consideration and support.



Annex



The level of support for extended character sets in some Internet
Standard protocols.

                                   3




  ____________________________________________________________________
  | CharSet |                   |CharSet |                            |
  |_Support_|Protocol____________S|upport_|``Next_Generation''_Protocol_|
  |    1    |SMTP                 | 3    |ESMTP                       |
  |    1    |RFC822               | 4    |MIME part 1 + part 2         |
  |    1    |DNS                  |      |                            |
  |    2    |FTP                  |      |                            |
  |    3    |Telnet               |      |                            |
  |    2    |NNTP                 |      |                            |
  |    2    |Finger               |      |                            |
  |    2    |POP3                 |      |                            |
  |    2    |IMAP2                | 3    |IMAP2bis                    |
  |    1    |NFS                  |      |                            |
  |    1    |AFS                  |      |                            |
  |    2    |MIME Text/Enhanced   |      |                            |
  |    ?    |MIME Text/simplemail |      |                            |
  |    3    |STIF                 |      |                            |
  |    2    |Gopher               | 3    |Gopher +                    |
  |    1    |WAIS                 |      |                            |
  |    ?    |Prospero             |      |                            |
  |    2    |HTML                 |      |                            |
  |    2    |Whois                | 3    |Whois ++                    |
  |    2    |URL                  |      |                            |
  |    2    |URN                  |      |                            |
  |____3____|URM__________________|______|____________________________|


Legend:

1 -- hostile against 8-bit characters
2 -- no support for different character sets
3 -- some support for different character sets
4 -- well thought-out support for different character sets
5 -- uniform treatment of all characters


Attendees

Harald Alvestrand        Harald.Alvestrand@uninett.no
Piet Bovenga             p.bovenga@uci.kun.nl
Maria Dimou-Zacharova    dimou@dxcern.cern.ch
Tim Dixon                dixon@rare.nl
Olle Jarnefors           ojarnef@admin.kth.se
Borka Jerman-Blazic      jerman-blazic@ijs.si
Tomaz Kalin              kalin@rare.nl
John Klensin             Klensin@infoods.unu.edu
Pekka Kytolaakso         pekka.kytolaakso@csc.fi
Thomas Lenggenhager      lenggenhager@switch.ch
Jun Matsukata            jm@eng.isas.ac.jp
Keith Moore              moore@cs.utk.edu
Masataka Ohta            mohta@cc.titech.ac.jp
Geir Pedersen            Geir.Pedersen@usit.uio.no

                                   4





Luc Rooijakkers          lwj@cs.kun.nl
Rickard Schoultz         schoultz@admin.kth.se
Milan Sova               sova@feld.cvut.cz
Simon Spero              simon_spero@unc.edu
Peter Svanberg           psv@nada.kth.se
Guido van Rossum         guido@cwi.nl



                                   5