Grapheme Break Chart

Unicode Version: 5.0.0

Date: 2006-06-13, 23:23:42 GMT

This page illustrates the application of the boundary specifications. The first chart shows where breaks would appear between different sample characters or strings. The sample characters are chosen mechanically to represent the different properties used by the specification. Where properties used in the rules have 'overlaps', the samples are given 'composed' names. For example, SentenceBreak uses GCLF_Sep: Sep is the SentenceBreak property, but it overlaps with the GraphemeClusterBreak property LF.

	Other	CR	LF	Control	Extend	L	V	T	LV	LVT
Other	÷	÷	÷	÷	×	÷	÷	÷	÷	÷
CR	÷	÷	×	÷	÷	÷	÷	÷	÷	÷
LF	÷	÷	÷	÷	÷	÷	÷	÷	÷	÷
Control	÷	÷	÷	÷	÷	÷	÷	÷	÷	÷
Extend	÷	÷	÷	÷	×	÷	÷	÷	÷	÷
L	÷	÷	÷	÷	×	×	×	÷	×	×
V	÷	÷	÷	÷	×	÷	×	×	÷	÷
T	÷	÷	÷	÷	×	÷	÷	×	÷	÷
LV	÷	÷	÷	÷	×	÷	×	×	÷	÷
LVT	÷	÷	÷	÷	×	÷	÷	×	÷	÷

Rules

Due to the way they have been mechanically processed for generation, the following rules do not match the UAX rules precisely. In particular:

The rules are cast into a more regex-style.
The rules "sot ÷", "÷ eot", and "÷ Any" are added mechanically, and have artificial numbers.
The rules are given decimal numbers, so rules such as 11a are given a number using tenths, such as 11.1.
Where a rule has multiple parts (lines), each one is numbered using hundredths, such as 21.01) × BA, 21.02) × HY,...
Any 'treat as' or 'ignore' rules are handled as discussed in Unicode Standard Annex #29, and thusreflected in a transformation of the rules not visible here.

For the original rules, see the UAX.

0.2) sot ÷
0.3) ÷ eot
3.0) CR × LF
4.0) ( Control | CR | LF ) ÷
5.0) ÷ ( Control | CR | LF )
6.0) L × ( L | V | LV | LVT )
7.0) ( LV | V ) × ( V | T )
8.0) ( LVT | T) × T
9.0) × Extend
999.0) ÷ Any