CHARSET_INFO.txt 9.35 KB
Newer Older
unknown's avatar
unknown committed
1 2 3 4 5

CHARSET_INFO
============
A structure containing data for charset+collation pair implementation. 

6 7
Virtual functions that use this data are collected into separate
structures, MY_CHARSET_HANDLER and MY_COLLATION_HANDLER.
unknown's avatar
unknown committed
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34


typedef struct charset_info_st
{
  uint      number;
  uint      primary_number;
  uint      binary_number;
  uint      state;

  const char *csname;
  const char *name;
  const char *comment;

  uchar    *ctype;
  uchar    *to_lower;
  uchar    *to_upper;
  uchar    *sort_order;

  uint16      *tab_to_uni;
  MY_UNI_IDX  *tab_from_uni;

  uchar state_map[256];
  uchar ident_map[256];

  uint      strxfrm_multiply;
  uint      mbminlen;
  uint      mbmaxlen;
35
  uint16    max_sort_char; /* For LIKE optimization */
unknown's avatar
unknown committed
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57

  MY_CHARSET_HANDLER *cset;
  MY_COLLATION_HANDLER *coll;

} CHARSET_INFO;


CHARSET_INFO fields description:
===============================


Numbers (identifiers)
---------------------

number - an ID uniquely identifying this charset+collation pair.

primary_number - ID of a charset+collation pair, which consists
of the same character set and the default collation of this
character set. Not really used now. Intended to optimize some
parts of the code where we need to find the default collation
using its non-default counterpart for the given character set.

58
binary_number - ID of a charset+collation pair, which consists
unknown's avatar
unknown committed
59
of the same character set and the binary collation of this
unknown's avatar
unknown committed
60
character set. Not really used now. 
unknown's avatar
unknown committed
61 62 63 64 65 66

Names
-----

  csname  - name of the character set for this charset+collation pair.
  name    - name of the collation for this charset+collation pair.
67
  comment - a text comment, displayed in "Description" column of
unknown's avatar
unknown committed
68 69 70 71 72 73
            SHOW CHARACTER SET output.

Conversion tables
-----------------
  
  ctype      - pointer to array[257] of "type of characters"
74 75
               bit mask for each character, e.g., whether a 
               character is a digit, letter, separator, etc.
unknown's avatar
unknown committed
76 77 78 79 80 81 82 83 84 85

               Monty 2004-10-21:
                 If you look at the macros, we use ctype[(char)+1].
                 ctype[0] is traditionally in most ctype libraries
                 reserved for EOF (-1). The idea is that you can use
                 the result from fgetc() directly with ctype[]. As
                 we have to be compatible with external ctype[] versions,
                 it's better to do it the same way as they do...

  to_lower   - pointer to array[256] used in LCASE()
unknown's avatar
unknown committed
86 87 88
  to_upper   - pointer to array[256] used in UCASE()
  sort_order - pointer to array[256] used for strings comparison

89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135
In all Asian charsets these arrays are set up as follows:

- All bytes in the range 0x80..0xFF were marked as letters in the
  ctype array.

- The to_lower and to_upper arrays map only ASCII letters.
  UPPER() and LOWER() doesn't really work for multi-byte characters.
  Most of the characters in Asian character sets are ideograms
  anyway and they don't have case mapping. However, there are
  still some characters from European alphabets.
  For example:
  _ujis 0x8FAAF2 - LATIN CAPITAL LETTER Y WITH ACUTE
  _ujis 0x8FABF2 - LATIN SMALL LETTER Y WITH ACUTE

  But they don't map to each other with UPPER and LOWER operations.

- The sort_order array is filled case insensitively for the
  ASCII range 0x00..0x7F, and in "binary" fashion for the multi-byte
  range 0x80..0xFF for these collations:

  cp932_japanese_ci,
  euckr_korean_ci,
  eucjpms_japanese_ci,
  gb2312_chinese_ci,
  sjis_japanese_ci,
  ujis_japanese_ci.

  So multi-byte characters are sorted just according to their codes.


- Two collations are still case insensitive for the ASCII characters,
  but have special sorting order for multi-byte characters
  (something more complex than just according to codes):

  big5_chinese_ci
  gbk_chinese_ci

  So handlers for these collations use only the 0x00..0x7F part
  of their sort_order arrays, and apply the special functions
  for multi-byte characters

In Unicode character sets we have full support of UPPER/LOWER mapping,
for sorting order, and for character type detection.
"utf8_general_ci" still has the "old-fashioned" arrays
like to_upper, to_lower, sort_order and ctype, but they are
not really used (maybe only in some rare legacy functions).

unknown's avatar
unknown committed
136 137 138 139


Unicode conversion data
-----------------------
140
For 8-bit character sets:
unknown's avatar
unknown committed
141 142 143 144

tab_to_uni  : array[256] of charset->Unicode translation
tab_from_uni: a structure for Unicode->charset translation

145 146
Non-8-bit charsets have their own structures per charset
hidden in corresponding ctype-xxx.c file and don't use
unknown's avatar
unknown committed
147 148 149 150 151 152 153 154
tab_to_uni and tab_from_uni tables.


Parser maps
-----------
state_map[]
ident_map[]

155 156 157
These maps are used to quickly identify whether a character is an
identifier part, a digit, a special character, or a part of another
SQL language lexical item.
unknown's avatar
unknown committed
158 159 160 161 162 163 164

Probably can be combined with ctype array in the future.
But for some reasons these two arrays are used in the parser,
while a separate ctype[] array is used in the other part of the
code, like fulltext, etc.


165 166
Miscellaneous fields
--------------------
unknown's avatar
unknown committed
167

168 169
  strxfrm_multiply - how many times a sort key (that is, a string
                     that can be passed into memcmp() for comparison)
unknown's avatar
unknown committed
170 171
                     can be longer than the original string. 
                     Usually it is 1. For some complex
172
                     collations it can be bigger. For example,
unknown's avatar
unknown committed
173
                     in latin1_german2_ci, a sort key is up to
174
                     two times longer than the original string.
unknown's avatar
unknown committed
175 176
                     e.g. Letter 'A' with two dots above is
                     substituted with 'AE'. 
177 178
  mbminlen         - minimum multi-byte sequence length.
                     Now always 1 except for ucs2. For ucs2,
unknown's avatar
unknown committed
179
                     it is 2.
180 181
  mbmaxlen         - maximum multi-byte sequence length.
                     1 for 8-bit charsets. Can be also 2 or 3.
unknown's avatar
unknown committed
182

183
  max_sort_char    - for LIKE range
184
                     in case of 8-bit character sets - native code
185 186 187
		     of maximum character (max_str pad byte);
                     in case of UTF8 and UCS2 - Unicode code of the maximum
		     possible character (usually U+FFFF). This code is
188
		     converted to multi-byte representation (usually 0xEFBFBF)
189
		     and then used as a pad sequence for max_str.
190
		     in case of other multi-byte character sets -
191
		     max_str pad byte (usually 0xFF).
unknown's avatar
unknown committed
192 193 194 195 196 197 198 199

MY_CHARSET_HANDLER
==================

MY_CHARSET_HANDLER is a collection of character-set
related routines. Defined in m_ctype.h. Have the 
following set of functions:

200
Multi-byte routines
unknown's avatar
unknown committed
201
------------------
202 203
ismbchar()  - detects whether the given string is a multi-byte sequence
mbcharlen() - returns length of multi-byte sequence starting with
unknown's avatar
unknown committed
204 205 206 207 208 209 210 211
              the given character
numchars()  - returns number of characters in the given string, e.g.
              in SQL function CHAR_LENGTH().
charpos()   - calculates the offset of the given position in the string.
              Used in SQL functions LEFT(), RIGHT(), SUBSTRING(), 
              INSERT()

well_formed_length()
212
            - finds the length of correctly formed multi-byte beginning.
unknown's avatar
unknown committed
213 214 215
              Used in INSERTs to cut a beginning of the given string
              which is
              a) "well formed" according to the given character set.
216
              b) can fit into the given data type
unknown's avatar
unknown committed
217
              Terminates the string in the good position, taking in account
218
              multi-byte character boundaries.
unknown's avatar
unknown committed
219

220
lengthsp()  - returns the length of the given string without trailing spaces.
unknown's avatar
unknown committed
221 222 223 224


Unicode conversion routines
---------------------------
225 226
mb_wc       - converts the left multi-byte sequence into its Unicode code.
mc_mb       - converts the given Unicode code into multi-byte sequence.
unknown's avatar
unknown committed
227 228


229
Case and sort conversion
unknown's avatar
unknown committed
230
------------------------
231 232 233 234
caseup_str  - converts the given 0-terminated string to uppercase
casedn_str  - converts the given 0-terminated string to lowercase
caseup      - converts the given string to lowercase using length
casedn      - converts the given string to lowercase using length
unknown's avatar
unknown committed
235 236 237 238 239 240 241

Number-to-string conversion routines
------------------------------------
snprintf()
long10_to_str()
longlong10_to_str()

242
The names are pretty self-describing.
unknown's avatar
unknown committed
243 244 245 246 247 248 249

String padding routines
-----------------------
fill()     - writes the given Unicode value into the given string
             with the given length. Used to pad the string, usually
             with space character, according to the given charset.

250
String-to-number conversion routines
unknown's avatar
unknown committed
251 252 253 254 255 256 257
------------------------------------
strntol()
strntoul()
strntoll()
strntoull()
strntod()

258 259
These functions are almost the same as their STDLIB counterparts,
but also:
unknown's avatar
unknown committed
260
  - accept length instead of 0-terminator
261
  - are character set dependent
unknown's avatar
unknown committed
262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278

Simple scanner routines
-----------------------
scan()    - to skip leading spaces in the given string.
            Used when a string value is inserted into a numeric field.



MY_COLLATION_HANDLER
====================
strnncoll()   - compares two strings according to the given collation
strnncollsp() - like the above but ignores trailing spaces
strnxfrm()    - makes a sort key suitable for memcmp() corresponding
                to the given string
like_range()  - creates a LIKE range, for optimizer
wildcmp()     - wildcard comparison, for LIKE
strcasecmp()  - 0-terminated string comparison
279 280
instr()       - finds the first substring appearance in the string
hash_sort()   - calculates hash value taking into account
unknown's avatar
unknown committed
281 282 283
                the collation rules, e.g. case-insensitivity, 
                accent sensitivity, etc.

284