- C
- D
- G
- I
- N
- R
- T
- U
- CLASS ActiveSupport::Multibyte::Unicode::Codepoint
- CLASS ActiveSupport::Multibyte::Unicode::UnicodeDatabase
NORMALIZATION_FORMS | = | [:c, :kc, :d, :kd] |
A list of all available normalization forms. See www.unicode.org/reports/tr15/tr15-29.html for more information about normalization. |
||
UNICODE_VERSION | = | '5.2.0' |
The Unicode version that is supported by the implementation |
||
HANGUL_SBASE | = | 0xAC00 |
Hangul character boundaries and properties |
||
HANGUL_LBASE | = | 0x1100 |
HANGUL_VBASE | = | 0x1161 |
HANGUL_TBASE | = | 0x11A7 |
HANGUL_LCOUNT | = | 19 |
HANGUL_VCOUNT | = | 21 |
HANGUL_TCOUNT | = | 28 |
HANGUL_NCOUNT | = | HANGUL_VCOUNT * HANGUL_TCOUNT |
HANGUL_SCOUNT | = | 11172 |
HANGUL_SLAST | = | HANGUL_SBASE + HANGUL_SCOUNT |
HANGUL_JAMO_FIRST | = | 0x1100 |
HANGUL_JAMO_LAST | = | 0x11FF |
WHITESPACE | = | [ (0x0009..0x000D).to_a, # White_Space # Cc [5] <control-0009>..<control-000D> 0x0020, # White_Space # Zs SPACE 0x0085, # White_Space # Cc <control-0085> 0x00A0, # White_Space # Zs NO-BREAK SPACE 0x1680, # White_Space # Zs OGHAM SPACE MARK 0x180E, # White_Space # Zs MONGOLIAN VOWEL SEPARATOR (0x2000..0x200A).to_a, # White_Space # Zs [11] EN QUAD..HAIR SPACE 0x2028, # White_Space # Zl LINE SEPARATOR 0x2029, # White_Space # Zp PARAGRAPH SEPARATOR 0x202F, # White_Space # Zs NARROW NO-BREAK SPACE 0x205F, # White_Space # Zs MEDIUM MATHEMATICAL SPACE 0x3000, # White_Space # Zs IDEOGRAPHIC SPACE ].flatten.freeze |
All the unicode whitespace |
||
LEADERS_AND_TRAILERS | = | WHITESPACE + [65279] |
BOM (byte order mark) can also be seen as whitespace, it’s a non-rendering character used to distinguish between little and big endian. This is not an issue in utf-8, so it must be ignored. |
||
TRAILERS_PAT | = | /(#{codepoints_to_pattern(LEADERS_AND_TRAILERS)})+\Z/u |
LEADERS_PAT | = | /\A(#{codepoints_to_pattern(LEADERS_AND_TRAILERS)})+/u |
[RW] | default_normalization_form | The default normalization used for operations that require normalization. It can be set to any of the normalizations in NORMALIZATION_FORMS. Example: ActiveSupport::Multibyte::Unicode.default_normalization_form = :c |
Compose decomposed characters to the composed form.
# File activesupport/lib/active_support/multibyte/unicode.rb, line 167 167: def compose_codepoints(codepoints) 168: pos = 0 169: eoa = codepoints.length - 1 170: starter_pos = 0 171: starter_char = codepoints[0] 172: previous_combining_class = -1 173: while pos < eoa 174: pos += 1 175: lindex = starter_char - HANGUL_LBASE 176: # -- Hangul 177: if 0 <= lindex and lindex < HANGUL_LCOUNT 178: vindex = codepoints[starter_pos+1] - HANGUL_VBASE rescue vindex = -1 179: if 0 <= vindex and vindex < HANGUL_VCOUNT 180: tindex = codepoints[starter_pos+2] - HANGUL_TBASE rescue tindex = -1 181: if 0 <= tindex and tindex < HANGUL_TCOUNT 182: j = starter_pos + 2 183: eoa -= 2 184: else 185: tindex = 0 186: j = starter_pos + 1 187: eoa -= 1 188: end 189: codepoints[starter_pos..j] = (lindex * HANGUL_VCOUNT + vindex) * HANGUL_TCOUNT + tindex + HANGUL_SBASE 190: end 191: starter_pos += 1 192: starter_char = codepoints[starter_pos] 193: # -- Other characters 194: else 195: current_char = codepoints[pos] 196: current = database.codepoints[current_char] 197: if current.combining_class > previous_combining_class 198: if ref = database.composition_map[starter_char] 199: composition = ref[current_char] 200: else 201: composition = nil 202: end 203: unless composition.nil? 204: codepoints[starter_pos] = composition 205: starter_char = composition 206: codepoints.delete_at pos 207: eoa -= 1 208: pos -= 1 209: previous_combining_class = -1 210: else 211: previous_combining_class = current.combining_class 212: end 213: else 214: previous_combining_class = current.combining_class 215: end 216: if current.combining_class == 0 217: starter_pos = pos 218: starter_char = codepoints[pos] 219: end 220: end 221: end 222: codepoints 223: end
Decompose composed characters to the decomposed form.
# File activesupport/lib/active_support/multibyte/unicode.rb, line 146 146: def decompose_codepoints(type, codepoints) 147: codepoints.inject([]) do |decomposed, cp| 148: # if it's a hangul syllable starter character 149: if HANGUL_SBASE <= cp and cp < HANGUL_SLAST 150: sindex = cp - HANGUL_SBASE 151: ncp = [] # new codepoints 152: ncp << HANGUL_LBASE + sindex / HANGUL_NCOUNT 153: ncp << HANGUL_VBASE + (sindex % HANGUL_NCOUNT) / HANGUL_TCOUNT 154: tindex = sindex % HANGUL_TCOUNT 155: ncp << (HANGUL_TBASE + tindex) unless tindex == 0 156: decomposed.concat ncp 157: # if the codepoint is decomposable in with the current decomposition type 158: elsif (ncp = database.codepoints[cp].decomp_mapping) and (!database.codepoints[cp].decomp_type || type == :compatability) 159: decomposed.concat decompose_codepoints(type, ncp.dup) 160: else 161: decomposed << cp 162: end 163: end 164: end
Reverse operation of g_unpack.
Example:
Unicode.g_pack(Unicode.g_unpack('क्षि')) # => 'क्षि'
Unpack the string at grapheme boundaries. Returns a list of character lists.
Example:
Unicode.g_unpack('क्षि') # => [[2325, 2381], [2359], [2367]] Unicode.g_unpack('Café') # => [[67], [97], [102], [233]]
# File activesupport/lib/active_support/multibyte/unicode.rb, line 91 91: def g_unpack(string) 92: codepoints = u_unpack(string) 93: unpacked = [] 94: pos = 0 95: marker = 0 96: eoc = codepoints.length 97: while(pos < eoc) 98: pos += 1 99: previous = codepoints[pos-1] 100: current = codepoints[pos] 101: if ( 102: # CR X LF 103: ( previous == database.boundary[:cr] and current == database.boundary[:lf] ) or 104: # L X (L|V|LV|LVT) 105: ( database.boundary[:l] === previous and in_char_class?(current, [:l,:v,:lv,:lvt]) ) or 106: # (LV|V) X (V|T) 107: ( in_char_class?(previous, [:lv,:v]) and in_char_class?(current, [:v,:t]) ) or 108: # (LVT|T) X (T) 109: ( in_char_class?(previous, [:lvt,:t]) and database.boundary[:t] === current ) or 110: # X Extend 111: (database.boundary[:extend] === current) 112: ) 113: else 114: unpacked << codepoints[marker..pos-1] 115: marker = pos 116: end 117: end 118: unpacked 119: end
Detect whether the codepoint is in a certain character class. Returns true when it’s in the specified character class and false otherwise. Valid character classes are: :cr, :lf, :l, :v, :lv, :lvt and :t.
Primarily used by the grapheme cluster support.
Returns the KC normalization of the string by default. NFKC is considered the best normalization form for passing strings to databases and validations.
- string - The string to perform normalization on.
- form - The form you want to normalize in. Should be one of the following: :c, :kc, :d, or :kd. Default is ActiveSupport::Multibyte.default_normalization_form
# File activesupport/lib/active_support/multibyte/unicode.rb, line 283 283: def normalize(string, form=nil) 284: form ||= @default_normalization_form 285: # See http://www.unicode.org/reports/tr15, Table 1 286: codepoints = u_unpack(string) 287: case form 288: when :d 289: reorder_characters(decompose_codepoints(:canonical, codepoints)) 290: when :c 291: compose_codepoints(reorder_characters(decompose_codepoints(:canonical, codepoints))) 292: when :kd 293: reorder_characters(decompose_codepoints(:compatability, codepoints)) 294: when :kc 295: compose_codepoints(reorder_characters(decompose_codepoints(:compatability, codepoints))) 296: else 297: raise ArgumentError, "#{form} is not a valid normalization variant", caller 298: end.pack('U*') 299: end
Re-order codepoints so the string becomes canonical.
# File activesupport/lib/active_support/multibyte/unicode.rb, line 130 130: def reorder_characters(codepoints) 131: length = codepoints.length- 1 132: pos = 0 133: while pos < length do 134: cp1, cp2 = database.codepoints[codepoints[pos]], database.codepoints[codepoints[pos+1]] 135: if (cp1.combining_class > cp2.combining_class) && (cp2.combining_class > 0) 136: codepoints[pos..pos+1] = cp2.code, cp1.code 137: pos += (pos > 0 ? -1 : 1) 138: else 139: pos += 1 140: end 141: end 142: codepoints 143: end
Replaces all ISO-8859-1 or CP1252 characters by their UTF-8 equivalent resulting in a valid UTF-8 string.
Passing true will forcibly tidy all bytes, assuming that the string’s encoding is entirely CP1252 or ISO-8859-1.
# File activesupport/lib/active_support/multibyte/unicode.rb, line 228 228: def tidy_bytes(string, force = false) 229: if force 230: return string.unpack("C*").map do |b| 231: tidy_byte(b) 232: end.flatten.compact.pack("C*").unpack("U*").pack("U*") 233: end 234: 235: bytes = string.unpack("C*") 236: conts_expected = 0 237: last_lead = 0 238: 239: bytes.each_index do |i| 240: 241: byte = bytes[i] 242: is_cont = byte > 127 && byte < 192 243: is_lead = byte > 191 && byte < 245 244: is_unused = byte > 240 245: is_restricted = byte > 244 246: 247: # Impossible or highly unlikely byte? Clean it. 248: if is_unused || is_restricted 249: bytes[i] = tidy_byte(byte) 250: elsif is_cont 251: # Not expecting continuation byte? Clean up. Otherwise, now expect one less. 252: conts_expected == 0 ? bytes[i] = tidy_byte(byte) : conts_expected -= 1 253: else 254: if conts_expected > 0 255: # Expected continuation, but got ASCII or leading? Clean backwards up to 256: # the leading byte. 257: (1..(i - last_lead)).each {|j| bytes[i - j] = tidy_byte(bytes[i - j])} 258: conts_expected = 0 259: end 260: if is_lead 261: # Final byte is leading? Clean it. 262: if i == bytes.length - 1 263: bytes[i] = tidy_byte(bytes.last) 264: else 265: # Valid leading byte? Expect continuations determined by position of 266: # first zero bit, with max of 3. 267: conts_expected = byte < 224 ? 1 : byte < 240 ? 2 : 3 268: last_lead = i 269: end 270: end 271: end 272: end 273: bytes.empty? ? "" : bytes.flatten.compact.pack("C*").unpack("U*").pack("U*") 274: end
Unpack the string at codepoints boundaries. Raises an EncodingError when the encoding of the string isn’t valid UTF-8.
Example:
Unicode.u_unpack('Café') # => [67, 97, 102, 233]