UnicodeUtil (Lucene 4.0.0 API)

Overview

Package

Class

Use

Tree

Deprecated

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.lucene.util
Class UnicodeUtil

java.lang.Object
  org.apache.lucene.util.UnicodeUtil

public final class UnicodeUtil
extends Object
extends Object

Class to encode java's UTF16 char[] into UTF8 byte[] without always allocating a new byte[] as String.getBytes("UTF-8") does.

NOTE: This API is for internal purposes only and might change in incompatible ways in the next release.

Field Summary
`static BytesRef`	`BIG_TERM` A binary term consisting of a number of 0xff bytes, likely to be bigger than other terms one would normally encounter, and definitely bigger than any UTF-8 terms.
`static int`	`UNI_REPLACEMENT_CHAR`
`static int`	`UNI_SUR_HIGH_END`
`static int`	`UNI_SUR_HIGH_START`
`static int`	`UNI_SUR_LOW_END`
`static int`	`UNI_SUR_LOW_START`

Method Summary
`static int`	`codePointCount(BytesRef utf8)` Returns the number of code points in this utf8 sequence.
`static String`	`newString(int[] codePoints, int offset, int count)` Cover JDK 1.5 API.
`static String`	`toHexString(String s)`
`static void`	`UTF16toUTF8(char[] source, int offset, int length, BytesRef result)` Encode characters from a char[] source, starting at offset for length chars.
`static void`	`UTF16toUTF8(CharSequence s, int offset, int length, BytesRef result)` Encode characters from this String, starting at offset for length characters.
`static int`	`UTF16toUTF8WithHash(char[] source, int offset, int length, BytesRef result)` Encode characters from a char[] source, starting at offset for length chars.
`static void`	`UTF8toUTF16(byte[] utf8, int offset, int length, CharsRef chars)` Interprets the given byte array as UTF-8 and converts to UTF-16.
`static void`	`UTF8toUTF16(BytesRef bytesRef, CharsRef chars)` Utility method for `UTF8toUTF16(byte[], int, int, CharsRef)`
`static void`	`UTF8toUTF32(BytesRef utf8, IntsRef utf32)`
`static boolean`	`validUTF16String(char[] s, int size)`
`static boolean`	`validUTF16String(CharSequence s)`

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

BIG_TERM

public static final BytesRef BIG_TERM

A binary term consisting of a number of 0xff bytes, likely to be bigger than other terms one would normally encounter, and definitely bigger than any UTF-8 terms.

WARNING: This is not a valid UTF8 Term

UNI_SUR_HIGH_START

public static final int UNI_SUR_HIGH_START

See Also:: Constant Field Values

UNI_SUR_HIGH_END

public static final int UNI_SUR_HIGH_END

See Also:: Constant Field Values

UNI_SUR_LOW_START

public static final int UNI_SUR_LOW_START

See Also:: Constant Field Values

UNI_SUR_LOW_END

public static final int UNI_SUR_LOW_END

See Also:: Constant Field Values

UNI_REPLACEMENT_CHAR

public static final int UNI_REPLACEMENT_CHAR

See Also:: Constant Field Values

Method Detail

UTF16toUTF8WithHash

public static int UTF16toUTF8WithHash(char[] source,
                                      int offset,
                                      int length,
                                      BytesRef result)

Encode characters from a char[] source, starting at offset for length chars. Returns a hash of the resulting bytes. After encoding, result.offset will always be 0.

UTF16toUTF8

public static void UTF16toUTF8(char[] source,
                               int offset,
                               int length,
                               BytesRef result)

Encode characters from a char[] source, starting at offset for length chars. After encoding, result.offset will always be 0.

UTF16toUTF8

public static void UTF16toUTF8(CharSequence s,
                               int offset,
                               int length,
                               BytesRef result)

Encode characters from this String, starting at offset for length characters. After encoding, result.offset will always be 0.

validUTF16String

public static boolean validUTF16String(CharSequence s)

validUTF16String

public static boolean validUTF16String(char[] s,
                                       int size)

codePointCount

public static int codePointCount(BytesRef utf8)

Returns the number of code points in this utf8 sequence. Behavior is undefined if the utf8 sequence is invalid.

UTF8toUTF32

public static void UTF8toUTF32(BytesRef utf8,
                               IntsRef utf32)

newString

public static String newString(int[] codePoints,
                               int offset,
                               int count)

Cover JDK 1.5 API. Create a String from an array of codePoints.

Parameters:: codePoints - The code array; offset - The start of the text in the code point array; count - The number of code points
Returns:: a String representing the code points between offset and count
Throws:: IllegalArgumentException - If an invalid code point is encountered; IndexOutOfBoundsException - If the offset or count are out of bounds.

toHexString

public static String toHexString(String s)

UTF8toUTF16

public static void UTF8toUTF16(byte[] utf8,
                               int offset,
                               int length,
                               CharsRef chars)

Interprets the given byte array as UTF-8 and converts to UTF-16. The CharsRef will be extended if it doesn't provide enough space to hold the worst case of each byte becoming a UTF-16 codepoint.

NOTE: Full characters are read, even if this reads past the length passed (and can result in an ArrayOutOfBoundsException if invalid UTF-8 is passed). Explicit checks for valid UTF-8 are not performed.

UTF8toUTF16

public static void UTF8toUTF16(BytesRef bytesRef,
                               CharsRef chars)

Utility method for UTF8toUTF16(byte[], int, int, CharsRef)

See Also:: UTF8toUTF16(byte[], int, int, CharsRef)

Overview

Package

Class

Use

Tree

Deprecated

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.lucene.util Class UnicodeUtil

BIG_TERM

UNI_SUR_HIGH_START

UNI_SUR_HIGH_END

UNI_SUR_LOW_START

UNI_SUR_LOW_END

UNI_REPLACEMENT_CHAR

UTF16toUTF8WithHash

UTF16toUTF8

UTF16toUTF8

validUTF16String

validUTF16String

codePointCount

UTF8toUTF32

newString

toHexString

UTF8toUTF16

UTF8toUTF16

org.apache.lucene.util
Class UnicodeUtil