org.apache.lucene.util
Class UnicodeUtil

java.lang.Object
  extended by org.apache.lucene.util.UnicodeUtil

public final class UnicodeUtil
extends Object

Class to encode java's UTF16 char[] into UTF8 byte[] without always allocating a new byte[] as String.getBytes("UTF-8") does.

NOTE: This API is for internal purposes only and might change in incompatible ways in the next release.

Nested Class Summary
static class UnicodeUtil.UTF16Result
          Holds decoded UTF16 code units.
static class UnicodeUtil.UTF8Result
          Holds decoded UTF8 code units.
 
Field Summary
static int UNI_REPLACEMENT_CHAR
           
static int UNI_SUR_HIGH_END
           
static int UNI_SUR_HIGH_START
           
static int UNI_SUR_LOW_END
           
static int UNI_SUR_LOW_START
           
 
Method Summary
static String newString(int[] codePoints, int offset, int count)
          Cover JDK 1.5 API.
static void UTF16toUTF8(char[] source, int offset, int length, BytesRef result)
          Encode characters from a char[] source, starting at offset for length chars.
static void UTF16toUTF8(char[] source, int offset, int length, UnicodeUtil.UTF8Result result)
          Encode characters from a char[] source, starting at offset for length chars.
static void UTF16toUTF8(char[] source, int offset, UnicodeUtil.UTF8Result result)
          Encode characters from a char[] source, starting at offset and stopping when the character 0xffff is seen.
static void UTF16toUTF8(CharSequence s, int offset, int length, BytesRef result)
          Encode characters from this String, starting at offset for length characters.
static void UTF16toUTF8(String s, int offset, int length, UnicodeUtil.UTF8Result result)
          Encode characters from this String, starting at offset for length characters.
static int UTF16toUTF8WithHash(char[] source, int offset, int length, BytesRef result)
          Encode characters from a char[] source, starting at offset for length chars.
static void UTF8toUTF16(byte[] utf8, int offset, int length, CharsRef chars)
          Interprets the given byte array as UTF-8 and converts to UTF-16.
static void UTF8toUTF16(byte[] utf8, int offset, int length, UnicodeUtil.UTF16Result result)
          Convert UTF8 bytes into UTF16 characters.
static void UTF8toUTF16(BytesRef bytesRef, CharsRef chars)
          Utility method for UTF8toUTF16(byte[], int, int, CharsRef)
static boolean validUTF16String(CharSequence s)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

UNI_SUR_HIGH_START

public static final int UNI_SUR_HIGH_START
See Also:
Constant Field Values

UNI_SUR_HIGH_END

public static final int UNI_SUR_HIGH_END
See Also:
Constant Field Values

UNI_SUR_LOW_START

public static final int UNI_SUR_LOW_START
See Also:
Constant Field Values

UNI_SUR_LOW_END

public static final int UNI_SUR_LOW_END
See Also:
Constant Field Values

UNI_REPLACEMENT_CHAR

public static final int UNI_REPLACEMENT_CHAR
See Also:
Constant Field Values
Method Detail

UTF16toUTF8WithHash

public static int UTF16toUTF8WithHash(char[] source,
                                      int offset,
                                      int length,
                                      BytesRef result)
Encode characters from a char[] source, starting at offset for length chars. Returns a hash of the resulting bytes. After encoding, result.offset will always be 0.


UTF16toUTF8

public static void UTF16toUTF8(char[] source,
                               int offset,
                               UnicodeUtil.UTF8Result result)
Encode characters from a char[] source, starting at offset and stopping when the character 0xffff is seen. Returns the number of bytes written to bytesOut.


UTF16toUTF8

public static void UTF16toUTF8(char[] source,
                               int offset,
                               int length,
                               UnicodeUtil.UTF8Result result)
Encode characters from a char[] source, starting at offset for length chars. Returns the number of bytes written to bytesOut.


UTF16toUTF8

public static void UTF16toUTF8(String s,
                               int offset,
                               int length,
                               UnicodeUtil.UTF8Result result)
Encode characters from this String, starting at offset for length characters. Returns the number of bytes written to bytesOut.


UTF16toUTF8

public static void UTF16toUTF8(CharSequence s,
                               int offset,
                               int length,
                               BytesRef result)
Encode characters from this String, starting at offset for length characters. After encoding, result.offset will always be 0.


UTF16toUTF8

public static void UTF16toUTF8(char[] source,
                               int offset,
                               int length,
                               BytesRef result)
Encode characters from a char[] source, starting at offset for length chars. After encoding, result.offset will always be 0.


UTF8toUTF16

public static void UTF8toUTF16(byte[] utf8,
                               int offset,
                               int length,
                               UnicodeUtil.UTF16Result result)
Convert UTF8 bytes into UTF16 characters. If offset is non-zero, conversion starts at that starting point in utf8, re-using the results from the previous call up until offset.


newString

public static String newString(int[] codePoints,
                               int offset,
                               int count)
Cover JDK 1.5 API. Create a String from an array of codePoints.

Parameters:
codePoints - The code array
offset - The start of the text in the code point array
count - The number of code points
Returns:
a String representing the code points between offset and count
Throws:
IllegalArgumentException - If an invalid code point is encountered
IndexOutOfBoundsException - If the offset or count are out of bounds.

UTF8toUTF16

public static void UTF8toUTF16(byte[] utf8,
                               int offset,
                               int length,
                               CharsRef chars)
Interprets the given byte array as UTF-8 and converts to UTF-16. The CharsRef will be extended if it doesn't provide enough space to hold the worst case of each byte becoming a UTF-16 codepoint.

NOTE: Full characters are read, even if this reads past the length passed (and can result in an ArrayOutOfBoundsException if invalid UTF-8 is passed). Explicit checks for valid UTF-8 are not performed.


UTF8toUTF16

public static void UTF8toUTF16(BytesRef bytesRef,
                               CharsRef chars)
Utility method for UTF8toUTF16(byte[], int, int, CharsRef)

See Also:
UTF8toUTF16(byte[], int, int, CharsRef)

validUTF16String

public static boolean validUTF16String(CharSequence s)