Class CharsetICU

  • All Implemented Interfaces:
    java.lang.Comparable<java.nio.charset.Charset>
    Direct Known Subclasses:
    CharsetASCII, CharsetBOCU1, CharsetCompoundText, CharsetHZ, CharsetISCII, CharsetISO2022, CharsetLMBCS, CharsetMBCS, CharsetSCSU, CharsetUTF16, CharsetUTF32, CharsetUTF7, CharsetUTF8

    public abstract class CharsetICU
    extends java.nio.charset.Charset

    A subclass of java.nio.Charset for providing implementation of ICU's charset converters. This API is used to convert codepage or character encoded data to and from UTF-16. You can open a converter with Charset.forName(java.lang.String) and forNameICU(java.lang.String). With that converter, you can get its properties, set options, convert your data.

    Since many software programs recognize different converter names for different types of converters, there are other functions in this API to iterate over the converter aliases.

    Note that Charset.name() cannot always return a unique charset name. Charset documents that, for charsets listed in the IANA Charset Registry, the Charset.name() must be listed there, and it “must be the MIME-preferred name” if there are multiple names.

    However, there are different implementations of many if not most charsets, ICU provides multiple variants for some of them, ICU provides variants of some java.nio-system-supported charsets, and ICU users are free to add more variants. This is so that applications can be compatible with multiple implementations at the same time.

    This is in conflict with the Charset.name() requirements. It is not possible to offer variants of an IANA charset and always use the MIME-preferred name and also have those names be unique.

    Charset.name() returns the MIME-preferred name, or IANA name, so that it can always be used for the charset field in internet protocols.

    Same-name charsets are accessible via Charset.forName(java.lang.String) or forNameICU(java.lang.String) by using unique aliases (e.g., the ICU-canonical names).

    Charset also documents that “Two charsets are equal if, and only if, they have the same canonical names.” This is not possible.

    Unfortunately, Charset.equals(java.lang.Object) is final, and Charset.availableCharsets() returns “a sorted map from canonical charset names to charset objects”. Since Charset.name() cannot be unique, Charset.equals(java.lang.Object) cannot work properly in such cases, and Charset.availableCharsets() can only include one variant for a name.

    • Constructor Summary

      Constructors 
      Modifier Constructor Description
      protected CharsetICU​(java.lang.String icuCanonicalName, java.lang.String canonicalName, java.lang.String[] aliases)  
    • Method Summary

      All Methods Static Methods Instance Methods Abstract Methods Concrete Methods 
      Modifier and Type Method Description
      boolean contains​(java.nio.charset.Charset cs)
      Ascertains if a charset is a sub set of this charset Implements the abstract method of super class.
      static java.nio.charset.Charset forNameICU​(java.lang.String charsetName)
      Returns a charset object for the named charset.
      (package private) static java.nio.charset.Charset getCharset​(java.lang.String icuCanonicalName, java.lang.String javaCanonicalName, java.lang.String[] aliases)  
      (package private) static void getCompleteUnicodeSet​(UnicodeSet setFillIn)  
      (package private) static void getNonSurrogateUnicodeSet​(UnicodeSet setFillIn)  
      void getUnicodeSet​(UnicodeSet setFillIn, int which)
      Returns the set of Unicode code points that can be converted by an ICU Converter.
      (package private) abstract void getUnicodeSetImpl​(UnicodeSet setFillIn, int which)
      This follows ucnv.c method ucnv_detectUnicodeSignature() to detect the start of the stream for example U+FEFF (the Unicode BOM/signature character) that can be ignored.
      boolean isFixedWidth()
      Returns whether or not the charset of the converter has a fixed number of bytes per charset character.
      (package private) static boolean isSurrogate​(int c)  
      • Methods inherited from class java.nio.charset.Charset

        aliases, availableCharsets, canEncode, compareTo, decode, defaultCharset, displayName, displayName, encode, encode, equals, forName, hashCode, isRegistered, isSupported, name, newDecoder, newEncoder, toString
      • Methods inherited from class java.lang.Object

        clone, finalize, getClass, notify, notifyAll, wait, wait, wait
    • Field Detail

      • icuCanonicalName

        java.lang.String icuCanonicalName
      • options

        int options
      • maxCharsPerByte

        float maxCharsPerByte
      • name

        java.lang.String name
      • codepage

        int codepage
      • platform

        byte platform
      • conversionType

        byte conversionType
      • minBytesPerChar

        int minBytesPerChar
      • maxBytesPerChar

        int maxBytesPerChar
      • subChar

        byte[] subChar
      • subCharLen

        byte subCharLen
      • hasToUnicodeFallback

        byte hasToUnicodeFallback
      • hasFromUnicodeFallback

        byte hasFromUnicodeFallback
      • unicodeMask

        short unicodeMask
      • subChar1

        byte subChar1
      • ROUNDTRIP_SET

        public static final int ROUNDTRIP_SET
        Parameter that select the set of roundtrippable Unicode code points.
        See Also:
        Constant Field Values
      • ROUNDTRIP_AND_FALLBACK_SET

        @Deprecated
        public static final int ROUNDTRIP_AND_FALLBACK_SET
        Deprecated.
        This API is ICU internal only.
        Select the set of Unicode code points with roundtrip or fallback mappings. Not supported at this point.
        See Also:
        Constant Field Values
      • algorithmicCharsets

        private static final java.util.HashMap<java.lang.String,​java.lang.String> algorithmicCharsets
    • Constructor Detail

      • CharsetICU

        protected CharsetICU​(java.lang.String icuCanonicalName,
                             java.lang.String canonicalName,
                             java.lang.String[] aliases)
        Parameters:
        icuCanonicalName -
        canonicalName -
        aliases -
    • Method Detail

      • contains

        public boolean contains​(java.nio.charset.Charset cs)
        Ascertains if a charset is a sub set of this charset Implements the abstract method of super class.
        Specified by:
        contains in class java.nio.charset.Charset
        Parameters:
        cs - charset to test
        Returns:
        true if the given charset is a subset of this charset
      • getCharset

        static final java.nio.charset.Charset getCharset​(java.lang.String icuCanonicalName,
                                                         java.lang.String javaCanonicalName,
                                                         java.lang.String[] aliases)
      • isSurrogate

        static final boolean isSurrogate​(int c)
      • forNameICU

        public static java.nio.charset.Charset forNameICU​(java.lang.String charsetName)
                                                   throws java.nio.charset.IllegalCharsetNameException,
                                                          java.nio.charset.UnsupportedCharsetException
        Returns a charset object for the named charset. This method guarantees that ICU charset is returned when available. If the ICU charset provider does not support the specified charset, then try other charset providers including the standard Java charset provider.
        Parameters:
        charsetName - The name of the requested charset, may be either a canonical name or an alias
        Returns:
        A charset object for the named charset
        Throws:
        java.nio.charset.IllegalCharsetNameException - If the given charset name is illegal
        java.nio.charset.UnsupportedCharsetException - If no support for the named charset is available in this instance of th Java virtual machine
      • getUnicodeSetImpl

        abstract void getUnicodeSetImpl​(UnicodeSet setFillIn,
                                        int which)
        This follows ucnv.c method ucnv_detectUnicodeSignature() to detect the start of the stream for example U+FEFF (the Unicode BOM/signature character) that can be ignored. Detects Unicode signature byte sequences at the start of the byte stream and returns number of bytes of the BOM of the indicated Unicode charset. 0 is returned when no Unicode signature is recognized.
      • getUnicodeSet

        public void getUnicodeSet​(UnicodeSet setFillIn,
                                  int which)
        Returns the set of Unicode code points that can be converted by an ICU Converter.

        The current implementation returns only one kind of set (UCNV_ROUNDTRIP_SET): The set of all Unicode code points that can be roundtrip-converted (converted without any data loss) with the converter This set will not include code points that have fallback mappings or are only the result of reverse fallback mappings. See UTR #22 "Character Mapping Markup Language" at http://www.unicode.org/reports/tr22/

        In the future, there may be more UConverterUnicodeSet choices to select sets with different properties.

        This is useful for example for

        • checking that a string or document can be roundtrip-converted with a converter, without/before actually performing the conversion
        • testing if a converter can be used for text for typical text for a certain locale, by comparing its roundtrip set with the set of ExemplarCharacters from ICU's locale data or other sources
        Parameters:
        setFillIn - A valid UnicodeSet. It will be cleared by this function before the converter's specific set is filled in.
        which - A selector; currently ROUNDTRIP_SET is the only supported value.
        Throws:
        java.lang.IllegalArgumentException - if the parameters does not match.
      • isFixedWidth

        public boolean isFixedWidth()
        Returns whether or not the charset of the converter has a fixed number of bytes per charset character. An example of this are converters that are of the type UCNV_SBCS or UCNV_DBCS. Another example is UTF-32 which is always 4 bytes per character. A UTF-32 code point may represent more than one UTF-8 or UTF-16 code units but always have size of 4 bytes. Note: This method is not intended to be used to determine whether the charset has a fixed ratio of bytes to Unicode codes units for any particular Unicode encoding form.
        Returns:
        true if the converter is fixed-width
      • getNonSurrogateUnicodeSet

        static void getNonSurrogateUnicodeSet​(UnicodeSet setFillIn)
      • getCompleteUnicodeSet

        static void getCompleteUnicodeSet​(UnicodeSet setFillIn)