Class UCharacterName


  • public final class UCharacterName
    extends java.lang.Object
    Internal class to manage character names. Since data for names are stored in an array of char, by default indexes used in this class is referring to a 2 byte count, unless otherwise stated. Cases where the index is referring to a byte count, the index is halved and depending on whether the index is even or odd, the MSB or LSB of the result char at the halved index is returned. For indexes to an array of int, the index is multiplied by 2, result char at the multiplied index and its following char is returned as an int. UCharacter acts as a public facade for this class Note : 0 - 0x1F are control characters without names in Unicode 3.0
    Since:
    nov0700
    • Nested Class Summary

      Nested Classes 
      Modifier and Type Class Description
      (package private) static class  UCharacterName.AlgorithmName
      Algorithmic name class
    • Constructor Summary

      Constructors 
      Modifier Constructor Description
      private UCharacterName()
      Protected constructor for use in UCharacter.
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      private static void add​(int[] set, char ch)
      Adds a codepoint into a set of ints.
      private static int add​(int[] set, java.lang.String str)
      Adds all characters of the argument str and gets the length Equivalent to calcStringSetLength.
      private static int add​(int[] set, java.lang.StringBuffer str)
      Adds all characters of the argument str and gets the length Equivalent to calcStringSetLength.
      private int addAlgorithmName​(int maxlength)
      Adds all algorithmic names into the name set.
      private int addExtendedName​(int maxlength)
      Adds all extended names into the name set.
      private void addGroupName​(int maxlength)
      Adds names of all group to the argument set.
      private int[] addGroupName​(int offset, int length, byte[] tokenlength, int[] set)
      Adds names of a group to the argument set.
      private static boolean contains​(int[] set, char ch)
      Checks if a codepoint is a part of a set of ints.
      private void convert​(int[] set, UnicodeSet uset)
      Converts the char set cset into a Unicode set uset.
      private java.lang.String getAlgName​(int ch, int choice)
      Gets the algorithmic name for the argument character
      int getAlgorithmEnd​(int index)
      Gets the end of the range
      int getAlgorithmLength()
      Get the Algorithm range length
      java.lang.String getAlgorithmName​(int index, int codepoint)
      Gets the Algorithmic name of the codepoint
      int getAlgorithmStart​(int index)
      Gets the start of the range
      int getCharFromName​(int choice, java.lang.String name)
      Find a character by its name and return its code point value
      void getCharNameCharacters​(UnicodeSet set)
      Fills set with characters that are used in Unicode character names.
      static int getCodepointMSB​(int codepoint)
      Gets the MSB of the codepoint
      private static int getExtendedChar​(java.lang.String name, int choice)
      Getting the character with extended name of the form <....>.
      java.lang.String getExtendedName​(int ch)
      Retrieves the extended name
      java.lang.String getExtendedOr10Name​(int ch)
      Gets the extended and 1.0 name when the most current unicode names fail
      int getGroup​(int codepoint)
      Gets the group index for the codepoint, or the group before it.
      private int getGroupChar​(int index, char[] length, java.lang.String name, int choice)
      Compares and retrieve character if name is found within the argument group
      private int getGroupChar​(java.lang.String name, int choice)
      Getting the character with the tokenized argument name
      int getGroupLengths​(int index, char[] offsets, char[] lengths)
      Reads a block of compressed lengths of 32 strings and expands them into offsets and lengths for each string.
      static int getGroupLimit​(int msb)
      Gets the maximum codepoint + 1 of the group
      static int getGroupMin​(int msb)
      Gets the minimum codepoint of the group
      static int getGroupMinFromCodepoint​(int codepoint)
      Gets the minimum codepoint of a group
      int getGroupMSB​(int gindex)
      Gets the MSB from the group index
      java.lang.String getGroupName​(int ch, int choice)
      Gets the group name of the character
      java.lang.String getGroupName​(int index, int length, int choice)
      Gets the name of the argument group index.
      static int getGroupOffset​(int codepoint)
      Gets the offset to a group
      void getISOCommentCharacters​(UnicodeSet set)
      Fills set with characters that are used in Unicode character names.
      int getMaxCharNameLength()
      Gets the maximum length of any codepoint name.
      int getMaxISOCommentLength()
      Gets the maximum length of any iso comments.
      java.lang.String getName​(int ch, int choice)
      Retrieve the name of a Unicode code point.
      private static int getType​(int ch)
      Gets the character extended type
      private boolean initNameSetsLengths()
      Sets up the name sets and the calculation of the maximum lengths.
      (package private) boolean setAlgorithm​(UCharacterName.AlgorithmName[] alg)
      Set the algorithm name information array
      (package private) boolean setGroup​(char[] group, byte[] groupstring)
      Sets the group name data
      (package private) boolean setGroupCountSize​(int count, int size)
      Sets the number of group and size of each group in number of char
      (package private) boolean setToken​(char[] token, byte[] tokenstring)
      Sets the token data
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • LINES_PER_GROUP_

        public static final int LINES_PER_GROUP_
        Number of lines per group 1 << GROUP_SHIFT_
        See Also:
        Constant Field Values
      • m_groupcount_

        public int m_groupcount_
        Maximum number of groups
      • m_groupsize_

        int m_groupsize_
        Size of each groups
      • m_tokentable_

        private char[] m_tokentable_
        Data used in unames.icu
      • m_tokenstring_

        private byte[] m_tokenstring_
      • m_groupinfo_

        private char[] m_groupinfo_
      • m_groupstring_

        private byte[] m_groupstring_
      • m_groupoffsets_

        private char[] m_groupoffsets_
        Group use. Note - access must be synchronized.
      • m_grouplengths_

        private char[] m_grouplengths_
      • FILE_NAME_

        private static final java.lang.String FILE_NAME_
        Default name of the name datafile
        See Also:
        Constant Field Values
      • GROUP_SHIFT_

        private static final int GROUP_SHIFT_
        Shift count to retrieve group information
        See Also:
        Constant Field Values
      • GROUP_MASK_

        private static final int GROUP_MASK_
        Mask to retrieve the offset for a particular character within a group
        See Also:
        Constant Field Values
      • OFFSET_HIGH_OFFSET_

        private static final int OFFSET_HIGH_OFFSET_
        Position of offsethigh in group information array
        See Also:
        Constant Field Values
      • OFFSET_LOW_OFFSET_

        private static final int OFFSET_LOW_OFFSET_
        Position of offsetlow in group information array
        See Also:
        Constant Field Values
      • SINGLE_NIBBLE_MAX_

        private static final int SINGLE_NIBBLE_MAX_
        Double nibble indicator, any nibble > this number has to be combined with its following nibble
        See Also:
        Constant Field Values
      • m_nameSet_

        private int[] m_nameSet_
        Set of chars used in character names (regular & 1.0). Chars are platform-dependent (can be EBCDIC).
      • m_ISOCommentSet_

        private int[] m_ISOCommentSet_
        Set of chars used in ISO comments. (regular & 1.0). Chars are platform-dependent (can be EBCDIC).
      • m_utilStringBuffer_

        private java.lang.StringBuffer m_utilStringBuffer_
        Utility StringBuffer
      • m_utilIntBuffer_

        private int[] m_utilIntBuffer_
        Utility int buffer
      • m_maxISOCommentLength_

        private int m_maxISOCommentLength_
        Maximum ISO comment length
      • m_maxNameLength_

        private int m_maxNameLength_
        Maximum name length
      • TYPE_NAMES_

        private static final java.lang.String[] TYPE_NAMES_
        Type names used for extended names
      • UNKNOWN_TYPE_NAME_

        private static final java.lang.String UNKNOWN_TYPE_NAME_
        Unknown type name
        See Also:
        Constant Field Values
      • NON_CHARACTER_

        private static final int NON_CHARACTER_
        Not a character type
        See Also:
        Constant Field Values
      • LEAD_SURROGATE_

        private static final int LEAD_SURROGATE_
        Lead surrogate type
        See Also:
        Constant Field Values
      • TRAIL_SURROGATE_

        private static final int TRAIL_SURROGATE_
        Trail surrogate type
        See Also:
        Constant Field Values
      • EXTENDED_CATEGORY_

        static final int EXTENDED_CATEGORY_
        Extended category count
        See Also:
        Constant Field Values
    • Constructor Detail

      • UCharacterName

        private UCharacterName()
                        throws java.io.IOException

        Protected constructor for use in UCharacter.

        Throws:
        java.io.IOException - thrown when data reading fails
    • Method Detail

      • getName

        public java.lang.String getName​(int ch,
                                        int choice)
        Retrieve the name of a Unicode code point. Depending on choice, the character name written into the buffer is the "modern" name or the name that was defined in Unicode version 1.0. The name contains only "invariant" characters like A-Z, 0-9, space, and '-'.
        Parameters:
        ch - the code point for which to get the name.
        choice - Selector for which name to get.
        Returns:
        if code point is above 0x1fff, null is returned
      • getCharFromName

        public int getCharFromName​(int choice,
                                   java.lang.String name)
        Find a character by its name and return its code point value
        Parameters:
        choice - selector to indicate if argument name is a Unicode 1.0 or the most current version
        name - the name to search for
        Returns:
        code point
      • getGroupLengths

        public int getGroupLengths​(int index,
                                   char[] offsets,
                                   char[] lengths)
        Reads a block of compressed lengths of 32 strings and expands them into offsets and lengths for each string. Lengths are stored with a variable-width encoding in consecutive nibbles: If a nibble<0xc, then it is the length itself (0 = empty string). If a nibble>=0xc, then it forms a length value with the following nibble. The offsets and lengths arrays must be at least 33 (one more) long because there is no check here at the end if the last nibble is still used.
        Parameters:
        index - of group string object in array
        offsets - array to store the value of the string offsets
        lengths - array to store the value of the string length
        Returns:
        next index of the data string immediately after the lengths in terms of byte address
      • getGroupName

        public java.lang.String getGroupName​(int index,
                                             int length,
                                             int choice)
        Gets the name of the argument group index. UnicodeData.txt uses ';' as a field separator, so no field can contain ';' as part of its contents. In unames.icu, it is marked as token[';'] == -1 only if the semicolon is used in the data file - which is iff we have Unicode 1.0 names or ISO comments or aliases. So, it will be token[';'] == -1 if we store U1.0 names/ISO comments/aliases although we know that it will never be part of a name. Equivalent to ICU4C's expandName.
        Parameters:
        index - of the group name string in byte count
        length - of the group name string
        choice - of Unicode 1.0 name or the most current name
        Returns:
        name of the group
      • getExtendedName

        public java.lang.String getExtendedName​(int ch)
        Retrieves the extended name
      • getGroup

        public int getGroup​(int codepoint)
        Gets the group index for the codepoint, or the group before it.
        Parameters:
        codepoint - The codepoint index.
        Returns:
        group index containing codepoint or the group before it.
      • getExtendedOr10Name

        public java.lang.String getExtendedOr10Name​(int ch)
        Gets the extended and 1.0 name when the most current unicode names fail
        Parameters:
        ch - codepoint
        Returns:
        name of codepoint extended or 1.0
      • getGroupMSB

        public int getGroupMSB​(int gindex)
        Gets the MSB from the group index
        Parameters:
        gindex - group index
        Returns:
        the MSB of the group if gindex is valid, -1 otherwise
      • getCodepointMSB

        public static int getCodepointMSB​(int codepoint)
        Gets the MSB of the codepoint
        Parameters:
        codepoint - The codepoint value.
        Returns:
        the MSB of the codepoint
      • getGroupLimit

        public static int getGroupLimit​(int msb)
        Gets the maximum codepoint + 1 of the group
        Parameters:
        msb - most significant byte of the group
        Returns:
        limit codepoint of the group
      • getGroupMin

        public static int getGroupMin​(int msb)
        Gets the minimum codepoint of the group
        Parameters:
        msb - most significant byte of the group
        Returns:
        minimum codepoint of the group
      • getGroupOffset

        public static int getGroupOffset​(int codepoint)
        Gets the offset to a group
        Parameters:
        codepoint - The codepoint value.
        Returns:
        offset to a group
      • getGroupMinFromCodepoint

        public static int getGroupMinFromCodepoint​(int codepoint)
        Gets the minimum codepoint of a group
        Parameters:
        codepoint - The codepoint value.
        Returns:
        minimum codepoint in the group which codepoint belongs to
      • getAlgorithmLength

        public int getAlgorithmLength()
        Get the Algorithm range length
        Returns:
        Algorithm range length
      • getAlgorithmStart

        public int getAlgorithmStart​(int index)
        Gets the start of the range
        Parameters:
        index - algorithm index
        Returns:
        algorithm range start
      • getAlgorithmEnd

        public int getAlgorithmEnd​(int index)
        Gets the end of the range
        Parameters:
        index - algorithm index
        Returns:
        algorithm range end
      • getAlgorithmName

        public java.lang.String getAlgorithmName​(int index,
                                                 int codepoint)
        Gets the Algorithmic name of the codepoint
        Parameters:
        index - algorithmic range index
        codepoint - The codepoint value.
        Returns:
        algorithmic name of codepoint
      • getGroupName

        public java.lang.String getGroupName​(int ch,
                                             int choice)
        Gets the group name of the character
        Parameters:
        ch - character to get the group name
        choice - name choice selector to choose a unicode 1.0 or newer name
      • getMaxCharNameLength

        public int getMaxCharNameLength()
        Gets the maximum length of any codepoint name. Equivalent to uprv_getMaxCharNameLength.
        Returns:
        the maximum length of any codepoint name
      • getMaxISOCommentLength

        public int getMaxISOCommentLength()
        Gets the maximum length of any iso comments. Equivalent to uprv_getMaxISOCommentLength.
        Returns:
        the maximum length of any codepoint name
      • getCharNameCharacters

        public void getCharNameCharacters​(UnicodeSet set)
        Fills set with characters that are used in Unicode character names. Equivalent to uprv_getCharNameCharacters.
        Parameters:
        set - USet to receive characters. Existing contents are deleted.
      • getISOCommentCharacters

        public void getISOCommentCharacters​(UnicodeSet set)
        Fills set with characters that are used in Unicode character names. Equivalent to uprv_getISOCommentCharacters.
        Parameters:
        set - USet to receive characters. Existing contents are deleted.
      • setToken

        boolean setToken​(char[] token,
                         byte[] tokenstring)
        Sets the token data
        Parameters:
        token - array of tokens
        tokenstring - array of string values of the tokens
        Returns:
        false if there is a data error
      • setAlgorithm

        boolean setAlgorithm​(UCharacterName.AlgorithmName[] alg)
        Set the algorithm name information array
        Parameters:
        alg - Algorithm information array
        Returns:
        true if the group string offset has been set correctly
      • setGroupCountSize

        boolean setGroupCountSize​(int count,
                                  int size)
        Sets the number of group and size of each group in number of char
        Parameters:
        count - number of groups
        size - size of group in char
        Returns:
        true if group size is set correctly
      • setGroup

        boolean setGroup​(char[] group,
                         byte[] groupstring)
        Sets the group name data
        Parameters:
        group - index information array
        groupstring - name information array
        Returns:
        false if there is a data error
      • getAlgName

        private java.lang.String getAlgName​(int ch,
                                            int choice)
        Gets the algorithmic name for the argument character
        Parameters:
        ch - character to determine name for
        choice - name choice
        Returns:
        the algorithmic name or null if not found
      • getGroupChar

        private int getGroupChar​(java.lang.String name,
                                 int choice)
        Getting the character with the tokenized argument name
        Parameters:
        name - of the character
        Returns:
        character with the tokenized argument name or -1 if character is not found
      • getGroupChar

        private int getGroupChar​(int index,
                                 char[] length,
                                 java.lang.String name,
                                 int choice)
        Compares and retrieve character if name is found within the argument group
        Parameters:
        index - index where the set of names reside in the group block
        length - list of lengths of the strings
        name - character name to search for
        choice - of either 1.0 or the most current unicode name
        Returns:
        relative character in the group which matches name, otherwise if not found, -1 will be returned
      • getType

        private static int getType​(int ch)
        Gets the character extended type
        Parameters:
        ch - character to be tested
        Returns:
        extended type it is associated with
      • getExtendedChar

        private static int getExtendedChar​(java.lang.String name,
                                           int choice)
        Getting the character with extended name of the form <....>.
        Parameters:
        name - of the character to be found
        choice - name choice
        Returns:
        character associated with the name, -1 if such character is not found and -2 if we should continue with the search.
      • add

        private static void add​(int[] set,
                                char ch)
        Adds a codepoint into a set of ints. Equivalent to SET_ADD.
        Parameters:
        set - set to add to
        ch - 16 bit char to add
      • contains

        private static boolean contains​(int[] set,
                                        char ch)
        Checks if a codepoint is a part of a set of ints. Equivalent to SET_CONTAINS.
        Parameters:
        set - set to check in
        ch - 16 bit char to check
        Returns:
        true if codepoint is part of the set, false otherwise
      • add

        private static int add​(int[] set,
                               java.lang.String str)
        Adds all characters of the argument str and gets the length Equivalent to calcStringSetLength.
        Parameters:
        set - set to add all chars of str to
        str - string to add
      • add

        private static int add​(int[] set,
                               java.lang.StringBuffer str)
        Adds all characters of the argument str and gets the length Equivalent to calcStringSetLength.
        Parameters:
        set - set to add all chars of str to
        str - string to add
      • addAlgorithmName

        private int addAlgorithmName​(int maxlength)
        Adds all algorithmic names into the name set. Equivalent to part of calcAlgNameSetsLengths.
        Parameters:
        maxlength - length to compare to
        Returns:
        the maximum length of any possible algorithmic name if it is > maxlength, otherwise maxlength is returned.
      • addExtendedName

        private int addExtendedName​(int maxlength)
        Adds all extended names into the name set. Equivalent to part of calcExtNameSetsLengths.
        Parameters:
        maxlength - length to compare to
        Returns:
        the maxlength of any possible extended name.
      • addGroupName

        private int[] addGroupName​(int offset,
                                   int length,
                                   byte[] tokenlength,
                                   int[] set)
        Adds names of a group to the argument set. Equivalent to calcNameSetLength.
        Parameters:
        offset - of the group name string in byte count
        length - of the group name string
        tokenlength - array to store the length of each token
        set - to add to
        Returns:
        the length of the name string and the length of the group string parsed
      • addGroupName

        private void addGroupName​(int maxlength)
        Adds names of all group to the argument set. Sets the data member m_max*Length_. Method called only once. Equivalent to calcGroupNameSetsLength.
        Parameters:
        maxlength - length to compare to
      • initNameSetsLengths

        private boolean initNameSetsLengths()
        Sets up the name sets and the calculation of the maximum lengths. Equivalent to calcNameSetsLengths.
      • convert

        private void convert​(int[] set,
                             UnicodeSet uset)
        Converts the char set cset into a Unicode set uset. Equivalent to charSetToUSet.
        Parameters:
        set - Set of 256 bit flags corresponding to a set of chars.
        uset - USet to receive characters. Existing contents are deleted.