Class UCharacterName

java.lang.Object
com.ibm.icu.impl.UCharacterName

public final class UCharacterName extends Object
Internal class to manage character names. Since data for names are stored in an array of char, by default indexes used in this class is referring to a 2 byte count, unless otherwise stated. Cases where the index is referring to a byte count, the index is halved and depending on whether the index is even or odd, the MSB or LSB of the result char at the halved index is returned. For indexes to an array of int, the index is multiplied by 2, result char at the multiplied index and its following char is returned as an int. UCharacter acts as a public facade for this class Note : 0 - 0x1F are control characters without names in Unicode 3.0
Since:
nov0700
  • Field Details

    • INSTANCE

      public static final UCharacterName INSTANCE
    • LINES_PER_GROUP_

      public static final int LINES_PER_GROUP_
      Number of lines per group 1 << GROUP_SHIFT_
      See Also:
    • m_groupcount_

      public int m_groupcount_
      Maximum number of groups
    • m_groupsize_

      int m_groupsize_
      Size of each groups
    • m_tokentable_

      private char[] m_tokentable_
      Data used in unames.icu
    • m_tokenstring_

      private byte[] m_tokenstring_
    • m_groupinfo_

      private char[] m_groupinfo_
    • m_groupstring_

      private byte[] m_groupstring_
    • m_algorithm_

      private UCharacterName.AlgorithmName[] m_algorithm_
    • m_groupoffsets_

      private char[] m_groupoffsets_
      Group use. Note - access must be synchronized.
    • m_grouplengths_

      private char[] m_grouplengths_
    • FILE_NAME_

      private static final String FILE_NAME_
      Default name of the name datafile
      See Also:
    • GROUP_SHIFT_

      private static final int GROUP_SHIFT_
      Shift count to retrieve group information
      See Also:
    • GROUP_MASK_

      private static final int GROUP_MASK_
      Mask to retrieve the offset for a particular character within a group
      See Also:
    • OFFSET_HIGH_OFFSET_

      private static final int OFFSET_HIGH_OFFSET_
      Position of offsethigh in group information array
      See Also:
    • OFFSET_LOW_OFFSET_

      private static final int OFFSET_LOW_OFFSET_
      Position of offsetlow in group information array
      See Also:
    • SINGLE_NIBBLE_MAX_

      private static final int SINGLE_NIBBLE_MAX_
      Double nibble indicator, any nibble > this number has to be combined with its following nibble
      See Also:
    • m_nameSet_

      private int[] m_nameSet_
      Set of chars used in character names (regular & 1.0). Chars are platform-dependent (can be EBCDIC).
    • m_ISOCommentSet_

      private int[] m_ISOCommentSet_
      Set of chars used in ISO comments. (regular & 1.0). Chars are platform-dependent (can be EBCDIC).
    • m_utilStringBuffer_

      private StringBuffer m_utilStringBuffer_
      Utility StringBuffer
    • m_utilIntBuffer_

      private int[] m_utilIntBuffer_
      Utility int buffer
    • m_maxISOCommentLength_

      private int m_maxISOCommentLength_
      Maximum ISO comment length
    • m_maxNameLength_

      private int m_maxNameLength_
      Maximum name length
    • TYPE_NAMES_

      private static final String[] TYPE_NAMES_
      Type names used for extended names
    • UNKNOWN_TYPE_NAME_

      private static final String UNKNOWN_TYPE_NAME_
      Unknown type name
      See Also:
    • NON_CHARACTER_

      private static final int NON_CHARACTER_
      Not a character type
      See Also:
    • LEAD_SURROGATE_

      private static final int LEAD_SURROGATE_
      Lead surrogate type
      See Also:
    • TRAIL_SURROGATE_

      private static final int TRAIL_SURROGATE_
      Trail surrogate type
      See Also:
    • EXTENDED_CATEGORY_

      static final int EXTENDED_CATEGORY_
      Extended category count
      See Also:
  • Constructor Details

    • UCharacterName

      private UCharacterName() throws IOException

      Protected constructor for use in UCharacter.

      Throws:
      IOException - thrown when data reading fails
  • Method Details

    • getName

      public String getName(int ch, int choice)
      Retrieve the name of a Unicode code point. Depending on choice, the character name written into the buffer is the "modern" name or the name that was defined in Unicode version 1.0. The name contains only "invariant" characters like A-Z, 0-9, space, and '-'.
      Parameters:
      ch - the code point for which to get the name.
      choice - Selector for which name to get.
      Returns:
      if code point is above 0x1fff, null is returned
    • getCharFromName

      public int getCharFromName(int choice, String name)
      Find a character by its name and return its code point value
      Parameters:
      choice - selector to indicate if argument name is a Unicode 1.0 or the most current version
      name - the name to search for
      Returns:
      code point
    • getGroupLengths

      public int getGroupLengths(int index, char[] offsets, char[] lengths)
      Reads a block of compressed lengths of 32 strings and expands them into offsets and lengths for each string. Lengths are stored with a variable-width encoding in consecutive nibbles: If a nibble<0xc, then it is the length itself (0 = empty string). If a nibble>=0xc, then it forms a length value with the following nibble. The offsets and lengths arrays must be at least 33 (one more) long because there is no check here at the end if the last nibble is still used.
      Parameters:
      index - of group string object in array
      offsets - array to store the value of the string offsets
      lengths - array to store the value of the string length
      Returns:
      next index of the data string immediately after the lengths in terms of byte address
    • getGroupName

      public String getGroupName(int index, int length, int choice)
      Gets the name of the argument group index. UnicodeData.txt uses ';' as a field separator, so no field can contain ';' as part of its contents. In unames.icu, it is marked as token[';'] == -1 only if the semicolon is used in the data file - which is iff we have Unicode 1.0 names or ISO comments or aliases. So, it will be token[';'] == -1 if we store U1.0 names/ISO comments/aliases although we know that it will never be part of a name. Equivalent to ICU4C's expandName.
      Parameters:
      index - of the group name string in byte count
      length - of the group name string
      choice - of Unicode 1.0 name or the most current name
      Returns:
      name of the group
    • getExtendedName

      public String getExtendedName(int ch)
      Retrieves the extended name
    • getGroup

      public int getGroup(int codepoint)
      Gets the group index for the codepoint, or the group before it.
      Parameters:
      codepoint - The codepoint index.
      Returns:
      group index containing codepoint or the group before it.
    • getExtendedOr10Name

      public String getExtendedOr10Name(int ch)
      Gets the extended and 1.0 name when the most current unicode names fail
      Parameters:
      ch - codepoint
      Returns:
      name of codepoint extended or 1.0
    • getGroupMSB

      public int getGroupMSB(int gindex)
      Gets the MSB from the group index
      Parameters:
      gindex - group index
      Returns:
      the MSB of the group if gindex is valid, -1 otherwise
    • getCodepointMSB

      public static int getCodepointMSB(int codepoint)
      Gets the MSB of the codepoint
      Parameters:
      codepoint - The codepoint value.
      Returns:
      the MSB of the codepoint
    • getGroupLimit

      public static int getGroupLimit(int msb)
      Gets the maximum codepoint + 1 of the group
      Parameters:
      msb - most significant byte of the group
      Returns:
      limit codepoint of the group
    • getGroupMin

      public static int getGroupMin(int msb)
      Gets the minimum codepoint of the group
      Parameters:
      msb - most significant byte of the group
      Returns:
      minimum codepoint of the group
    • getGroupOffset

      public static int getGroupOffset(int codepoint)
      Gets the offset to a group
      Parameters:
      codepoint - The codepoint value.
      Returns:
      offset to a group
    • getGroupMinFromCodepoint

      public static int getGroupMinFromCodepoint(int codepoint)
      Gets the minimum codepoint of a group
      Parameters:
      codepoint - The codepoint value.
      Returns:
      minimum codepoint in the group which codepoint belongs to
    • getAlgorithmLength

      public int getAlgorithmLength()
      Get the Algorithm range length
      Returns:
      Algorithm range length
    • getAlgorithmStart

      public int getAlgorithmStart(int index)
      Gets the start of the range
      Parameters:
      index - algorithm index
      Returns:
      algorithm range start
    • getAlgorithmEnd

      public int getAlgorithmEnd(int index)
      Gets the end of the range
      Parameters:
      index - algorithm index
      Returns:
      algorithm range end
    • getAlgorithmName

      public String getAlgorithmName(int index, int codepoint)
      Gets the Algorithmic name of the codepoint
      Parameters:
      index - algorithmic range index
      codepoint - The codepoint value.
      Returns:
      algorithmic name of codepoint
    • getGroupName

      public String getGroupName(int ch, int choice)
      Gets the group name of the character
      Parameters:
      ch - character to get the group name
      choice - name choice selector to choose a unicode 1.0 or newer name
    • getMaxCharNameLength

      public int getMaxCharNameLength()
      Gets the maximum length of any codepoint name. Equivalent to uprv_getMaxCharNameLength.
      Returns:
      the maximum length of any codepoint name
    • getMaxISOCommentLength

      public int getMaxISOCommentLength()
      Gets the maximum length of any iso comments. Equivalent to uprv_getMaxISOCommentLength.
      Returns:
      the maximum length of any codepoint name
    • getCharNameCharacters

      public void getCharNameCharacters(UnicodeSet set)
      Fills set with characters that are used in Unicode character names. Equivalent to uprv_getCharNameCharacters.
      Parameters:
      set - USet to receive characters. Existing contents are deleted.
    • getISOCommentCharacters

      public void getISOCommentCharacters(UnicodeSet set)
      Fills set with characters that are used in Unicode character names. Equivalent to uprv_getISOCommentCharacters.
      Parameters:
      set - USet to receive characters. Existing contents are deleted.
    • setToken

      boolean setToken(char[] token, byte[] tokenstring)
      Sets the token data
      Parameters:
      token - array of tokens
      tokenstring - array of string values of the tokens
      Returns:
      false if there is a data error
    • setAlgorithm

      boolean setAlgorithm(UCharacterName.AlgorithmName[] alg)
      Set the algorithm name information array
      Parameters:
      alg - Algorithm information array
      Returns:
      true if the group string offset has been set correctly
    • setGroupCountSize

      boolean setGroupCountSize(int count, int size)
      Sets the number of group and size of each group in number of char
      Parameters:
      count - number of groups
      size - size of group in char
      Returns:
      true if group size is set correctly
    • setGroup

      boolean setGroup(char[] group, byte[] groupstring)
      Sets the group name data
      Parameters:
      group - index information array
      groupstring - name information array
      Returns:
      false if there is a data error
    • getAlgName

      private String getAlgName(int ch, int choice)
      Gets the algorithmic name for the argument character
      Parameters:
      ch - character to determine name for
      choice - name choice
      Returns:
      the algorithmic name or null if not found
    • getGroupChar

      private int getGroupChar(String name, int choice)
      Getting the character with the tokenized argument name
      Parameters:
      name - of the character
      Returns:
      character with the tokenized argument name or -1 if character is not found
    • getGroupChar

      private int getGroupChar(int index, char[] length, String name, int choice)
      Compares and retrieve character if name is found within the argument group
      Parameters:
      index - index where the set of names reside in the group block
      length - list of lengths of the strings
      name - character name to search for
      choice - of either 1.0 or the most current unicode name
      Returns:
      relative character in the group which matches name, otherwise if not found, -1 will be returned
    • getType

      private static int getType(int ch)
      Gets the character extended type
      Parameters:
      ch - character to be tested
      Returns:
      extended type it is associated with
    • getExtendedChar

      private static int getExtendedChar(String name, int choice)
      Getting the character with extended name of the form <....>.
      Parameters:
      name - of the character to be found
      choice - name choice
      Returns:
      character associated with the name, -1 if such character is not found and -2 if we should continue with the search.
    • add

      private static void add(int[] set, char ch)
      Adds a codepoint into a set of ints. Equivalent to SET_ADD.
      Parameters:
      set - set to add to
      ch - 16 bit char to add
    • contains

      private static boolean contains(int[] set, char ch)
      Checks if a codepoint is a part of a set of ints. Equivalent to SET_CONTAINS.
      Parameters:
      set - set to check in
      ch - 16 bit char to check
      Returns:
      true if codepoint is part of the set, false otherwise
    • add

      private static int add(int[] set, String str)
      Adds all characters of the argument str and gets the length Equivalent to calcStringSetLength.
      Parameters:
      set - set to add all chars of str to
      str - string to add
    • add

      private static int add(int[] set, StringBuffer str)
      Adds all characters of the argument str and gets the length Equivalent to calcStringSetLength.
      Parameters:
      set - set to add all chars of str to
      str - string to add
    • addAlgorithmName

      private int addAlgorithmName(int maxlength)
      Adds all algorithmic names into the name set. Equivalent to part of calcAlgNameSetsLengths.
      Parameters:
      maxlength - length to compare to
      Returns:
      the maximum length of any possible algorithmic name if it is > maxlength, otherwise maxlength is returned.
    • addExtendedName

      private int addExtendedName(int maxlength)
      Adds all extended names into the name set. Equivalent to part of calcExtNameSetsLengths.
      Parameters:
      maxlength - length to compare to
      Returns:
      the maxlength of any possible extended name.
    • addGroupName

      private int[] addGroupName(int offset, int length, byte[] tokenlength, int[] set)
      Adds names of a group to the argument set. Equivalent to calcNameSetLength.
      Parameters:
      offset - of the group name string in byte count
      length - of the group name string
      tokenlength - array to store the length of each token
      set - to add to
      Returns:
      the length of the name string and the length of the group string parsed
    • addGroupName

      private void addGroupName(int maxlength)
      Adds names of all group to the argument set. Sets the data member m_max*Length_. Method called only once. Equivalent to calcGroupNameSetsLength.
      Parameters:
      maxlength - length to compare to
    • initNameSetsLengths

      private boolean initNameSetsLengths()
      Sets up the name sets and the calculation of the maximum lengths. Equivalent to calcNameSetsLengths.
    • convert

      private void convert(int[] set, UnicodeSet uset)
      Converts the char set cset into a Unicode set uset. Equivalent to charSetToUSet.
      Parameters:
      set - Set of 256 bit flags corresponding to a set of chars.
      uset - USet to receive characters. Existing contents are deleted.