Class CompositeBreakIterator

java.lang.Object
org.apache.lucene.analysis.icu.segmentation.CompositeBreakIterator

final class CompositeBreakIterator extends Object
An internal BreakIterator for multilingual text, following recommendations from: UAX #29: Unicode Text Segmentation. (http://unicode.org/reports/tr29/)

See http://unicode.org/reports/tr29/#Tailoring for the motivation of this design.

Text is first divided into script boundaries. The processing is then delegated to the appropriate break iterator for that specific script.

This break iterator also allows you to retrieve the ISO 15924 script code associated with a piece of text.

See also UAX #29, UTR #24

  • Field Details

  • Constructor Details

  • Method Details

    • next

      int next()
      Retrieve the next break position. If the RBBI range is exhausted within the script boundary, examine the next script boundary.
      Returns:
      the next break position or BreakIterator.DONE
    • current

      int current()
      Retrieve the current break position.
      Returns:
      the current break position or BreakIterator.DONE
    • getRuleStatus

      int getRuleStatus()
      Retrieve the rule status code (token type) from the underlying break iterator
      Returns:
      rule status code (see RuleBasedBreakIterator constants)
    • getScriptCode

      int getScriptCode()
      Retrieve the UScript script code for the current token. This code can be decoded with UScript into a name or ISO 15924 code.
      Returns:
      UScript script code for the current token.
    • setText

      void setText(char[] text, int start, int length)
      Set a new region of text to be examined by this iterator
      Parameters:
      text - buffer of text
      start - offset into buffer
      length - maximum length to examine
    • getBreakIterator

      private BreakIteratorWrapper getBreakIterator(int scriptCode)