Class CharacterReader

java.lang.Object
org.jsoup.parser.CharacterReader

public final class CharacterReader extends Object
CharacterReader consumes tokens off a string. Used internally by jsoup. API subject to changes.
  • Field Details

    • EOF

      static final char EOF
      See Also:
    • maxStringCacheLen

      private static final int maxStringCacheLen
      See Also:
    • maxBufferLen

      static final int maxBufferLen
      See Also:
    • readAheadLimit

      static final int readAheadLimit
      See Also:
    • minReadAheadLen

      private static final int minReadAheadLen
      See Also:
    • charBuf

      private char[] charBuf
    • reader

      private Reader reader
    • bufLength

      private int bufLength
    • bufSplitPoint

      private int bufSplitPoint
    • bufPos

      private int bufPos
    • readerPos

      private int readerPos
    • bufMark

      private int bufMark
    • stringCacheSize

      private static final int stringCacheSize
      See Also:
    • stringCache

      private String[] stringCache
    • newlinePositions

      private ArrayList<Integer> newlinePositions
    • lineNumberOffset

      private int lineNumberOffset
    • readFully

      private boolean readFully
    • lastIcSeq

      private String lastIcSeq
    • lastIcIndex

      private int lastIcIndex
  • Constructor Details

    • CharacterReader

      public CharacterReader(Reader input, int sz)
    • CharacterReader

      public CharacterReader(Reader input)
    • CharacterReader

      public CharacterReader(String input)
  • Method Details

    • close

      public void close()
    • bufferUp

      private void bufferUp()
    • pos

      public int pos()
      Gets the position currently read to in the content. Starts at 0.
      Returns:
      current position
    • readFully

      boolean readFully()
      Tests if the buffer has been fully read.
    • trackNewlines

      public void trackNewlines(boolean track)
      Enables or disables line number tracking. By default, will be off.Tracking line numbers improves the legibility of parser error messages, for example. Tracking should be enabled before any content is read to be of use.
      Parameters:
      track - set tracking on|off
      Since:
      1.14.3
    • isTrackNewlines

      public boolean isTrackNewlines()
      Check if the tracking of newlines is enabled.
      Returns:
      the current newline tracking state
      Since:
      1.14.3
    • lineNumber

      public int lineNumber()
      Get the current line number (that the reader has consumed to). Starts at line #1.
      Returns:
      the current line number, or 1 if line tracking is not enabled.
      Since:
      1.14.3
      See Also:
    • lineNumber

      int lineNumber(int pos)
    • columnNumber

      public int columnNumber()
      Get the current column number (that the reader has consumed to). Starts at column #1.
      Returns:
      the current column number
      Since:
      1.14.3
      See Also:
    • columnNumber

      int columnNumber(int pos)
    • posLineCol

      String posLineCol()
      Get a formatted string representing the current line and column positions. E.g. 5:10 indicating line number 5 and column number 10.
      Returns:
      line:col position
      Since:
      1.14.3
      See Also:
    • lineNumIndex

      private int lineNumIndex(int pos)
    • scanBufferForNewlines

      private void scanBufferForNewlines()
      Scans the buffer for newline position, and tracks their location in newlinePositions.
    • isEmpty

      public boolean isEmpty()
      Tests if all the content has been read.
      Returns:
      true if nothing left to read.
    • isEmptyNoBufferUp

      private boolean isEmptyNoBufferUp()
    • current

      public char current()
      Get the char at the current position.
      Returns:
      char
    • consume

      char consume()
    • unconsume

      void unconsume()
      Unconsume one character (bufPos--). MUST only be called directly after a consume(), and no chance of a bufferUp.
    • advance

      public void advance()
      Moves the current position by one.
    • mark

      void mark()
    • unmark

      void unmark()
    • rewindToMark

      void rewindToMark()
    • nextIndexOf

      int nextIndexOf(char c)
      Returns the number of characters between the current position and the next instance of the input char
      Parameters:
      c - scan target
      Returns:
      offset between current position and next instance of target. -1 if not found.
    • nextIndexOf

      int nextIndexOf(CharSequence seq)
      Returns the number of characters between the current position and the next instance of the input sequence
      Parameters:
      seq - scan target
      Returns:
      offset between current position and next instance of target. -1 if not found.
    • consumeTo

      public String consumeTo(char c)
      Reads characters up to the specific char.
      Parameters:
      c - the delimiter
      Returns:
      the chars read
    • consumeTo

      String consumeTo(String seq)
    • consumeToAny

      public String consumeToAny(char... chars)
      Read characters until the first of any delimiters is found.
      Parameters:
      chars - delimiters to scan for
      Returns:
      characters read up to the matched delimiter.
    • consumeToAnySorted

      String consumeToAnySorted(char... chars)
    • consumeData

      String consumeData()
    • consumeAttributeQuoted

      String consumeAttributeQuoted(boolean single)
    • consumeRawData

      String consumeRawData()
    • consumeTagName

      String consumeTagName()
    • consumeToEnd

      String consumeToEnd()
    • consumeLetterSequence

      String consumeLetterSequence()
    • consumeLetterThenDigitSequence

      String consumeLetterThenDigitSequence()
    • consumeHexSequence

      String consumeHexSequence()
    • consumeDigitSequence

      String consumeDigitSequence()
    • matches

      boolean matches(char c)
    • matches

      boolean matches(String seq)
    • matchesIgnoreCase

      boolean matchesIgnoreCase(String seq)
    • matchesAny

      boolean matchesAny(char... seq)
    • matchesAnySorted

      boolean matchesAnySorted(char[] seq)
    • matchesLetter

      boolean matchesLetter()
    • matchesAsciiAlpha

      boolean matchesAsciiAlpha()
      Checks if the current pos matches an ascii alpha (A-Z a-z) per https://infra.spec.whatwg.org/#ascii-alpha
      Returns:
      if it matches or not
    • matchesDigit

      boolean matchesDigit()
    • matchConsume

      boolean matchConsume(String seq)
    • matchConsumeIgnoreCase

      boolean matchConsumeIgnoreCase(String seq)
    • containsIgnoreCase

      boolean containsIgnoreCase(String seq)
      Used to check presence of , when we're in RCData and see a <xxx. Only finds consistent case.
    • toString

      public String toString()
      Overrides:
      toString in class Object
    • cacheString

      private static String cacheString(char[] charBuf, String[] stringCache, int start, int count)
      Caches short strings, as a flyweight pattern, to reduce GC load. Just for this doc, to prevent leaks.

      Simplistic, and on hash collisions just falls back to creating a new string, vs a full HashMap with Entry list. That saves both having to create objects as hash keys, and running through the entry list, at the expense of some more duplicates.

    • rangeEquals

      static boolean rangeEquals(char[] charBuf, int start, int count, String cached)
      Check if the value of the provided range equals the string.
    • rangeEquals

      boolean rangeEquals(int start, int count, String cached)