Class Tag
- All Implemented Interfaces:
CharSequence
,Comparable<Segment>
StartTag
or EndTag
in a specific source document.
Take the following HTML segment as an example:
<p>This is a sample paragraph.</p>
The "<p>
" is represented by a StartTag
object, and the "</p>
" is represented by an EndTag
object,
both of which are subclasses of the Tag
class.
The whole segment, including the start tag, its corresponding end tag and all of the content in between, is represented by an Element
object.
Tag Parsing Process
The following process describes how each tag is identified by the parser:-
Every '
<
' character found in the source document is considered to be the start of a tag. The characters following it are compared with the start delimiters of all the registered tag types, and a list of matching tag types is determined. -
A more detailed analysis of the source is performed according to the features of each matching tag type from the first step,
in order of precedence, until a valid tag is able to be constructed.
The analysis performed in relation to each candidate tag type is a two-stage process:
-
The position of the tag is checked to determine whether it is valid.
In theory, a server tag is valid in any position, but a non-server tag is not valid inside any other tag,
nor inside elements with CDATA content such as
SCRIPT
andSTYLE
elements. Theory dictates therefore that comments and explicit CDATA sections inside script elements should not be recognised as tags. The behaviour of the parser however does not always strictly adhere to the theory, to maintain compatability with major browsers and also for efficiency reasons.The
TagType.isValidPosition(Source, int pos, int[] fullSequentialParseData)
method is responsible for this check and has a common default implementation for all tag types (although custom tag types can override it if necessary). Its behaviour differs depending on whether or not a full sequential parse is peformed. See the documentation of theisValidPosition
method for full details. -
A final analysis is performed by the
TagType.constructTagAt(Source, int pos)
method of the candidate tag type. This method returns a validTag
object if all conditions of the candidate tag type are met, otherwise it returnsnull
and the process continues with the next candidate tag type.
-
The position of the tag is checked to determine whether it is valid.
In theory, a server tag is valid in any position, but a non-server tag is not valid inside any other tag,
nor inside elements with CDATA content such as
-
If the source does not match the start delimiter or syntax of any registered tag type, the segment spanning it and the next
'
>
' character is taken to be an unregistered tag. Some tag search methods ignore unregistered tags. See theisUnregistered()
method for more information.
See the documentation of the TagType
class for more details on how tags are recognised.
Tag Search Methods
Methods that get tags in a source document are collectively referred to as Tag Search Methods.
They are found mostly in the Source
and Segment
classes, and can be generally categorised as follows:
- Open Search:
- These methods search for tags of any name and type.
getNextTag()
getPreviousTag()
Segment.getAllElements()
Segment.getFirstElement()
Source.getTagAt(int pos)
Source.getPreviousTag(int pos)
Source.getNextTag(int pos)
Source.getEnclosingTag(int pos)
Segment.getAllTags()
Segment.getAllStartTags()
Segment.getFirstStartTag()
Source.getPreviousStartTag(int pos)
Source.getNextStartTag(int pos)
Source.getPreviousEndTag(int pos)
Source.getNextEndTag(int pos)
- Named Search:
- These methods include a parameter called
name
which is used to specify the name of the tag to search for. Specifying a name that ends in a colon (:
) searches for all elements or tags in the specified XML namespace.Segment.getAllElements(String name)
Segment.getFirstElement(String name)
Segment.getAllStartTags(String name)
Segment.getFirstStartTag(String name)
Source.getPreviousStartTag(int pos, String name)
Source.getNextStartTag(int pos, String name)
Source.getPreviousEndTag(int pos, String name)
Source.getNextEndTag(int pos, String name)
Source.getNextEndTag(int pos, String name, EndTagType)
- Tag Type Search:
- These methods typically include a parameter called
tagType
which is used to specify the type of the tag to search for. In some methods the search parameter is restricted to theStartTagType
orEndTagType
subclass ofTagType
.Segment.getAllElements(StartTagType)
Segment.getAllTags(TagType)
Segment.getAllStartTags(StartTagType)
Segment.getFirstStartTag(StartTagType)
Source.getPreviousTag(int pos, TagType)
Source.getPreviousStartTag(int pos, StartTagType)
Source.getPreviousEndTag(int pos, EndTagType)
Source.getNextTag(int pos, TagType)
Source.getNextStartTag(int pos, StartTagType)
Source.getNextEndTag(int pos, EndTagType)
Source.getEnclosingTag(int pos, TagType)
Source.getNextEndTag(int pos, String name, EndTagType)
- Attribute Search:
- These methods perform the search based on an attribute name and value.
Segment.getAllElements(String attributeName, String value, boolean valueCaseSensitive)
Segment.getFirstElement(String attributeName, String value, boolean valueCaseSensitive)
Segment.getAllStartTags(String attributeName, String value, boolean valueCaseSensitive)
Segment.getFirstStartTag(String attributeName, String value, boolean valueCaseSensitive)
Segment.getAllElements(String attributeName, Pattern valueRegexPattern)
Segment.getFirstElement(String attributeName, Pattern valueRegexPattern)
Segment.getAllStartTags(String attributeName, Pattern valueRegexPattern)
Segment.getFirstStartTag(String attributeName, Pattern valueRegexPattern)
Segment.getAllElementsByClass(String className)
Segment.getFirstElementByClass(String className)
Segment.getAllStartTagsByClass(String className)
Segment.getFirstStartTagByClass(String className)
Source.getElementById(String id)
Source.getNextElement(int pos, String attributeName, Pattern valueRegexPattern)
Source.getNextElement(int pos, String attributeName, String value, boolean valueCaseSensitive)
Source.getNextElementByClass(int pos, String className)
Source.getNextStartTag(int pos, String attributeName, Pattern valueRegexPattern)
Source.getNextStartTag(int pos, String attributeName, String value, boolean valueCaseSensitive)
Source.getNextStartTagByClass(int pos, String className)
-
Method Summary
Modifier and TypeMethodDescriptionabstract Element
Returns the element that is started or ended by this tag.final String
getName()
Returns the name of this tag, always in lower case.Returns the segment spanning the name of this tag.Returns the next tag in the source document.Returns the previous tag in the source document.abstract TagType
Returns the type of this tag.Returns the general purpose user data object that has previously been associated with this tag via thesetUserData(Object)
method.abstract boolean
Indicates whether this tag has a syntax that does not match any of the registered tag types.static final boolean
isXMLName
(CharSequence text) Indicates whether the specified text is a valid XML Name.static final boolean
isXMLNameChar
(char ch) Indicates whether the specified character is valid anywhere in an XML Name.static final boolean
isXMLNameStartChar
(char ch) Indicates whether the specified character is valid at the start of an XML Name.void
setUserData
(Object userData) Associates the specified general purpose user data object with this tag.abstract String
tidy()
Returns an XML representation of this tag.Methods inherited from class net.htmlparser.jericho.Segment
charAt, compareTo, encloses, encloses, equals, getAllCharacterReferences, getAllElements, getAllElements, getAllElements, getAllElements, getAllElements, getAllElementsByClass, getAllStartTags, getAllStartTags, getAllStartTags, getAllStartTags, getAllStartTags, getAllStartTagsByClass, getAllTags, getAllTags, getBegin, getChildElements, getDebugInfo, getEnd, getFirstElement, getFirstElement, getFirstElement, getFirstElement, getFirstElementByClass, getFirstStartTag, getFirstStartTag, getFirstStartTag, getFirstStartTag, getFirstStartTag, getFirstStartTagByClass, getFormControls, getFormFields, getMaxDepthIndicator, getNodeIterator, getRenderer, getRowColumnVector, getSource, getStyleURISegments, getTextExtractor, getURIAttributes, hashCode, ignoreWhenParsing, isWhiteSpace, isWhiteSpace, length, parseAttributes, subSequence, toString
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
Methods inherited from interface java.lang.CharSequence
chars, codePoints, isEmpty
-
Method Details
-
getElement
Returns the element that is started or ended by this tag.StartTag.getElement()
is guaranteed notnull
.EndTag.getElement()
can returnnull
if the end tag is not properly matched to a start tag.- Returns:
- the element that is started or ended by this tag.
-
getName
Returns the name of this tag, always in lower case.The name always starts with the name prefix defined in this tag's type. For some tag types, the name consists only of this prefix, while in others it must be followed by a valid XML name (see
StartTagType.isNameAfterPrefixRequired()
).If the name is equal to one of the constants defined in the
HTMLElementName
interface, this method is guaranteed to return the constant itself. This allows comparisons to be performed using the==
operator instead of the less efficientString.equals(Object)
method.For example, the following expression can be used to test whether a
StartTag
is from aSELECT
element:startTag.getName()==HTMLElementName.SELECT
To get the name of this tag in its original case, use
getNameSegment()
.toString()
.- Returns:
- the name of this tag, always in lower case.
-
getNameSegment
Returns the segment spanning the name of this tag.The code
getNameSegment().toString()
can be used to retrieve the name of this tag in its original case.Every call to this method constructs a new
Segment
object. -
getTagType
Returns the type of this tag.- Returns:
- the type of this tag.
-
getUserData
Returns the general purpose user data object that has previously been associated with this tag via thesetUserData(Object)
method.If
setUserData(Object)
has not been called, this method returnsnull
.- Returns:
- the generic data object that has previously been associated with this tag via the
setUserData(Object)
method.
-
setUserData
Associates the specified general purpose user data object with this tag.This property can be useful for applications that need to associate extra information with tags. The object can be retrieved later via the
getUserData()
method.- Parameters:
userData
- general purpose user data of any type.
-
getNextTag
Returns the next tag in the source document.This method also returns server tags.
The result of a call to this method is cached. Performing a full sequential parse prepopulates this cache.
If the result is not cached, a call to this method is equivalent to
source.
getNextTag
(
getBegin()
+1)
.See the
Tag
class documentation for more details about the behaviour of this method.- Returns:
- the next tag in the source document, or
null
if this is the last tag.
-
getPreviousTag
Returns the previous tag in the source document.This method also returns server tags.
The result of a call to this method is cached. Performing a full sequential parse prepopulates this cache.
If the result is not cached, a call to this method is equivalent to
source.
getPreviousTag
(
getBegin()
-1)
.See the
Tag
class documentation for more details about the behaviour of this method.- Returns:
- the previous tag in the source document, or
null
if this is the first tag.
-
isUnregistered
public abstract boolean isUnregistered()Indicates whether this tag has a syntax that does not match any of the registered tag types.The only requirement of an unregistered tag type is that it starts with '
<
' and there is a closing '>
' character at some position after it in the source document.The absence or presence of a '
/
' character after the initial '<
' determines whether an unregistered tag is respectively aStartTag
with a type ofStartTagType.UNREGISTERED
or anEndTag
with a type ofEndTagType.UNREGISTERED
.There are no restrictions on the characters that might appear between these delimiters, including other '
<
' characters. This may result in a '>
' character that is identified as the closing delimiter of two separate tags, one an unregistered tag, and the other a tag of any type that begins in the middle of the unregistered tag. As explained below, unregistered tags are usually only found when specifically looking for them, so it is up to the user to detect and deal with any such nonsensical results.Unregistered tags are only returned by the
Source.getTagAt(int pos)
method, named search methods, where the specifiedname
matches the first characters inside the tag, and by tag type search methods, where the specifiedtagType
is eitherStartTagType.UNREGISTERED
orEndTagType.UNREGISTERED
.Open tag searches and other searches always ignore unregistered tags, although every discovery of an unregistered tag is logged by the parser.
The logic behind this design is that unregistered tag types are usually the result of a '
<
' character in the text that was mistakenly left unencoded, or a less-than operator inside a script, or some other occurrence which is of no interest to the user. By returning unregistered tags in named and tag type search methods, the library allows the user to specifically search for tags with a certain syntax that does not match any existingTagType
. This expediency feature avoids the need for the user to create a custom tag type to define the syntax before searching for these tags. By not returning unregistered tags in the less specific search methods, it is providing only the information that most users are interested in.- Returns:
true
if this tag has a syntax that does not match any of the registered tag types, otherwisefalse
.
-
tidy
Returns an XML representation of this tag.This is an abstract method which is implemented in the
StartTag
andEndTag
subclasses. See the documentation of theStartTag.tidy()
andEndTag.tidy()
methods for details.- Returns:
- an XML representation of this tag.
-
isXMLName
Indicates whether the specified text is a valid XML Name.This implementation first checks that the first character of the specified text is a valid XML Name start character as defined by the
isXMLNameStartChar(char)
method, and then checks that the rest of the characters are valid XML Name characters as defined by theisXMLNameChar(char)
method.Note that this implementation does not exactly adhere to the formal definition of an XML Name, but the differences are unlikely to be significant in real-world XML or HTML documents.
- Parameters:
text
- the text to test.- Returns:
true
if the specified text is a valid XML Name, otherwisefalse
.- See Also:
-
isXMLNameStartChar
public static final boolean isXMLNameStartChar(char ch) Indicates whether the specified character is valid at the start of an XML Name.The XML 1.0 specification section 2.3 defines a
Name
as starting with one of the characters(Letter | '_' | ':')
.This method uses the expression
Character.isLetter(ch) || ch=='_' || ch==':'
.Note that there are many differences between the
Character.isLetter()
definition of a Letter and the XML definition of a Letter, but these differences are unlikely to be significant in real-world XML or HTML documents.- Parameters:
ch
- the character to test.- Returns:
true
if the specified character is valid at the start of an XML Name, otherwisefalse
.- See Also:
-
isXMLNameChar
public static final boolean isXMLNameChar(char ch) Indicates whether the specified character is valid anywhere in an XML Name.The XML 1.0 specification section 2.3 uses the entity
NameChar
to represent this set of characters, which is defined as(Letter | Digit | '.' | '-' | '_' | ':' | CombiningChar | Extender)
.This method uses the expression
Character.isLetterOrDigit(ch) || ch=='.' || ch=='-' || ch=='_' || ch==':'
.Note that there are many differences between these definitions, but these differences are unlikely to be significant in real-world XML or HTML documents.
- Parameters:
ch
- the character to test.- Returns:
true
if the specified character is valid anywhere in an XML Name, otherwisefalse
.- See Also:
-