Class Element
- All Implemented Interfaces:
CharSequence
,Comparable<Segment>
Take the following HTML segment as an example:
<p>This is a sample paragraph.</p>
The whole segment is represented by an Element
object. This is comprised of the StartTag
"<p>
",
the EndTag
"</p>
", as well as the text in between.
An element may also contain other elements between its start and end tags.
The term normal element refers to an element having a start tag
with a type of StartTagType.NORMAL
.
This comprises all HTML elements and non-HTML elements.
Element
instances are obtained using one of the following methods:
StartTag.getElement()
EndTag.getElement()
Segment.getAllElements()
Segment.getAllElements(String name)
Segment.getAllElements(StartTagType)
HTMLElements
class, and the
XML 1.0 specification for elements.
Element Structure
The three possible structures of an element are listed below:
- Single Tag Element:
-
Example:
<img src="mypicture.jpg">
The element consists only of a single start tag and has no element content (although the start tag itself may have tag content).
getEndTag()
==null
isEmpty()
==true
getEnd()
==
getStartTag()
.
getEnd()
This occurs in the following situations:
- An HTML element for which the end tag is forbidden.
- An HTML element for which the end tag is required, but the end tag is not present in the source document.
- An HTML element for which the end tag is optional, where the implicitly terminating tag is situated immediately after the element's start tag.
- An empty element tag
- A non-HTML element that is not an empty element tag but is missing its end tag.
- An element with a start tag of a type that does not define a corresponding end tag type.
- An element with a start tag of a type that does define a corresponding end tag type but is missing its end tag.
- Explicitly Terminated Element:
-
Example:
<p>This is a sample paragraph.</p>
The element consists of a start tag, content, and an end tag.
getEndTag()
!=null
.isEmpty()
==false
(provided the end tag doesn't immediately follow the start tag)getEnd()
==
getEndTag()
.
getEnd()
.This occurs in the following situations, assuming the start tag's matching end tag is present in the source document:
- An HTML element for which the end tag is either required or optional.
- A non-HTML element that is not an empty element tag.
- An element with a start tag of a type that defines a corresponding end tag type.
- Implicitly Terminated Element:
-
Example:
<p>This text is included in the paragraph element even though no end tag is present.
<p>This is the next paragraph.
The element consists of a start tag and content, but no end tag.
getEndTag()
==null
.isEmpty()
==false
getEnd()
!=
getStartTag()
.
getEnd()
.This only occurs in an HTML element for which the end tag is optional.
The element ends at the start of a tag which implies the termination of the element, called the implicitly terminating tag. If the implicitly terminating tag is situated immediately after the element's start tag, the element is classed as a single tag element.
See the element parsing rules for HTML elements with optional end tags for details on which tags can implicitly terminate a given element.
See also the documentation of the
HTMLElements.getEndTagOptionalElementNames()
method.
Element Parsing Rules
The following rules describe the algorithm used in theStartTag.getElement()
method to construct an element.
The detection of the start tag's matching end tag or other terminating tags always takes into account the possible nesting of elements.
-
If the start tag has a type of
StartTagType.NORMAL
:-
If the name of the start tag matches one of the
recognised HTML element names (indicating an HTML element):
- If the end tag for an element of this name is forbidden, the parser does not conduct any search for an end tag and a single tag element is created.
-
If the end tag for an element of this name is
required, the parser searches for the start tag's matching end tag.
- If the matching end tag is found, an explicitly terminated element is created.
- If no matching end tag is found, the source document is not valid HTML and the incident is logged as a missing required end tag. In this situation a single tag element is created.
-
If the end tag for an element of this name is
optional, the parser searches not only for the start tag's matching end tag,
but also for any other tag that implicitly terminates the element.
For each tag (T2) following the start tag (ST1) of this element (E1):-
If T2 is a start tag:
- If the name of T2 is in the list of non-terminating element names for E1, then continue evaluating tags from the end of T2's corresponding element.
- If the name of T2 is in the list of terminating start tag names for E1, then E1 ends at the beginning of T2. If T2 follows immediately after ST1, a single tag element is created, otherwise an implicitly terminated element is created.
-
If T2 is an end tag:
- If the name of T2 is the same as that of ST1, an explicitly terminated element is created.
- If the name of T2 is in the list of terminating end tag names for E1, then E1 ends at the beginning of T2. If T2 follows immediately after ST1, a single tag element is created, otherwise an implicitly terminated element is created.
- If no more tags are present in the source document, then E1 ends at the end of the file, and an implicitly terminated element is created.
-
If T2 is a start tag:
isEmptyElementTag()
method for more information. -
If the name of the start tag does not match one of the
recognised HTML element names (indicating a non-HTML element):
- If the start tag is syntactically an empty-element tag, the parser does not conduct any search for an end tag and a single tag element is created.
-
Otherwise, section 3.1
of the XML 1.0 specification states that a matching end tag MUST be present, and
the parser searches for the start tag's matching end tag.
- If the matching end tag is found, an explicitly terminated element is created.
- If no matching end tag is found, the source document is not valid XML and the incident is logged as a missing required end tag. In this situation a single tag element is created.
-
If the name of the start tag matches one of the
recognised HTML element names (indicating an HTML element):
-
If the start tag has any type other than
StartTagType.NORMAL
:- If the start tag's type does not define a corresponding end tag type, the parser does not conduct any search for an end tag and a single tag element is created.
-
If the start tag's type does define a corresponding end tag type,
the parser assumes that a matching end tag is required and searches for it.
- If the matching end tag is found, an explicitly terminated element is created.
- If no matching end tag is found, the missing required end tag is logged and a single tag element is created.
- See Also:
-
Method Summary
Modifier and TypeMethodDescriptionReturns the attributes specified in this element's start tag.getAttributeValue
(String attributeName) Returns the decoded value of the attribute with the specified name (case insensitive).Returns a list of the immediate children of this element in the document element hierarchy.Returns the segment representing the content of the element.Returns a string representation of this object useful for debugging purposes.int
getDepth()
Returns the nesting depth of this element in the document element hierarchy.Returns the end tag of the element.Returns theFormControl
defined by this element.getName()
Returns the parent of this element in the document element hierarchy.Returns the start tag of the element.boolean
isEmpty()
Indicates whether this element has zero-length content.boolean
Indicates whether this element is an empty-element tag.Methods inherited from class net.htmlparser.jericho.Segment
charAt, compareTo, encloses, encloses, equals, getAllCharacterReferences, getAllElements, getAllElements, getAllElements, getAllElements, getAllElements, getAllElementsByClass, getAllStartTags, getAllStartTags, getAllStartTags, getAllStartTags, getAllStartTags, getAllStartTagsByClass, getAllTags, getAllTags, getBegin, getEnd, getFirstElement, getFirstElement, getFirstElement, getFirstElement, getFirstElementByClass, getFirstStartTag, getFirstStartTag, getFirstStartTag, getFirstStartTag, getFirstStartTag, getFirstStartTagByClass, getFormControls, getFormFields, getMaxDepthIndicator, getNodeIterator, getRenderer, getRowColumnVector, getSource, getStyleURISegments, getTextExtractor, getURIAttributes, hashCode, ignoreWhenParsing, isWhiteSpace, isWhiteSpace, length, parseAttributes, subSequence, toString
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
Methods inherited from interface java.lang.CharSequence
chars, codePoints
-
Method Details
-
getParentElement
Returns the parent of this element in the document element hierarchy.The
Source.fullSequentialParse()
method must be called (either explicitly or implicitly) immediately after construction of theSource
object if this method is to be used. AnIllegalStateException
is thrown if a full sequential parse has not been performed or if it was performed after this element was found.This method returns
null
for a top-level element, as well as any element formed from a server tag, regardless of whether it is nested inside a normal element.See the
Source.getChildElements()
method for more details.- Returns:
- the parent of this element in the document element hierarchy, or
null
if this element is a top-level element. - Throws:
IllegalStateException
- if a full sequential parse has not been performed or if it was performed after this element was found.- See Also:
-
getChildElements
Returns a list of the immediate children of this element in the document element hierarchy.The objects in the list are all of type
Element
.See the
Source.getChildElements()
method for more details.- Overrides:
getChildElements
in classSegment
- Returns:
- a list of the immediate children of this element in the document element hierarchy, guaranteed not
null
. - See Also:
-
getDepth
public int getDepth()Returns the nesting depth of this element in the document element hierarchy.The
Source.fullSequentialParse()
method must be called (either explicitly or implicitly) after construction of theSource
object if this method is to be used. AnIllegalStateException
is thrown if a full sequential parse has not been performed or if it was performed after this element was found.A top-level element has a nesting depth of
0
.An element formed from a server tag always have a nesting depth of
0
, regardless of whether it is nested inside a normal element.See the
Source.getChildElements()
method for more details.- Returns:
- the nesting depth of this element in the document element hierarchy.
- Throws:
IllegalStateException
- if a full sequential parse has not been performed or if it was performed after this element was found.- See Also:
-
getContent
Returns the segment representing the content of the element.This segment spans between the end of the start tag and the start of the end tag. If the end tag is not present, the content reaches to the end of the element.
A zero-length segment is returned if the element is empty,
- Returns:
- the segment representing the content of the element, guaranteed not
null
.
-
getStartTag
Returns the start tag of the element.- Returns:
- the start tag of the element.
-
getEndTag
Returns the end tag of the element.If the element has no end tag this method returns
null
.- Returns:
- the end tag of the element, or
null
if the element has no end tag.
-
getName
Returns the name of the start tag of this element, always in lower case.This is equivalent to
getStartTag()
.
getName()
.See the
Tag.getName()
method for more information.- Returns:
- the name of the start tag of this element, always in lower case.
-
isEmpty
public boolean isEmpty()Indicates whether this element has zero-length content.This is equivalent to
getContent()
.
length()
==0
.Note that this is a broader definition than that of both the HTML definition of an empty element, which is only those elements whose end tag is forbidden, and the XML definition of an empty element, which is "either a start-tag immediately followed by an end-tag, or an empty-element tag". The other possibility covered by this property is the case of an HTML element with an optional end tag that is immediately followed by another tag that implicitly terminates the element.
- Returns:
true
if this element has zero-length content, otherwisefalse
.- See Also:
-
isEmptyElementTag
public boolean isEmptyElementTag()Indicates whether this element is an empty-element tag.This is equivalent to
getStartTag()
.
isEmptyElementTag()
.- Returns:
true
if this element is an empty-element tag, otherwisefalse
.
-
getAttributes
Returns the attributes specified in this element's start tag.This is equivalent to
getStartTag()
.
getAttributes()
.- Returns:
- the attributes specified in this element's start tag.
- See Also:
-
getAttributeValue
Returns the decoded value of the attribute with the specified name (case insensitive).Returns
null
if the start tag of this element does not have attributes, no attribute with the specified name exists or the attribute has no value.This is equivalent to
getStartTag()
.
getAttributeValue(attributeName)
.- Parameters:
attributeName
- the name of the attribute to get.- Returns:
- the decoded value of the attribute with the specified name, or
null
if the attribute does not exist or has no value.
-
getFormControl
Returns theFormControl
defined by this element.- Returns:
- the
FormControl
defined by this element, ornull
if it is not a control.
-
getDebugInfo
Description copied from class:Segment
Returns a string representation of this object useful for debugging purposes.- Overrides:
getDebugInfo
in classSegment
- Returns:
- a string representation of this object useful for debugging purposes.
-