This class is intended to parse markup languages, not to validate them. "Malformed" data is interpreted as graciously as possible, in order to extract as much information as possible. For instance: spaces are allowed between the "<" and the tag name, values in tags do not need to be quoted, and unbalanced quotes are accepted.
One type of "malformed" data specifically not handled is a quoted ">" character occurring within the body of a tag. Even if it is quoted, a ">" in the attributes of a tag will be interpreted as the end of the tag. For example, the single tag <img src='foo.jpg' alt='xyz > abc'>
will be erroneously broken by this parser into two tokens:
<img src='foo.jpg' alt='xyz >
This class also may not properly parse all well-formed XML tags, such as tags with extended paired delimiters <&
and &>
, <?
and ?>
, or <![CDATA[
and ]]>
. Additionally, XML tags that have embedded comments containing the ">" character will not be parsed correctly (for example: <!DOCTYPE foo SYSTEM -- a > b -- foo.dtd>
), since the ">" in the comment will be interpreted as the end of declaration tag, for the same reason mentioned above.
Note: this behavior may be changed on a per-application basis by overriding the findClose
method in a subclass.
@author Colin Stevens (colin.stevens@sun.com)
@version 2.6
|
|
|
|