Examples of net.htmlparser.jericho.TextExtractor

e.apache.org/java/">Apache Lucene, especially when the {@link #setIncludeAttributes(boolean) IncludeAttributes} property has been set to true.

Use one of the following methods to obtain the output:

{@link #writeTo(Writer)}
{@link #appendTo(Appendable)}
{@link #toString()}
{@link CharStreamSourceUtil#getReader(CharStreamSource) CharStreamSourceUtil.getReader(this)}

The process removes all of the tags and {@linkplain CharacterReference#decodeCollapseWhiteSpace(CharSequence) decodes the result, collapsing all white space}. A space character is included in the output where a normal tag is present in the source, unless the tag belongs to an {@linkplain HTMLElements#getInlineLevelElementNames() inline-level} element.An exception to this is the {@link HTMLElementName#BR BR} element, which is also converted to a space despite being an inline-level element.

Text inside {@link HTMLElementName#SCRIPT SCRIPT} and {@link HTMLElementName#STYLE STYLE} elements contained within this segmentis ignored.

Setting the {@link #setExcludeNonHTMLElements(boolean) ExcludeNonHTMLElements} property results in the exclusion of any content within anon-HTML element.

See the {@link #excludeElement(StartTag)} method for details on how to implement a more complex mechanism to determine whether the{@linkplain Element#getContent() content} of each {@link Element} is to be excluded from the output.

All tags that are not normal tags, such as {@linkplain TagType#isServerTag() server tags}, {@linkplain StartTagType#COMMENT comments} etc., are removed from the output without adding white space to the output.

Note that segments on which the {@link Segment#ignoreWhenParsing()} method has been called are treated as text rather than markup,resulting in their inclusion in the output. To remove specific segments before extracting the text, create an {@link OutputDocument} and call its {@link OutputDocument#remove(Segment) remove(Segment)} or{@link OutputDocument#replaceWithSpaces(int,int) replaceWithSpaces(int begin, int end)} method for each segment to be removed.Then create a new source document using {@link Source#Source(CharSequence) new Source(outputDocument.toString())}and perform the text extraction on this new source object.

Extracting the text from an entire {@link Source} object performs a {@linkplain Source#fullSequentialParse() full sequential parse} automatically.

To perform a simple rendering of HTML markup into text, which is more readable than the output of this class, use the {@link Renderer} class instead.

Example:: Using the default settings, the source segment:
"<div><b>O</b>ne</div><div title="Two"><b>Th</b><script>//a script </script>ree</div>"
produces the text "One Two Three".

Examples of net.htmlparser.jericho.TextExtractor

Related Classes of net.htmlparser.jericho.TextExtractor