List of fullSequentialParse() Examples

Examples of fullSequentialParse()

net.htmlparser.jericho.Source.fullSequentialParse()
Parses all of the {@linkplain Tag tags} in this source document sequentially from beginning to end.
Calling this method can greatly improve performance if most or all of the tags in the document need to be parsed.
Calling the {@link #getAllTags()}, {@link #getAllStartTags()}, {@link #getAllElements()}, {@link #getChildElements()}, {@link #iterator()} or {@link #getNodeIterator()}method on the Source object performs a full sequential parse automatically. There are however still circumstances where it should be called manually, such as when it is known that most or all of the tags in the document will need to be parsed, but none of the abovementioned methods are used, or are called only after calling one or more other tag search methods.
If this method is called manually, is should be called soon after the Source object is created, before any tag search methods are called.
By default, tags are parsed only as needed, which is referred to as parse on demand mode. In this mode, every call to a tag search method that is not returning previously cached tags must perform a relatively complex check to determine whether a potential tag is in a {@linkplain TagType#isValidPosition(Source,int,int[]) valid position}.
Generally speaking, a tag is in a valid position if it does not appear inside any another tag. {@linkplain TagType#isServerTag() Server tags} can appear anywhere in a document, including inside other tags, so this relates only to non-server tags.Theoretically, checking whether a specified position in the document is enclosed in another tag is only possible if every preceding tag has been parsed, otherwise it is impossible to tell whether one of the delimiters of the enclosing tag was in fact enclosed by some other tag before it, thereby invalidating it.
When this method is called, each tag is parsed in sequence starting from the beginning of the document, making it easy to check whether each potential tag is in a valid position. In parse on demand mode a compromise technique must be used for this check, since the theoretical requirement of having parsed all preceding tags is no longer practical. This compromise involves only checking whether the position is enclosed by other tags with {@linkplain TagType#getTagTypesIgnoringEnclosedMarkup() certain tag types}. The added complexity of this technique makes parsing each tag slower compared to when a full sequential parse is performed, but when only a few tags need parsing this is an extremely beneficial trade-off.
The documentation of the {@link TagType#isValidPosition(Source, int pos, int[] fullSequentialParseData)} method,which is called internally by the parser to perform the valid position check, includes a more detailed explanation of the differences between the two modes of operation.
Calling this method a second or subsequent time has no effect.
This method returns the same list of tags as the {@link Source#getAllTags() Source.getAllTags()} method, but as an array instead of a list.
If this method is called after any of the tag search methods are called, the {@linkplain #getCacheDebugInfo() cache} is cleared of any previously found tags before being restocked via the full sequential parse.This means that if you still have references to tags or elements from before the full sequential parse, they will not be the same objects as those that are returned by tag search methods after the full sequential parse, which can cause confusion if you are allocating {@linkplain Tag#setUserData(Object) user data} to tags.It is also significant if the {@link Segment#ignoreWhenParsing()} method has been called since the tags were first found, as any tags inside theignored segments will no longer be returned by any of the tag search methods.
See also the {@link Tag} class documentation for more general details about how tags are parsed. @return an array of all {@linkplain Tag tags} in this source document.

Examples of net.htmlparser.jericho.Source.fullSequentialParse()

            if (MainFrame.downloadTomcatFlag.isSelected()) {


                Pattern pattern = Pattern.compile("^http://.*/tomcat/.*bin/apache-tomcat-[[0-9]+\\.]+zip");
                Source source = new Source(new URL("http://tomcat.apache.org/download-70.cgi"));
                source.setLogger(null);
                source.fullSequentialParse();
                List<Element> linkElements = source.getAllElements(HTMLElementName.A);


                for (Element linkElement : linkElements) {
                    String href = linkElement.getAttributeValue("href");
                    if (href != null) {

View Full Code Here

Examples of net.htmlparser.jericho.Source.fullSequentialParse()

        List<DBpediaResource> entities = new ArrayList<DBpediaResource>();


        try {
            InputStream is = new ByteArrayInputStream(html.getBytes("UTF-8"));
            parser = new Source(is);
            parser.fullSequentialParse();
            parser.getElementById("div");
        } catch (IOException e) {
            throw new AnnotationException("Error reading output from WikiMachine ",e);
        }
        List<Element>KeywordElements=parser.getAllElementsByClass("keywords");

View Full Code Here

Examples of net.htmlparser.jericho.Source.fullSequentialParse()

     * @return plain text
     */
    public static String getPlainText ( final String html, final String lineSeparator )
    {
        final Source source = new Source ( html );
        final Tag[] tags = source.fullSequentialParse ();
        if ( tags.length > 0 )
        {
            final Renderer renderer = source.getRenderer ();
            renderer.setIncludeHyperlinkURLs ( false );
            renderer.setIncludeAlternateText ( false );

View Full Code Here

Examples of net.htmlparser.jericho.Source.fullSequentialParse()

    public static boolean hasTag ( final String text, final String tag )
    {
        if ( text != null && text.trim ().length () > 0 )
        {
            final Source source = new Source ( text );
            source.fullSequentialParse ();
            return source.getFirstElement ( tag ) != null;
        }
        else
        {
            return false;

View Full Code Here

Examples of net.htmlparser.jericho.Source.fullSequentialParse()

    private void loadFirstResource ( final List<ResourceFile> resources, final List<String> xmlContent, final List<String> xmlNames,
                                     final List<ResourceFile> xmlFiles ) throws IOException
    {
        final ResourceFile rf = resources.get ( 0 );
        final Source xmlSource = new Source ( ReflectUtils.getClassSafely ( rf.getClassName () ).getResource ( rf.getSource () ) );
        xmlSource.fullSequentialParse ();


        final Element baseClassTag = xmlSource.getFirstElement ( SkinInfoConverter.CLASS_NODE );
        final String baseClass = baseClassTag != null ? baseClassTag.getContent ().toString () : null;


        for ( final Element includeTag : xmlSource.getAllElements ( SkinInfoConverter.INCLUDE_NODE ) )

View Full Code Here

Examples of net.htmlparser.jericho.Source.fullSequentialParse()

    MicrosoftConditionalCommentTagTypes.register();
    PHPTagTypes.register();
    PHPTagTypes.PHP_SHORT.deregister(); // remove PHP short tags for this example otherwise they override processing instructions
    MasonTagTypes.register();
    Source source=new Source(rawPage);
    source.fullSequentialParse();


    if (depth==0 || depth==2) {
      List<Element> linkElements=source.getAllElements(HTMLElementName.FRAME);
      for (Element linkElement : linkElements) {
        String link=linkElement.getAttributeValue("src");

View Full Code Here

Examples of net.htmlparser.jericho.Source.fullSequentialParse()


  }


  public String changeTagCase(String contents, boolean uppercase) {
    Source source = new Source(contents);
    source.fullSequentialParse();
    OutputDocument outputDocument = new OutputDocument(source);
    List<Tag> tags = source.getAllTags();
    int pos = 0;
    for (Tag tag : tags) {
      Element tagElement = tag.getElement();

View Full Code Here

Examples of net.htmlparser.jericho.Source.fullSequentialParse()

  }


  @Test
  public void extractLinksWithText() throws IOException {
    Source source = new Source(TableOfLinks.getUrl());
    source.fullSequentialParse();
    List<Link> links = ScraperUtil.extractLinks(source.toString());
    log.info("found following links in table: {}", links);
  }


  @Test

View Full Code Here

Examples of net.htmlparser.jericho.Source.fullSequentialParse()

  @Test
  public void canParseLIWithStrong() {
    String li = "<li><strong> Minimum Term&nbsp;&nbsp;&nbsp;</strong> &nbsp;</li>";


    Source source = new Source(li);
    source.fullSequentialParse();


    String[] parsedOnClosingTag = source.toString().split("</");


    log.info("split on close tag: {} and {}", parsedOnClosingTag[0], parsedOnClosingTag[1]);
    Element liElement = source.getAllElements(HTMLElementName.LI).get(0);

View Full Code Here

Examples of net.htmlparser.jericho.Source.fullSequentialParse()

   */
  public List<URL> getLinks() throws IOException {
    List<URL> links = new ArrayList<URL>();


    Source source = new Source(url);
    source.fullSequentialParse();
    List<Element> linkElements = source.getAllElements(HTMLElementName.A);
    for (Element linkElement : linkElements) {
      String href = linkElement.getAttributeValue("href");
      if (href == null) {
        continue;

View Full Code Here

0 1 2 3

TOP

All source code are property of their respective owners. Java is a trademark of Sun Microsystems, Inc and owned by ORACLE Inc. Contact coftware#gmail.com.