Package de.jungblut.crawl.extraction.HtmlExtrator

Examples of de.jungblut.crawl.extraction.HtmlExtrator.HtmlFetchResult


    try {
      InputStream connection = getConnection(site);
      String html = consumeStream(connection);
      html = StringEscapeUtils.unescapeHtml(html);
      final HashSet<String> outlinkSet = extractOutlinks(html, site);
      return new HtmlFetchResult(site, outlinkSet, html);
    } catch (ParserException pEx) {
      // ignore parser exceptions, they contain mostly garbage
    } catch (Exception e) {
      String errMsg = e.getMessage().length() > 150 ? e.getMessage().substring(
          0, 150) : e.getMessage();
View Full Code Here

TOP

Related Classes of de.jungblut.crawl.extraction.HtmlExtrator.HtmlFetchResult

Copyright © 2018 www.massapicom. All rights reserved.
All source code are property of their respective owners. Java is a trademark of Sun Microsystems, Inc and owned by ORACLE Inc. Contact coftware#gmail.com.