Package de.l3s.boilerpipe.sax

Examples of de.l3s.boilerpipe.sax.HTMLDocument


                String text;
                if(cache.contains(url.toString())){
                    text = cache.get(url.toString());
                    logger.debug("  Fetched from cache:"+url.toString());
                } else {
                    HTMLDocument htmlDoc = HTMLFetcher.fetch(url);
                    TextDocument doc = new BoilerpipeSAXInput(htmlDoc.toInputSource()).getTextDocument();
                    text = ArticleExtractor.INSTANCE.getText(doc);
                    cache.put(url.toString(), text);
                    logger.debug("Fetched from web:"+url.toString());
                }
                if(text.length()<100){
View Full Code Here


                }
                in.close();

                final byte[] data = bos.toByteArray();
               
                return new HTMLDocument(data, cs);
        }
View Full Code Here

TOP

Related Classes of de.l3s.boilerpipe.sax.HTMLDocument

Copyright © 2018 www.massapicom. All rights reserved.
All source code are property of their respective owners. Java is a trademark of Sun Microsystems, Inc and owned by ORACLE Inc. Contact coftware#gmail.com.