Examples of Fetcher

cn.edu.hfut.dmic.webcollector.fetcher.Fetcher
抓取器 @author hu
com.volantis.xml.pipeline.sax.drivers.uri.Fetcher
Fetches content from a URL and inserts it into the pipeline.
net.azib.ipscan.fetchers.Fetcher
Interface of all IP Fetchers. Fetcher is responsible for gathering a certain type of information about the provided scanning subject (in GUI terms, Fetcher is a column in the results list). Fetchers do the actual information fetching about each scanned IP address. Instances of this classes are shared among all the threads, so implementations must be thread safe and stateless. @author Anton Keks
org.apache.hadoop.mapred.CoronaJTState.Fetcher
org.apache.nutch.fetcher.Fetcher
A queue-based fetcher.
This fetcher uses a well-known model of one producer (a QueueFeeder) and many consumers (FetcherThread-s).
QueueFeeder reads input fetchlists and populates a set of FetchItemQueue-s, which hold FetchItem-s that describe the items to be fetched. There are as many queues as there are unique hosts, but at any given time the total number of fetch items in all queues is less than a fixed number (currently set to a multiple of the number of threads).
As items are consumed from the queues, the QueueFeeder continues to add new input items, so that their total count stays fixed (FetcherThread-s may also add new items to the queues e.g. as a results of redirection) - until all input items are exhausted, at which point the number of items in the queues begins to decrease. When this number reaches 0 fetcher will finish.
This fetcher implementation handles per-host blocking itself, instead of delegating this work to protocol-specific plugins. Each per-host queue handles its own "politeness" settings, such as the maximum number of concurrent requests and crawl delay between consecutive requests - and also a list of requests in progress, and the time the last request was finished. As FetcherThread-s ask for new items to be fetched, queues may return eligible items or null if for "politeness" reasons this host's queue is not yet ready.
If there are still unfetched items in the queues, but none of the items are ready, FetcherThread-s will spin-wait until either some items become available, or a timeout is reached (at which point the Fetcher will abort, assuming the task is hung). @author Andrzej Bialecki
org.apache.tez.runtime.library.shuffle.common.Fetcher
Responsible for fetching inputs served by the ShuffleHandler for a single host. Construct using {@link FetcherBuilder}
org.fenixedu.academic.ui.struts.action.teacher.siteArchive.Fetcher
The Fetcher manages a queue of {@link org.fenixedu.academic.ui.struts.action.teacher.siteArchive.Resource} and it'sresponsible for retrieving and transforming each resource in the queue.
Each resource is retrieved by creating a new RequestDispatcher to the resource's url and by forwarding the request to that dispatcher. The current request and response are wrapped to avoid unwanted secondary effects and to allow the called to generate it's own content to the user.
If the resource is an HTML page then url's present in the page are transformed using the resource's rules. @author cfgi
org.stringtree.Fetcher

Examples of net.azib.ipscan.fetchers.Fetcher

    
    result.setType(subject.getResultType());
  }


  public void interrupt(Thread thread) {
    Fetcher fetcher = currentFetchers.get(thread.getId());
    if (fetcher != null) fetcher.cleanup();
  }

View Full Code Here

Examples of org.apache.hadoop.mapred.CoronaJTState.Fetcher

      parentClient = RPC.waitForProxy(
          InterCoronaJobTrackerProtocol.class,
          InterCoronaJobTrackerProtocol.versionID, parentAddr, conf,
          RemoteJTProxy.getRemotJTTimeout(conf));
      // Fetch saved state and prepare for application
      stateFetcher = new Fetcher(parentClient, jtAttemptId);
      // Remote JT should ask local for permission to commit
      commitPermissionClient = new CommitPermissionClient(attemptId,
          parentAddr, conf);
    } else {
      stateFetcher = new Fetcher();
      commitPermissionClient = new CommitPermissionClient();
    }
    // Start with dummy submitter until saved state is restored
    localJTSubmitter = new Submitter();

View Full Code Here

Examples of org.apache.hadoop.mapred.CoronaJTState.Fetcher

    this.maxEventsPerRpc = conf.getInt(TASK_COMPLETION_EVENTS_PER_RPC, 100);
    this.conf = conf;
    this.trackerStats = new TrackerStats(conf);
    this.parentAddr = null;
    this.fs = FileSystem.get(conf);
    this.stateFetcher = new Fetcher();
    this.localJTSubmitter = new Submitter();
    this.jtAttemptId = null;
    this.commitPermissionClient = new CommitPermissionClient();
    
    initializePJTClient();

View Full Code Here

Examples of org.apache.nutch.fetcher.Fetcher

    Path index = new Path(dir + "/index");


    Path tmpDir = job.getLocalPath("crawl"+Path.SEPARATOR+getDate());
    Injector injector = new Injector(conf);
    Generator generator = new Generator(conf);
    Fetcher fetcher = new Fetcher(conf);
    ParseSegment parseSegment = new ParseSegment(conf);
    CrawlDb crawlDbTool = new CrawlDb(conf);
    LinkDb linkDbTool = new LinkDb(conf);
      
    // initialize crawlDb
    injector.inject(crawlDb, rootUrlDir);
    int i;
    for (i = 0; i < depth; i++) {             // generate new segment
      Path[] segs = generator.generate(crawlDb, segments, -1, topN, System
          .currentTimeMillis());
      if (segs == null) {
        LOG.info("Stopping at depth=" + i + " - no more URLs to fetch.");
        break;
      }
      fetcher.fetch(segs[0], threads, org.apache.nutch.fetcher.Fetcher.isParsing(conf));  // fetch it
      if (!Fetcher.isParsing(job)) {
        parseSegment.parse(segs[0]);    // parse it, if needed
      }
      crawlDbTool.update(crawlDb, segs, true, true); // update crawldb
    }

View Full Code Here

Examples of org.apache.nutch.fetcher.Fetcher

    Path index = new Path(dir + "/index");


    Path tmpDir = job.getLocalPath("crawl"+Path.SEPARATOR+getDate());
    Injector injector = new Injector(conf);
    Generator generator = new Generator(conf);
    Fetcher fetcher = new Fetcher(conf);
    ParseSegment parseSegment = new ParseSegment(conf);
    CrawlDb crawlDbTool = new CrawlDb(conf);
    LinkDb linkDbTool = new LinkDb(conf);
    Indexer indexer = new Indexer(conf);
    DeleteDuplicates dedup = new DeleteDuplicates(conf);
    IndexMerger merger = new IndexMerger(conf);
      
    // initialize crawlDb
    injector.inject(crawlDb, rootUrlDir);
    int i;
    for (i = 0; i < depth; i++) {             // generate new segment
      Path segment = generator.generate(crawlDb, segments, -1, topN, System
          .currentTimeMillis(), false, false);
      if (segment == null) {
        LOG.info("Stopping at depth=" + i + " - no more URLs to fetch.");
        break;
      }
      fetcher.fetch(segment, threads);  // fetch it
      if (!Fetcher.isParsing(job)) {
        parseSegment.parse(segment);    // parse it, if needed
      }
      crawlDbTool.update(crawlDb, new Path[]{segment}, true, true); // update crawldb
    }

View Full Code Here

Examples of org.apache.nutch.fetcher.Fetcher

    Path index = new Path(dir + "/index");


    Path tmpDir = job.getLocalPath("crawl"+Path.SEPARATOR+getDate());
    Injector injector = new Injector(conf);
    Generator generator = new Generator(conf);
    Fetcher fetcher = new Fetcher(conf);
    ParseSegment parseSegment = new ParseSegment(conf);
    CrawlDb crawlDbTool = new CrawlDb(conf);
    LinkDb linkDbTool = new LinkDb(conf);
    Indexer indexer = new Indexer(conf);
    DeleteDuplicates dedup = new DeleteDuplicates(conf);
    IndexMerger merger = new IndexMerger(conf);
      
    // initialize crawlDb
    injector.inject(crawlDb, rootUrlDir);
    int i;
    for (i = 0; i < depth; i++) {             // generate new segment
      Path segment = generator.generate(crawlDb, segments, -1, topN, System
          .currentTimeMillis(), false, false);
      if (segment == null) {
        LOG.info("Stopping at depth=" + i + " - no more URLs to fetch.");
        break;
      }
      fetcher.fetch(segment, threads);  // fetch it
      if (!Fetcher.isParsing(job)) {
        parseSegment.parse(segment);    // parse it, if needed
      }
      crawlDbTool.update(crawlDb, new Path[]{segment}, true, true); // update crawldb
    }

View Full Code Here

Examples of org.apache.nutch.fetcher.Fetcher

     */
    public Nutch9Fetcher() {
        Config config = Config.getConfig("nutchfetcher.properties");
        segmentsDir = config.getString("fetchlist.dir");
        keepUrl = config.getBoolean("keep.original.url.on.redirect");
        fetcher = new Fetcher();
        Configuration conf = new Configuration();
        // conf.addDefaultResource("crawl-tool.xml");
        conf.addDefaultResource("nutch-default.xml");
        conf.addDefaultResource("nutch-site.xml");
        JobConf job = new NutchJob(conf);

View Full Code Here

Examples of org.apache.nutch.fetcher.Fetcher


            // generate
            Generator g = new Generator(conf);
            // fetch
            conf.setBoolean("fetcher.parse", true);
            Fetcher fetcher = new Fetcher(conf);
            CrawlDb crawlDbTool = new CrawlDb(conf);


            int depth = 5;
            int threads = 4;
            for (int i = 0; i < depth; i++) { // generate new segment
                Path generatedSegment = g.generate(crawldbPath, segmentsPath, 1, Long.MAX_VALUE, Long.MAX_VALUE, false,
                        false);


                if (generatedSegment == null) {
                    logger.info("Stopping at depth=" + i + " - no more URLs to fetch.");
                    break;
                }
                fetcher.fetch(generatedSegment, threads, true);
                crawlDbTool.update(crawldbPath, new Path[] { generatedSegment }, true, true);
            }
        } catch (IOException e) {
            logger.error("Exception while crawling", e);
        }

View Full Code Here

Examples of org.apache.nutch.fetcher.Fetcher

    Path index = new Path(dir + "/index");


    Path tmpDir = job.getLocalPath("crawl"+Path.SEPARATOR+getDate());
    Injector injector = new Injector(conf);
    Generator generator = new Generator(conf);
    Fetcher fetcher = new Fetcher(conf);
    ParseSegment parseSegment = new ParseSegment(conf);
    CrawlDb crawlDbTool = new CrawlDb(conf);
    LinkDb linkDbTool = new LinkDb(conf);
    Indexer indexer = new Indexer(conf);
    DeleteDuplicates dedup = new DeleteDuplicates(conf);
    IndexMerger merger = new IndexMerger(conf);
      
    // initialize crawlDb
    injector.inject(crawlDb, rootUrlDir);
    int i;
    for (i = 0; i < depth; i++) {             // generate new segment
      Path segment = generator.generate(crawlDb, segments, -1, topN, System
          .currentTimeMillis());
      if (segment == null) {
        LOG.info("Stopping at depth=" + i + " - no more URLs to fetch.");
        break;
      }
      fetcher.fetch(segment, threads, org.apache.nutch.fetcher.Fetcher.isParsing(conf));  // fetch it
      if (!Fetcher.isParsing(job)) {
        parseSegment.parse(segment);    // parse it, if needed
      }
      crawlDbTool.update(crawlDb, new Path[]{segment}, true, true); // update crawldb
    }

View Full Code Here

Examples of org.apache.nutch.fetcher.Fetcher

    Path index = new Path(dir + "/index");


    Path tmpDir = job.getLocalPath("crawl"+Path.SEPARATOR+getDate());
    Injector injector = new Injector(getConf());
    Generator generator = new Generator(getConf());
    Fetcher fetcher = new Fetcher(getConf());
    ParseSegment parseSegment = new ParseSegment(getConf());
    CrawlDb crawlDbTool = new CrawlDb(getConf());
    LinkDb linkDbTool = new LinkDb(getConf());
      
    // initialize crawlDb
    injector.inject(crawlDb, rootUrlDir);
    int i;
    for (i = 0; i < depth; i++) {             // generate new segment
      Path[] segs = generator.generate(crawlDb, segments, -1, topN, System
          .currentTimeMillis());
      if (segs == null) {
        LOG.info("Stopping at depth=" + i + " - no more URLs to fetch.");
        break;
      }
      fetcher.fetch(segs[0], threads);  // fetch it
      if (!Fetcher.isParsing(job)) {
        parseSegment.parse(segs[0]);    // parse it, if needed
      }
      crawlDbTool.update(crawlDb, segs, true, true); // update crawldb
    }

View Full Code Here

0 1 2 3 4

TOP

All source code are property of their respective owners. Java is a trademark of Sun Microsystems, Inc and owned by ORACLE Inc. Contact coftware#gmail.com.