Examples of com.datasalt.pangool.tuplemr.TupleMRBuilder$Input

com.datasalt.pangool.tuplemr.TupleMRBuilder
TupleMRBuilder creates Tuple-based Map-Reduce jobs.
One of the key concepts of Tuple-based Map-Reduce is that Hadoop Key-Value pairs are no longer used.Instead,they are replaced by tuples.
Tuples(see {@link ITuple}) are just an ordered list of elements whose types are defined in a {@link Schema}.TupleMRBuilder contains several methods to define how grouping and sorting among tuples will be performed, avoiding the complex task of defining custom binary {@link SortComparator} ,{@link GroupComparator} and {@link TupleHashPartitioner} implementations.
A Tuple-based Map-Red job, in its simplest form, requires to define :
- Intermediate schemas:
  An schema specifies the name and types of a Tuple's fields. Several schemas can be defined in order to perform joins among different input data. It's mandatory to specify ,at least,one schema using {@link #addIntermediateSchema(Schema)}
- Group-by fields:
  Needed to specify how the tuples will be grouped. Several tuples with the same group-by fields will be groupped and reduced together in the Reduce phase.
- Tuple-based Mapper:
  The job needs to specify a {@link TupleMapper} instance,the Tuple-basedimplementation of Hadoop's {@link Mapper}. Unlike Hadoop's Mappers, Tuple-based mappers are configured using stateful serializable instances and not static class definitions.
- Tuple-based Reducer: Similar to mapper instances,the job needs to specify a {@link TupleReducer} instance,the Tuple-based implementation ofHadoop's {@link Reducer}.
@see ITuple @see Schema @see TupleMapper @see TupleReducer

    fields.add(Field.create("strField", Type.STRING));
    fields.add(Field.create("longField", Type.LONG));
    fields.add(Field.create("doubleField", Type.DOUBLE));
    Schema schema = new Schema("schema", fields);


    TupleMRBuilder mr = new TupleMRBuilder(conf, "Pangool Secondary Sort");
    mr.addIntermediateSchema(schema);
    mr.setGroupByFields("intField", "strField");
    mr.setOrderBy(new OrderBy().add("intField", Order.ASC).add("strField", Order.ASC).add("longField", Order.ASC));
    mr.setTupleReducer(new Handler());
    mr.addInput(new Path(input), new HadoopInputFormat(TextInputFormat.class), new IProcessor());
    mr.setOutput(new Path(output), new HadoopOutputFormat(TextOutputFormat.class), Text.class,
        DoubleWritable.class);
    mr.createJob().waitForCompletion(true);
    return 1;
  }

View Full Code Here

    fields.add(Field.create("date", Type.STRING));
    fields.add(Field.create("hashtag", Type.STRING));
    fields.add(Field.create("count", Type.INT));
    Schema schema = new Schema("my_schema", fields);


    TupleMRBuilder mr = new TupleMRBuilder(conf);
    mr.addIntermediateSchema(schema);
    mr.setGroupByFields("location", "date", "hashtag");
    mr.setOrderBy(new OrderBy().add("location", Order.ASC).add("date", Order.ASC).add("hashtag", Order.ASC));
    mr.setRollupFrom("date");
    // Input / output and such
    mr.setTupleReducer(new TweetsHandler(n));
    mr.setOutput(new Path(output), new HadoopOutputFormat(TextOutputFormat.class), Text.class, NullWritable.class);
    mr.addInput(new Path(input), new HadoopInputFormat(TextInputFormat.class), new TweetsProcessor());
    mr.createJob().waitForCompletion(true);
    return 0;
  }

View Full Code Here

    fields.add(Field.create("date", Type.STRING));
    fields.add(Field.create("visits",Type.INT));


    Schema schema = new Schema("my_schema", fields);


    TupleMRBuilder mr = new TupleMRBuilder(conf);
    mr.addIntermediateSchema(schema);
    mr.setGroupByFields("url");
    mr.setOrderBy(new OrderBy().add("url", Order.ASC).add("date", Order.ASC));
    // Input / output and such
    mr.setTupleReducer(new MovingAverageHandler(nDaysAverage));
    mr.setOutput(new Path(output), new HadoopOutputFormat(TextOutputFormat.class), Text.class, NullWritable.class);
    mr.addInput(new Path(input), new HadoopInputFormat(TextInputFormat.class), new URLVisitsProcessor());
    mr.createJob().waitForCompletion(true);
    return 1;
  }

View Full Code Here

    fields.add(Field.create("all",Type.BOOLEAN));
    fields.add(Field.create("clicks", Type.INT));


    Schema schema = new Schema("my_schema", fields);


    TupleMRBuilder mr = new TupleMRBuilder(conf);
    mr.addIntermediateSchema(schema);
    mr.setGroupByFields("user", "all", "feature");
    mr.setOrderBy(new OrderBy().add("user", Order.ASC).add("all", Order.DESC).add("feature", Order.ASC));
    // Rollup from "user" - all features from same user will go to the same Reducer
    mr.setRollupFrom("user");
    // Input / output and such
    mr.setTupleCombiner(new CountCombinerHandler());
    mr.setTupleReducer(new NormalizingHandler());
    mr.setOutput(new Path(output), new HadoopOutputFormat(TextOutputFormat.class), Text.class, NullWritable.class);
    mr.addInput(new Path(input), new HadoopInputFormat(TextInputFormat.class), new UserActivityProcessor());
    mr.createJob().waitForCompletion(true);
    
    return 1;
  }

View Full Code Here

    }
    String inputExamples = args[0];
    String output = args[1];
    delete(output);


    TupleMRBuilder job = new TupleMRBuilder(conf, "Naive Bayes Model Generator");
    job.addIntermediateSchema(INTERMEDIATE_SCHEMA);
    // perform per-category word count mapping
    job.addInput(new Path(inputExamples), new HadoopInputFormat(TextInputFormat.class),
        new TupleMapper<LongWritable, Text>() {


          ITuple tuple = new Tuple(INTERMEDIATE_SCHEMA);


          @Override
          public void map(LongWritable toIgnore, Text value, TupleMRContext context, Collector collector)
              throws IOException, InterruptedException {


            Category category = Category.valueOf(value.toString().split("\t")[0]);
            StringTokenizer itr = new StringTokenizer(value.toString().split("\t")[1]);
            tuple.set("category", category);
            tuple.set("count", 1);
            while(itr.hasMoreTokens()) {
              tuple.set("word", normalizeWord(itr.nextToken()));
              collector.write(tuple);
            }
          }
        });


    TupleReducer countReducer = new TupleReducer<ITuple, NullWritable>() {


      public void reduce(ITuple group, Iterable<ITuple> tuples, TupleMRContext context, Collector collector)
          throws IOException, InterruptedException, TupleMRException {
        int count = 0;
        ITuple outputTuple = null;
        for(ITuple tuple : tuples) {
          count += (Integer) tuple.get("count");
          outputTuple = tuple;
        }
        outputTuple.set("count", count);
        collector.write(outputTuple, NullWritable.get());
      }
    };
    job.setTupleCombiner(countReducer);
    job.setTupleReducer(countReducer);
    job.setGroupByFields("word", "category");
    job.setTupleOutput(new Path(output), INTERMEDIATE_SCHEMA);
    if(job.createJob().waitForCompletion(true)) {
      return 1;
    }
    return -1;
  }

View Full Code Here


    delete(args[1]);
    // Parse the size of the Top
    Integer n = Integer.parseInt(args[2]);


    TupleMRBuilder builder = new TupleMRBuilder(conf, "Pangool Topic Fingerprint From Topical Word Count");
    builder.addIntermediateSchema(TopicalWordCount.getSchema());
    // We need to group the counts by (topic)
    builder.setGroupByFields("topic");
    // Then we need to sort by topic and count (DESC) -> This way we will receive the most relevant words first.
    builder.setOrderBy(new OrderBy().add("topic", Order.ASC).add("count", Order.DESC));
    // Note that we are changing the grouping logic in the job configuration,
    // However, as we work with tuples, we don't need to write specific code for grouping the same data differently,
    // Therefore an IdentityTupleMapper is sufficient for this Job.
    builder.addTupleInput(new Path(args[0]), new IdentityTupleMapper()); // Note the use of "addTupleInput"
    /*
     * TODO Add Combiner as same Reducer when possible
     */
    builder.setTupleOutput(new Path(args[1]), TopicalWordCount.getSchema());
    builder.addNamedTupleOutput(OUTPUT_TOTALCOUNT, getOutputCountSchema());
    builder.setTupleReducer(new TopNWords(n));


    builder.createJob().waitForCompletion(true);


    return 1;
  }

View Full Code Here

    }


    delete(args[1]);
    List<String> stopWords = Files.readLines(new File(args[2]), Charset.forName("UTF-8"));


    TupleMRBuilder cg = new TupleMRBuilder(conf, "Pangool Topical Word Count With Stop Words");
    cg.addIntermediateSchema(TopicalWordCount.getSchema());
    // We will count each (topicId, word) pair
    // Note that the order in which we defined the fields of the Schema is not relevant here
    cg.setGroupByFields("topic", "word");
    // Here we instantiate a mapper with stop words:
    // Note that we don't need to use the DistributedCache for that becasuse mappers, reducers, etc themselves are instantiable
    StopWordMapper mapper = new StopWordMapper(stopWords);
    cg.addInput(new Path(args[0]), new HadoopInputFormat(TextInputFormat.class), mapper);
    // We'll use a TupleOutputFormat with the same schema than the intermediate schema
    cg.setTupleOutput(new Path(args[1]), TopicalWordCount.getSchema());
    cg.setTupleReducer(new CountReducer());
    cg.setTupleCombiner(new CountReducer());


    cg.createJob().waitForCompletion(true);


    return 1;
  }

View Full Code Here

    String input = args[0];
    String output = args[1];
    
    delete(output);
    
    TupleMRBuilder builder = new TupleMRBuilder(conf);
    builder.addIntermediateSchema(getSchema());
    builder.setGroupByFields("first");
    builder.setOrderBy(new OrderBy().add("first",Order.ASC).add("second",Order.ASC));
    // Input / output and such
    builder.setTupleReducer(new Handler());
    builder.setOutput(new Path(output), new HadoopOutputFormat(TextOutputFormat.class), Text.class, NullWritable.class);
    builder.addInput(new Path(input), new HadoopInputFormat(TextInputFormat.class), new IProcessor());
    builder.createJob().waitForCompletion(true);


    return 1;
  }

View Full Code Here

    Path tweetsPath = new Path(args[0]);
    Path retweetsPath = new Path(args[1]);
    Path outputPath = new Path(args[2]);
    delete(outputPath.toString());


    TupleMRBuilder mr = new TupleMRBuilder(conf, "AvroTweetsJoin");
    mr.addIntermediateSchema(getPangoolTweetSchema());
    mr.addIntermediateSchema(getPangoolRetweetSchema());
    mr.setGroupByFields("tweet_id");
    mr.setOrderBy(new OrderBy().add("tweet_id", Order.ASC).addSchemaOrder(Order.ASC));


    mr.addInput(tweetsPath, new AvroInputFormat<Record>(getAvroTweetSchema()), new TweetsMapper());
    mr.addInput(retweetsPath, new HadoopInputFormat(TextInputFormat.class), new RetweetsMapper());
    mr.setOutput(outputPath, new AvroOutputFormat<Record>(getAvroOutputSchema()), AvroWrapper.class, NullWritable.class);


    mr.setTupleReducer(new Red());


    Job job = mr.createJob();
    job.waitForCompletion(true);


    return 0;
  }

View Full Code Here

      return -1;
    }


    delete(args[1]);


    TupleMRBuilder mr = new TupleMRBuilder(conf, "Pangool Topical Word Count");
    mr.addIntermediateSchema(getSchema());
    mr.setGroupByFields("my_avro");
    //here the custom comparator that groups by "topic,word" is used. 
    MyAvroComparator customComp = new MyAvroComparator(getAvroSchema(),"topic","word");
    mr.setOrderBy(new OrderBy().add("my_avro",Order.ASC,customComp));
    mr.addInput(new Path(args[0]), new HadoopInputFormat(TextInputFormat.class), new TokenizeMapper());
    // We'll use a TupleOutputFormat with the same schema than the intermediate schema
    mr.setTupleOutput(new Path(args[1]), getSchema());
    mr.setTupleReducer(new CountReducer());
    mr.setTupleCombiner(new CountReducer());


    mr.createJob().waitForCompletion(true);


    return 1;
  }

View Full Code Here

0 1 2 3 4 5 6 7

TOP

Related Classes of com.datasalt.pangool.tuplemr.TupleMRBuilder$Input

com.datasalt.pangool.examples.avro.AvroCustomSerializationJob

com.datasalt.pangool.examples.avro.AvroTopicalWordCount

com.datasalt.pangool.examples.avro.AvroTweetsJoin

com.datasalt.pangool.examples.gameoflife.GameOfLifeJob

com.datasalt.pangool.examples.movingaverage.MovingAverage

com.datasalt.pangool.examples.naivebayes.NaiveBayesGenerate

com.datasalt.pangool.examples.secondarysort.SecondarySort

com.datasalt.pangool.examples.simplesecondarysort.SimpleSecondarySort

com.datasalt.pangool.examples.solr.MultiShakespeareIndexer

com.datasalt.pangool.examples.topicalwordcount.TopicalWordCount

All source code are property of their respective owners. Java is a trademark of Sun Microsystems, Inc and owned by ORACLE Inc. Contact coftware#gmail.com.