Examples of cascading.pipe.GroupBy

cascading.pipe.GroupBy
The GroupBy pipe groups the {@link Tuple} stream by the given groupFields.
If more than one {@link Pipe} instance is provided on the constructor, all branches will be merged. It is requiredthat all Pipe instances output the same field names, otherwise the {@link cascading.flow.FlowConnector} will fail to create a{@link cascading.flow.Flow} instance. Again, the Pipe instances are merged together as if one Tuple stream and not joined.See {@link CoGroup} for joining by common fields.
Typically an {@link Every} follows GroupBy to apply an {@link Aggregator} function to every grouping. The{@link Each} operator may also follow GroupBy to apply a {@link Function} or {@link Filter} to the resultingstream. But an Each cannot come immediately before an Every.
Optionally a stream can be further sorted by providing sortFields. This allows an Aggregator to receive values in the order of the sortedFields.
Note that local sorting always happens on the groupFields, sortFields are a secondary sorting on the grouped values within the current grouping. sortFields is particularly useful if the Aggregators following the GroupBy would like to see their arguments in order.
For more control over sorting at the group or secondary sort level, use {@link cascading.tuple.Fields}containing {@link java.util.Comparator} instances for the appropriate fields when setting the groupFields orsortFields values. Fields allows you to set a custom {@link java.util.Comparator} instance for each field name orposition. It is required that each Comparator class also be {@link java.io.Serializable}.
It should be noted for MapReduce systems, distributed group sorting is not 'total'. That is groups are sorted as seen by each Reducer, but they are not sorted across Reducers. See the MapReduce algorithm for details.
See the {@link cascading.tuple.Hasher} interface when a custom {@link java.util.Comparator} on the grouping keys isbeing provided that makes two values with differing hashCode values equal. For example, {@code new BigDecimal( 100.0D )} and {@code new Double 100.0D )} are equal using a custom Comparator, but{@link Object#hashCode()} will be different, thus forcing each value into differing partitions.
Note that grouping one String key with a lowercase value with another String key with an uppercase value using a "case insensitive" Comparator will not have consistent results. The grouping will execute and be correct, but the actual values in the key columns may be replaced with "equivalent" values from other streams.
That is, if two streams are merged and then grouped on a key, where one stream the key values are uppercase and the other stream values are lowercase, the resulting key value for the grouping may arbitrarily be either upper or lower case.
If the original key values must be retained, consider normalizing the keys with a Function and then grouping on the resulting field.

    sources.put( "a", source );


    Pipe pipeA = new Pipe( "a" );
    Pipe pipeB = new Pipe( "b" );


    Pipe group1 = new GroupBy( pipeA );
    Pipe group2 = new GroupBy( pipeB );


    Pipe merge = new GroupBy( "tail", Pipe.pipes( group1, group2 ), new Fields( "first", "second" ) );


    sinks.put( merge.getName(), new Hfs( new TextLine(), "output/path" ) );


    try
      {
      Flow flow = getPlatform().getFlowConnector().connect( sources, sinks, merge );
      fail( "did not catch missing source tap" );

View Full Code Here

    sources.put( "b", tap );


    Pipe pipeA = new Pipe( "a" );
    Pipe pipeB = new Pipe( "b" );


    Pipe group1 = new GroupBy( pipeA );
    Pipe group2 = new GroupBy( pipeB );


    Pipe merge = new GroupBy( "tail", Pipe.pipes( group1, group2 ), new Fields( "first", "second" ) );


//    sinks.put( merge.getName(), new Hfs( new TextLine(), "output/path" ) );


    try
      {

View Full Code Here

    sources.put( "c", tap );


    Pipe pipeA = new Pipe( "a" );
    Pipe pipeB = new Pipe( "b" );


    Pipe group1 = new GroupBy( pipeA );
    Pipe group2 = new GroupBy( pipeB );


    Pipe merge = new GroupBy( "tail", Pipe.pipes( group1, group2 ), new Fields( "first", "second" ) );


    sinks.put( merge.getName(), new Hfs( new TextLine(), "output/path" ) );


    try
      {
      Flow flow = getPlatform().getFlowConnector().connect( sources, sinks, merge );
      fail( "did not catch extra source tap" );

View Full Code Here

    sources.put( "b", tap );


    Pipe pipeA = new Pipe( "a" );
    Pipe pipeB = new Pipe( "b" );


    Pipe group1 = new GroupBy( pipeA );
    Pipe group2 = new GroupBy( pipeB );


    Pipe merge = new GroupBy( "tail", Pipe.pipes( group1, group2 ), new Fields( "first", "second" ) );


    sinks.put( merge.getName(), new Hfs( new TextLine(), "output/path" ) );
    sinks.put( "c", new Hfs( new TextLine(), "output/path" ) );


    try
      {
      Flow flow = getPlatform().getFlowConnector().connect( sources, sinks, merge );

View Full Code Here


    sources.put( "count", new Hfs( new TextLine( new Fields( "first", "second" ) ), "input/path" ) );
    sinks.put( "count", new Hfs( new TextLine( new Fields( 0, 1 ) ), "output/path" ) );


    Pipe pipe = new Pipe( "count" );
    pipe = new GroupBy( pipe, new Fields( 1 ) );
    pipe = new Every( pipe, new Fields( 1 ), new TestBuffer( new Fields( "fourth" ), "value" ), new Fields( 0, 1 ) );


    List steps = getPlatform().getFlowConnector().connect( sources, sinks, pipe ).getFlowSteps();


    assertEquals( "wrong size", 1, steps.size() );

View Full Code Here


    sources.put( "count", new Hfs( new TextLine( new Fields( "first", "second" ) ), "input/path" ) );
    sinks.put( "count", new Hfs( new TextLine( new Fields( 0, 1 ) ), "output/path" ) );


    Pipe pipe = new Pipe( "count" );
    pipe = new GroupBy( pipe, new Fields( 1 ) );
    pipe = new Every( pipe, new Fields( 1 ), new TestBuffer( new Fields( "fourth" ), "value" ), new Fields( 0, 1 ) );
    pipe = new Every( pipe, new Fields( 1 ), new Count(), new Fields( 0, 1 ) );


    try
      {

View Full Code Here


    sources.put( "count", new Hfs( new TextLine( new Fields( "first", "second" ) ), "input/path" ) );
    sinks.put( "count", new Hfs( new TextLine( new Fields( 0, 1 ) ), "output/path" ) );


    Pipe pipe = new Pipe( "count" );
    pipe = new GroupBy( pipe, new Fields( 1 ) );
    pipe = new Every( pipe, new Fields( 1 ), new Count(), new Fields( 0, 1 ) );
    pipe = new Every( pipe, new Fields( 1 ), new TestBuffer( new Fields( "fourth" ), "value" ), new Fields( 0, 1 ) );


    try
      {

View Full Code Here


    Tap source = getPlatform().getTextFile( Fields.size( 2 ), inputFileIps );
    Tap sink = getPlatform().getTextFile( Fields.size( 1 ), getOutputPath( name ), SinkMode.REPLACE );


    Pipe pipe = new Pipe( "count" );
    pipe = new GroupBy( pipe, new Fields( 1 ) );
    pipe = new Every( pipe, argumentSelector, new Count( fieldDeclaration ), outputSelector );


    Flow flow = getPlatform().getFlowConnector().connect( source, sink, pipe );


    flow.start(); // simple test for start

View Full Code Here

    pipe = new Each( pipe, new Fields( 1 ), parser, new Fields( 0, 2 ) );


    // test that selector against incoming creates proper outgoing
    pipe = new Each( pipe, new Fields( 1 ), new Identity() );


    pipe = new GroupBy( pipe, new Fields( 0 ) );


    Aggregator counter = new Count();


    pipe = new Every( pipe, new Fields( 0 ), counter, new Fields( 0, 1 ) );

View Full Code Here


    pipe = new Each( pipe, new Fields( 1 ), new Identity() );


    pipe = new Each( pipe, Fields.ALL, new RegexFilter( "a|b|c" ) );


    pipe = new GroupBy( pipe, new Fields( 0 ) );


    Aggregator counter = new Count();


    pipe = new Every( pipe, new Fields( 0 ), counter, new Fields( 0, 1 ) );

View Full Code Here

0 1 2 3 4 5 6 7 8 9

TOP

Related Classes of cascading.pipe.GroupBy

bixo.examples.crawl.DemoCrawlWorkflow

bixo.examples.crawl.LatestUrlDatumBufferTest

bixo.examples.webmining.DemoWebMiningWorkflow

bixo.pipes.FetchPipe

cascading.assembly.CrossTab

cascading.BasicPipesPlatformTest

cascading.BasicTrapPlatformTest

cascading.BufferPipesPlatformTest

cascading.CoGroupFieldedPipesPlatformTest

cascading.detail.EveryAssemblyFactory

All source code are property of their respective owners. Java is a trademark of Sun Microsystems, Inc and owned by ORACLE Inc. Contact coftware#gmail.com.