Examples of cascading.pipe.HashJoin

cascading.pipe.HashJoin
The HashJoin pipe allows for two or more tuple streams to join into a single stream via a {@link Joiner} whenall but one tuple stream is considered small enough to fit into memory.
When planned onto MapReduce, this is effectively a non-blocking "asymmetrical join" or "replicated join", where the left-most side will not block (accumulate into memory) in order to complete the join, but the right-most sides will. See below...
No aggregations can be performed with a HashJoin pipe as there is no guarantee all value will be associated with a given grouping key. In fact, an Aggregator would see the same grouping many times with a partial set of values.
For every incoming {@link Pipe} instance, a {@link Fields} instance must be specified that denotes the field namesor positions that should be joined with the other given Pipe instances. If the incoming Pipe instances declare one or more field with the same name, the declaredFields must be given to name the outgoing Tuple stream fields to overcome field name collisions.
By default HashJoin performs an inner join via the {@link cascading.pipe.joiner.InnerJoin}{@link cascading.pipe.joiner.Joiner} class.
Self joins can be achieved by using a constructor that takes a single Pipe and a numSelfJoins value. A value of 1 for numSelfJoins will join the Pipe with itself once. Note that a self join will block until all data is accumulated thus the stream must be reasonably small.
Note "outer" joins on the left most side will not behave as expected. All observed keys on the right most sides will be emitted with {@code null} for the left most stream, thus when running distributed, duplicate values willemerge from every Map task split on the MapReduce platform.
HashJoin does not scale well to large data sizes and thus requires streams with more data on the left hand side to join with more sparse data on the right hand side. That is, always attempt to effect M x N joins where M is large and N is small, instead of where M is small and N is large. Right hand side streams will be accumulated, and spilled to disk if the collection reaches a specific threshold when using Hadoop.
If spills are happening, consider increasing the spill thresholds, see {@link cascading.tuple.collect.SpillableTupleMap}.

If one of the right hand side streams starts larger than memory but is filtered (likely by a {@link cascading.operation.Filter} implementation) down to the point it fits into memory, it may be useful to usea {@link Checkpoint} Pipe to persist the stream and force a new FlowStep (MapReduce job) to read the data fromdisk, instead of applying the filter redundantly. This will minimize the amount of data "replicated" across the network.
See the {@link cascading.tuple.collect.TupleCollectionFactory} and {@link cascading.tuple.collect.TupleMapFactory} for a meansto use alternative spillable types. @see cascading.pipe.joiner.InnerJoin @see cascading.pipe.joiner.OuterJoin @see cascading.pipe.joiner.LeftJoin @see cascading.pipe.joiner.RightJoin @see cascading.pipe.joiner.MixedJoin @see cascading.tuple.Fields @see cascading.tuple.collect.SpillableTupleMap

    RegexFilter filter = new RegexFilter( "^\\S\\S+$" );
    tweetPipe = new Each( tweetPipe, new Fields( "token" ), filter );


    // create PIPEs for left join on the stop words
    Pipe stopPipe = new Pipe( "stop" ); // name branch
    Pipe joinPipe = new HashJoin( tweetPipe, new Fields( "token" ), stopPipe, new Fields( "stop" ), new LeftJoin() );
    joinPipe = new Each( joinPipe, new Fields( "stop" ), new RegexFilter( "^$" ) );


    joinPipe = new Retain( joinPipe, new Fields( "uid", "token" ) );


    /*

View Full Code Here

    docPipe = new Each( docPipe, scrubArguments, new ScrubFunction( scrubArguments ), Fields.RESULTS );


    // perform a left join to remove stop words, discarding the rows
    // which joined with stop words, i.e., were non-null after left join
    Pipe stopPipe = new Pipe( "stop" );
    Pipe tokenPipe = new HashJoin( docPipe, token, stopPipe, stop, new LeftJoin() );
    tokenPipe = new Each( tokenPipe, stop, new RegexFilter( "^$" ) );
    tokenPipe = new Retain( tokenPipe, fieldSelector );


    // one branch of the flow tallies the token counts for term frequency (TF)
    Pipe tfPipe = new Pipe( "TF", tokenPipe );
    Fields tf_count = new Fields( "tf_count" );
    tfPipe = new CountBy( tfPipe, new Fields( "doc_id", "token" ), tf_count );


    Fields tf_token = new Fields( "tf_token" );
    tfPipe = new Rename( tfPipe, token, tf_token );


    // one branch counts the number of documents (D)
    Fields doc_id = new Fields( "doc_id" );
    Fields tally = new Fields( "tally" );
    Fields rhs_join = new Fields( "rhs_join" );
    Fields n_docs = new Fields( "n_docs" );
    Pipe dPipe = new Unique( "D", tokenPipe, doc_id );
    dPipe = new Each( dPipe, new Insert( tally, 1 ), Fields.ALL );
    dPipe = new Each( dPipe, new Insert( rhs_join, 1 ), Fields.ALL );
    dPipe = new SumBy( dPipe, rhs_join, tally, n_docs, long.class );


    // one branch tallies the token counts for document frequency (DF)
    Pipe dfPipe = new Unique( "DF", tokenPipe, Fields.ALL );
    Fields df_count = new Fields( "df_count" );
    dfPipe = new CountBy( dfPipe, token, df_count );


    Fields df_token = new Fields( "df_token" );
    Fields lhs_join = new Fields( "lhs_join" );
    dfPipe = new Rename( dfPipe, token, df_token );
    dfPipe = new Each( dfPipe, new Insert( lhs_join, 1 ), Fields.ALL );


    // join to bring together all the components for calculating TF-IDF
    // the D side of the join is smaller, so it goes on the RHS
    Pipe idfPipe = new HashJoin( dfPipe, lhs_join, dPipe, rhs_join );


    // the IDF side of the join is smaller, so it goes on the RHS
    Pipe tfidfPipe = new CoGroup( tfPipe, tf_token, idfPipe, df_token );


    // calculate the TF-IDF weights, per token, per document

View Full Code Here

    ExpressionFunction exprFunc = new ExpressionFunction( new Fields( "tree_species" ), expression, String.class );
    treePipe = new Each( treePipe, new Fields( "scrub_species" ), exprFunc, Fields.ALL );


    // join with tree metadata
    Pipe metaTreePipe = new Pipe( "meta_tree" );
    treePipe = new HashJoin( treePipe, new Fields( "tree_species" ), metaTreePipe, new Fields( "species" ), new InnerJoin() );
    treePipe = new Rename( treePipe, new Fields( "blurb" ), new Fields( "tree_name" ) );


    regex = "^(\\S+),(\\S+),(\\S+)\\s*$";
    int[] gpsGroups = { 1, 2, 3 };
    parser = new RegexParser( new Fields( "tree_lat", "tree_lng", "tree_alt" ), regex, gpsGroups );
    treePipe = new Each( treePipe, new Fields( "geo" ), parser, Fields.ALL );


    // determine a tree geohash
    Fields geohashArguments = new Fields( "tree_lat", "tree_lng" );
    treePipe = new Each( treePipe, geohashArguments, new GeoHashFunction( new Fields( "tree_geohash" ), 6 ), Fields.ALL );


    Fields fieldSelector = new Fields( "tree_name", "priv", "tree_id", "situs", "tree_site", "species", "wikipedia", "calflora", "min_height", "max_height", "tree_lat", "tree_lng", "tree_alt", "tree_geohash" );
    treePipe = new Retain( treePipe, fieldSelector );


    // parse the "road" output
    Pipe roadPipe = new Pipe( "road", tsvCheck );
    regex = "^\\s+Sequence\\:.*\\s+Year Constructed\\:\\s+(\\d+)\\s+Traffic Count\\:\\s+(\\d+)\\s+Traffic Index\\:\\s+(\\w.*\\w)\\s+Traffic Class\\:\\s+(\\w.*\\w)\\s+Traffic Date.*\\s+Paving Length\\:\\s+(\\d+)\\s+Paving Width\\:\\s+(\\d+)\\s+Paving Area\\:\\s+(\\d+)\\s+Surface Type\\:\\s+(\\w.*\\w)\\s+Surface Thickness.*\\s+Bike Lane\\:\\s+(\\w+)\\s+Bus Route\\:\\s+(\\w+)\\s+Truck Route\\:\\s+(\\w+)\\s+Remediation.*$";
    roadPipe = new Each( roadPipe, new Fields( "misc" ), new RegexFilter( regex ) );
    Fields roadFields = new Fields( "year_construct", "traffic_count", "traffic_index", "traffic_class", "paving_length", "paving_width", "paving_area", "surface_type", "bike_lane", "bus_route", "truck_route" );
    int[] roadGroups = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 };
    parser = new RegexParser( roadFields, regex, roadGroups );
    roadPipe = new Each( roadPipe, new Fields( "misc" ), parser, Fields.ALL );


    // join with road metadata
    Pipe metaRoadPipe = new Pipe( "meta_road" );
    roadPipe = new HashJoin( roadPipe, new Fields( "surface_type" ), metaRoadPipe, new Fields( "pavement_type" ), new InnerJoin() );
    roadPipe = new Rename( roadPipe, new Fields( "blurb" ), new Fields( "road_name" ) );


    // estimate albedo based on the road surface age and pavement type
    Fields albedoArguments = new Fields( "year_construct", "albedo_new", "albedo_worn" );
    roadPipe = new Each( roadPipe, albedoArguments, new AlbedoFunction( new Fields( "albedo" ), 2002 ), Fields.ALL );

View Full Code Here

  public void testPipeHashJoin()
    {
    Pipe pipe = new Pipe( "foo" );


    pipe = new Each( pipe, new Fields( "a" ), new Identity() );
    pipe = new HashJoin( pipe, new Fields( "b" ), new Pipe( "bar" ), new Fields( "c" ) );


    assertEqualsTrace( "cascading.TraceTest.testPipeHashJoin(TraceTest.java", pipe.getTrace() );
    }

View Full Code Here

    else if( !isMerge && isGroup )
      join = new CoGroup( pipeLower, numLHS, pipeUpper, numRHS, declaredFields, new InnerJoin() );
    else if( isMerge && !isGroup )
      join = new Merge( pipeLower, pipeUpper );
    else
      join = new HashJoin( pipeLower, numLHS, pipeUpper, numRHS, declaredFields, new InnerJoin() );


    Flow flow = null;
    try
      {
      flow = getPlatform().getFlowConnector().connect( sources, sink, join );

View Full Code Here

    Tap sink = getPlatform().getTextFile( new Fields( "line" ), getOutputPath( "cross" ), SinkMode.REPLACE );


    Pipe pipeLower = new Each( "lhs", new Fields( "line" ), new RegexSplitter( new Fields( "numLHS", "charLHS" ), " " ) );
    Pipe pipeUpper = new Each( "rhs", new Fields( "line" ), new RegexSplitter( new Fields( "numRHS", "charRHS" ), " " ) );


    Pipe cross = new HashJoin( pipeLower, new Fields( "numLHS" ), pipeUpper, new Fields( "numRHS" ), new InnerJoin() );


    Flow flow = getPlatform().getFlowConnector().connect( sources, sink, cross );


    flow.complete();

View Full Code Here

    Function splitter = new RegexSplitter( new Fields( "num", "char" ), " " );


    Pipe pipeLower = new Each( new Pipe( "lower" ), new Fields( "line" ), splitter );
    Pipe pipeUpper = new Each( new Pipe( "upper" ), new Fields( "line" ), splitter );


    Pipe splice = new HashJoin( pipeLower, new Fields( "num" ), pipeUpper, new Fields( "num" ), Fields.size( 4 ) );


    Map<Object, Object> properties = getProperties();


    Flow flow = getPlatform().getFlowConnector( properties ).connect( sources, sink, splice );

View Full Code Here

    pipeUpper = new Pipe( "right", pipeUpper );


//    pipeLower = new Each( pipeLower, new Debug( true ) );
//    pipeUpper = new Each( pipeUpper, new Debug( true ) );


    Pipe splice = new HashJoin( pipeLower, new Fields( "num" ), pipeUpper, new Fields( "num" ), Fields.size( 4 ) );


//    splice = new Each( splice, new Debug( true ) );
    splice = new Pipe( "splice", splice );
    splice = new Pipe( "tail", splice );

View Full Code Here

    Function splitter = new RegexSplitter( Fields.UNKNOWN, " " );


    Pipe pipeLower = new Each( new Pipe( "lower" ), new Fields( "line" ), splitter );
    Pipe pipeUpper = new Each( new Pipe( "upper" ), new Fields( "line" ), splitter );


    Pipe splice = new HashJoin( pipeLower, new Fields( 0 ), pipeUpper, new Fields( 0 ), Fields.size( 4 ) );


    Flow flow = getPlatform().getFlowConnector().connect( sources, sink, splice );


    flow.complete();

View Full Code Here

    Pipe pipeLower = new Each( new Pipe( "lower" ), new Fields( "line" ), splitter );
    Pipe pipeUpper = new Each( new Pipe( "upper" ), new Fields( "line" ), splitter );
    pipeUpper = new Each( pipeUpper, new Fields( "num" ), new RegexFilter( "^fobar" ) ); // intentionally filtering all
    pipeUpper = new GroupBy( pipeUpper, new Fields( "num" ) );


    Pipe splice = new HashJoin( pipeLower, new Fields( "num" ), pipeUpper, new Fields( "num" ), Fields.size( 4 ), new OuterJoin() );


    Flow flow = getPlatform().getFlowConnector().connect( sources, sink, splice );


    flow.complete();

View Full Code Here

0 1 2 3 4 5

TOP

Related Classes of cascading.pipe.HashJoin

cascading.flow.hadoop.BuildJobsHadoopPlatformTest

cascading.flow.iso.graph.HashJoinAroundHashJoinLeftMostGraph

cascading.flow.iso.graph.HashJoinMergeIntoHashJoinStreamedStreamedMergeGraph

cascading.flow.iso.graph.HashJoinSameSourceGraph

cascading.flow.iso.graph.HashJoinsIntoMerge

cascading.flow.iso.graph.JoinAroundJoinRightMostGraph

cascading.flow.iso.graph.JoinAroundJoinRightMostGraphSwapped

cascading.JoinFieldedPipesPlatformTest

cascading.lingual.optiq.CascadingAggregateRel

cascading.MergePipesPlatformTest

All source code are property of their respective owners. Java is a trademark of Sun Microsystems, Inc and owned by ORACLE Inc. Contact coftware#gmail.com.