Examples of com.liveramp.cascading_ext.assembly.BloomJoin

com.liveramp.cascading_ext.assembly.BloomJoin

This SubAssembly behaves almost exactly like CoGroup, except that the LHS is filtered using a bloom filter built from the keys on the RHS. This means that the side with fewer keys should always be passed in as the RHS.

Note that if there are field name conflicts on the LHS and RHS, you'll have to pass in renameFields (just like with CoGroup). Additionally, @param coGroupOrder allows tweaking of reduce performance (default is LARGE_LHS).

Note: if this SubAssembly is used without CascadingUtil, the flow will need certain properties set. See BloomJoinExampleWithoutCascadingUtil for details.

Note: In the current implementation, using a LeftJoin joiner with LARGE_LHS or a RightJoin joiner with LARGE_RHS will fall back to a regular CoGroup.

IMPORTANT: one important behavior difference between BloomJoin and CoGroup is that RHS and LHS keys which are expected to match in the join MUST serialize identically as well (the bloom filter is built by serializing key fields.) If normal java/hadoop types are used this should not be a problem, but comparing key types which extend each other WILL cause data loss, unless custom serializers are used. ex, LHS = (BytesWritable), RHS = (x extends BytesWritable) loses data when the types serialize differently.

    Hfs sink = new Hfs(new SequenceFile(new Fields("field1", "field2", "field3", "field4")), outputDir);


    Pipe source1 = new Pipe("source1");
    Pipe source2 = new Pipe("source2");


    Pipe joined = new BloomJoin(source1, new Fields("field1"), source2, new Fields("field3"));


    CascadingUtil.get().getFlowConnector().connect("Example flow", sources, sink, joined).complete();


    //  Take a look at the output tuples
    TupleEntryIterator output = sink.openForRead(CascadingUtil.get().getFlowProcess());


    Pipe source1 = new Pipe("source1");


    Pipe source2 = new Pipe("source2");


    Pipe joined = new BloomJoin(source1, new Fields("field1"),
        source2, new Fields("field3"));


    Map<String, Tap> sources = new HashMap<String, Tap>();
    sources.put("source1", ExampleFixtures.SOURCE_TAP_1);
    sources.put("source2", ExampleFixtures.SOURCE_TAP_2);

Examples of com.liveramp.cascading_ext.assembly.BloomJoin

Related Classes of com.liveramp.cascading_ext.assembly.BloomJoin