org.apache.hadoop.hive.ql.optimizer.listbucketingpruner.ListBucketingPruner
Note: this class is not designed to be used in general but for list bucketing pruner only. The structure addresses the following requirements: 1. multiple dimension collection 2. length of each dimension is dynamic. It's decided at runtime. The first user is list bucketing pruner and used in pruning phase: 1. Each skewed column has a batch of skewed elements. 2. One skewed column represents one dimension. 3. Length of dimension is size of skewed elements. 4. no. of skewed columns and length of dimension are dynamic and configured by user. use case: ======== Use case #1: Multiple dimension collection represents if to select a directory representing by the cell. skewed column: C1, C2, C3 skewed value: (1,a,x), (2,b,x), (1,c,x), (2,a,y) Other: represent value for the column which is not part of skewed value. C3 = x C1\C2 | a | b | c |Other 1 | Boolean(1,a,x) | X | Boolean(1,c,x) |X 2 | X |Boolean(2,b,x) | X |X other | X | X | X |X C3 = y C1\C2 | a | b | c |Other 1 | X | X | X |X 2 | Boolean(2,a,y) | X | X |X other | X | X | X |X Boolean is cell type which can be False/True/Null(Unknown). (1,a,x) is just for information purpose to explain which skewed value it represents. 1. value of Boolean(1,a,x) represents if we select the directory for list bucketing 2. value of Boolean(2,b,x) represents if we select the directory for list bucketing ... 3. All the rest, marked as "X", will decide if to pickup the default directory. 4. Not only "other" columns/rows but also the rest as long as it doesn't represent skewed value. For cell representing skewed value: 1. False, skip the directory 2. True/Unknown, select the directory For cells representing default directory: 1. only if all cells are false, skip the directory 2. all other cases, select the directory Use case #2: Multiple dimension collection represents skewed elements so that walk through tree one by one. Cell is a List representing the value mapping from index path and skewed value. skewed column: C1, C2, C3 skewed value: (1,a,x), (2,b,x), (1,c,x), (2,a,y) Other: represent value for the column which is not part of skewed value. C3 = x C1\C2 | a | b | c |Other 1 | (1,a,x) | X | (1,c,x) |X 2 | X |(2,b,x) | X |X other | X | X | X |X C3 = y C1\C2 | a | b | c |Other 1 | X | X | X |X 2 | (2,a,y) | X | X |X other | X | X | X |X Implementation: ============== please see another example in {@link ListBucketingPruner#prune}We will use a HasMap to represent the Dynamic-Multiple-Dimension collection: 1. Key is List representing the index path to the cell 2. value represents the cell (Boolean for use case #1, List for case #2) For example: 1. skewed column (list): C1, C2, C3 2. skewed value (list of list): (1,a,x), (2,b,x), (1,c,x), (2,a,y) From skewed value, we calculate the unique skewed element for each skewed column: C1: (1,2) C2: (a,b,c) C3: (x,y) We store them in list of list. We don't need to store skewed column name since we use order to match: 1. Skewed column (list): C1, C2, C3 2. Unique skewed elements for each skewed column (list of list): (1,2,other), (a,b,c,other), (x,y,other) 3. index (0,1,2) (0,1,2,3) (0,1,2) We use the index,starting at 0. to construct hashmap representing dynamic-multi-dimension collection: key (what skewed value key represents) -> value (Boolean for use case #1, List for case #2). (0,0,0) (1,a,x) (0,0,1) (1,a,y) (0,1,0) (1,b,x) (0,1,1) (1,b,y) (0,2,0) (1,c,x) (0,2,1) (1,c,y) (1,0,0) (2,a,x) (1,0,1) (2,a,y) (1,1,0) (2,b,x) (1,1,1) (2,b,y) (1,2,0) (2,c,x) (1,2,1) (2,c,y) ...