apache pig - How to improve performance of PIG job that uses Datafu's Hyperloglog for estimating cardinality? -


i using datafu's hyperloglog udf estimate count of unique ids in dataset. in case have 320 million unique ids may appear multiple times in dataset.

dataset : country, id. 

here code :

register datafu-1.2.0.jar;  define  hyperloglogplusplus datafu.pig.stats.hyperloglogplusplus();  -- id uuid, example : de305d54-75b4-431b-adb2-eb6b9e546014  all_ids = load '$data' using pigstorage(';') (country:chararray, id:chararray);  estimate_unique_ids = foreach (group all_ids country) generate     'total ids' label,     hyperloglogplusplus(all_ids) reach;  store estimate_unique_ids '$output' using pigstorage(); 

using 120 reducers noticed majority of them completed within minutes. handful of reducers overloaded data , ran forever. killed them after 24 hours.

i thought hyperloglog more efficient counting real. going wrong here?

in datafu 1.3.0, algebraic implementation of hyperloglog added. allows udf use combiner , improve performance in skewed situations.

however, in the comments in jira issue there discussion of other performance problems can arise when using hyperloglog. relevant quote below:

the thing keep in mind each instance of hyperloglogplus allocates pretty large byte array. can't remember exact numbers, think default precision of 20 hundreds of kb. in example if cardinality of "a" large going allocate lot of large byte arrays need transmitted combiner reducer. avoid using in "group by" situations unless know key cardinality quite small. udf better suited "group all" scenarios have lot of input data. if input data smaller byte array worse off using udf. if can accept worse precision byte array made smaller.


Comments

Popular posts from this blog

python - argument must be rect style object - Pygame -

webrtc - Which ICE candidate am I using and why? -

c# - Better 64-bit byte array hash -