How wordCount mapReduce jobs, run on hadoop yarn cluster with apache tez? -
as github page of tez says, tez simple , @ heart has 2 components:
the data-processing pipeline engine, and
a master data-processing application, where-by 1 can put arbitrary data-processing 'tasks' described above task-dag
well first question is, how existing mapreduce jobs wordcount exists in tez-examples.jar, converted task-dag? where? or don't...?
and second , more important question part:
every 'task' in tez has following:
- input consume key/value pairs from.
- processor process them.
- output collect processed key/value pairs.
who in charge of splitting input data between tez-tasks? code user provide or yarn (the resource manager) or tez itself?
the question same output phase. in advance
to answer first question on converting mapreduce jobs tez dags:
any mapreduce job can thought of single dag 2 vertices(stages). first vertex map phase , connected downstream vertex reduce via shuffle edge.
there 2 ways in mr jobs can run on tez:
- one approach write native 2-stage dag using tez apis directly. present in tez-examples.
- the second use mapreduce apis , use yarn-tez mode. in scenario, there layer intercepts mr job submission , instead of using mr, translates mr job 2-stage tez dag , executes dag on tez runtime.
for data handling related questions have:
the user provides logic on understanding data read , how split it. tez takes each split of data , takes on responsibility of assigning split or set of splits given task.
the tez framework controls generation , movement of data i.e. generate data between intermediate steps , how move data between 2 vertices/stages. however, not control underlying data contents/structure, partitioning or serialization logic provided user plugins.
the above high level view additional intricacies. more detailed answers posting specific questions development list ( http://tez.apache.org/mail-lists.html )
Comments
Post a Comment