You need to perform statistical analysis in your MapReduce job and would like to call methods in the Apache Commons Math library, which is distributed as a 1.3 megabyte Java archive (JAR) file. Which is the best way to make this library available to your MapReducer job at runtime?
Answer : C
The usage of the jar command is like this,
Usage: hadoop jar <jar> [mainClass] args...
If you want the commons-math3.jar to be available for all the tasks you can do any one of
these
1. Copy the jar file in $HADOOP_HOME/lib dir
or
2. Use the generic option -libjars.
Given the following Hive commands:
Which one of the following statements Is true?
Answer : A
Indentify the utility that allows you to create and run MapReduce jobs with any executable or script as the mapper and/or the reducer?
Answer : D
Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.
Which one of the following is NOT a valid Oozie action?
Answer : D
Examine the following Hive statements:
Assuming the statements above execute successfully, which one of the following statements is true?
Answer : A
The Hadoop framework provides a mechanism for coping with machine issues such as faulty configuration or impending hardware failure. MapReduce detects that one or a number of machines are performing poorly and starts more copies of a map or reduce task. All the tasks run simultaneously and the task finish first are used. This is called:
Answer : E
Speculative execution: One problem with the Hadoop system is that by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program. For example if one node has a slow disk controller, then it may be reading its input at only 10% the speed of all the other nodes. So when 99 map tasks are already complete, the system is still waiting for the final map task to check in, which takes much longer than all the other nodes.
By forcing tasks to run in isolation from one another, individual tasks do not know where their inputs come from. Tasks trust the Hadoop platform to just deliver the appropriate input. Therefore, the same input can be processed multiple times in parallel, to exploit differences in machine capabilities. As most of the tasks in a job are coming to a close, the Hadoop platform will schedule redundant copies of the remaining tasks across several nodes which do not have other work to perform. This process is known as speculative execution. When tasks complete, they announce this fact to the JobTracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing speculatively, Hadoop tells the TaskTrackers to abandon the tasks and discard their outputs. The Reducers then receive their inputs from whichever Mapper completed successfully, first.
Note:
* Hadoop uses 'speculative execution.' The same task may be started on multiple boxes. The first one to finish wins, and the other copies are killed.
Failed tasks are tasks that error out.
* There are a few reasons Hadoop can kill tasks by his own decisions:
a) Task does not report progress during timeout (default is 10 minutes)
b) FairScheduler or CapacityScheduler needs the slot for some other pool (FairScheduler) or queue (CapacityScheduler).
c) Speculative execution causes results of task not to be needed since it has completed on other place.
Which one of the following statements is true regarding a MapReduce job?
Answer : A