Cloudera CCA Spark and Hadoop Developer CCA175 Exam Practice Test

Page: 1 / 14
Total 96 questions

Want more questions? Get Premium Access.
()

Question 1

Problem Scenario 59 : You have been given below code snippet.

val x = sc.parallelize(1 to 20)

val y = sc.parallelize(10 to 30) operationl

z.collect

Write a correct code snippet for operationl which will produce desired output, shown below. Array[lnt] = Array(16,12, 20,13,17,14,18,10,19,15,11)

ASolution :
val z = x.intersection(y)
intersection : Returns the elements in the two RDDs which are the same.

BSolution :
val z = x.intersection(y)
intersection : Returns the elements in the two RDs which are the same.

Answer : A

Question 2

Problem Scenario 52 : You have been given below code snippet.

val b = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))

Operation_xyz

Write a correct code snippet for Operation_xyz which will produce below output. scalaxollection.Map[lnt,Long] = Map(5 -> 1, 8 -> 1, 3 -> 1, 6 -> 1, 1 -> S, 2 -> 3, 4 -> 2, 7 -> 1)

ASolution :
b.countByValue
countByValue
Returns a map that contains all unique values of the RDD and their respective occurrence counts. (Warning: This operation will finallyaggregate the information in a single reducer.)
Listing Variants
def countByValue(): Map[T, Long]

BSolution :
b.countByValue
countByValue
Returns a map that contains all unique values of the RTT and their respective occurrence counts. (Warning: This operation will finallyaggregate the information in a single reducer.)
Listing Variants
def countByValue(): Map[T, Long]

Answer : A

Question 3

Problem Scenario 44 : You have been given 4 files , with the content as given below:

spark11/file1.txt

Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework

spark11/file2.txt

The core of Apache Hadoop consists of a storage part known as Hadoop Distributed File System (HDFS) and a processing part called MapReduce. Hadoop splits files into large blocks and distributes them across nodes in a cluster. To process data, Hadoop transfers packaged code for nodes to process in parallel based on the data that needs to be processed.

spark11/file3.txt

his approach takes advantage of data locality nodes manipulating the data they have access to to allow the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking

spark11/file4.txt

Apache Storm is focused on stream processing or what some call complex event processing. Storm implements a fault tolerant method for performing a computation or pipelining multiple computations on an event as it flows into a system. One might use Storm to transform unstructured data as it flows into a system into a desired format

(spark11Afile1.txt)

(spark11/file2.txt)

(spark11/file3.txt)

(sparkl 1/file4.txt)

Write a Spark program, which will give you the highest occurring words in each file. With their file name and highest occurring words.

ASolution :
Step 1 : Create all 4 file first using Hue in hdfs.
Step 2 : Load all file as an RDD
val file1 = sc.textFile('sparkl1/filel.txt')
val file2 = sc.textFile('spark11/file2.txt')
val file3 = sc.textFile('spark11/file3.txt')
val file4 = sc.textFile('spark11/file4.txt')
Step 3 : Now do the word count for each file and sort in reverse order of count.
val contentl = filel.flatMap( line => line.split(' ')).map(word => (word,1)).reduceByKey(_ + _).map(item => item.swap).sortByKey(false).map(e=>e.swap)
val content.2 = file2.flatMap( line => line.splitf ')).map(word => (word,1)).reduceByKey(_ + _).map(item => item.swap).sortByKey(false).map(e=>e.swap)
val content3 = file3.flatMap( line > line.split)' ')).map(word => (word,1)).reduceByKey(_ + _).map(item => item.swap).sortByKey(false).map(e=>e.swap)
val content4 = file4.flatMap( line => line.split(' ')).map(word => (word,1)).reduceByKey(_ + _).map(item => item.swap).sortByKey(false).map(e=>e.swap)
Step 4 : Split the data and create RDD of all Employee objects.
val filelword = sc.makeRDD(Array(file1.name+'->'+content1(0)._1+'-'+content1(0)._2)) val file2word = sc.makeRDD(Array(file2.name+'->'+content2(0)._1+'-'+content2(0)._2)) val file3word = sc.makeRDD(Array(file3.name+'->'+content3(0)._1+'-'+content3(0)._2)) val file4word = sc.makeRDD(Array(file4.name+M->'+content4(0)._1+'-'+content4(0)._2))
Step 5: Union all the RDDS
val unionRDDs = filelword.union(file2word).union(file3word).union(file4word)
Step 6 : Save the results in a text file as below. unionRDDs.repartition(1).saveAsTextFile('spark11/union.txt')

BSolution :
Step 1 : Create all 4 file first using Hue in hdfs.
Step 2 : Load all file as an RDD
val file1 = sc.textFile('sparkl1/filel.txt')
val file2 = sc.textFile('spark11/file2.txt')
val file3 = sc.textFile('spark11/file3.txt')
val file4 = sc.textFile('spark11/file4.txt')
Step 3 : Now do the word count for each file and sort in reverse order of count.
val contentl = filel.flatMap( line => line.split(' ')).map(word => (word,1)).reduceByKey(_ + _).map(item => item.swap).sortByKey(false).map(e=>e.swap)
val content.2 = file2.flatMap( line => line.splitf ')).map(word => (word,1)).reduceByKey(_ + _).map(item => item.swap).sortByKey(false).map(e=>e.swap)
val content3 = file3.flatMap( line > line.split)' ')).map(word => (word,1)).reduceByKey(_ + _).map(item => item.swap).sortByKey(false).map(e=>e.swap)
val content4 = file4.flatMap( line => line.split(' ')).map(word => (word,1)).reduceByKey(_ + _).map(item => item.swap).sortByKey(false).map(e=>e.swap)
Step 4: Union all the RDDS
val unionRDDs = filelword.union(file2word).union(file3word).union(file4word)
Step 5 : Save the results in a text file as below. unionRDDs.repartition(1).saveAsTextFile('spark11/union.txt')

Answer : A

Question 4

Problem Scenario 31 : You have given following two files

1. Content.txt: Contain a huge text file containing space separated words.

2. Remove.txt: Ignore/filter all the words given in this file (Comma Separated).

Write a Spark program which reads the Content.txt file and load as an RDD, remove all the words from a broadcast variables(which is loaded as an RDD of words from Remove.txt). And count the occurrence of the each word and save it as a text file in HDFS.

Content.txt

Hello this is ABCTech.com

This is TechABY.com

Apache Spark Training

This is Spark Learning Session

Spark is faster than MapReduce

Remove.txt

Hello, is, this, the

ASolution :
Step 1 : Create all three files in hdfs in directory called spark2 (We will do using Hue). However, you can first create in local filesystem and then upload it to hdfs
Step 2 : Load the Content.txt file
val content = sc.textFile('spark2/Content.txt') //Load the text file
Step 3 : Load the Remove.txt file
val remove = sc.textFile('spark2/Remove.txt') //Load the text file
Step 4 : Create an RDD from remove, However, there is a possibility each word could have trailing spaces, remove those whitespaces as well.We have used two functions here flatMap, map and trim.
val removeRDD= remove.flatMap(x=> x.splitf',') ).map(word=>word.trim)//Create an array of words
Step 5 : Broadcast the variable, which you want to ignore
val bRemove = sc.broadcast(removeRDD.collect().toList) // It should be array of Strings
Step 6 : Split the content RDD, so we can have Array of String. val words = content.flatMap(line => line.split(' '))
Step 7 : Filter the RDD, so it can have only content which are not present in 'Broadcast Variable'. val filtered = words.filter{case (word) => !bRemove.value.contains(word)}
Step 8 : Create a PairRDD, so we can have (word,1) tuple or PairRDD. val pairRDD = filtered.map(word => (word,1))
Step 9 : Nowdo the word count on PairRDD. val wordCount = pairRDD.reduceByKey(_ + _)
Step 10 : Save the output as a Text file.
wordCount.saveAsTextFile('spark2/result.txt')

BSolution :
Step 1 : Create all three files in hdfs in directory called spark2 (We will do using Hue). However, you can first create in local filesystem and then upload it to hdfs
Step 2 : Load the Content.txt file
val content = sc.textFile('spark2/Content.txt') //Load the text file
Step 3 : Load the Remove.txt file
val remove = sc.textFile('spark2/Remove.txt') //Load the text file
Step 4 : Create an RDD from remove, However, there is a possibility each word could have trailing spaces, remove those whitespaces as well.We have used two functions here flatMap, map and trim.
val removeRDD= remove.flatMap(x=> x.splitf',') ).map(word=>word.trim)//Create an array of words
Step 5 : Filter the RDD, so it can have only content which are not present in 'Broadcast Variable'. val filtered = words.filter{case (word) => !bRemove.value.contains(word)}
Step 6 : Create a PairRDD, so we can have (word,1) tuple or PairRDD. val pairRDD = filtered.map(word => (word,1))
Step 7 : Nowdo the word count on PairRDD. val wordCount = pairRDD.reduceByKey(_ + _)
Step 8 : Save the output as a Text file.
wordCount.saveAsTextFile('spark2/result.txt')

Answer : A

Question 5

Problem Scenario 5 : You have been given following mysql database details.

user=retail_dba

password=cloudera

database=retail_db

jdbc URL = jdbc:mysql://quickstart:3306/retail_db

Please accomplish following activities.

1. List all the tables using sqoop command from retail_db

2. Write simple sqoop eval command to check whether you have permission to read database tables or not.

3. Import all the tables as avro files in /user/hive/warehouse/retail cca174.db

4. Import departments table as a text file in /user/cloudera/departments.

ASolution:
Step 1 : List tables using sqoop
sqoop list-tables --connect jdbc:mysql://quickstart:330G/retail_db --username retail dba -password cloudera
Step 2 : Eval command, just run a count query on one of the table.
sqoop eval \
--connect jdbc:mysql://quickstart:3306/retail_db \
-username retail_dba \
-password cloudera \
--query 'select count(1) from ordeMtems'
Step 3 : Import all the tables as avro file.
sqoop import-all-tables \
-connect jdbc:mysql://quickstart:3306/retail_db \
-username=retail_dba \
-password=cloudera \
-as-avrodatafile \
-warehouse-dir=/user/hive/warehouse/retail stage.db \
-ml
Step 4 : Import departments table as a text file in /user/cloudera/departments
sqoop import \
-connect jdbc:mysql://quickstart:3306/retail_db \
-username=retail_dba \
-password=cloudera \
-table departments \
-as-textfile \
-target-dir=/user/cloudera/departments
Step 5 : Verify the imported data.
hdfs dfs -Is /user/cloudera/departments
hdfs dfs -Is /user/hive/warehouse/retailstage.db
hdfs dfs -Is /user/hive/warehouse/retail_stage.db/products

BSolution:
Step 1 : List tables using sqoop
sqoop list-tables --connect jdbc:mysql://quickstart:330G/retail_db --username retail dba -password cloudera
Step 2 : Eval command, just run a count query on one of the table.
sqoop eval \
--connect jdbc:mysql://quickstart:3306/retail_db \
-username retail_dba \
-password cloudera \
--query 'select count(1) from ordeMtems'
Step 3 : Import all the tables as avro file.
sqoop import-all-tables \
-connect jdbc:mysql://quickstart:3306/retail_db \
-username=retail_dba \
-password=cloudera \
-as-avrodatafile \
-warehouse-dir=/user/hive/warehouse/retail stage.db \
-ml
Step 4 : Verify the imported data.
hdfs dfs -Is /user/cloudera/departments
hdfs dfs -Is /user/hive/warehouse/retailstage.db
hdfs dfs -Is /user/hive/warehouse/retail_stage.db/products

Answer : A

Question 6

Problem Scenario 77 : You have been given MySQL DB with following details.

user=retail_dba

password=cloudera

database=retail_db

table=retail_db.orders

table=retail_db.order_items

jdbc URL = jdbc:mysql://quickstart:3306/retail_db

Columns of order table : (orderid , order_date , order_customer_id, order_status)

Columns of ordeMtems table : (order_item_id , order_item_order_ld , order_item_product_id, order_item_quantity,order_item_subtotal,order_item_product_price)

Please accomplish following activities.

1. Copy "retail_db.orders" and "retail_db.order_items" table to hdfs in respective directory p92_orders and p92 order items .

2. Join these data using orderid in Spark and Python

3. Calculate total revenue perday and per order

4. Calculate total and average revenue for each date. - combineByKey

-aggregateByKey

ASolution :
Step 1 : Import Single table .
sqoop import --connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba -password=cloudera -table=orders --target-dir=p92_orders --m 1
sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --username=retail_dba -password=cloudera -table=order_items --target-dir=p92_order_items --m1
Note : Please check you dont have space between before or after '=' sign. Sqoop uses the MapReduce framework to copy data from RDBMS to hdfs
Step 2 : Read the data from one of the partition, created using above command, hadoop fs -cat p92_orders/part-m-00000 hadoop fs -cat p92_order_items/part-m-00000
Step 3 : Load these above two directory as RDD using Spark and Python (Open pyspark terminal and do following). orders = sc.textFile('p92_orders') orderltems = sc.textFile('p92_order_items')
Step 4 : Convert RDD into key value as (orderjd as a key and rest of the values as a value)
#First value is orderjd
ordersKeyValue = orders.map(lambda line: (int(line.split(',')[0]), line))
#Second value as an Orderjd
orderltemsKeyValue = orderltems.map(lambda line: (int(line.split(',')[1]), line))
Step 5 : Join both the RDD using orderjd
joinedData = orderltemsKeyValue.join(ordersKeyValue)
#print the joined data
for line in joinedData.collect():
print(line)
Format of joinedData as below.
[Orderld, 'All columns from orderltemsKeyValue', 'All columns from orders Key Value']
Step 6 : Now fetch selected values Orderld, Order date and amount collected on this order.
//Retruned row will contain ((order_date,order_id),amout_collected)
revenuePerDayPerOrder = joinedData.map(lambda row: ((row[1][1].split(M,M)[1],row[0]}, float(row[1][0].split(',')[4])))
#print the result
for line in revenuePerDayPerOrder.collect():
print(line)
Step 7 : Now calculate total revenue perday and per order
A . Using reduceByKey
totalRevenuePerDayPerOrder = revenuePerDayPerOrder.reduceByKey(lambda runningSum, value: runningSum + value)
for line in totalRevenuePerDayPerOrder.sortByKey().collect(): print(line)
#Generate data as (date, amount_collected) (Ignore ordeMd)
dateAndRevenueTuple = totalRevenuePerDayPerOrder.map(lambda line: (line[0][0], line[1]))
for line in dateAndRevenueTuple.sortByKey().collect(): print(line)
Step 8 : Calculate total amount collected for each day. And also calculate number of days. #Generate output as (Date, Total Revenue for date, total_number_of_dates)
#Line 1 : it will generate tuple (revenue, 1)
#Line 2 : Here, we will do summation for all revenues at the same time another counter to maintain number of records.
#Line 3 : Final function to merge all the combiner
totalRevenueAndTotalCount = dateAndRevenueTuple.combineByKey( \
lambda revenue: (revenue, 1), \
lambda revenueSumTuple, amount: (revenueSumTuple[0] + amount, revenueSumTuple[1] + 1), \
lambda tuplel, tuple2: (round(tuple1[0] + tuple2[0], 2}, tuple1[1] + tuple2[1]) \
for line in totalRevenueAndTotalCount.collect(): print(line)
Step 9 : Now calculate average for each date
averageRevenuePerDate = totalRevenueAndTotalCount.map(lambda threeElements: (threeElements[0], threeElements[1][0]/threeElements[1][1]}}
for line in averageRevenuePerDate.collect(): print(line)
Step 10 : Using aggregateByKey
#line 1 : (Initialize both the value, revenue and count)
#line 2 : runningRevenueSumTuple (Its a tuple for total revenue and total record count for each date)
#line 3 : Summing all partitions revenue and count
totalRevenueAndTotalCount = dateAndRevenueTuple.aggregateByKey( \
(0,0), \
lambda runningRevenueSumTuple, revenue: (runningRevenueSumTuple[0] + revenue, runningRevenueSumTuple[1] + 1), \
lambda tupleOneRevenueAndCount, tupleTwoRevenueAndCount: (tupleOneRevenueAndCount[0] + tupleTwoRevenueAndCount[0], tupleOneRevenueAndCount[1] + tupleTwoRevenueAndCount[1]) \
)
for line in totalRevenueAndTotalCount.collect(): print(line)
Step 11 : Calculate the average revenue per date
averageRevenuePerDate = totalRevenueAndTotalCount.map(lambda threeElements: (threeElements[0], threeElements[1][0]/threeElements[1][1]))
for line in averageRevenuePerDate.collect(): print(line)

BSolution :
Step 1 : Import Single table .
sqoop import --connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba -password=cloudera -table=orders --target-dir=p92_orders --m 1
sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --username=retail_dba -password=cloudera -table=order_items --target-dir=p92_order_items --m1
Note : Please check you dont have space between before or after '=' sign. Sqoop uses the MapReduce framework to copy data from RDBMS to hdfs
Step 2 : Read the data from one of the partition, created using above command, hadoop fs -cat p92_orders/part-m-00000 hadoop fs -cat p92_order_items/part-m-66666
Step 3 : Load these above two directory as RDD using Spark and Python (Open pyspark terminal and do following). orders = sc.textFile('p92_orders') orderltems = sc.textFile('p92_order_items')
Step 4 : Convert RDD into key value as (orderjd as a key and rest of the values as a value)
#First value is orderjd
ordersKeyValue = orders.map(lambda line: (int(line.split(',')[0]), line))
#Second value as an Orderjd
orderltemsKeyValue = orderltems.map(lambda line: (int(line.split(',')[1]), line))
Step 5 : Join both the RDD using orderjd
joinedData = orderltemsKeyValue.join(ordersKeyValue)
#print the joined data
for line in joinedData.collect():
print(line)
Format of joinedData as below.
[Orderld, 'All columns from orderltemsKeyValue', 'All columns from orders Key Value']
Step 6 : Now fetch selected values Orderld, Order date and amount collected on this order.
//Retruned row will contain ((order_date,order_id),amout_collected)
revenuePerDayPerOrder = joinedData.map(lambda row: ((row[1][1].split(M,M)[1],row[0]},
print(line)
Step 7 : Now calculate total revenue perday and per order
A . Using reduceByKey
totalRevenuePerDayPerOrder = revenuePerDayPerOrder.reduceByKey(lambda runningSum, value: runningSum + value)
for line in totalRevenuePerDayPerOrder.sortByKey().collect(): print(line)
#Generate data as (date, amount_collected) (Ignore ordeMd)
dateAndRevenueTuple = totalRevenuePerDayPerOrder.map(lambda line: (line[0][0], line[1]))
for line in dateAndRevenueTuple.sortByKey().collect(): print(line)
Step 8 : Calculate total amount collected for each day. And also calculate number of days. #Generate output as (Date, Total Revenue for date, total_number_of_dates)
#Line 1 : it will generate tuple (revenue, 1)
#Line 2 : Here, we will do summation for all revenues at the same time another counter to maintain number of records.
#Line 3 : Final function to merge all the combiner
totalRevenueAndTotalCount = dateAndRevenueTuple.combineByKey( \
lambda revenue: (revenue, 1), \
lambda revenueSumTuple, amount: (revenueSumTuple[0] + amount, revenueSumTuple[1] + 1), \
lambda tuplel, tuple2: (round(tuple1[0] + tuple2[0], 2}, tuple1[1] + tuple2[1]) \
for line in totalRevenueAndTotalCount.collect(): print(line)
Step 9 : Now calculate average for each date
averageRevenuePerDate = totalRevenueAndTotalCount.map(lambda threeElements: (threeElements[0], threeElements[1][0]/threeElements[1][1]}}
for line in averageRevenuePerDate.collect(): print(line)
Step 10 : Using aggregateByKey
#line 1 : (Initialize both the value, revenue and count)
#line 2 : runningRevenueSumTuple (Its a tuple for total revenue and total record count for each date)
#line 3 : Summing all partitions revenue and count
totalRevenueAndTotalCount = dateAndRevenueTuple.aggregateByKey( \
(0,0), \
lambda runningRevenueSumTuple, revenue: (runningRevenueSumTuple[0] + revenue, runningRevenueSumTuple[1] + 1), \
lambda tupleOneRevenueAndCount, tupleTwoRevenueAndCount: (tupleOneRevenueAndCount[0] + tupleTwoRevenueAndCount[0], tupleOneRevenueAndCount[1] + tupleTwoRevenueAndCount[1]) \
)
for line in totalRevenueAndTotalCount.collect(): print(line)
Step 11 : Calculate the average revenue per date
averageRevenuePerDate = totalRevenueAndTotalCount.map(lambda threeElements: (threeElements[0], threeElements[1][0]/threeElements[1][1]))
for line in averageRevenuePerDate.collect(): print(line)

Answer : A

Question 7

Problem Scenario 47 : You have been given below code snippet, with intermediate output.

val z = sc.parallelize(List(1,2,3,4,5,6), 2)

// lets first print out the contents of the RDD with partition labels

def myfunc(index: Int, iter: lterator[(lnt)]): lterator[String] = {

iter.toList.map(x => "[partID:" + index + ", val: " + x + "]").iterator

}

//In each run , output could be different, while solving problem assume belowm output only.

z.mapPartitionsWithlndex(myfunc).collect

res28: Array[String] = Array([partlD:0, val: 1], [partlD:0, val: 2], [partlD:0, val: 3], [partlD:1, val: 4], [partlD:1, val: S], [partlD:1, val: 6])

Now apply aggregate method on RDD z , with two reduce function , first will select max value in each partition and second will add all the maximumvalues from all partitions.

Initialize the aggregate with value 5. hence expected output will be 16.

See Below Explanation:

Answer : A