Databricks Certified Associate Developer for Apache Spark 3.0 Exam Practice Test Instant Access

Question 1

In which order should the code blocks shown below be run in order to assign articlesDf a DataFrame that lists all items in column attributes ordered by the number of times these items occur, from

most to least often?

Sample of DataFrame articlesDf:

1. +------+-----------------------------+-------------------+

2. |itemId|attributes |supplier |

3. +------+-----------------------------+-------------------+

4. |1 |[blue, winter, cozy] |Sports Company Inc.|

5. |2 |[red, summer, fresh, cooling]|YetiX |

6. |3 |[green, summer, travel] |Sports Company Inc.|

7. +------+-----------------------------+-------------------+

A1. articlesDf = articlesDf.groupby('col')
2. articlesDf = articlesDf.select(explode(col('attributes')))
3. articlesDf = articlesDf.orderBy('count').select('col')
4. articlesDf = articlesDf.sort('count',ascending=False).select('col')
5. articlesDf = articlesDf.groupby('col').count()

A4, 5

B2, 5, 3

C5, 2

D2, 3, 4

E2, 5, 4

Answer : E

Correct code block:

articlesDf = articlesDf.select(explode(col('attributes')))

articlesDf = articlesDf.groupby('col').count()

articlesDf = articlesDf.sort('count',ascending=False).select('col')

Output of correct code block:

+-------+

| col|

+-------+

| summer|

| winter|

| blue|

| cozy|

| travel|

| fresh|

| red|

|cooling|

| green|

+-------+

Static notebook | Dynamic notebook: See test 2, Question: 44 (Databricks import instructions)

Question 2

Which of the following describes the conversion of a computational query into an execution plan in Spark?

ASpark uses the catalog to resolve the optimized logical plan.

BThe catalog assigns specific resources to the optimized memory plan.

CThe executed physical plan depends on a cost optimization from a previous stage.

DDepending on whether DataFrame API or SQL API are used, the physical plan may differ.

EThe catalog assigns specific resources to the physical plan.

Answer : C

The executed physical plan depends on a cost optimization from a previous stage.

Correct! Spark considers multiple physical plans on which it performs a cost analysis and selects the final physical plan in accordance with the lowest-cost outcome of that analysis. That final

physical plan is then executed by Spark.

Spark uses the catalog to resolve the optimized logical plan.

No. Spark uses the catalog to resolve the unresolved logical plan, but not the optimized logical plan. Once the unresolved logical plan is resolved, it is then optimized using the Catalyst Optimizer.

The optimized logical plan is the input for physical planning.

The catalog assigns specific resources to the physical plan.

No. The catalog stores metadata, such as a list of names of columns, data types, functions, and databases. Spark consults the catalog for resolving the references in a logical plan at the beginning

of the conversion of the query into an execution plan. The result is then an optimized logical plan.

Depending on whether DataFrame API or SQL API are used, the physical plan may differ.

Wrong -- the physical plan is independent of which API was used. And this is one of the great strengths of Spark!

The catalog assigns specific resources to the optimized memory plan.

There is no specific 'memory plan' on the journey of a Spark computation.

More info: Spark's Logical and Physical plans ... When, Why, How and Beyond. | by Laurent Leturgez | datalex | Medium

Question 3

The code block displayed below contains an error. The code block should return a DataFrame in which column predErrorAdded contains the results of Python function add_2_if_geq_3 as applied to

numeric and nullable column predError in DataFrame transactionsDf. Find the error.

Code block:

1. def add_2_if_geq_3(x):

2. if x is None:

3. return x

4. elif x >= 3:

5. return x+2

6. return x

8. add_2_if_geq_3_udf = udf(add_2_if_geq_3)

10. transactionsDf.withColumnRenamed("predErrorAdded", add_2_if_geq_3_udf(col("predError")))

AThe operator used to adding the column does not add column predErrorAdded to the DataFrame.

BInstead of col('predError'), the actual DataFrame with the column needs to be passed, like so transactionsDf.predError.

CThe udf() method does not declare a return type.

DUDFs are only available through the SQL API, but not in the Python API as shown in the code block.

EThe Python function is unable to handle null values, resulting in the code block crashing on execution.

Answer : A

Correct code block:

def add_2_if_geq_3(x):

if x is None:

return x

elif x >= 3:

return x+2

return x

add_2_if_geq_3_udf = udf(add_2_if_geq_3)

transactionsDf.withColumn('predErrorAdded', add_2_if_geq_3_udf(col('predError'))).show()

Instead of withColumnRenamed, you should use the withColumn operator.

The udf() method does not declare a return type.

It is fine that the udf() method does not declare a return type, this is not a required argument. However, the default return type is StringType. This may not be the ideal return type for numeric,

nullable data -- but the code will run without specified return type nevertheless.

The Python function is unable to handle null values, resulting in the code block crashing on execution.

The Python function is able to handle null values, this is what the statement if x is None does.

UDFs are only available through the SQL API, but not in the Python API as shown in the code block.

No, they are available through the Python API. The code in the code block that concerns UDFs is correct.

Instead of col('predError'), the actual DataFrame with the column needs to be passed, like so transactionsDf.predError.

You may choose to use the transactionsDf.predError syntax, but the col('predError') syntax is fine.

Question 4

The code block displayed below contains an error. The code block should return DataFrame transactionsDf, but with the column storeId renamed to storeNumber. Find the error.

Code block:

transactionsDf.withColumn("storeNumber", "storeId")

AInstead of withColumn, the withColumnRenamed method should be used.

BArguments 'storeNumber' and 'storeId' each need to be wrapped in a col() operator.

CArgument 'storeId' should be the first and argument 'storeNumber' should be the second argument to the withColumn method.

DThe withColumn operator should be replaced with the copyDataFrame operator.

EInstead of withColumn, the withColumnRenamed method should be used and argument 'storeId' should be the first and argument 'storeNumber' should be the second argument to that method.

Answer : E

Correct code block:

transactionsDf.withColumnRenamed('storeId', 'storeNumber')

More info: pyspark.sql.DataFrame.withColumnRenamed --- PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1, Question: 38 (Databricks import instructions)

Question 5

Which of the following is the deepest level in Spark's execution hierarchy?

AJob

BTask

CExecutor

DSlot

EStage

Answer : B

The hierarchy is, from top to bottom: Job, Stage, Task.

Executors and slots facilitate the execution of tasks, but they are not directly part of the hierarchy. Executors are launched by the driver on worker nodes for the purpose of running a specific Spark

application. Slots help Spark parallelize work. An executor can have multiple slots which enable it to process multiple tasks in parallel.

Question 6

The code block displayed below contains an error. The code block should read the csv file located at path data/transactions.csv into DataFrame transactionsDf, using the first row as column header

and casting the columns in the most appropriate type. Find the error.

First 3 rows of transactions.csv:

1. transactionId;storeId;productId;name

2. 1;23;12;green grass

3. 2;35;31;yellow sun

4. 3;23;12;green grass

Code block:

transactionsDf = spark.read.load("data/transactions.csv", sep=";", format="csv", header=True)

AThe DataFrameReader is not accessed correctly.

BThe transaction is evaluated lazily, so no file will be read.

CSpark is unable to understand the file type.

DThe code block is unable to capture all columns.

EThe resulting DataFrame will not have the appropriate schema.

Answer : E

Correct code block:

transactionsDf = spark.read.load('data/transactions.csv', sep=';', format='csv', header=True, inferSchema=True)

By default, Spark does not infer the schema of the CSV (since this usually takes some time). So, you need to add the inferSchema=True option to the code block.

More info: pyspark.sql.DataFrameReader.csv --- PySpark 3.1.2 documentation

Question 7

The code block shown below should return all rows of DataFrame itemsDf that have at least 3 items in column itemNameElements. Choose the answer that correctly fills the blanks in the code block

to accomplish this.

Example of DataFrame itemsDf:

1. +------+----------------------------------+-------------------+------------------------------------------+

3. +------+----------------------------------+-------------------+------------------------------------------+

7. +------+----------------------------------+-------------------+------------------------------------------+

Code block:

itemsDf.__1__(__2__(__3__)__4__)

A1. select
2. count
3. col('itemNameElements')
4. >3

B1. filter
2. count
3. itemNameElements
4. >=3

C1. select
2. count
3. 'itemNameElements'
4. >3

D1. filter
2. size
3. 'itemNameElements'
4. >=3
(Correct)

E1. select
2. size
3. 'itemNameElements'
4. >3

Answer : D

Correct code block:

itemsDf.filter(size('itemNameElements')>3)

Output of code block:

+------+----------------------------------+-------------------+------------------------------------------+

+------+----------------------------------+-------------------+------------------------------------------+

+------+----------------------------------+-------------------+------------------------------------------+

The big difficulty with this Question: is in knowing the difference between count and size (refer to documentation below). size is the correct function to choose here since it returns the number

of elements in an array on a per-row basis.

The other consideration for solving this Question: is the difference between select and filter. Since we want to return the rows in the original DataFrame, filter is the right choice. If we would

use select, we would simply get a single-column DataFrame showing which rows match the criteria, like so:

+----------------------------+

|(size(itemNameElements) > 3)|

+----------------------------+

|true |

|false |

+----------------------------+

More info:

Count documentation: pyspark.sql.functions.count --- PySpark 3.1.1 documentation

Size documentation: pyspark.sql.functions.size --- PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1, Question: 47 (Databricks import instructions)