Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Databricks Certified Associate Developer for Apache Spark 3.0 Exam Practice Test

Page: 1 / 14
Total 180 questions
Question 1

Which of the following is not a feature of Adaptive Query Execution?



Answer : D

Reroute a query in case of an executor failure.

Correct. Although this feature exists in Spark, it is not a feature of Adaptive Query Execution. The cluster manager keeps track of executors and will work together with the driver to launch an

executor and assign the workload of the failed executor to it (see also link below).

Replace a sort merge join with a broadcast join, where appropriate.

No, this is a feature of Adaptive Query Execution.

Coalesce partitions to accelerate data processing.

Wrong, Adaptive Query Execution does this.

Collect runtime statistics during query execution.

Incorrect, Adaptive Query Execution (AQE) collects these statistics to adjust query plans. This feedback loop is an essential part of accelerating queries via AQE.

Split skewed partitions into smaller partitions to avoid differences in partition processing time.

No, this is indeed a feature of Adaptive Query Execution. Find more information in the Databricks blog post linked below.

More info: Learning Spark, 2nd Edition, Chapter 12, On which way does RDD of spark finish fault-tolerance? - Stack Overflow, How to Speed up SQL Queries with Adaptive Query Execution


Question 2

Which of the following statements about storage levels is incorrect?



Answer : D

MEMORY_AND_DISK replicates cached DataFrames both on memory and disk.

Correct, this statement is wrong. Spark prioritizes storage in memory, and will only store data on disk that does not fit into memory.

DISK_ONLY will not use the worker node's memory.

Wrong, this statement is correct. DISK_ONLY keeps data only on the worker node's disk, but not in memory.

In client mode, DataFrames cached with the MEMORY_ONLY_2 level will not be stored in the edge node's memory.

Wrong, this statement is correct. In fact, Spark does not have a provision to cache DataFrames in the driver (which sits on the edge node in client mode). Spark caches DataFrames in the executors'

memory.

Caching can be undone using the DataFrame.unpersist() operator.

Wrong, this statement is correct. Caching, as achieved via the DataFrame.cache() or DataFrame.persist() operators can be undone using the DataFrame.unpersist() operator. This operator will

remove all of its parts from the executors' memory and disk.

The cache operator on DataFrames is evaluated like a transformation.

Wrong, this statement is correct. DataFrame.cache() is evaluated like a transformation: Through lazy evaluation. This means that after calling DataFrame.cache() the command will not have any

effect until you call a subsequent action, like DataFrame.cache().count().

More info: pyspark.sql.DataFrame.unpersist --- PySpark 3.1.2 documentation


Question 3

Which of the following statements about reducing out-of-memory errors is incorrect?



Answer : A

Concatenating multiple string columns into a single column may guard against out-of-memory errors.

Exactly, this is an incorrect answer! Concatenating any string columns does not reduce the size of the data, it just structures it a different way. This does little to how Spark processes the data and

definitely does not reduce out-of-memory errors.

Reducing partition size can help against out-of-memory errors.

No, this is not incorrect. Reducing partition size is a viable way to aid against out-of-memory errors, since executors need to load partitions into memory before processing them. If the executor does

not have enough memory available to do that, it will throw an out-of-memory error. Decreasing partition size can therefore be very helpful for preventing that.

Decreasing the number of cores available to each executor can help against out-of-memory errors.

No, this is not incorrect. To process a partition, this partition needs to be loaded into the memory of an executor. If you imagine that every core in every executor processes a partition, potentially in

parallel with other executors, you can imagine that memory on the machine hosting the executors fills up quite quickly. So, memory usage of executors is a concern, especially when multiple

partitions are processed at the same time. To strike a balance between performance and memory usage, decreasing the number of cores may help against out-of-memory errors.

Setting a limit on the maximum size of serialized data returned to the driver may help prevent out-of-memory errors.

No, this is not incorrect. When using commands like collect() that trigger the transmission of potentially large amounts of data from the cluster to the driver, the driver may experience out-of-memory

errors. One strategy to avoid this is to be careful about using commands like collect() that send back large amounts of data to the driver. Another strategy is setting the parameter

spark.driver.maxResultSize. If data to be transmitted to the driver exceeds the threshold specified by the parameter, Spark will abort the job and therefore prevent an out-of-memory error.

Limiting the amount of data being automatically broadcast in joins can help against out-of-memory errors.

Wrong, this is not incorrect. As part of Spark's internal optimization, Spark may choose to speed up operations by broadcasting (usually relatively small) tables to executors. This broadcast is

happening from the driver, so all the broadcast tables are loaded into the driver first. If these tables are relatively big, or multiple mid-size tables are being broadcast, this may lead to an out-of-

memory error. The maximum table size for which Spark will consider broadcasting is set by the spark.sql.autoBroadcastJoinThreshold parameter.

More info: Configuration - Spark 3.1.2 Documentation and Spark OOM Error --- Closeup. Does the following look familiar when... | by Amit Singh Rathore | The Startup | Medium


Question 4

Which of the following is a problem with using accumulators?



Answer : C

Accumulator values can only be read by the driver, but not by executors.

Correct. So, for example, you cannot use an accumulator variable for coordinating workloads between executors. The typical, canonical, use case of an accumulator value is to report data, for

example for debugging purposes, back to the driver. For example, if you wanted to count values that match a specific condition in a UDF for debugging purposes, an accumulator provides a good

way to do that.

Only numeric values can be used in accumulators.

No. While pySpark's Accumulator only supports numeric values (think int and float), you can define accumulators for custom types via the AccumulatorParam interface (documentation linked below).

Accumulators do not obey lazy evaluation.

Incorrect -- accumulators do obey lazy evaluation. This has implications in practice: When an accumulator is encapsulated in a transformation, that accumulator will not be modified until a

subsequent action is run.

Accumulators are difficult to use for debugging because they will only be updated once, independent if a task has to be re-run due to hardware failure.

Wrong. A concern with accumulators is in fact that under certain conditions they can run for each task more than once. For example, if a hardware failure occurs during a task after an accumulator

variable has been increased but before a task has finished and Spark launches the task on a different worker in response to the failure, already executed accumulator variable increases will be

repeated.

Only unnamed accumulators can be inspected in the Spark UI.

No. Currently, in PySpark, no accumulators can be inspected in the Spark UI. In the Scala interface of Spark, only named accumulators can be inspected in the Spark UI.

More info: Aggregating Results with Spark Accumulators | Sparkour, RDD Programming Guide - Spark 3.1.2 Documentation, pyspark.Accumulator --- PySpark 3.1.2 documentation, and

pyspark.AccumulatorParam --- PySpark 3.1.2 documentation


Question 5

Which of the following describes a valid concern about partitioning?



Answer : A

A shuffle operation returns 200 partitions if not explicitly set.

Correct. 200 is the default value for the Spark property spark.sql.shuffle.partitions. This property determines how many partitions Spark uses when shuffling data for joins or aggregations.

The coalesce() method should be used to increase the number of partitions.

Incorrect. The coalesce() method can only be used to decrease the number of partitions.

Decreasing the number of partitions reduces the overall runtime of narrow transformations if there are more executors available than partitions.

No. For narrow transformations, fewer partitions usually result in a longer overall runtime, if more executors are available than partitions.

A narrow transformation does not include a shuffle, so no data need to be exchanged between executors. Shuffles are expensive and can be a bottleneck for executing Spark workloads.

Narrow transformations, however, are executed on a per-partition basis, blocking one executor per partition. So, it matters how many executors are available to perform work in parallel relative to the

number of partitions. If the number of executors is greater than the number of partitions, then some executors are idle while other process the partitions. On the flip side, if the number of executors is

smaller than the number of partitions, the entire operation can only be finished after some executors have processed multiple partitions, one after the other. To minimize the overall runtime, one

would want to have the number of partitions equal to the number of executors (but not more).

So, for the scenario at hand, increasing the number of partitions reduces the overall runtime of narrow transformations if there are more executors available than partitions.

No data is exchanged between executors when coalesce() is run.

No. While coalesce() avoids a full shuffle, it may still cause a partial shuffle, resulting in data exchange between executors.

Short partition processing times are indicative of low skew.

Incorrect. Data skew means that data is distributed unevenly over the partitions of a dataset. Low skew therefore means that data is distributed evenly.

Partition processing time, the time that executors take to process partitions, can be indicative of skew if some executors take a long time to process a partition, but others do not. However, a short

processing time is not per se indicative a low skew: It may simply be short because the partition is small.

A situation indicative of low skew may be when all executors finish processing their partitions in the same timeframe. High skew may be indicated by some executors taking much longer to finish their

partitions than others. But the answer does not make any comparison -- so by itself it does not provide enough information to make any assessment about skew.

More info: Spark Repartition & Coalesce - Explained and Performance Tuning - Spark 3.1.2 Documentation


Question 6

Which of the following statements about executors is correct?



Answer : B

Executors stop upon application completion by default.

Correct. Executors only persist during the lifetime of an application.

A notable exception to that is when Dynamic Resource Allocation is enabled (which it is not by default). With Dynamic Resource Allocation enabled, executors are terminated when they are idle,

independent of whether the application has been completed or not.

An executor can serve multiple applications.

Wrong. An executor is always specific to the application. It is terminated when the application completes (exception see above).

Each node hosts a single executor.

No. Each node can host one or more executors.

Executors store data in memory only.

No. Executors can store data in memory or on disk.

Executors are launched by the driver.

Incorrect. Executors are launched by the cluster manager on behalf of the driver.

More info: Job Scheduling - Spark 3.1.2 Documentation, How Applications are Executed on a Spark Cluster | Anatomy of a Spark Application | InformIT, and Spark Jargon for Starters. This blog is to

clear some of the... | by Mageswaran D | Medium


Question 7

Which of the following describes Spark actions?



Answer : C

The driver receives data upon request by actions.

Correct! Actions trigger the distributed execution of tasks on executors which, upon task completion, transfer result data back to the driver.

Actions are Spark's way of exchanging data between executors.

No. In Spark, data is exchanged between executors via shuffles.

Writing data to disk is the primary purpose of actions.

No. The primary purpose of actions is to access data that is stored in Spark's RDDs and return the data, often in aggregated form, back to the driver.

Actions are Spark's way of modifying RDDs.

Incorrect. Firstly, RDDs are immutable -- they cannot be modified. Secondly, Spark generates new RDDs via transformations and not actions.

Stage boundaries are commonly established by actions.

Wrong. A stage boundary is commonly established by a shuffle, for example caused by a wide transformation.


Page:    1 / 14   
Total 180 questions