Databricks Certified Associate Developer for Apache Spark 3.5 - Python Databricks Certified Associate Developer for Apache Spark 3.5 Exam Questions

Page: 1 / 14
Total 135 questions
Question 1

A data engineer needs to persist a file-based data source to a specific location. However, by default, Spark writes to the warehouse directory (e.g., /user/hive/warehouse). To override this, the engineer must explicitly define the file path.

Which line of code ensures the data is saved to a specific location?

Options:



Answer : C

To persist a table and specify the save path, use:

users.write.option('path', '/some/path').saveAsTable('default_table')

The .option('path', ...) must be applied before calling saveAsTable.

Option A uses invalid syntax (write(path=...)).

Option B applies .option() after .saveAsTable()---which is too late.

Option D uses incorrect syntax (no path parameter in saveAsTable).


Question 2

A developer is trying to join two tables, sales.purchases_fct and sales.customer_dim, using the following code:

fact_df = purch_df.join(cust_df, F.col('customer_id') == F.col('custid'))

The developer has discovered that customers in the purchases_fct table that do not exist in the customer_dim table are being dropped from the joined table.

Which change should be made to the code to stop these customer records from being dropped?



Answer : A

In Spark, the default join type is an inner join, which returns only the rows with matching keys in both DataFrames. To retain all records from the left DataFrame (purch_df) and include matching records from the right DataFrame (cust_df), a left outer join should be used.

By specifying the join type as 'left', the modified code ensures that all records from purch_df are preserved, and matching records from cust_df are included. Records in purch_df without a corresponding match in cust_df will have null values for the columns from cust_df.

This approach is consistent with standard SQL join operations and is supported in PySpark's DataFrame API.


Question 3

44 of 55. A data engineer is working on a real-time analytics pipeline using Spark Structured Streaming. They want the system to process incoming data in micro-batches at a fixed interval of 5 seconds.

Which code snippet fulfills this requirement?

A.

query = df.writeStream \

.outputMode("append") \

.trigger(processingTime="5 seconds") \

.start()

B.

query = df.writeStream \

.outputMode("append") \

.trigger(continuous="5 seconds") \

.start()

C.

query = df.writeStream \

.outputMode("append") \

.trigger(once=True) \

.start()

D.

query = df.writeStream \

.outputMode("append") \

.start()



Answer : A

To process data in fixed micro-batch intervals, use the .trigger(processingTime='interval') option in Structured Streaming.

Correct usage:

query = df.writeStream \

.outputMode('append') \

.trigger(processingTime='5 seconds') \

.start()

This instructs Spark to process available data every 5 seconds.

Why the other options are incorrect:

B: continuous triggers are for continuous processing mode (different execution model).

C: once=True runs the stream a single time (batch mode).

D: Default trigger runs as fast as possible, not fixed intervals.


PySpark Structured Streaming Guide --- Trigger types: processingTime, once, continuous.

Databricks Exam Guide (June 2025): Section ''Structured Streaming'' --- controlling streaming triggers and batch intervals.

===========

Question 4

A data engineer is working on the DataFrame:

(Referring to the table image: it has columns Id, Name, count, and timestamp.)

Which code fragment should the engineer use to extract the unique values in the Name column into an alphabetically ordered list?



Answer : B

To extract unique values from a column and sort them alphabetically:

distinct() is required to remove duplicate values.

orderBy() is needed to sort the results alphabetically (ascending by default).

Correct code:

df.select('Name').distinct().orderBy(df['Name'])

This is directly aligned with standard DataFrame API usage in PySpark, as documented in the official Databricks Spark APIs. Option A is incorrect because it may not remove duplicates. Option C omits sorting. Option D sorts in descending order, which doesn't meet the requirement for alphabetical (ascending) order.


Question 5

30 of 55. A data engineer is working on a num_df DataFrame and has a Python UDF defined as:

def cube_func(val):

return val * val * val

Which code fragment registers and uses this UDF as a Spark SQL function to work with the DataFrame num_df?

A.

spark.udf.register("cube_func", cube_func)

num_df.selectExpr("cube_func(num)").show()

B.

num_df.select(cube_func("num")).show()

C.

spark.createDataFrame(cube_func("num")).show()

D.

num_df.register("cube_func").select("num").show()



Answer : A

To use a Python function as a UDF (User Defined Function) in Spark SQL, it must first be registered using spark.udf.register().

Correct usage:

spark.udf.register('cube_func', cube_func)

num_df.selectExpr('cube_func(num)').show()

This registers cube_func as a callable SQL function available in expressions or queries.

Why the other options are incorrect:

B: You must wrap with udf() or selectExpr; calling plain Python functions won't work.

C: createDataFrame is for building DataFrames, not calling UDFs.

D: DataFrames cannot directly register UDFs.


PySpark SQL Functions --- spark.udf.register() and selectExpr().

Databricks Exam Guide (June 2025): Section ''Using Spark SQL'' --- user-defined functions and Spark SQL integration.

===========

Question 6

A developer is running Spark SQL queries and notices underutilization of resources. Executors are idle, and the number of tasks per stage is low.

What should the developer do to improve cluster utilization?



Answer : A

The number of tasks is controlled by the number of partitions. By default, spark.sql.shuffle.partitions is 200. If stages are showing very few tasks (less than total cores), you may not be leveraging full parallelism.

From the Spark tuning guide:

'To improve performance, especially for large clusters, increase spark.sql.shuffle.partitions to create more tasks and parallelism.'

Thus:

A is correct: increasing shuffle partitions increases parallelism

B is wrong: it further reduces parallelism

C is invalid: increasing dataset size doesn't guarantee more partitions

D is irrelevant to task count per stage

Final Answer: A


Question 7

35 of 55.

A data engineer is building a Structured Streaming pipeline and wants it to recover from failures or intentional shutdowns by continuing where it left off.

How can this be achieved?



Answer : C

In Structured Streaming, checkpoints store state information (offsets, progress, and metadata) needed to resume a stream after a failure or restart.

Correct usage:

Set the checkpointLocation option when writing the streaming output:

streaming_df.writeStream \

.format('delta') \

.option('checkpointLocation', '/path/to/checkpoint/dir') \

.start('/path/to/output')

Spark uses this checkpoint directory to recover progress automatically and maintain exactly-once semantics.

Why the other options are incorrect:

A/D: recoveryLocation is not a valid Spark configuration option.

B: Checkpointing must be configured in writeStream, not during readStream.


PySpark Structured Streaming Guide --- Checkpointing and recovery.

Databricks Exam Guide (June 2025): Section ''Structured Streaming'' --- explains checkpointing and fault-tolerant streaming recovery.

Page:    1 / 14   
Total 135 questions