Databricks Certified Data Engineer Associate Exam Practice Test Instant Access

Question 1

A data engineer that is new to using Python needs to create a Python function to add two integers together and return the sum?

Which of the following code blocks can the data engineer use to complete this task?

AOption A

BOption B

COption C

DOption D

EOption E

Answer : D

https://www.w3schools.com/python/python_functions.asp

https://www.geeksforgeeks.org/python-functions/

Question 2

In order for Structured Streaming to reliably track the exact progress of the processing so that it can handle any kind of failure by restarting and/or reprocessing, which of the following two approaches is used by Spark to record the offset range of the data being processed in each trigger?

ACheckpointing and Write-ahead Logs

BStructured Streaming cannot record the offset range of the data being processed in each trigger.

CReplayable Sources and Idempotent Sinks

DWrite-ahead Logs and Idempotent Sinks

ECheckpointing and Idempotent Sinks

Answer : A

Structured Streaming uses checkpointing and write-ahead logs to record the offset range of the data being processed in each trigger. This ensures that the engine can reliably track the exact progress of the processing and handle any kind of failure by restarting and/or reprocessing. Checkpointing is the mechanism of saving the state of a streaming query to fault-tolerant storage (such as HDFS) so that it can be recovered after a failure. Write-ahead logs are files that record the offset range of the data being processed in each trigger and are written to the checkpoint location before the processing starts. These logs are used to recover the query state and resume processing from the last processed offset range in case of a failure.Reference:Structured Streaming Programming Guide,Fault Tolerance Semantics

Question 3

A data engineer is using the following code block as part of a batch ingestion pipeline to read from a composable table:

Which of the following changes needs to be made so this code block will work when the transactions table is a stream source?

AReplace predict with a stream-friendly prediction function

BReplace schema(schema) with option ('maxFilesPerTrigger', 1)

CReplace 'transactions' with the path to the location of the Delta table

DReplace format('delta') with format('stream')

EReplace spark.read with spark.readStream

Answer : E

: To read from a stream source, the data engineer needs to use the spark.readStream method instead of the spark.read method. The spark.readStream method returns a DataStreamReader object that can be used to specify the details of the input source, such as the format, the schema, the path, and the options. The spark.read method is only suitable for batch processing, not streaming processing. The other changes are not necessary or correct for reading from a stream source.Reference:Structured Streaming Programming Guide,Read a stream,Databricks Data Sources

Question 4

A data engineer has configured a Structured Streaming job to read from a table, manipulate the data, and then perform a streaming write into a new table.

The cade block used by the data engineer is below:

If the data engineer only wants the query to execute a micro-batch to process data every 5 seconds, which of the following lines of code should the data engineer use to fill in the blank?

Atrigger('5 seconds')

Btrigger()

Ctrigger(once='5 seconds')

Dtrigger(processingTime='5 seconds')

Etrigger(continuous='5 seconds')

Answer : D

The processingTime option specifies a time-based trigger interval for fixed interval micro-batches. This means that the query will execute a micro-batch to process data every 5 seconds, regardless of how much data is available. This option is suitable for near-real time processing workloads that require low latency and consistent processing frequency. The other options are either invalid syntax (A, C), default behavior (B), or experimental feature (E).Reference:Databricks Documentation - Configure Structured Streaming trigger intervals,Databricks Documentation - Trigger.

Question 5

Which type of workloads are compatible with Auto Loader?

AStreaming workloads

BMachine learning workloads

CServerless workloads

DBatch workloads

Answer : A

Question 6

Which of the following code blocks will remove the rows where the value in column age is greater than 25 from the existing Delta table my_table and save the updated table?

ASELECT * FROM my_table WHERE age > 25;

BUPDATE my_table WHERE age > 25;

CDELETE FROM my_table WHERE age > 25;

DUPDATE my_table WHERE age <= 25;

EDELETE FROM my_table WHERE age <= 25;

Answer : C

: The DELETE command in Delta Lake allows you to remove data that matches a predicate from a Delta table. This command will delete all the rows where the value in the column age is greater than 25 from the existing Delta table my_table and save the updated table. The other options are either incorrect or do not achieve the desired result. Option A will only select the rows that match the predicate, but not delete them. Option B will update the rows that match the predicate, but not delete them. Option D will update the rows that do not match the predicate, but not delete them. Option E will delete the rows that do not match the predicate, which is the opposite of what we want.Reference:Table deletes, updates, and merges --- Delta Lake Documentation

Question 7

A data engineer is working with two tables. Each of these tables is displayed below in its entirety. The data engineer runs the following query to join these tables together: Which of the following will be returned by the above query?

AOption A

BOption B

COption C

DOption D

EOption E

Answer : A

Option A is the correct answer because it shows the result of an INNER JOIN between the two tables. An INNER JOIN returns only the rows that have matching values in both tables based on the join condition. In this case, the join condition isON a.customer_id = c.customer_id, which means that only the rows that have the same customer ID in both tables will be included in the output. The output will have four columns: customer_id, name, account_id, and overdraft_amt. The output will have four rows, corresponding to the four customers who have accounts in the account table.