Databricks Certified Data Engineer Associate Exam Questions

Page: 1 / 14
Total 109 questions
Question 1

A data engineer is maintaining a data pipeline. Upon data ingestion, the data engineer notices that the source data is starting to have a lower level of quality. The data engineer would like to automate the process of monitoring the quality level.

Which of the following tools can the data engineer use to solve this problem?



Answer : D

Delta Live Tables is a tool that enables data engineers to build and manage reliable data pipelines with minimal code. One of the features of Delta Live Tables isdata quality monitoring, which allows data engineers to define quality expectations for their data and automatically check them at every step of the pipeline. Data quality monitoring can help detect and resolve data quality issues, such as missing values, duplicates, outliers, or schema changes. Data quality monitoring can also generate alerts and reports on the quality level of the data, and enable data engineers to troubleshoot and fix problems quickly.Reference:Delta Live Tables Overview,Data Quality Monitoring


Question 2

A data engineer wants to create a new table containing the names of customers that live in France.

They have written the following command:

A senior data engineer mentions that it is organization policy to include a table property indicating that the new table includes personally identifiable information (PII).

Which of the following lines of code fills in the above blank to successfully complete the task?



Answer : D

In Databricks, when creating a table, you can add a comment to columns or the entire table to provide more information about the data it contains. In this case, since it's organization policy to indicate that the new table includes personally identifiable information (PII), option D is correct. The line of code would be added after defining the table structure and before closing with a semicolon.Reference:Data Engineer Associate Exam Guide,CREATE TABLE USING (Databricks SQL)


Question 3

A data engineer is attempting to drop a Spark SQL table my_table. The data engineer wants to delete all table metadata and data.

They run the following command:

DROP TABLE IF EXISTS my_table

While the object no longer appears when they run SHOW TABLES, the data files still exist.

Which of the following describes why the data files still exist and the metadata files were deleted?



Answer : C

An external table is a table that is defined in the metastore and points to an existing location in the storage system. When you drop an external table, only the metadata is deleted from the metastore, but the data files are not deleted from the storage system. This is because external tables are meant to be shared by multiple applications and users, and dropping them should not affect the data availability. On the other hand, a managed table is a table that is defined in the metastore and also managed by the metastore. When you drop a managed table, both the metadata and the data files are deleted from the metastore and the storage system, respectively. This is because managed tables are meant to be exclusive to the application or user that created them, and dropping them should free up the storage space. Therefore, the correct answer is C, because the table was external and only the metadata was deleted when the table was dropped.Reference:Databricks Documentation - Managed and External Tables,Databricks Documentation - Drop Table


Question 4

A dataset has been defined using Delta Live Tables and includes an expectations clause:

CONSTRAINT valid_timestamp EXPECT (timestamp > '2020-01-01') ON VIOLATION DROP ROW

What is the expected behavior when a batch of data containing data that violates these constraints is processed?



Answer : C

Delta Live Tables expectations are optional clauses that apply data quality checks on each record passing through a query. An expectation consists of a description, a boolean statement, and an action to take when a record fails the expectation. The ON VIOLATION clause specifies the action to take, which can be one of the following: warn, drop, or fail. The drop action means that invalid records are dropped from the target dataset before data is written to the target. The failure is reported as a metric for the dataset, which can be viewed by querying the Delta Live Tables event log. The event log contains information such as the number of records that violate an expectation, the number of records dropped, and the number of records written to the target dataset.Reference:

Manage data quality with Delta Live Tables

Monitor Delta Live Tables pipelines

Delta Live Tables SQL language reference


Question 5

Which of the following tools is used by Auto Loader process data incrementally?



Answer : B

Auto Loader provides a Structured Streaming source calledcloudFilesthat can process new data files as they arrive in cloud storage without any additional setup. Auto Loader uses a scalable key-value store to track ingestion progress and ensure exactly-once semantics. Auto Loader can ingest various file formats and load them into Delta Lake tables. Auto Loader is recommended for incremental data ingestion with Delta Live Tables, which extends the functionality of Structured Streaming and allows you to write declarative Python or SQL code to deploy a production-quality data pipeline.Reference:What is Auto Loader?,What is Auto Loader? | Databricks on AWS,Solved: How does Auto Loader ingest data? - Databricks - 5629


Question 6

A data engineer needs to apply custom logic to string column city in table stores for a specific use case. In order to apply this custom logic at scale, the data engineer wants to create a SQL user-defined function (UDF).

Which of the following code blocks creates this SQL UDF?

A.

B.

C.

D.

E.



Answer : A

https://www.databricks.com/blog/2021/10/20/introducing-sql-user-defined-functions.html


Question 7

Which query is performing a streaming hop from raw data to a Bronze table?

A)

B)

C)

D)



Answer : D

The query performing a streaming hop from raw data to a Bronze table is identified by using the Spark streaming read capability and then writing to a Bronze table. Let's analyze the options:

Option A: Utilizes .writeStream but performs a complete aggregation which is more characteristic of a roll-up into a summarized table rather than a hop into a Bronze table.

Option B: Also uses .writeStream but calculates an average, which again does not typically represent the raw to Bronze transformation, which usually involves minimal transformations.

Option C: This uses a basic .write with .mode('append') which is not a streaming operation, and hence not suitable for real-time streaming data transformation to a Bronze table.

Option D: It employs spark.readStream.load() to ingest raw data as a stream and then writes it out with .writeStream, which is a typical pattern for streaming data into a Bronze table where raw data is captured in real-time and minimal transformation is applied. This approach aligns with the concept of a Bronze table in a modern data architecture, where raw data is ingested continuously and stored in a more accessible format.

Reference: Databricks documentation on Structured Streaming: Structured Streaming in Databricks


Page:    1 / 14   
Total 109 questions