Databricks Certified Data Engineer Professional Exam Practice Test Instant Access

Question 1

A junior data engineer on your team has implemented the following code block.

The view new_events contains a batch of records with the same schema as the events Delta table. The event_id field serves as a unique key for this table.

When this query is executed, what will happen with new records that have the same event_id as an existing record?

AThey are merged.

BThey are ignored.

CThey are updated.

DThey are inserted.

EThey are deleted.

Answer : B

This is the correct answer because it describes what will happen with new records that have the same event_id as an existing record when the query is executed. The query uses the INSERT INTO command to append new records from the view new_events to the table events. However, the INSERT INTO command does not check for duplicate values in the primary key column (event_id) and does not perform any update or delete operations on existing records. Therefore, if there are new records that have the same event_id as an existing record, they will be ignored and not inserted into the table events. Verified Reference: [Databricks Certified Data Engineer Professional], under ''Delta Lake'' section; Databricks Documentation, under ''Append data using INSERT INTO'' section.

'If none of the WHEN MATCHED conditions evaluate to true for a source and target row pair that matches the merge_condition, then the target row is left unchanged.' https://docs.databricks.com/en/sql/language-manual/delta-merge-into.html#:~:text=If%20none%20of%20the%20WHEN%20MATCHED%20conditions%20evaluate%20to%20true%20for%20a%20source%20and%20target%20row%20pair%20that%20matches%20the%20merge_condition%2C%20then%20the%20target%20row%20is%20left%20unchanged.

Question 2

A Data engineer wants to run unit's tests using common Python testing frameworks on python functions defined across several Databricks notebooks currently used in production.

How can the data engineer run unit tests against function that work with data in production?

ARun unit tests against non-production data that closely mirrors production

BDefine and unit test functions using Files in Repos

CDefine units test and functions within the same notebook

DDefine and import unit test functions from a separate Databricks notebook

Answer : A

The best practice for running unit tests on functions that interact with data is to use a dataset that closely mirrors the production data. This approach allows data engineers to validate the logic of their functions without the risk of affecting the actual production data. It's important to have a representative sample of production data to catch edge cases and ensure the functions will work correctly when used in a production environment.

Databricks Documentation on Testing: Testing and Validation of Data and Notebooks

Question 3

A distributed team of data analysts share computing resources on an interactive cluster with autoscaling configured. In order to better manage costs and query throughput, the workspace administrator is hoping to evaluate whether cluster upscaling is caused by many concurrent users or resource-intensive queries.

In which location can one review the timeline for cluster resizing events?

AWorkspace audit logs

BDriver's log file

CGanglia

DCluster Event Log

EExecutor's log file

Answer : D

Question 4

A Delta Lake table was created with the below query:

Realizing that the original query had a typographical error, the below code was executed:

ALTER TABLE prod.sales_by_stor RENAME TO prod.sales_by_store

Which result will occur after running the second command?

AThe table reference in the metastore is updated and no data is changed.

BThe table name change is recorded in the Delta transaction log.

CAll related files and metadata are dropped and recreated in a single ACID transaction.

DThe table reference in the metastore is updated and all data files are moved.

EA new Delta transaction log Is created for the renamed table.

Answer : A

The query uses the CREATE TABLE USING DELTA syntax to create a Delta Lake table from an existing Parquet file stored in DBFS. The query also uses the LOCATION keyword to specify the path to the Parquet file as /mnt/finance_eda_bucket/tx_sales.parquet. By using the LOCATION keyword, the query creates an external table, which is a table that is stored outside of the default warehouse directory and whose metadata is not managed by Databricks. An external table can be created from an existing directory in a cloud storage system, such as DBFS or S3, that contains data files in a supported format, such as Parquet or CSV.

The result that will occur after running the second command is that the table reference in the metastore is updated and no data is changed. The metastore is a service that stores metadata about tables, such as their schema, location, properties, and partitions. The metastore allows users to access tables using SQL commands or Spark APIs without knowing their physical location or format. When renaming an external table using the ALTER TABLE RENAME TO command, only the table reference in the metastore is updated with the new name; no data files or directories are moved or changed in the storage system. The table will still point to the same location and use the same format as before. However, if renaming a managed table, which is a table whose metadata and data are both managed by Databricks, both the table reference in the metastore and the data files in the default warehouse directory are moved and renamed accordingly. Verified Reference: [Databricks Certified Data Engineer Professional], under ''Delta Lake'' section; Databricks Documentation, under ''ALTER TABLE RENAME TO'' section; Databricks Documentation, under ''Metastore'' section; Databricks Documentation, under ''Managed and external tables'' section.

Question 5

Which configuration parameter directly affects the size of a spark-partition upon ingestion of data into Spark?

Aspark.sql.files.maxPartitionBytes

Bspark.sql.autoBroadcastJoinThreshold

Cspark.sql.files.openCostInBytes

Dspark.sql.adaptive.coalescePartitions.minPartitionNum

Espark.sql.adaptive.advisoryPartitionSizeInBytes

Answer : A

This is the correct answer because spark.sql.files.maxPartitionBytes is a configuration parameter that directly affects the size of a spark-partition upon ingestion of data into Spark. This parameter configures the maximum number of bytes to pack into a single partition when reading files from file-based sources such as Parquet, JSON and ORC. The default value is 128 MB, which means each partition will be roughly 128 MB in size, unless there are too many small files or only one large file. Verified Reference: [Databricks Certified Data Engineer Professional], under ''Spark Configuration'' section;Databricks Documentation, under ''Available Properties - spark.sql.files.maxPartitionBytes'' section.

Question 6

The data engineering team has configured a Databricks SQL query and alert to monitor the values in a Delta Lake table. The recent_sensor_recordings table contains an identifying sensor_id alongside the timestamp and temperature for the most recent 5 minutes of recordings.

The below query is used to create the alert:

The query is set to refresh each minute and always completes in less than 10 seconds. The alert is set to trigger when mean (temperature) > 120. Notifications are triggered to be sent at most every 1 minute.

If this alert raises notifications for 3 consecutive minutes and then stops, which statement must be true?

AThe total average temperature across all sensors exceeded 120 on three consecutive executions of the query

BThe recent_sensor_recordingstable was unresponsive for three consecutive runs of the query

CThe source query failed to update properly for three consecutive minutes and then restarted

DThe maximum temperature recording for at least one sensor exceeded 120 on three consecutive executions of the query

EThe average temperature recordings for at least one sensor exceeded 120 on three consecutive executions of the query

Answer : E

This is the correct answer because the query is using a GROUP BY clause on the sensor_id column, which means it will calculate the mean temperature for each sensor separately. The alert will trigger when the mean temperature for any sensor is greater than 120, which means at least one sensor had an average temperature above 120 for three consecutive minutes. The alert will stop when the mean temperature for all sensors drops below 120. Verified Reference: [Databricks Certified Data Engineer Professional], under ''SQL Analytics'' section; Databricks Documentation, under ''Alerts'' section.

Question 7

Which of the following is true of Delta Lake and the Lakehouse?

ABecause Parquet compresses data row by row. strings will only be compressed when a character is repeated multiple times.

BDelta Lake automatically collects statistics on the first 32 columns of each table which are leveraged in data skipping based on query filters.

CViews in the Lakehouse maintain a valid cache of the most recent versions of source tables at all times.

DPrimary and foreign key constraints can be leveraged to ensure duplicate values are never entered into a dimension table.

EZ-order can only be applied to numeric values stored in Delta Lake tables

Answer : B

https://docs.delta.io/2.0.0/table-properties.html

Delta Lake automatically collects statistics on the first 32 columns of each table, which are leveraged in data skipping based on query filters1.Data skipping is a performance optimization technique that aims to avoid reading irrelevant data from the storage layer1.By collecting statistics such as min/max values, null counts, and bloom filters, Delta Lake can efficiently prune unnecessary files or partitions from the query plan1. This can significantly improve the query performance and reduce the I/O cost.

The other options are false because:

Parquet compresses data column by column, not row by row2.This allows for better compression ratios, especially for repeated or similar values within a column2.

Views in the Lakehouse do not maintain a valid cache of the most recent versions of source tables at all times3.Views are logical constructs that are defined by a SQL query on one or more base tables3.Views are not materialized by default, which means they do not store any data, but only the query definition3.Therefore, views always reflect the latest state of the source tables when queried3. However, views can be cached manually using the CACHE TABLE or CREATE TABLE AS SELECT commands.

Primary and foreign key constraints can not be leveraged to ensure duplicate values are never entered into a dimension table. Delta Lake does not support enforcing primary and foreign key constraints on tables. Constraints are logical rules that define the integrity and validity of the data in a table. Delta Lake relies on the application logic or the user to ensure the data quality and consistency.

Z-order can be applied to any values stored in Delta Lake tables, not only numeric values. Z-order is a technique to optimize the layout of the data files by sorting them on one or more columns. Z-order can improve the query performance by clustering related values together and enabling more efficient data skipping. Z-order can be applied to any column that has a defined ordering, such as numeric, string, date, or boolean values.