Google Cloud Certified Professional Data Engineer Exam Practice Test

Page: 1 / 14
Total 373 questions
Question 1

One of your encryption keys stored in Cloud Key Management Service (Cloud KMS) was exposed. You need to re-encrypt all of your CMEK-protected Cloud Storage data that used that key. and then delete the compromised key. You also want to reduce the risk of objects getting written without customer-managed encryption key (CMEK protection in the future. What should you do?



Answer : C

To re-encrypt all of your CMEK-protected Cloud Storage data after a key has been exposed, and to ensure future writes are protected with a new key, creating a new Cloud KMS key and a new Cloud Storage bucket is the best approach. Here's why option C is the best choice:

Re-encryption of Data:

By creating a new Cloud Storage bucket and copying all objects from the old bucket to the new bucket while specifying the new Cloud KMS key, you ensure that all data is re-encrypted with the new key.

This process effectively re-encrypts the data, removing any dependency on the compromised key.

Ensuring CMEK Protection:

Creating a new bucket and setting the new CMEK as the default ensures that all future objects written to the bucket are automatically protected with the new key.

This reduces the risk of objects being written without CMEK protection.

Deletion of Compromised Key:

Once the data has been copied and re-encrypted, the old key can be safely deleted from Cloud KMS, eliminating the risk associated with the compromised key.

Steps to Implement:

Create a New Cloud KMS Key:

Create a new encryption key in Cloud KMS to replace the compromised key.

Create a New Cloud Storage Bucket:

Create a new Cloud Storage bucket and set the default CMEK to the new key.

Copy and Re-encrypt Data:

Use the gsutil tool to copy data from the old bucket to the new bucket while specifying the new CMEK key:

gsutil -o 'GSUtil:gs_json_api_version=2' cp -r gs://old-bucket/* gs://new-bucket/

Delete the Old Key:

After ensuring all data is copied and re-encrypted, delete the compromised key from Cloud KMS.


Cloud KMS Documentation

Cloud Storage Encryption

Re-encrypting Data in Cloud Storage

Question 2

You currently use a SQL-based tool to visualize your data stored in BigQuery The data visualizations require the use of outer joins and analytic functions. Visualizations must be based on data that is no less than 4 hours old. Business users are complaining that the visualizations are too slow to generate. You want to improve the performance of the visualization queries while minimizing the maintenance overhead of the data preparation pipeline. What should you do?



Answer : C

To improve the performance of visualization queries while minimizing maintenance overhead, using materialized views is the most effective solution. Here's why option C is the best choice:

Materialized Views:

Materialized views store the results of a query physically, allowing for faster access compared to regular views which execute the query each time it is accessed.

They can be automatically refreshed to reflect changes in the underlying data.

Incremental Updates:

The incremental updates capability of BigQuery materialized views ensures that only the changed data is processed during refresh operations, significantly improving performance and reducing computation costs.

This feature helps maintain up-to-date data in the materialized view with minimal processing time, which is crucial for data that needs to be no less than 4 hours old.

Performance and Maintenance:

By using materialized views, you can pre-compute and store the results of complex queries involving outer joins and analytic functions, resulting in faster query performance for data visualizations.

This approach also reduces the maintenance overhead, as BigQuery handles the incremental updates and refreshes automatically.

Steps to Implement:

Create Materialized Views:

Define materialized views for the visualization queries with the necessary configurations

CREATE MATERIALIZED VIEW project.dataset.view_name

AS

SELECT ...

FROM ...

WHERE ...

Enable Incremental Updates:

Ensure that the materialized views are set up to handle incremental updates automatically.


Update the data visualization tool to reference the materialized views instead of running the original queries directly.

BigQuery Materialized Views

Optimizing Query Performance

Question 3

You are deploying a batch pipeline in Dataflow. This pipeline reads data from Cloud Storage, transforms the data, and then writes the data into BigQuory. The security team has enabled an organizational constraint in Google Cloud, requiring all Compute Engine instances to use only internal IP addresses and no external IP addresses. What should you do?



Answer : D

To deploy a batch pipeline in Dataflow that adheres to the organizational constraint of using only internal IP addresses, ensuring Private Google Access is the most effective solution. Here's why option D is the best choice:

Private Google Access:

Private Google Access allows resources in a VPC network that do not have external IP addresses to access Google APIs and services through internal IP addresses.

This ensures compliance with the organizational constraint of using only internal IPs while allowing Dataflow to access Cloud Storage and BigQuery.

Dataflow with Internal IPs:

Dataflow can be configured to use only internal IP addresses for its worker nodes, ensuring that no external IP addresses are assigned.

This configuration ensures secure and compliant communication between Dataflow, Cloud Storage, and BigQuery.

Firewall and Network Configuration:

Enabling Private Google Access requires ensuring the correct firewall rules and network configurations to allow internal traffic to Google Cloud services.

Steps to Implement:

Enable Private Google Access:

Enable Private Google Access on the subnetwork used by the Dataflow pipeline

gcloud compute networks subnets update [SUBNET_NAME] \

--region [REGION] \

--enable-private-ip-google-access

Configure Dataflow:

Configure the Dataflow job to use only internal IP addresses

gcloud dataflow jobs run [JOB_NAME] \

--region [REGION] \

--network [VPC_NETWORK] \

--subnetwork [SUBNETWORK] \

--no-use-public-ips

Verify Access:

Ensure that firewall rules allow the necessary traffic from the Dataflow workers to Cloud Storage and BigQuery using internal IPs.


Private Google Access Documentation

Configuring Dataflow to Use Internal IPs

VPC Firewall Rules

Question 4

Different teams in your organization store customer and performance data in BigOuery. Each team needs to keep full control of their collected data, be able to query data within their projects, and be able to exchange their data with other teams. You need to implement an organization-wide solution, while minimizing operational tasks and costs. What should you do?



Answer : C

To enable different teams to manage their own data while allowing data exchange across the organization, using Analytics Hub is the best approach. Here's why option C is the best choice:

Analytics Hub:

Analytics Hub allows teams to publish their data as data exchanges, making it easy for other teams to discover and subscribe to the data they need.

This approach maintains each team's control over their data while facilitating easy and secure data sharing across the organization.

Data Publishing and Subscribing:

Teams can publish datasets they control, allowing them to manage access and updates independently.

Other teams can subscribe to these published datasets, ensuring they have access to the latest data without duplicating efforts.

Minimized Operational Tasks and Costs:

This method reduces the need for complex replication or data synchronization processes, minimizing operational overhead.

By centralizing data sharing through Analytics Hub, it also reduces storage costs associated with duplicating large datasets.

Steps to Implement:

Set Up Analytics Hub:

Enable Analytics Hub in your Google Cloud project.

Provide training to teams on how to publish and subscribe to data exchanges.

Publish Data:

Each team publishes their datasets in Analytics Hub, configuring access controls and metadata as needed.

Subscribe to Data:

Teams that need access to data from other teams can subscribe to the relevant data exchanges, ensuring they always have up-to-date data.


Analytics Hub Documentation

Publishing Data in Analytics Hub

Subscribing to Data in Analytics Hub

Question 5

You have an upstream process that writes data to Cloud Storage. This data is then read by an Apache Spark job that runs on Dataproc. These jobs are run in the us-central1 region, but the data could be stored anywhere in the United States. You need to have a recovery process in place in case of a catastrophic single region failure. You need an approach with a maximum of 15 minutes of data loss (RPO=15 mins). You want to ensure that there is minimal latency when reading the dat

a. What should you do?



Answer : B

To ensure data recovery with minimal data loss and low latency in case of a single region failure, the best approach is to use a dual-region bucket with turbo replication. Here's why option B is the best choice:

Dual-Region Bucket:

A dual-region bucket provides geo-redundancy by replicating data across two regions, ensuring high availability and resilience against regional failures.

The chosen regions (us-central1 and us-south1) provide geographic diversity within the United States.

Turbo Replication:

Turbo replication ensures that data is replicated between the two regions within 15 minutes, meeting the Recovery Point Objective (RPO) of 15 minutes.

This minimizes data loss in case of a regional failure.

Running Dataproc Cluster:

Running the Dataproc cluster in the same region as the primary data storage (us-central1) ensures minimal latency for normal operations.

In case of a regional failure, redeploying the Dataproc cluster to the secondary region (us-south1) ensures continuity with minimal data loss.

Steps to Implement:

Create a Dual-Region Bucket:

Set up a dual-region bucket in the Google Cloud Console, selecting us-central1 and us-south1 regions.

Enable turbo replication to ensure rapid data replication between the regions.

Deploy Dataproc Cluster:

Deploy the Dataproc cluster in the us-central1 region to read data from the bucket located in the same region for optimal performance.

Set Up Failover Plan:

Plan for redeployment of the Dataproc cluster to the us-south1 region in case of a failure in the us-central1 region.

Ensure that the failover process is well-documented and tested to minimize downtime and data loss.


Google Cloud Storage Dual-Region

Turbo Replication in Google Cloud Storage

Dataproc Documentation

Question 6

You have a BigQuery dataset named "customers". All tables will be tagged by using a Data Catalog tag template named "gdpr". The template contains one mandatory field, "has sensitive data~. with a boolean value. All employees must be able to do a simple search and find tables in the dataset that have either true or false in the "has sensitive data" field. However, only the Human Resources (HR) group should be able to see the data inside the tables for which "hass-ensitive-data" is true. You give the all employees group the bigquery.metadataViewer and bigquery.connectionUser roles on the dataset. You want to minimize configuration overhead. What should you do next?



Answer : D

To ensure that all employees can search and find tables with GDPR tags while restricting data access to sensitive tables only to the HR group, follow these steps:

Data Catalog Tag Template:

Use Data Catalog to create a tag template named 'gdpr' with a boolean field 'has sensitive data'. Set the visibility to public so all employees can see the tags.

Roles and Permissions:

Assign the datacatalog.tagTemplateViewer role to the all employees group. This role allows users to view the tags and search for tables based on the 'has sensitive data' field.

Assign the bigquery.dataViewer role to the HR group specifically on tables that contain sensitive data. This ensures only HR can access the actual data in these tables.

Steps to Implement:

Create the GDPR Tag Template:

Define the tag template in Data Catalog with the necessary fields and set visibility to public.

Assign Roles:

Grant the datacatalog.tagTemplateViewer role to the all employees group for visibility into the tags.

Grant the bigquery.dataViewer role to the HR group on tables marked as having sensitive data.


Data Catalog Documentation

Managing Access Control in BigQuery

IAM Roles in Data Catalog

Question 7

A web server sends click events to a Pub/Sub topic as messages. The web server includes an event Timestamp attribute in the messages, which is the time when the click occurred. You have a Dataflow streaming job that reads from this Pub/Sub topic through a subscription, applies some transformations, and writes the result to another Pub/Sub topic for use by the advertising department. The advertising department needs to receive each message within 30 seconds of the corresponding click occurrence, but they report receiving the messages late. Your Dataflow job's system lag is about 5 seconds, and the data freshness is about 40 seconds. Inspecting a few messages show no more than 1 second lag between their event Timestamp and publish Time. What is the problem and what should you do?



Answer : B

To ensure that the advertising department receives messages within 30 seconds of the click occurrence, and given the current system lag and data freshness metrics, the issue likely lies in the processing capacity of the Dataflow job. Here's why option B is the best choice:

System Lag and Data Freshness:

The system lag of 5 seconds indicates that Dataflow itself is processing messages relatively quickly.

However, the data freshness of 40 seconds suggests a significant delay before processing begins, indicating a backlog.

Backlog in Pub/Sub Subscription:

A backlog occurs when the rate of incoming messages exceeds the rate at which the Dataflow job can process them, causing delays.

Optimizing the Dataflow Job:

To handle the incoming message rate, the Dataflow job needs to be optimized or scaled up by increasing the number of workers, ensuring it can keep up with the message inflow.

Steps to Implement:

Analyze the Dataflow Job:

Inspect the Dataflow job metrics to identify bottlenecks and inefficiencies.

Optimize Processing Logic:

Optimize the transformations and operations within the Dataflow pipeline to improve processing efficiency.

Increase Number of Workers:

Scale the Dataflow job by increasing the number of workers to handle the higher load, reducing the backlog.


Dataflow Monitoring

Scaling Dataflow Jobs

Page:    1 / 14   
Total 373 questions