Google Associate Data Practitioner Google Cloud Associate Data Practitioner Exam Practice Test

Page: 1 / 14
Total 106 questions

Want more questions? Get Premium Access.
()

Question 1

You are working with a small dataset in Cloud Storage that needs to be transformed and loaded into BigQuery for analysis. The transformation involves simple filtering and aggregation operations. You want to use the most efficient and cost-effective data manipulation approach. What should you do?

AUse Dataproc to create an Apache Hadoop cluster, perform the ETL process using Apache Spark, and load the results into BigQuery.

BUse BigQuery's SQL capabilities to load the data from Cloud Storage, transform it, and store the results in a new BigQuery table.

CCreate a Cloud Data Fusion instance and visually design an ETL pipeline that reads data from Cloud Storage, transforms it using built-in transformations, and loads the results into BigQuery.

DUse Dataflow to perform the ETL process that reads the data from Cloud Storage, transforms it using Apache Beam, and writes the results to BigQuery.

Answer : B

Comprehensive and Detailed In-Depth

For a small dataset with simple transformations (filtering, aggregation), Google recommends leveraging BigQuery's native SQL capabilities to minimize cost and complexity.

Option A: Dataproc with Spark is overkill for a small dataset, incurring cluster management costs and setup time.

Option B: BigQuery can load data directly from Cloud Storage (e.g., CSV, JSON) and perform transformations using SQL in a serverless manner, avoiding additional service costs. This is the most efficient and cost-effective approach.

Option C: Cloud Data Fusion is suited for complex ETL but adds overhead (instance setup, UI design) unnecessary for simple tasks.

Option D: Dataflow is powerful for large-scale or streaming ETL but introduces unnecessary complexity and cost for a small, simple batch job. Extract from Google Documentation: From 'Loading Data into BigQuery from Cloud Storage' (https://cloud.google.com/bigquery/docs/loading-data-cloud-storage): 'You can load data directly from Cloud Storage into BigQuery and use SQL queries to transform it without needing additional processing tools, making it cost-effective for simple transformations.' Reference: Google Cloud Documentation - 'BigQuery Data Loading' (https://cloud.google.com/bigquery/docs/loading-data).

Extract from Google Documentation: From 'Loading Data into BigQuery from Cloud Storage' (https://cloud.google.com/bigquery/docs/loading-data-cloud-storage): 'You can load data directly from Cloud Storage into BigQuery and use SQL queries to transform it without needing additional processing tools, making it cost-effective for simple transformations.'

Question 2

You are developing a data ingestion pipeline to load small CSV files into BigQuery from Cloud Storage. You want to load these files upon arrival to minimize data latency. You want to accomplish this with minimal cost and maintenance. What should you do?

AUse the bq command-line tool within a Cloud Shell instance to load the data into BigQuery.

BCreate a Cloud Composer pipeline to load new files from Cloud Storage to BigQuery and schedule it to run every 10 minutes.

CCreate a Cloud Run function to load the data into BigQuery that is triggered when data arrives in Cloud Storage.

DCreate a Dataproc cluster to pull CSV files from Cloud Storage, process them using Spark, and write the results to BigQuery.

Answer : C

Using a Cloud Run function triggered by Cloud Storage to load the data into BigQuery is the best solution because it minimizes both cost and maintenance while providing low-latency data ingestion. Cloud Run is a serverless platform that automatically scales based on the workload, ensuring efficient use of resources without requiring a dedicated instance or cluster. It integrates seamlessly with Cloud Storage event notifications, enabling real-time processing of incoming files and loading them into BigQuery. This approach is cost-effective, scalable, and easy to manage.

The goal is to load small CSV files into BigQuery upon arrival (event-driven) with minimal latency, cost, and maintenance. Google Cloud provides serverless, event-driven options that align with this requirement. Let's evaluate each option in detail:

Option A: Cloud Composer (managed Apache Airflow) can schedule a pipeline to check Cloud Storage every 10 minutes, but this polling approach introduces latency (up to 10 minutes) and incurs costs for running Composer even when no files arrive. Maintenance includes managing DAGs and the Composer environment, which adds overhead. This is better suited for scheduled batch jobs, not event-driven ingestion.

Option B: A Cloud Run function triggered by a Cloud Storage event (via Eventarc or Pub/Sub) loads files into BigQuery as soon as they arrive, minimizing latency. Cloud Run is serverless, scales to zero when idle (low cost), and requires minimal maintenance (deploy and forget). Using the BigQuery API in the function (e.g., Python client library) handles small CSV loads efficiently. This aligns with Google's serverless, event-driven best practices.

Option C: Dataproc with Spark is designed for large-scale, distributed processing, not small CSV ingestion. It requires cluster management, incurs higher costs (even with ephemeral clusters), and adds unnecessary complexity for a simple load task.

Option D: The bq command-line tool in Cloud Shell is manual and not automated, failing the ''upon arrival'' requirement. It's a one-off tool, not a pipeline solution, and Cloud Shell isn't designed for persistent automation.

Why B is Best: Cloud Run leverages Cloud Storage's object creation events, ensuring near-zero latency between file arrival and BigQuery ingestion. It's serverless, meaning no infrastructure to manage, and costs scale with usage (free when idle). For small CSVs, the BigQuery load job is lightweight, avoiding processing overhead.

Extract from Google Documentation: From 'Triggering Cloud Run with Cloud Storage Events' (https://cloud.google.com/run/docs/triggering/using-events): 'You can trigger Cloud Run services in response to Cloud Storage events, such as object creation, using Eventarc. This serverless approach minimizes latency and maintenance, making it ideal for real-time data pipelines.' Additionally, from 'Loading Data into BigQuery' (https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv): 'Programmatically load CSV files from Cloud Storage using the BigQuery API, enabling automated ingestion with minimal overhead.'

Question 3

You have created a LookML model and dashboard that shows daily sales metrics for five regional managers to use. You want to ensure that the regional managers can only see sales metrics specific to their region. You need an easy-to-implement solution. What should you do?

ACreate a sales_region user attribute, and assign each manager's region as the value of their user attribute. Add an access_filter Explore filter on the region_name dimension by using the sales_region user attribute.

BCreate five different Explores with the sql_always_filter Explore filter applied on the region_name dimension. Set each region_name value to the corresponding region for each manager.

CCreate separate Looker dashboards for each regional manager. Set the default dashboard filter to the corresponding region for each manager.

DCreate separate Looker instances for each regional manager. Copy the LookML model and dashboard to each instance. Provision viewer access to the corresponding manager.

Answer : A

Using a sales_region user attribute is the best solution because it allows you to dynamically filter data based on each manager's assigned region. By adding an access_filter Explore filter on the region_name dimension that references the sales_region user attribute, each manager sees only the sales metrics specific to their region. This approach is easy to implement, scalable, and avoids duplicating dashboards or Explores, making it both efficient and maintainable.

Question 4

Your retail company wants to predict customer churn using historical purchase data stored in BigQuery. The dataset includes customer demographics, purchase history, and a label indicating whether the customer churned or not. You want to build a machine learning model to identify customers at risk of churning. You need to create and train a logistic regression model for predicting customer churn, using the customer_data table with the churned column as the target label. Which BigQuery ML query should you use?

ACREATE OR REPLACE MODEL churn_prediction_model OPTIONS(model_uype='logisric_reg') AS SELECT * from cusromer_data;

BCREATE OR REPLACE MODEL churn_prediction_model OPTIONS (rr.odel_type=' logisric_reg *) AS select * except(churned), churned AS label FROM customer_data;

CCREATE OR REPLACE MODEL churn_prediction_model options (model type='logistic_reg') AS select churned as label FROM customer_data;

DCREATE OR REPLACE MODEL churn_prediction_model options(model_type='logistic_reg*) as select ' except(churned) FROM customer data;

Answer : B

Comprehensive and Detailed in Depth

Why B is correct:BigQuery ML requires the target label to be explicitly named label.

EXCEPT(churned) selects all columns except the churned column, which becomes the features.

churned AS label renames the churned column to label, which is required for BigQuery ML.

logistic_reg is the correct model_type option.

Why other options are incorrect:A: Does not rename the target column to label. Also has a typo in the model type.

C: Only selects the target label, not the features.

D: Has a syntax error with the single quote before except.

BigQuery ML Logistic Regression: https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-logistic-regression

BigQuery ML Syntax: https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create

Question 5

You want to build a model to predict the likelihood of a customer clicking on an online advertisement. You have historical data in BigQuery that includes features such as user demographics, ad placement, and previous click behavior. After training the model, you want to generate predictions on new dat

a. Which model type should you use in BigQuery ML?

ALinear regression

BMatrix factorization

CLogistic regression

DK-means clustering

Answer : C

Comprehensive and Detailed In-Depth

Predicting the likelihood of a click (binary outcome: click or no-click) requires a classification model. BigQuery ML supports this use case with logistic regression.

Option A: Linear regression predicts continuous values, not probabilities for binary outcomes.

Option B: Matrix factorization is for recommendation systems, not binary prediction.

Option C: Logistic regression predicts probabilities for binary classification (e.g., click likelihood), ideal for this scenario and supported in BigQuery ML.

Option D: K-means clustering is for unsupervised grouping, not predictive modeling. Extract from Google Documentation: From 'BigQuery ML: Logistic Regression' (https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create#logistic_reg): 'Logistic regression models are used to predict the probability of a binary outcome, such as whether an event will occur, making them suitable for classification tasks like click prediction.' Reference: Google Cloud Documentation - 'BigQuery ML Model Types' (https://cloud.google.com/bigquery-ml/docs/introduction).

Extract from Google Documentation: From 'BigQuery ML: Logistic Regression' (https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create#logistic_reg): 'Logistic regression models are used to predict the probability of a binary outcome, such as whether an event will occur, making them suitable for classification tasks like click prediction.'

Question 6

Your company is adopting BigQuery as their data warehouse platform. Your team has experienced Python developers. You need to recommend a fully-managed tool to build batch ETL processes that extract data from various source systems, transform the data using a variety of Google Cloud services, and load the transformed data into BigQuery. You want this tool to leverage your team's Python skills. What should you do?

AUse Dataform with assertions.

BDeploy Cloud Data Fusion and included plugins.

CUse Cloud Composer with pre-built operators.

DUse Dataflow and pre-built templates.

Answer : C

Comprehensive and Detailed In-Depth

The tool must be fully managed, support batch ETL, integrate with multiple Google Cloud services, and leverage Python skills.

Option A: Dataform is SQL-focused for ELT within BigQuery, not Python-centric, and lacks broad service integration for extraction.

Option B: Cloud Data Fusion is a visual ETL tool, not Python-focused, and requires more UI-based configuration than coding.

Option C: Cloud Composer (managed Apache Airflow) is fully managed, supports batch ETL via DAGs, integrates with various Google Cloud services (e.g., BigQuery, GCS) through operators, and allows custom Python code in tasks. It's ideal for Python developers per the 'Cloud Composer' documentation.

Option D: Dataflow excels at streaming and batch processing but focuses on Apache Beam (Python SDK available), not broad service orchestration. Pre-built templates limit customization. Reference: Google Cloud Documentation - 'Cloud Composer Overview' (https://cloud.google.com/composer/docs).

Question 7

Your organization has decided to migrate their existing enterprise data warehouse to BigQuery. The existing data pipeline tools already support connectors to BigQuery. You need to identify a data migration approach that optimizes migration speed. What should you do?

ACreate a temporary file system to facilitate data transfer from the existing environment to Cloud Storage. Use Storage Transfer Service to migrate the data into BigQuery.

BUse the Cloud Data Fusion web interface to build data pipelines. Create a directed acyclic graph (DAG) that facilitates pipeline orchestration.

CUse the existing data pipeline tool's BigQuery connector to reconfigure the data mapping.

DUse the BigQuery Data Transfer Service to recreate the data pipeline and migrate the data into BigQuery.

Answer : C

Since your existing data pipeline tools already support connectors to BigQuery, the most efficient approach is to use the existing data pipeline tool's BigQuery connector to reconfigure the data mapping. This leverages your current tools, reducing migration complexity and setup time, while optimizing migration speed. By reconfiguring the data mapping within the existing pipeline, you can seamlessly direct the data into BigQuery without needing additional services or intermediary steps.