A company uses AWS Glue jobs to implement several data pipelines. The pipelines are critical to the company.
The company needs to implement a monitoring mechanism that will alert stakeholders if the pipelines fail.
Which solution will meet these requirements with the LEAST operational overhead?
Answer : A
Creating an EventBridge rule that triggers a Lambda function on AWS Glue job failure events and then sends notifications via Amazon SNS is the most direct and operationally efficient method:
''Practice Quiz 10: A data engineer must monitor the data pipeline... Which solution will meet these requirements?
A . Inspect the job run monitoring section of the AWS Glue console.
Correct Answer: A.''
-- Ace the AWS Certified Data Engineer - Associate Certification - version 2 - apple.pdf
Although this reference directly supports using AWS Glue's monitoring features via EventBridge, it implies that solutions like A---which directly use EventBridge failure events for automation---are more optimal and less complex than constructing custom logs and metrics.
A company needs to store semi-structured transactional data in a serverless database.
The application writes data infrequently but reads it frequently, with millisecond retrieval required.
Answer : D
Amazon DynamoDB is a serverless, low-latency, NoSQL database ideal for semi-structured data.
Adding DynamoDB Accelerator (DAX) provides microsecond response times for read-heavy workloads.
''For applications requiring millisecond or sub-millisecond reads with serverless operation, use DynamoDB with DAX caching.''
-- Ace the AWS Certified Data Engineer - Associate Certification - version 2 - apple.pdf
An ecommerce company stores sales data in an AWS Glue table named sales_data. The company stores the sales_data table in an Amazon S3 Standard bucket. The table contains columns named order_id, customer_id, product_id, order_date, shipping_date, and order_amount.
The company wants to improve query performance by partitioning the sales_data table by order_date. The company needs to add the partition to the existing sales_data table in AWS Glue.
Which solution will meet these requirements?
Answer : C
In AWS Glue, table partitions are managed as metadata objects within the AWS Glue Data Catalog. To add a new partition to an existing table, the correct and supported approach is to use the AWS Glue Data Catalog API, such as the CreatePartition operation, or equivalent console or SDK actions.
Updating the table schema alone does not create partitions or inform Glue about new partition values. Editing metadata files directly in Amazon S3 is unsupported and can corrupt the Data Catalog. Manually modifying the S3 bucket structure without registering partitions in Glue will result in Athena and other query engines being unable to recognize the partitions.
By adding partitions through the Glue Data Catalog API, query engines such as Amazon Athena and Amazon Redshift Spectrum can perform partition pruning, which significantly improves query performance by scanning only relevant data.
This method aligns with AWS best practices, ensures metadata consistency, and avoids unnecessary operational risk. Therefore, Option C is the correct solution.
A company processes 500 GB of audience and advertising data daily, storing CSV files in Amazon S3 with schemas registered in AWS Glue Data Catalog. They need to convert these files to Apache Parquet format and store them in an S3 bucket.
The solution requires a long-running workflow with 15 GiB memory capacity to process the data concurrently, followed by a correlation process that begins only after the first two processes complete.
Which solution will meet these requirements with the LEAST operational overhead?
Answer : C
Option C is correct because AWS Glue workflows are designed to orchestrate multiple ETL jobs, crawlers, and triggers with dependency management and a visual workflow graph. AWS documentation states that Glue workflows can create and visualize complex ETL activities involving multiple jobs and triggers, and that triggers can be configured so that a job starts only when multiple watched jobs satisfy specified completion conditions. AWS also states that a conditional trigger can fire when any or all watched jobs complete with the desired status. That directly matches the requirement to run the first two processes concurrently and begin the third process only after both finish.
This is also the least operational overhead option because the whole workflow stays inside AWS Glue, which already fits the ETL conversion use case from CSV to Parquet on S3 with catalog integration. Option A would work but adds MWAA operational complexity. Option B is more custom and infrastructure-heavy. Option D using Lambda is not ideal for long-running, memory-intensive ETL steps. Therefore, AWS Glue workflows is the most direct and exam-aligned answer.
A data engineer is building a new data pipeline that stores metadata in an Amazon DynamoDB table. The data engineer must ensure that all items that are older than a specified age are removed from the DynamoDB table daily.
Which solution will meet this requirement with the LEAST configuration effort?
Answer : A
DynamoDB Time to Live (TTL) automatically expires and deletes items based on a timestamp attribute --- it requires minimal setup and runs without additional infrastructure.
The engineer only needs to define a TTL attribute (e.g., expiration_time) and set its epoch value when items are inserted.
''To automatically delete old items from a DynamoDB table, enable TTL and define a timestamp attribute. Items are removed automatically after the specified time with no extra maintenance.''
-- Ace the AWS Certified Data Engineer - Associate Certification - version 2 - apple.pdf
This is serverless, free, and requires no scheduling or code-based cleanup.
A retail company stores customer data in an Amazon S3 bucket. Some of the customer data contains personally identifiable information (PII) about customers. The company must not share PII data with business partners.
A data engineer must determine whether a dataset contains PII before making objects in the dataset available to business partners.
Which solution will meet this requirement with the LEAST manual intervention?
Answer : A
Amazon Macie is a fully managed data security and privacy service that uses machine learning to automatically discover, classify, and protect sensitive data in AWS, such as PII. By configuring Macie for automated sensitive data discovery, the company can minimize manual intervention while ensuring PII is identified before data is shared.
A company extracts approximately 1 TB of data every day from data sources such as SAP HANA, Microsoft SQL Server, MongoDB, Apache Kafka, and Amazon DynamoDB. Some of the data sources have undefined data schemas or data schemas that change.
A data engineer must implement a solution that can detect the schema for these data sources. The solution must extract, transform, and load the data to an Amazon S3 bucket. The company has a service level agreement (SLA) to load the data into the S3 bucket within 15 minutes of data creation.
Which solution will meet these requirements with the LEAST operational overhead?
Answer : B
AWS Glue is a fully managed service that provides a serverless data integration platform. It can automatically discover and categorize data from various sources, including SAP HANA, Microsoft SQL Server, MongoDB, Apache Kafka, and Amazon DynamoDB. It can also infer the schema of the data and store it in the AWS Glue Data Catalog, which is a central metadata repository. AWS Glue can then use the schema information to generate and run Apache Spark code to extract, transform, and load the data into an Amazon S3 bucket. AWS Glue can also monitor and optimize the performance and cost of the data pipeline, and handle any schema changes that may occur in the source data. AWS Glue can meet the SLA of loading the data into the S3 bucket within 15 minutes of data creation, as it can trigger the data pipeline based on events, schedules, or on-demand. AWS Glue has the least operational overhead among the options, as it does not require provisioning, configuring, or managing any servers or clusters. It also handles scaling, patching, and security automatically.Reference:
AWS Glue
[AWS Glue Data Catalog]
[AWS Glue Developer Guide]
AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide