CompTIA DataAI Certification Exam DY0-001 Practice Questions

Page: 1 / 14
Total 85 questions
Question 1

Which of the following JOINS would generate the largest amount of data?



Answer : C

A CROSS JOIN produces the Cartesian product of the two tables (every row from the first paired with every row from the second), yielding far more rows than any of the other join types.


Question 2

Which of the following measures would a data scientist most likely use to calculate the similarity of two text strings?



Answer : B

Edit distance quantifies how many single-character insertions, deletions, or substitutions are needed to transform one string into another, making it a direct measure of their similarity.


Question 3

A data scientist is using the following confusion matrix to assess model performance:

The model is predicting whether a delivery truck will be able to make 200 scheduled delivery stops. Every time the model is correct, the company saves an hour in planning and scheduling of maintenance work. Every time the model is wrong, the company loses four hours of delivery time for the truck. Which of the following is the net model impact for the company?



Answer : A

Treat each ''predicted-to-fail'' and ''predicted-to-succeed'' row as coming from 100 cases apiece (200 total).


Question 4

Which of the following distance metrics for KNN is best described as a straight line?



Answer : B

Euclidean distance measures the straight-line distance between two points in space, matching the geometric ''as-the-crow-flies'' notion of distance.


Question 5

A data scientist wants to digitize historical hard copies of documents. Which of the following is the best method for this task?



Answer : B

OCR converts scanned images of text into machinereadable characters, making it the appropriate tool for digitizing printed or handwritten historical documents.


Question 6

Which of the following distribution methods or models can most effectively represent the actual arrival times of a bus that runs on an hourly schedule?



Answer : C

Scheduled buses tend to arrive around a fixed time with random delays that cluster symmetrically around the hour. A normal distribution effectively models those continuous, bell-shaped deviations from the exact schedule.


Question 7

Which of the following issues should a data scientist be most concerned about when generating a synthetic data set?



Answer : D

If synthetic data don't accurately mirror the real-world distributions and relationships, any models trained on them will perform poorly in deployment. Representativeness is the critical concern when generating synthetic data.


Page:    1 / 14   
Total 85 questions