Which of the following issues should a data scientist be most concerned about when generating a synthetic data set?
Answer : D
If synthetic data don't accurately mirror the real-world distributions and relationships, any models trained on them will perform poorly in deployment. Representativeness is the critical concern when generating synthetic data.
A data scientist is using the following confusion matrix to assess model performance:

The model is predicting whether a delivery truck will be able to make 200 scheduled delivery stops. Every time the model is correct, the company saves an hour in planning and scheduling of maintenance work. Every time the model is wrong, the company loses four hours of delivery time for the truck. Which of the following is the net model impact for the company?
Answer : A
Treat each ''predicted-to-fail'' and ''predicted-to-succeed'' row as coming from 100 cases apiece (200 total).
Which of the following does k represent in the k-means model?
Answer : C
In k-means clustering, the parameter k directly defines how many clusters the algorithm will partition the data into.
Which of the following distance metrics for KNN is best described as a straight line?
Answer : B
Euclidean distance measures the straight-line distance between two points in space, matching the geometric ''as-the-crow-flies'' notion of distance.
A data scientist is presenting the recommendations from a monthslong modeling and experiment process to the company's Chief Executive Officer. Which of the following is the best set of artifacts to include in the presentation?
Answer : B
Executive audiences need concise, high-level insights: what you found (results), what you suggest (recommendations), why it matters (justifications), and visual summaries (clear charts). Detailed methods, code, or raw data aren't appropriate at the C-suite level.
Which of the following measures would a data scientist most likely use to calculate the similarity of two text strings?
Answer : B
Edit distance quantifies how many single-character insertions, deletions, or substitutions are needed to transform one string into another, making it a direct measure of their similarity.
A data scientist is standardizing a large data set that contains website addresses. A specific string inside some of the web addresses needs to be extracted. Which of the following is the best method for extracting the desired string from the text data?
Answer : A