Snowflake DSA-C02 SnowPro Advanced: Data Scientist Certification Exam Practice Test

Page: 1 / 14
Total 65 questions
Question 1

What Can Snowflake Data Scientist do in the Snowflake Marketplace as Consumer?



Answer : A, B, C, D

As a consumer, you can do the following:

* Discover and test third-party data sources.

* Receive frictionless access to raw data products from vendors.

* Combine new datasets with your existing data in Snowflake to derive new business insights.

* Have datasets available instantly and updated continually for users.

* Eliminate the costs of building and maintaining various APIs and data pipelines to load and up-date data.

* Use the business intelligence (BI) tools of your choice.


Question 2

Which one is not the feature engineering techniques used in ML data science world?



Answer : D

Feature engineering is the pre-processing step of machine learning, which is used to transform raw data into features that can be used for creating a predictive model using Machine learning or statistical Modelling.

What is a feature?

Generally, all machine learning algorithms take input data to generate the output. The input data re-mains in a tabular form consisting of rows (instances or observations) and columns (variable or at-tributes), and these attributes are often known as features. For example, an image is an instance in computer vision, but a line in the image could be the feature. Similarly, in NLP, a document can be an observation, and the word count could be the feature. So, we can say a feature is an attribute that impacts a problem or is useful for the problem.

What is Feature Engineering?

Feature engineering is the pre-processing step of machine learning, which extracts features from raw data. It helps to represent an underlying problem to predictive models in a better way, which as a result, improve the accuracy of the model for unseen data. The predictive model contains predictor variables and an outcome variable, and while the feature engineering process selects the most useful predictor variables for the model.

Some of the popular feature engineering techniques include:

1. Imputation

Feature engineering deals with inappropriate data, missing values, human interruption, general errors, insufficient data sources, etc. Missing values within the dataset highly affect the performance of the algorithm, and to deal with them 'Imputation' technique is used. Imputation is responsible for handling irregularities within the dataset.

For example, removing the missing values from the complete row or complete column by a huge percentage of missing values. But at the same time, to maintain the data size, it is required to impute the missing data, which can be done as:

For numerical data imputation, a default value can be imputed in a column, and missing values can be filled with means or medians of the columns.

For categorical data imputation, missing values can be interchanged with the maximum occurred value in a column.

2. Handling Outliers

Outliers are the deviated values or data points that are observed too away from other data points in such a way that they badly affect the performance of the model. Outliers can be handled with this feature engineering technique. This technique first identifies the outliers and then remove them out.

Standard deviation can be used to identify the outliers. For example, each value within a space has a definite to an average distance, but if a value is greater distant than a certain value, it can be considered as an outlier. Z-score can also be used to detect outliers.

3. Log transform

Logarithm transformation or log transform is one of the commonly used mathematical techniques in machine learning. Log transform helps in handling the skewed data, and it makes the distribution more approximate to normal after transformation. It also reduces the effects of outliers on the data, as because of the normalization of magnitude differences, a model becomes much robust.

4. Binning

In machine learning, overfitting is one of the main issues that degrade the performance of the model and which occurs due to a greater number of parameters and noisy data. However, one of the popular techniques of feature engineering, 'binning', can be used to normalize the noisy data. This process involves segmenting different features into bins.

5. Feature Split

As the name suggests, feature split is the process of splitting features intimately into two or more parts and performing to make new features. This technique helps the algorithms to better understand and learn the patterns in the dataset.

The feature splitting process enables the new features to be clustered and binned, which results in extracting useful information and improving the performance of the data models.

6. One hot encoding

One hot encoding is the popular encoding technique in machine learning. It is a technique that converts the categorical data in a form so that they can be easily understood by machine learning algorithms and hence can make a good prediction. It enables group the of categorical data without losing any information.


Question 3

Mark the incorrect statement regarding usage of Snowflake Stream & Tasks?



Answer : D

All are correct except a standard-only stream tracks row inserts only.

A standard (i.e. delta) stream tracks all DML changes to the source object, including inserts, up-dates, and deletes (including table truncates).


Question 4

What is the formula for measuring skewness in a dataset?



Answer : C

Since the normal curve is symmetric about its mean, its skewness is zero. This is a theoretical expla-nation for mathematical proofs, you can refer to books or websites that speak on the same in detail.


Question 5

There are a couple of different types of classification tasks in machine learning, Choose the Correct Classification which best categorized the below Application Tasks in Machine learning?

* To detect whether email is spam or not

* To determine whether or not a patient has a certain disease in medicine.

* To determine whether or not quality specifications were met when it comes to QA (Quality Assurance).



Answer : C

The Supervised Machine Learning algorithm can be broadly classified into Regression and Classification Algorithms. In Regression algorithms, we have predicted the output for continuous values, but to predict the categorical values, we need Classification algorithms.

What is the Classification Algorithm?

The Classification algorithm is a Supervised Learning technique that is used to identify the category of new observations on the basis of training data. In Classification, a program learns from the given dataset or observations and then classifies new observation into a number of classes or groups. Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes can be called as targets/labels or categories.

Unlike regression, the output variable of Classification is a category, not a value, such as 'Green or Blue', 'fruit or animal', etc. Since the Classification algorithm is a Supervised learning technique, hence it takes labeled input data, which means it contains input with the corresponding output.

In classification algorithm, a discrete output function(y) is mapped to input variable(x).

y=f(x), where y = categorical output

The best example of an ML classification algorithm is Email Spam Detector.

The main goal of the Classification algorithm is to identify the category of a given dataset, and these algorithms are mainly used to predict the output for the categorical data.

The algorithm which implements the classification on a dataset is known as a classifier. There are two types of Classifications:

Binary Classifier: If the classification problem has only two possible outcomes, then it is called as Binary Classifier.

Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.

Multi-class Classifier: If a classification problem has more than two outcomes, then it is called as Multi-class Classifier.

Example: Classifications of types of crops, Classification of types of music.

Binary classification in deep learning refers to the type of classification where we have two class labels -- one normal and one abnormal. Some examples of binary classification use:

* To detect whether email is spam or not

* To determine whether or not a patient has a certain disease in medicine.

* To determine whether or not quality specifications were met when it comes to QA (Quality Assurance).

For example, the normal class label would be that a patient has the disease, and the abnormal class label would be that they do not, or vice-versa.

As is with every other type of classification, it is only as good as the binary classification dataset that it has -- or, in other words, the more training and data it has, the better it is.


Question 6

Consider a data frame df with 10 rows and index [ 'r1', 'r2', 'r3', 'row4', 'row5', 'row6', 'r7', 'r8', 'r9', 'row10']. What does the expression g = df.groupby(df.index.str.len()) do?



Answer : D

Data frames cannot be grouped by index values. Hence it results in Error.


Question 7

Skewness of Normal distribution is ___________



Answer : C

Since the normal curve is symmetric about its mean, its skewness is zero. This is a theoretical explanation for mathematical proofs, you can refer to books or websites that speak on the same in detail.


Page:    1 / 14   
Total 65 questions