Which of the following code blocks reads in the parquet file stored at location filePath, given that all columns in the parquet file contain only whole numbers and are stored in the most appropriate
format for this kind of data?
Answer : D
The schema passed into schema should be of type StructType or a string, so all entries in which a list is passed are incorrect.
In addition, since all numbers are whole numbers, the IntegerType() data type is the correct option here. NumberType() is not a valid data type and StringType() would fail, since the parquet file is
stored in the 'most appropriate format for this kind of data', meaning that it is most likely an IntegerType, and Spark does not convert data types if a schema is provided.
Also note that StructType accepts only a single argument (a list of StructFields). So, passing multiple arguments is invalid.
Finally, Spark needs to know which format the file is in. However, all of the options listed are valid here, since Spark assumes parquet as a default when no file format is specifically passed.
More info: pyspark.sql.DataFrameReader.schema --- PySpark 3.1.2 documentation and StructType --- PySpark 3.1.2 documentation
Which of the following code blocks returns a copy of DataFrame itemsDf where the column supplier has been renamed to manufacturer?
Answer : C
itemsDf.withColumnRenamed('supplier', 'manufacturer')
Correct! This uses the relatively trivial DataFrame method withColumnRenamed for renaming column supplier to column manufacturer.
Note that the Question: asks for 'a copy of DataFrame itemsDf'. This may be confusing if you are not familiar with Spark yet. RDDs (Resilient Distributed Datasets) are the foundation of
Spark DataFrames and are immutable. As such, DataFrames are immutable, too. Any command that changes anything in the DataFrame therefore necessarily returns a copy, or a new version, of it
that has the changes applied.
itemsDf.withColumnsRenamed('supplier', 'manufacturer')
Incorrect. Spark's DataFrame API does not have a withColumnsRenamed() method.
itemsDf.withColumnRenamed(col('manufacturer'), col('supplier'))
No. Watch out -- although the col() method works for many methods of the DataFrame API, withColumnRenamed is not one of them. As outlined in the documentation linked below,
withColumnRenamed expects strings.
itemsDf.withColumn(['supplier', 'manufacturer'])
Wrong. While DataFrame.withColumn() exists in Spark, it has a different purpose than renaming columns. withColumn is typically used to add columns to DataFrames, taking the name of the new
column as a first, and a Column as a second argument. Learn more via the documentation that is linked below.
itemsDf.withColumn('supplier').alias('manufacturer')
No. While DataFrame.withColumn() exists, it requires 2 arguments. Furthermore, the alias() method on DataFrames would not help the cause of renaming a column much. DataFrame.alias() can be
useful in addressing the input of join statements. However, this is far outside of the scope of this question. If you are curious nevertheless, check out the link below.
More info: pyspark.sql.DataFrame.withColumnRenamed --- PySpark 3.1.1 documentation, pyspark.sql.DataFrame.withColumn --- PySpark 3.1.1 documentation, and pyspark.sql.DataFrame.alias ---
https://bit.ly/sparkpracticeexams_import_instructions)
Which of the following are valid execution modes?
Answer : B
This is a tricky Question: to get right, since it is easy to confuse execution modes and deployment modes. Even in literature, both terms are sometimes used interchangeably.
There are only 3 valid execution modes in Spark: Client, cluster, and local execution modes. Execution modes do not refer to specific frameworks, but to where infrastructure is located with respect
to each other.
In client mode, the driver sits on a machine outside the cluster. In cluster mode, the driver sits on a machine inside the cluster. Finally, in local mode, all Spark infrastructure is started in a single JVM
(Java Virtual Machine) in a single computer which then also includes the driver.
Deployment modes often refer to ways that Spark can be deployed in cluster mode and how it uses specific frameworks outside Spark. Valid deployment modes are standalone, Apache YARN,
Apache Mesos and Kubernetes.
Client, Cluster, Local
Correct, all of these are the valid execution modes in Spark.
Standalone, Client, Cluster
No, standalone is not a valid execution mode. It is a valid deployment mode, though.
Kubernetes, Local, Client
No, Kubernetes is a deployment mode, but not an execution mode.
Cluster, Server, Local
No, Server is not an execution mode.
Server, Standalone, Client
No, standalone and server are not execution modes.
More info: Apache Spark Internals - Learning Journal
The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to
transactionNumber. Find the error.
Code block:
transactionsDf.withColumn("transactionNumber", "transactionId")
Answer : E
Correct code block:
transactionsDf.withColumnRenamed('transactionId', 'transactionNumber')
Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block.
More info: pyspark.sql.DataFrame.withColumnRenamed --- PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2, Question: 26 (Databricks import instructions)
The code block shown below should return all rows of DataFrame itemsDf that have at least 3 items in column itemNameElements. Choose the answer that correctly fills the blanks in the code block
to accomplish this.
Example of DataFrame itemsDf:
1. +------+----------------------------------+-------------------+------------------------------------------+
2. |itemId|itemName |supplier |itemNameElements |
3. +------+----------------------------------+-------------------+------------------------------------------+
4. |1 |Thick Coat for Walking in the Snow|Sports Company Inc.|[Thick, Coat, for, Walking, in, the, Snow]|
5. |2 |Elegant Outdoors Summer Dress |YetiX |[Elegant, Outdoors, Summer, Dress] |
6. |3 |Outdoors Backpack |Sports Company Inc.|[Outdoors, Backpack] |
7. +------+----------------------------------+-------------------+------------------------------------------+
Code block:
itemsDf.__1__(__2__(__3__)__4__)
Answer : D
Correct code block:
itemsDf.filter(size('itemNameElements')>3)
Output of code block:
+------+----------------------------------+-------------------+------------------------------------------+
|itemId|itemName |supplier |itemNameElements |
+------+----------------------------------+-------------------+------------------------------------------+
|1 |Thick Coat for Walking in the Snow|Sports Company Inc.|[Thick, Coat, for, Walking, in, the, Snow]|
|2 |Elegant Outdoors Summer Dress |YetiX |[Elegant, Outdoors, Summer, Dress] |
+------+----------------------------------+-------------------+------------------------------------------+
The big difficulty with this Question: is in knowing the difference between count and size (refer to documentation below). size is the correct function to choose here since it returns the number
of elements in an array on a per-row basis.
The other consideration for solving this Question: is the difference between select and filter. Since we want to return the rows in the original DataFrame, filter is the right choice. If we would
use select, we would simply get a single-column DataFrame showing which rows match the criteria, like so:
+----------------------------+
|(size(itemNameElements) > 3)|
+----------------------------+
|true |
|true |
|false |
+----------------------------+
More info:
Count documentation: pyspark.sql.functions.count --- PySpark 3.1.1 documentation
Size documentation: pyspark.sql.functions.size --- PySpark 3.1.1 documentation
Static notebook | Dynamic notebook: See test 1, Question: 47 (Databricks import instructions)
Which of the following describes a valid concern about partitioning?
Answer : A
A shuffle operation returns 200 partitions if not explicitly set.
Correct. 200 is the default value for the Spark property spark.sql.shuffle.partitions. This property determines how many partitions Spark uses when shuffling data for joins or aggregations.
The coalesce() method should be used to increase the number of partitions.
Incorrect. The coalesce() method can only be used to decrease the number of partitions.
Decreasing the number of partitions reduces the overall runtime of narrow transformations if there are more executors available than partitions.
No. For narrow transformations, fewer partitions usually result in a longer overall runtime, if more executors are available than partitions.
A narrow transformation does not include a shuffle, so no data need to be exchanged between executors. Shuffles are expensive and can be a bottleneck for executing Spark workloads.
Narrow transformations, however, are executed on a per-partition basis, blocking one executor per partition. So, it matters how many executors are available to perform work in parallel relative to the
number of partitions. If the number of executors is greater than the number of partitions, then some executors are idle while other process the partitions. On the flip side, if the number of executors is
smaller than the number of partitions, the entire operation can only be finished after some executors have processed multiple partitions, one after the other. To minimize the overall runtime, one
would want to have the number of partitions equal to the number of executors (but not more).
So, for the scenario at hand, increasing the number of partitions reduces the overall runtime of narrow transformations if there are more executors available than partitions.
No data is exchanged between executors when coalesce() is run.
No. While coalesce() avoids a full shuffle, it may still cause a partial shuffle, resulting in data exchange between executors.
Short partition processing times are indicative of low skew.
Incorrect. Data skew means that data is distributed unevenly over the partitions of a dataset. Low skew therefore means that data is distributed evenly.
Partition processing time, the time that executors take to process partitions, can be indicative of skew if some executors take a long time to process a partition, but others do not. However, a short
processing time is not per se indicative a low skew: It may simply be short because the partition is small.
A situation indicative of low skew may be when all executors finish processing their partitions in the same timeframe. High skew may be indicated by some executors taking much longer to finish their
partitions than others. But the answer does not make any comparison -- so by itself it does not provide enough information to make any assessment about skew.
More info: Spark Repartition & Coalesce - Explained and Performance Tuning - Spark 3.1.2 Documentation
Which of the following describes Spark's Adaptive Query Execution?
Answer : D
Adaptive Query Execution features include dynamically coalescing shuffle partitions, dynamically injecting scan filters, and dynamically optimizing skew joins.
This is almost correct. All of these features, except for dynamically injecting scan filters, are part of Adaptive Query Execution. Dynamically injecting scan filters for join operations to limit the amount
of data to be considered in a query is part of Dynamic Partition Pruning and not of Adaptive Query Execution.
Adaptive Query Execution reoptimizes queries at execution points.
No, Adaptive Query Execution reoptimizes queries at materialization points.
Adaptive Query Execution is enabled in Spark by default.
No, Adaptive Query Execution is disabled in Spark needs to be enabled through the spark.sql.adaptive.enabled property.
Adaptive Query Execution applies to all kinds of queries.
No, Adaptive Query Execution applies only to queries that are not streaming queries and that contain at least one exchange (typically expressed through a join, aggregate, or window operator) or
one subquery.