Which of the following statements about stages is correct?
A. Different stages in a job may be executed in parallel.
B. Stages consist of one or more jobs.
C. Stages ephemerally store transactions, before they are committed through actions.
D. Tasks in a stage may be executed by multiple machines at the same time.
E. Stages may contain multiple actions, narrow, and wide transformations.
Which of the following code blocks creates a new one-column, two-row DataFrame dfDates with column date of type timestamp?
A. 1.dfDates = spark.createDataFrame(["23/01/2022 11:28:12","24/01/2022 10:58:34"], ["date"]) 2.dfDates = dfDates.withColumn("date", to_timestamp("dd/MM/yyyy HH:mm:ss", "date"))
B. 1.dfDates = spark.createDataFrame([("23/01/2022 11:28:12",),("24/01/2022 10:58:34",)], ["date"]) 2.dfDates = dfDates.withColumnRenamed("date", to_timestamp("date", "yyyy-MM-ddHH:mm:ss"))
C. 1.dfDates = spark.createDataFrame([("23/01/2022 11:28:12",),("24/01/2022 10:58:34",)], ["date"]) 2.dfDates = dfDates.withColumn("date", to_timestamp("date", "dd/MM/yyyy HH:mm:ss"))
D. 1.dfDates = spark.createDataFrame(["23/01/2022 11:28:12","24/01/2022 10:58:34"], ["date"]) 2.dfDates = dfDates.withColumnRenamed("date", to_datetime("date", "yyyy-MM-ddHH:mm:ss"))
E. 1.dfDates = spark.createDataFrame([("23/01/2022 11:28:12",),("24/01/2022 10:58:34",)], ["date"])
Which of the following code blocks reduces a DataFrame from 12 to 6 partitions and performs a full shuffle?
A. DataFrame.repartition(12)
B. DataFrame.coalesce(6).shuffle()
C. DataFrame.coalesce(6)
D. DataFrame.coalesce(6, shuffle=True)
E. DataFrame.repartition(6)
The code block shown below should return a DataFrame with two columns, itemId and col. In this DataFrame, for each element in column attributes of DataFrame itemDf there should be a separate
row in which the column itemId contains the associated itemId from DataFrame itemsDf. The new DataFrame should only contain rows for rows in DataFrame itemsDf in which the column attributes
contains the element cozy.
A sample of DataFrame itemsDf is below.
Code block:
itemsDf.__1__(__2__).__3__(__4__, __5__(__6__))
A. 1. filter
2.
array_contains("cozy")
3.
select
4.
"itemId"
5.
explode
6.
"attributes"
B. 1. where
2.
"array_contains(attributes, 'cozy')"
3.
select
4.
itemId
5.
explode
6.
attributes
C. 1. filter
2.
"array_contains(attributes, 'cozy')"
3.
select
4.
"itemId"
5.
map
6.
"attributes"
D. 1. filter
2.
"array_contains(attributes, cozy)"
3.
select
4.
"itemId"
5.
explode
6.
"attributes"
E. 1. filter
2.
"array_contains(attributes, 'cozy')"
3.
select
4.
"itemId"
5.
explode
6.
"attributes"
Which of the following code blocks returns a DataFrame that matches the multi-column DataFrame itemsDf, except that integer column itemId has been converted into a string column?
A. itemsDf.withColumn("itemId", convert("itemId", "string"))
B. itemsDf.withColumn("itemId", col("itemId").cast("string"))
C. itemsDf.select(cast("itemId", "string"))
D. itemsDf.withColumn("itemId", col("itemId").convert("string"))
E. spark.cast(itemsDf, "itemId", "string")
Which of the following describes Spark's way of managing memory?
A. Spark uses a subset of the reserved system memory.
B. Storage memory is used for caching partitions derived from DataFrames.
C. As a general rule for garbage collection, Spark performs better on many small objects than few big objects.
D. Disabling serialization potentially greatly reduces the memory footprint of a Spark application.
E. Spark's memory usage can be divided into three categories: Execution, transaction, and storage.
Which of the following code blocks prints out in how many rows the expression Inc. appears in the stringtype column supplier of DataFrame itemsDf?
A. 1.counter = 0
2.
3.for index, row in itemsDf.iterrows():
4.
if 'Inc.' in row['supplier']:
5.
counter = counter + 1
6.
7.print(counter)
B. 1.counter = 0
2.
3.def count(x):
4.
if 'Inc.' in x['supplier']:
5.
counter = counter + 1
6.
7.itemsDf.foreach(count)
8.print(counter)
C. print(itemsDf.foreach(lambda x: 'Inc.' in x))
D. print(itemsDf.foreach(lambda x: 'Inc.' in x).sum())
E. 1.accum=sc.accumulator(0)
2.
3.def check_if_inc_in_supplier(row):
4.
if 'Inc.' in row['supplier']:
5.
accum.add(1)
6.
7.itemsDf.foreach(check_if_inc_in_supplier)
8.print(accum.value)
Which of the following options describes the responsibility of the executors in Spark?
A. The executors accept jobs from the driver, analyze those jobs, and return results to the driver.
B. The executors accept tasks from the driver, execute those tasks, and return results to the cluster manager.
C. The executors accept tasks from the driver, execute those tasks, and return results to the driver.
D. The executors accept tasks from the cluster manager, execute those tasks, and return results to the driver.
E. The executors accept jobs from the driver, plan those jobs, and return results to the cluster manager.
The code block displayed below contains an error. The code block should merge the rows of DataFrames transactionsDfMonday and transactionsDfTuesday into a new DataFrame, matching column names and inserting null values where column names do not appear in both DataFrames. Find the error.
Sample of DataFrame transactionsDfMonday:
1.+-------------+---------+-----+-------+---------+----+
2.|transactionId|predError|value|storeId|productId| f|
3.+-------------+---------+-----+-------+---------+----+
4.| 5| null| null| null| 2|null|
5.| 6| 3| 2| 25| 2|null|
6.+-------------+---------+-----+-------+---------+----+
Sample of DataFrame transactionsDfTuesday:
1.+-------+-------------+---------+-----+
2.|storeId|transactionId|productId|value|
3.+-------+-------------+---------+-----+
4.| 25| 1| 1| 4|
5.| 2| 2| 2| 7|
6.| 3| 4| 2| null|
7.| null| 5| 2| null|
8.+-------+-------------+---------+-----+
Code block:
sc.union([transactionsDfMonday, transactionsDfTuesday])
A. The DataFrames' RDDs need to be passed into the sc.union method instead of the DataFrame variable names.
B. Instead of union, the concat method should be used, making sure to not use its default arguments.
C. Instead of the Spark context, transactionDfMonday should be called with the join method instead of the union method, making sure to use its default arguments.
D. Instead of the Spark context, transactionDfMonday should be called with the union method.
E. Instead of the Spark context, transactionDfMonday should be called with the unionByName method instead of the union method, making sure to not use its default arguments.
The code block shown below should return a new 2-column DataFrame that shows one attribute from column attributes per row next to the associated itemName, for all suppliers in column supplier whose name includes Sports. Choose the answer that correctly fills the blanks in the code block to accomplish this.
Sample of DataFrame itemsDf:
1.+------+----------------------------------+-----------------------------+-------------------+
2.|itemId|itemName |attributes |supplier |
3.+------+----------------------------------+-----------------------------+-------------------+
4.|1 |Thick Coat for Walking in the Snow|[blue, winter, cozy] |Sports Company Inc.| 5.|2 |Elegant Outdoors Summer Dress |[red, summer, fresh, cooling]|YetiX |
6.|3 |Outdoors Backpack |[green, summer, travel] |Sports Company Inc.|
7.+------+----------------------------------+-----------------------------+-------------------+
Code block:
itemsDf.__1__(__2__).select(__3__, __4__) A. 1. filter
2.
col("supplier").isin("Sports")
3.
"itemName"
4.
explode(col("attributes"))
B. 1. where
2.
col("supplier").contains("Sports")
3.
"itemName"
4.
"attributes"
C. 1. where
2.
col(supplier).contains("Sports")
3.
explode(attributes)
4.
itemName
D. 1. where
2.
"Sports".isin(col("Supplier"))
3.
"itemName"
4.
array_explode("attributes")
E. 1. filter
2.
col("supplier").contains("Sports")
3.
"itemName"
4.
explode("attributes")