Which of the following hyperparameter optimization methods automatically makes informed selections of hyperparameter values based on previous trials for each iterative model evaluation?
A. Random Search
B. Halving Random Search
C. Tree of Parzen Estimators
D. Grid Search
A data scientist is working with a feature set with the following schema:
Thecustomer_idcolumn is the primary key in the feature set. Each of the columns in the feature set has missing values. They want to replace the missing values by imputing a common value for each feature.
Which of the following lists all of the columns in the feature set that need to be imputed using the most common value of the column?
A. customer_id, loyalty_tier
B. loyalty_tier
C. units
D. spend
E. customer_id
A data scientist uses 3-fold cross-validation and the following hyperparameter grid when optimizing model hyperparameters via grid search for a classification problem:
Hyperparameter 1: [2, 5, 10] Hyperparameter 2: [50, 100]
Which of the following represents the number of machine learning models that can be trained in parallel during this process?
A. 3
B. 5
C. 6
D. 18
A health organization is developing a classification model to determine whether or not a patient currently has a specific type of infection. The organization's leaders want to maximize the number of positive cases identified by the model.
Which of the following classification metrics should be used to evaluate the model?
A. RMSE
B. Precision
C. Area under the residual operating curve
D. Accuracy
E. Recall
A machine learning engineer has grown tired of needing to install the MLflow Python library on each of their clusters. They ask a senior machine learning engineer how their notebooks can load the MLflow library without installing it each time. The senior machine learning engineer suggests that they use Databricks Runtime for Machine Learning.
Which of the following approaches describes how the machine learning engineer can begin using Databricks Runtime for Machine Learning?
A. They can add a line enabling Databricks Runtime ML in their init script when creating their clusters.
B. They can check the Databricks Runtime ML box when creating their clusters.
C. They can select a Databricks Runtime ML version from the Databricks Runtime Version dropdown when creating their clusters.
D. They can set the runtime-version variable in their Spark session to "ml".
An organization is developing a feature repository and is electing to one-hot encode all categorical feature variables. A data scientist suggests that the categorical feature variables should not be one-hot encoded within the feature repository.
Which of the following explanations justifies this suggestion?
A. One-hot encoding is a potentially problematic categorical variable strategy for some machine learning algorithms.
B. One-hot encoding is dependent on the target variable's values which differ for each apaplication.
C. One-hot encoding is computationally intensive and should only be performed on small samples of training sets for individual machine learning problems.
D. One-hot encoding is not a common strategy for representing categorical feature variables numerically.
A data scientist has defined a Pandas UDF function predict to parallelize the inference process for a single-node model:
They have written the following incomplete code block to use predict to score each record of Spark DataFramespark_df:
Which of the following lines of code can be used to complete the code block to successfully complete the task?
A. predict(*spark_df.columns)
B. mapInPandas(predict)
C. predict(Iterator(spark_df))
D. mapInPandas(predict(spark_df.columns))
E. predict(spark_df.columns)
A machine learning engineering team has a Job with three successive tasks. Each task runs a single notebook. The team has been alerted that the Job has failed in its latest run.
Which of the following approaches can the team use to identify which task is the cause of the failure?
A. Run each notebook interactively
B. Review the matrix view in the Job's runs
C. Migrate the Job to a Delta Live Tables pipeline
D. Change each Task's setting to use a dedicated cluster
A data scientist has produced three new models for a single machine learning problem. In the past, the solution used just one model. All four models have nearly the same prediction latency, but a machine learning engineer suggests that the new solution will be less time efficient during inference.
In which situation will the machine learning engineer be correct?
A. When the new solution requires if-else logic determining which model to use to compute each prediction
B. When the new solution's models have an average latency that is larger than the size of the original model
C. When the new solution requires the use of fewer feature variables than the original model
D. When the new solution requires that each model computes a prediction for every record
E. When the new solution's models have an average size that is larger than the size of the original model
A data scientist is attempting to tune a logistic regression model logistic using scikit-learn. They want to specify a search space for two hyperparameters and let the tuning process randomly select values for each evaluation.
They attempt to run the following code block, but it does not accomplish the desired task:
Which of the following changes can the data scientist make to accomplish the task?
A. Replace the GridSearchCV operation with RandomizedSearchCV
B. Replace the GridSearchCV operation with cross_validate
C. Replace the GridSearchCV operation with ParameterGrid
D. Replace the random_state=0 argument with random_state=1
E. Replace the penalty= ['12', '11'] argument with penalty=uniform ('12', '11')