Date: 19 October 2021 – 22 October 2021
Duration: 4 Days
Locations: Online

Take your knowledge to the next level with Cloudera’s Data Scientist Training

The workshop is designed for data scientists who use Python or R to work with small datasets on a single machine and who need to scale up their data science and machine learning workflows to large datasets on distributed clusters. Data engineers, data analysts, developers, and solution architects who collaborate with data scientists will also find this workshop valuable. Workshop participants walk through an end-to-end data science and machine learning workflow based on realistic scenarios and datasets from a fictitious technology company. The material is presented through a sequence of brief lectures, interactive demonstrations, extensive hands-on exercises, and lively discussions. The demonstrations and exercises are conducted in Python (with PySpark) using Cloudera Data Science Workbench (CDSW). Supplemental examples using R (with sparklyr) are provided.

Technologies

Through narrated lecture, recorded demonstrations, and hands-on exercises,you will learn how to:

How to use Apache Spark to run data science and machine learning workflows at scale
How to use Spark SQL and DataFrames to work with structured data
How to use MLlib, Spark’s machine learning library
How to use PySpark, Spark’s Python API
How to use sparklyr, a dplyr-compatible R interface to Spark
How to use Cloudera Data Science Workbench (CDSW)
How to use other Cloudera platform components including HDFS, Hive,
Impala, and Hue

Audience & Prerequisites

Workshop participants should have a basic understanding of Python or R and some experience exploring and analyzing data and developing statistical or machine learning models. Knowledge of Hadoop or Spark is not required.

Outline

Introduction to Apache NiFi

What Data Scientists Do
What Process Data Scientists Use
What Tools Data Scientists Use

Cloudera Data Science Workbench (CDSW)

Introduction to Cloudera Data

Science Workbench

How Cloudera Data Science

Workbench Works

How to Use Cloudera Data Science

Workbench

Entering Code
Getting Help
Accessing the Linux Command Line
Working with Python Packages
Formatting Session Output

Case Study

DuoCar
How DuoCar Works
DuoCar Datasets
DuoCar Business Goals
DuoCar Data Science Platform
DuoCar Cloudera EDH Cluster
HDFS
Apache Spark
Apache Hive
Apache Impala
Hue
YARN
DuoCar Cluster Architecture

Apache Spark

Apache Spark
HowSpark Works
The Spark Stack
Spark SQL
DataFrames
File Formats in Apache Spark
Text File Formats
Parquet File Format

Summarizing and Grouping DataFrames

Summarizing Data with Aggregate
Functions
Grouping Data
Pivoting Data

Window Functions

Introduction to Window Functions
Creating a Window Specification
Aggregating over a Window Specification

Exploring DataFrames

Possible Workflows for Big Data
Exploring a Single Variable
Exploring a Categorial Variable
Exploring a Continuous Variable
Exploring a Pair of Variables
Categorical-Categorical Pair
Categorical-Continuous Pair
Continuous-Continuous Pair

Apache Spark Job Execution

DataFrame Operations
Input Splits
Narrow Operations
Wide Operations
Stages and Tasks
Shuffle

Processing Text and Training and Evaluating Topic Models

Introduction to Topic Models
Scenario
Extracting and Transforming Features
Parsing Text Data
Removing Common (Stop) Words
Counting the Frequency of Words
Specifying a Topic Model
Training a topic model using Latent Dirichlet Allocation (LDA)
Assessing the Topic Model Fit
Examining a Topic Model
Applying a Topic Model

Training and Evaluating Recommender Models

Introduction to Recommender Models
Scenario
Preparing Data for a Recommender Model
Specifying a Recommender model
Spark Interface Languages
PySpark
Data Science with PySpark
sparklyr
dplyr and sparklyr
Comparison of PySpark and sparklyr
How sparklyr Works with dplyr
sparklyr DataFrame and MLib Functions
When to Use PySpark and sparklyr

Running a Spark Application from (CDSW)

Overview
Starting a Spark Application
Reading Data into a Spark SQL Data Frame
Examining the Schema of a Data Frame
Computing the Number of Rows

Columns of a DataFrame

Examining Rows of a DataFrame
Stopping a Spark Application

Inspecting a Spark SQL DataFrame

Overview
Inspecting a DataFrame
Inspecting a DataFrame Column
Inspecting a Primary Key Variable
Inspecting a Categorical Variable
Inspecting a Numerical Variable
Inspecting a Data and Time Variable

Transforming DataFrames

Spark SQLDataFrames
Working with Columns
Selecting Colums
Dropping Colums
Specifying Colums
Adding Columns
Changing the Column Name
Changing the Column Type

Monitoring, Tuning and Configuring Spark Applications

Monitoring Spark Applications
Persisting DataFrames
Partitioning DataFrames
Configuring the Spark Environment

Machine Learning Overview

Machine Learning
Underfitting and Overfitting
Model Validations
Hyperparameters
Supervised and Unsupervised Learning
Machine Learning Algorithms
Machine Learning Libraries
Apache Spark MLib

Training and Evaluating Regression Models

Introduction to Regression Models
Scenario
Preparing the Regression Data
Assembling the Feature Vector
Creating a Training and Test Set
Specifying a Linear Regression Model
Training a Linear Regression Model
Examining the Model Parameters
Examining Various Model Performance Measures
Examining Various Model Diagnostics
Applying the Linear Regression Model to the Test Data
Evaluating the Linear Regression Model on the Test Data
Plotting the Linear Regression Model
Training a Recommender Model using Alternating Least Squares
Examining a Recommender Model
Evaluating a Recommender Model
Generating Recommendations

Working with Machine Learning Pipelines

Specifying Pipeline Stages
Specifying a Pipeline
Training a Pipeline Model
Querying a Pipeline Model
Applying a Pipeline Model

Deploying Machine Learning Pipelines

Saving and Loading Pipelines and Pipeline Models in Python
Loading Pipelines and Pipeline Models in Scala
Working with Rows
Ordering Rows
Selecting a Fixed Number of Rows
Selecting Distinct Rows
Filtering Rows
Sampling Rows
Working with Missing Values

Transforming DataFrame Columns

Spark SQL Data Types
Working with Numerical Columns
Working with String Columns
Working with Date and Timestamp Columns
Working with Boolean Columns

Complex Types

Complex Collection Data Types
Arrays
Maps
Structs

User-Defined Functions

User-Defined Functions
Defining a Python Function
Registering a Python Function
Applying a User-Defined Function

Reading and Writing Data

Reading and Writing Data
Working with Delimited Text Files
Working with Text Files
Working with Parquet Files
Working with Hive Tables
Working on Object Stores
Working with Pandas DataFrames

Combining and Splitting DataFrames

Joining DataFrames
Cross Join
Inner Join
Left Semi Join
Left Anti Join
Left Outer Join
Right Outer Join
Full Outer Join
Applying Set Operations
DataFrames
Splitting a DataFrame

Training and Evaluating Classification Models

Introduction to Classification Models
Scenario
Preprocessing the Modeling Data
Generate a Label
Extract, Transform and Select Features
Create Train and Test Sets
Specify a Logistic Regression Model
Train the Logistic Regression Model
Evaluate Model Performance on the TestSet

Tuning Algorithm Hyperparameters Using Grid Search

Requirements for Hyperparameter Tuning
Specifying the Estimator
Specifying the Hyperparameter Grid
Specifying the Evaluator
Tuning Hyperparameters using Holdouts Cross-validation
Tuning Hyperparameters using K-folds Cross-validation

Training and Evaluating Clustering Models

Introduction to Clustering
Scenario
Preprocessing the Data
Extracting, Transforming and Selecting Features
Specifying a Gaussian Mixture Model
Training a Gaussian Mixture Model
Examining the Gaussian Mixture Model
Plotting the Clusters
Exploring the Cluster Profiles
Saving and Loading the Gaussian
Mixture Models

Overview of sparklyr

Connecting to Spark
Reading Data
Inspecting Data
Transforming Data Using dplyr Verbs
Using SQL Queries
Spark DataFrames Functions
Visualizing Data from Spark
Machine Learning with MLib

Introduction to Additional CDSQ Features

Collaboration
Jobs
Experiments
Models
Applications

How can we help with Training?

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Take your knowledge to the next level with Cloudera’s Data Scientist Training

Technologies

Audience & Prerequisites

Outline

Introduction to Apache NiFi

Cloudera Data Science Workbench (CDSW)

Science Workbench

Workbench Works

Workbench

Case Study

Apache Spark

Summarizing and Grouping DataFrames

Window Functions

Exploring DataFrames

Apache Spark Job Execution

Processing Text and Training and Evaluating Topic Models

Training and Evaluating Recommender Models

Running a Spark Application from (CDSW)

Columns of a DataFrame

Inspecting a Spark SQL DataFrame

Transforming DataFrames

Monitoring, Tuning and Configuring Spark Applications

Machine Learning Overview

Training and Evaluating Regression Models

Working with Machine Learning Pipelines

Deploying Machine Learning Pipelines

Transforming DataFrame Columns

Complex Types

User-Defined Functions

Reading and Writing Data

Combining and Splitting DataFrames

Training and Evaluating Classification Models

Tuning Algorithm Hyperparameters Using Grid Search

Training and Evaluating Clustering Models

Overview of sparklyr

Introduction to Additional CDSQ Features

How can we help with Training?

COMPANY

OUR SERVICES

GET IN TOUCH WITH BENTEGO