Apache Spark - Dawan.training

Goals

- Develop applications with Spark

- Use libraries for SQL, data flow and machine learning

- Transcribe difficulties encountered in the field in parallel algorithms

- Develop business applications that integrate with Spark

Program

Introduction

Defining Big Data and Calculations
What is Spark for
What are the benefits of Spark

Scalable applications

Identify the performance limits of modern CPUs
Develop traditional parallel processing models

Create parallel algorithms

Use functional programming for the execution of programs in parallel
Transcribe difficulties encountered in the field in parallel algorithms

Parallel data structures

Distribute the data in the cluster with RDDs (Resilient Distributed Datasets) and DataFrames
Distribute the execution of the tasks between several nodes
Launch the applications with the Spark execution model

Structure of Spark clusters

Build resilient and fault-resistant clusters
Set up a scalable distributed storage system

Cluster management

Monitoring and Administering Spark Applications
View Execution Plans and Results

Choose the development environment

Perform exploratory analysis with the Spark shell
Create stand-alone Spark applications

Use the Spark APIs

Programming with Scala and other compatible languages
Create applications with basic APIs
Enrich applications with integrated libraries

Duration

4 days

Price

£ 2367

Audience

Developers, system architects and technical managers who want to deploy Spark solutions in their company

Prerequisites

Proficiency in object-oriented programming in Java or C #

Reference

BUS100299-F

Query structured data

Process queries with DataFrames and embedded SQL code
Develop SQL with user-defined functions (UDF)
Use data sets in JSON and Parquet formats

Integration with external systems

Connect to databases with JDBC
Run Hive queries on external applications

What do we call data flow?

Use sliding windows
Determine the state of a continuous data stream
Process simultaneous data streams
Improve performance and reliability

Process data source flows

Process streams from integrated sources (log files, Twitter, Kinesis, Kafka sockets)
Develop custom receivers
Process data with the Streaming API and Spark SQL

Classify observations

Predict outcomes with supervised learning
Create a classification element for the decision tree

Identify recurring patterns

Group data with unsupervised learning
Create a cluster with k-means method

Develop business applications with Spark

Provision of Spark through a RESTful web service
Generate dashboards with Spark

Use Spark as a Service

Cloud service vs. on-premises
Choose a service provider (AWS, Azure, Databricks, etc.)

Develop Spark for large clusters
Improve the security of multi-vendor clusters
Monitoring the continuous development of Spark products in the market
Tungsten project: pushing performance to the limit of modern equipment
Use projects developed with Spark
Review the architecture of Spark for mobile platforms

Program

Sessions