Programming Hadoop in Java

Goals

- Develop efficient parallel algorithms

- Analyze unstructured files and develop MapReduce Java tasks

- Load and retrieve data from HBase and Hadoop Distributed File System (HDFS)

- User Defined Functions from Hive and Pig

Program

Introduction

Evaluate the value that Hadoop can bring to the business
Examine the Hadoop ecosystem
Choose a suitable distribution model

Challenge the complexity of parallel programming

Examine the difficulties associated with running parallel programs: algorithms, data exchange
Evaluate the storage mode and complexity of Big Data

Parallel programming with MapReduce

Fragment and solve large-scale problems
Discover tasks compatible with MapReduce
Solve common business problems

Apply the Hadoop MapReduce paradigm

Configure the development environment
Examine the Hadoop distribution
Study the Hadoop daemons
Create the different components of MapReduce tasks
Analyze the different stages of MapReduce processing: split, map, shuffle and shrink

Create complex MapReduce jobs

Choose and use multiple mapping and reduction tools, leverage partitioners and built-in map and reduce functions, analyze time series data with a second sort, streamline tasks in different programming languages

Solve data manipulation issues

Run algorithms: parallel sorts, joins and searches, analyze log files, social media data and emails

Implementation of partitioners and comparators

Identify parallel algorithms related to network, processor, and disk I / O
Distribute workload with partitioners
Control grouping and sort order with comparators
Measure performance with counters

Rationale for distributed data

Optimize data throughput performance
Use redundancy to recover data

Duration

4 days

Price

£ 2821

Audience

Anyone who will use, administer, or deploy SharePoint in an organization

Anyone who wants to develop or administer SharePoint applications

Prerequisites

Have experience of training level 471, Java Programming: Fundamentals, or more than 6 months of experience in Java programming

Reference

BUS100298-F

Interfacing with the Hadoop Distributed File System

Analyze the structure and organization of HDFS
Load raw data and retrieve the result
Read and write data with a program
Manipulate Hadoop’s SequenceFile types
Share reference data with DistributedCache

Structuring data with HBase

Switch from structured storage to unstructured storage
Apply NoSQL principles with a template application to read, connect to HBase from MapReduce tasks, compare HBase with other types of NoSQL datastores

Harness the power of SQL with Hive

Structure databases, tables, views and partitions
Integrate MapReduce jobs with Hive queries
Launch queries with HiveQL
Access Hive servers via IDBC, add functionality to HiveQL with user-defined functions

Execute workflows with Pig

Develop Pig Latin scripts to consolidate workflows, integrate Pig queries with Java
Interact with data through the Grunt console
Extend Pig with user-defined functions

Test and debug Hadoop code

Record important events to audit and debug
Validate specifications with MRUnit
Debug in local mode

Deploy, monitor and fine tune performance

Deploy the solution on a production cluster, use administration tools to optimize performance, monitor task execution via web user interfaces

Program

Sessions