Pig, Hive and Impala - Dawan.training

Goals

- Handle complex data sets stored in Hadoop without having to write complex code with Java

- Automate the transfer of data into Hadoop storage with Flume and Sqoop

- Filter data with Extract-Transform-Load (ETL) operations with Pig

- Query multiple datasets for analysis with Pig and Hive

Program

Introduction

Hadoop Overview
Analyze Hadoop Components
Define Hadoop Architecture

Store data in HDFS

Achieve reliable and secure
storage
Monitor storage metrics Control HDFS from the command line

Parallel processing with MapReduce

Detail the MapReduce approach
Transfer the algorithms and not the data
Break down the key steps of a MapReduce task

Automate data transfer

Facilitate data entry and exit
Aggregate data with Flume
Configure data fan in and fan out
Move relational data with Sqoop

Describe the characteristics of Apache Pig

Explain the differences between Pig and MapReduce
Identify Pig use cases
Identify key Pig configurations

Structuring unstructured data

Represent the data in the Pig data model
Execute the Pig Latin commands in the Grunt Shell
Express the transformations in the Pig Latin syntax
Call the load and store functions

Transform data with relational operators

Create new relationships with joins
Reduce data size by sampling
Exploit Pig and user-defined functions

Filter data with Pig

Consolidate datasets with unions
Partition datasets with splits
Add parameters in Pig scripts

Duration

4 days

Price

£ 2367

Audience

Database technicians and specialists, managers, business analysts and BI professionals who want to use Big Data technologies in their business

Prerequisites

Fundamental knowledge of databases and SQL is a major asset

Reference

BUS100295-F

Leverage the business advantages of Hive

Factorize Hive into components
Impose structure on data with Hive

Organize data in Hive

Create Hive databases and tables
Expose the differences between data types in Hive
Load and store data efficiently with SerDes

Design data layout for performance

Fill tables from queries
Partition Hive tables for optimal queries
Compose HiveQL queries

Perform joins on unstructured data

Distinguish the joins available in Hive
Optimize join structure for performance

Pushing the limits of HiveQL

Sort, distribute and group data
Reduce query complexity with views
Improve query performance with indexes

Deploy Hive in production

Design Hive schemas
Establish data compression
Debug Hive scripts

Streamline storage management with HCatalog

Unify the data view with HCatalog
Use HCatalog to access the Hive metastore
Communicate via the HCatalog interfaces
Fill in a Hive table from Pig

Parallel processing with Impala

Break down the fundamental components of Impala
Submit queries to Impala
Access Hive data from Impala

Launch the Spark framework

Reduce data access time with Spark-SQL
Query Hive data with Spark-SQL

Program

Sessions