Hadoop Architecture

Introduction to Data Storage and Processing

Installing the Hadoop Distributed File System (HDFS)
• Defining key design assumptions and architecture
• Configuring and setting up the file system
• Issuing commands from the console
• Reading and writing files
Setting the stage for MapReduce
• Reviewing the MapReduce approach
• Introducing the computing daemons
• Dissecting a MapReduce job
Defining Hadoop Cluster Requirements
Planning the architecture
• Selecting appropriate hardware
• Designing a scalable cluster
Building the cluster
• Installing Hadoop daemons
• Optimizing the network architecture
Configuring a Cluster
Preparing HDFS
• Setting basic configuration parameters
• Configuring block allocation, redundancy and replication
Deploying MapReduce
• Installing and setting up the MapReduce environment
• Delivering redundant load balancing via Rack Awareness
Maximizing HDFS Robustness
Creating a fault-tolerant file system

  •  Isolating single points of failure
  • Maintaining High Availability
  • Triggering manual failover
  • Automating failover with Zookeeper

Leveraging NameNode Federation

  • Extending HDFS resources
  • Managing the namespace volumes

Introducing YARN

  •  Critiquing the YARN architecture
  •  Identifying the new daemons

Managing Resources and Cluster Health
Allocating resources

  • Setting quotas to constrain HDFS utilization
  • Prioritizing access to MapReduce using schedulers

Maintaining HDFS

  • Starting and stopping Hadoop daemons
  • Monitoring HDFS status
  • Adding and removing data nodes

Administering MapReduce

  • Managing MapReduce jobs
  • Tracking progress with monitoring tools
  •  Commissioning and decommissioning compute nodes

Extending Hadoop
Simplifying information access
• Enabling SQL-like querying with Hive
• Installing Pig to create MapReduce jobs
Integrating additional elements of the ecosystem
• Imposing a tabular view on HDFS with HBase
• Configuring Oozie to schedule workflows
Implementing Data Ingress and Egress
Facilitating generic input/output
• Moving bulk data into and out of Hadoop
• Transmitting HDFS data over HTTP with WebHDFS
Acquiring application-specific data
• Collecting multi-sourced log files with Flume
• Importing and exporting relational information with Sqoop