> ## Documentation Index
> Fetch the complete documentation index at: https://rajanand.org/llms.txt
> Use this file to discover all available pages before exploring further.

# HDFS: Hadoop Distributed File System

## 1. **What is HDFS?**

The **Hadoop Distributed File System (HDFS)** is a distributed file system designed to store and manage large volumes of data across multiple machines in a Hadoop cluster. It is a core component of the Apache Hadoop ecosystem and is optimized for high-throughput access to data, making it ideal for big data applications. HDFS is highly scalable, fault-tolerant, and cost-effective, as it can run on commodity hardware.

## 2. **Key Features**

* **Distributed Storage**: Data is split into blocks and stored across multiple nodes in a cluster.
* **[Fault Tolerance](/glossary/fault-tolerance)**: Replicates data across nodes to ensure availability even if a node fails.
* **[Scalability](/glossary/scalability)**: Can scale horizontally by adding more nodes to the cluster.
* **High Throughput**: Optimized for batch processing and large data sets rather than low-latency access.
* **Cost-Effective**: Runs on inexpensive commodity hardware.
* **Data Locality**: Moves computation closer to the data to minimize network traffic.

## 3. **Architecture**

HDFS follows a **master-slave architecture** with two main components:

1. **NameNode (Master)**:
   * Manages the file system namespace (metadata).
   * Tracks the location of data blocks across the cluster.
   * Handles client requests for file operations (e.g., read, write).
2. **DataNode (Slave)**:
   * Stores the actual data blocks.
   * Performs read/write operations as instructed by the NameNode.
   * Sends periodic heartbeats and block reports to the NameNode.

Additional components:

* **Secondary NameNode**: Assists the NameNode by performing periodic checkpoints of the file system metadata.
* **Client**: Interacts with the NameNode and DataNodes to access or modify files.

## 4. **How HDFS Works**

1. **File Splitting**: Files are divided into fixed-size blocks (default size: 128 MB or 256 MB).
2. **Replication**: Each block is replicated across multiple DataNodes (default replication factor: 3).
3. **Storage**: Blocks are distributed across the cluster for fault tolerance and load balancing.
4. **Metadata Management**: The NameNode maintains metadata about file locations and block mappings.
5. **Data Access**: Clients interact with the NameNode to locate data blocks and directly access them from DataNodes.

## 5. **Key Concepts in HDFS**

* **Blocks**: The smallest unit of data storage in HDFS (default size: 128 MB or 256 MB).
* **Replication**: Copies of data blocks stored on multiple nodes for fault tolerance.
* **Rack Awareness**: Ensures data replicas are stored on different racks to prevent data loss during rack failures.
* **Heartbeat**: A signal sent by DataNodes to the NameNode to confirm they are operational.
* **Checkpointing**: The process of saving the file system metadata to prevent data loss.

## 6. **Advantages of HDFS**

* **Scalability**: Can handle petabytes of data by adding more nodes.
* **Fault Tolerance**: Data replication ensures high availability even during hardware failures.
* **Cost-Effective**: Uses commodity hardware, reducing infrastructure costs.
* **High Throughput**: Optimized for large-scale batch processing.
* **Data Locality**: Minimizes data movement by processing data where it is stored.

## 7. **Limitations of HDFS**

* **Not Suitable for Small Files**: Designed for large files; storing many small files can overwhelm the NameNode.
* **High Latency**: Not optimized for real-time or low-latency access.
* **Single Point of Failure**: The NameNode is a critical component; its failure can disrupt the entire system (mitigated by HDFS High Availability features).
* **Complexity**: Requires expertise to set up, configure, and manage.

## 8. **Use Cases of HDFS**

* **Big Data Processing**: Used as the storage layer for Hadoop-based big data applications.
* **Data Warehousing**: Stores large volumes of structured and unstructured data for analytics.
* **Log Processing**: Collects and processes log data from servers and applications.
* **Machine Learning**: Stores training data for machine learning models.
* **Backup and Archiving**: Provides a cost-effective solution for storing large backups.

## 9. **HDFS Commands**

HDFS provides a command-line interface (CLI) for file operations. Some common commands include:

* **List Files**: `hdfs dfs -ls <path>`
* **Create Directory**: `hdfs dfs -mkdir <path>`
* **Copy from Local to HDFS**: `hdfs dfs -put <local_path> <hdfs_path>`
* **Copy from HDFS to Local**: `hdfs dfs -get <hdfs_path> <local_path>`
* **Delete File**: `hdfs dfs -rm <path>`
* **View File Content**: `hdfs dfs -cat <path>`

## 10. **Key Takeaways**

* **HDFS**: A distributed file system for storing and managing large datasets in a Hadoop cluster.
* **Architecture**: Master-slave model with NameNode (metadata) and DataNodes (data storage).
* **Features**: Distributed storage, fault tolerance, scalability, high throughput, and cost-effectiveness.
* **How It Works**: Files are split into blocks, replicated, and distributed across nodes.
* **Advantages**: Scalability, fault tolerance, cost-effectiveness, and high throughput.
* **Limitations**: Not suitable for small files, high latency, and single point of failure.
* **Use Cases**: Big data processing, data warehousing, log processing, machine learning, and backups.
* **Commands**: CLI commands for file operations like listing, copying, and deleting files.