INDUSTRIAL MANUFACTURING BLOG YOU SHOULD BE READING

Relative merit and demerit of google file system and big ...

30 Dec.,2024

Relative merit and demerit of google file system and big ...

Relative merit and demerit of google file system and big table storage system

If you are looking for more details, kindly visit zhaoyang.

HRJEET SINGH

2 min read

Feb 19,

Google File System (GFS) and Bigtable are both distributed storage systems designed by Google to handle large-scale data processing and storage requirements. Each system has its own set of merits and demerits:

Google File System (GFS):

Merits:

Scalability: GFS is designed to scale horizontally, allowing it to handle petabytes of data across a large number of commodity servers.
Reliability: It provides high reliability through replication and automatic data recovery mechanisms. Data is replicated across multiple servers to ensure fault tolerance.
Sequential I/O Performance: GFS is optimized for sequential I/O operations, making it suitable for batch processing tasks like MapReduce.
Simplicity: GFS provides a simple and straightforward interface for file storage and access, making it easier for developers to work with.

Demerits:

Limited Support for Random Access: While GFS performs well with sequential I/O operations, it may not be as efficient for random access patterns, which can impact performance for certain types of workloads.
Single Point of Failure: While GFS is designed for high reliability, individual components such as the master server can still become single points of failure if they fail.
Metadata Bottleneck: The centralized metadata management in GFS can become a bottleneck as the system scales, impacting overall performance and scalability.

Bigtable:

Merits:

If you are looking for more details, kindly visit GFS Tanks.

Schema Flexibility: Bigtable offers schema flexibility, allowing developers to store semi-structured and unstructured data efficiently. This makes it suitable for a wide range of applications and use cases.
High Throughput: Bigtable is optimized for high throughput and low-latency access to large-scale data, making it well-suited for real-time analytics and data serving.
Automatic Sharding: Bigtable automatically shards data across multiple servers, enabling it to scale horizontally to handle massive datasets.
Column-Oriented Storage: Bigtable stores data in a columnar format, which enables efficient retrieval of specific columns or column ranges, making it suitable for analytical queries.

Demerits:

Complexity: Bigtable can be complex to manage and operate, especially for developers who are not familiar with distributed systems concepts. It requires careful planning and configuration to ensure optimal performance.
Consistency Guarantees: Bigtable provides eventual consistency, which means that updates may not be immediately visible to all clients. This can lead to consistency issues in certain use cases that require strong consistency guarantees.
Limited Support for Transactions: While Bigtable supports atomic operations within a single row, it does not provide full ACID transactions across multiple rows, which may be a limitation for certain applications.

Google File System

Google File System is a proprietary distributed file system developed by Google to provide efficient, reliable access to data using large clusters of commodity hardware. It was initially described in a paper titled 'The Google File System' by Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung [1].

Designed to meet Google's rapidly growing needs, GFS explores radically different design choices. It is widely deployed within Google as the storage platform for the generation and processing of data used. The largest cluster provides hundreds of terabytes of storage across thousands of disks over a thousand machines and it is concurrently accessed by hundreds of clients.

In , Google adopted Colossus as their revamped file system. Colossus is specifically built for real time services, whereas GFS was built for batch operations. Colossus, along with Caffeine, their new search infrastructure power virtually all of Google's web services from Gmail to Google Docs, and Youtube to Google Cloud Storage.

I was assigned GFS as a part of an Operating Systems Assignment, which is why I have skipped over most of distributed system aspects. Check out the original paper for more.

Design Overview

Assumptions

GFS has been designed by key observations about expected workload and technological environment. It reflects a departure from some earlier file system design assumptions.

Component failures are the norm rather than exception. GFS clusters are built using large clusters of commodity hardware. The quantity and quality of component guarantee that some are not functional at a given time and might not recover from their failures.
Files are huge by traditional standards. Multi-GB files and fast growing data sets of many TBs are typical. As a result, GFS uses a block size of 64 MB compared to 4 KB of a traditional linux file system.
Most files are mutated by appending new data rather than overwritting existing data. Once written, files are only read and often only sequentially. Given this access patern, appending becomes the focus of performance optimization and atomicity guarantees.
System must efficiently implement semantics for multiple clients that concurrently append to same file. Atomicity with minimal synchronization overhead is essential.
High sustained bandwith is more important than low latency. Target applications place a premium on processing data at high rate with little to no response time requirements for individual read or write.

Interface

GFS provides a familiar file system interface, although it does not implement a standard API like POSIX. Usual operations to create, delete, open, close, read and write files are supported. Addtionally, GFS supports snapshot and record append operations. Snapshot creates a copy of file or directory tree. Record append allows multiple clients to append data to the same file concurrently while guaranteeing atomicity.

Architecture

A GFS cluster consists of a single master and mutliple chunkservers and is accessed by mutliple clients. Each of these is typically a commodity linux machine running a user level server process.

Files are divided into fixed-size chunks. Each chunk is identified by an immutable and globally unique 64 bit chunk handle assigned by master at the time of chunk creation. Chunkservers store chunks on local disk as Linux files.

Read/write operations is specified by a chunk handle and byte range. For reliablity, each chunk is replicated on multiple chunkservers. By default, three replicas ares stored, though users can designate different replication levels.

Chunk size is one of the key design parameters. GFS choose 64 MB, which is much larger than typical file system block size. Lazy space allocation avoids wasting space due to internal fragmentation. A large chunk size reduces clients' need to interact with master, reduces network overhead by maintaing persistent TCP connection to chunkserver and reduces size of metadata stored on master.

The master maintains all file system metadata. This includes namespace, authorization, mapping of files to chunks and current location of chunks. It controls system wide activities such as chunk lease management, garbage collection and migration.

GFS client code linked into each application implements filesystem API and communicates with the cluster. Clients interact with master for metadata but all data-bearing communication goes directly to appropriate chunkserver.

Performance

When used with a small number of servers (15), the file system achieves read speeds comparable to a single disk (80-100 MB/s) but has a reduced write performance (30 MB/s) and slow append rate (5 MB/s). Aggregating multiple servers also allows greater capacity, reaching upto 583 MB/s for 342 nodes.

Advantages

GFS provides a location independent namespace.
GFS spreads file's data across storage servers, distributing read/writes.
GFS uses commodity machines, lowering infrastructure costs.
GFS stores minimal metadata in memory and has fast reboot times for master (1-2 seconds).

Disadvantanges

GFS uses replication for redundancy and consumes more raw storage than xFS or Swift.
GFS uses a centralized approach and failure of master stops the service.
GFS is specialized highly for its workload and performs poorly as a general purpose file system.
GFS uses a relaxed consistency model and clients have to perform independent consistency checks.

Summary

GFS is a highly customized file system, catering to Google's specific needs. It is an example of smart design - analysis of workload and expectations. It prioritizes performance for its specific use over overall performance.

Contact us to discuss your requirements of glass fused bolted steel tanks. Our experienced sales team can help you identify the options that best suit your needs.

References

Previous: Timeless Elegance: Vintage Stainless Steel Cutlery Set Essentials

Next: Why Shrink Wrap? Advantages and Benefits

Guest Posts