Mastering Distributed Computing Principles for Scalable and Fault-Tolerant Systems

Gaurav Kumar
10 min readApr 20, 2024

This is part of the Data Engineering Roadmap.

1. Foundational Knowledge — Distributed Computing Principles.

1. What is Distributed computing in MongoDB?

Distributed computing in MongoDB involves a database that stores data across multiple locations, which can be physical servers or virtual machines in a cloud database. This setup allows for horizontal scaling and improves data resiliency and availability.

Here are some key principles and concepts related to distributed computing in MongoDB:

  • Distributed Database: A database that spreads data across multiple locations to improve resiliency and availability.
  • Sharding: This is the process of distributing data across different databases to support horizontal scaling.
  • Homogeneous Distributed Databases: These databases use the same data model and operating system, and they share the same distributed database management system (DDBMS).
  • Schema Design: MongoDB supports a flexible schema model, allowing documents within a collection to have different fields and data types.
  • Data Modeling: It’s crucial to plan your schema to ensure logical structure and optimal performance. This involves identifying application workload, mapping object relationships, and applying design patterns.
  • Linking Related Data: MongoDB allows you to embed related data within a single document or store it in separate collections and access it with references.
  • Distributed Queries: MongoDB partitions data in a sharded collection based on shard key values, affecting the performance of write operations in the cluster.

What is a distributed database?

A distributed database is one that disperses data across multiple locations, diverging from the conventional single-location storage. Instead of consolidating all data onto a single server or computer, it spreads data across numerous servers or a cluster of computers, each comprising individual nodes. These nodes are frequently dispersed geographically and may encompass physical computers or virtual machines within a cloud database setup.

Visualization of Cluster and Nodes

The MongoDB cluster configuration showcased above represents just one of the numerous setups possible for establishing a distributed database. However, in contrast to conventional centralized databases, all distributed databases share a common trait: the distribution of data across various locations, be they physical or virtual. This practice enhances data resilience and availability. Moreover, by distributing data across multiple locations, sharding facilitates horizontal scaling.

Types of Distributed Databases:

1. Homogeneous Distributed Databases:

  • Definition: Homogeneous distributed databases consist of multiple nodes that share the same data model, operating system, and distributed database management system (DDBMS).
  • Characteristics: All nodes in a homogeneous distributed database environment adhere to uniform standards and configurations, simplifying management and ensuring compatibility across the distributed system.
  • Example: A homogeneous distributed database setup may include multiple nodes running the same version of a database management system, such as MongoDB or Cassandra, with identical schemas and configurations.

2. Heterogeneous Distributed Databases:

  • Definition: Heterogeneous distributed databases comprise multiple nodes with varying data models, operating systems, and DDBMS implementations.
  • Characteristics: Nodes in a heterogeneous distributed database environment may differ in terms of their hardware, software, and configurations, making interoperability and data integration more complex.
  • Example: A heterogeneous distributed database environment may include nodes running different database management systems, such as MongoDB, MySQL, and PostgreSQL, each with its own schema and configuration settings.

3. Replicated Distributed Databases:

  • Definition: Replicated distributed databases replicate data across multiple nodes to improve fault tolerance, availability, and performance.
  • Characteristics: Data replication ensures that copies of data are stored redundantly across distributed nodes, enabling rapid data access and resilience against node failures.
  • Example: In a replicated distributed database setup, each node maintains a copy of the entire dataset, and changes made to one node are propagated to all other nodes in the cluster, ensuring data consistency and reliability.

4. Partitioned Distributed Databases:

  • Definition: Partitioned distributed databases partition or shard data across multiple nodes based on predefined criteria, such as key ranges or hash values.
  • Characteristics: Data partitioning distributes the dataset across distributed nodes, allowing for horizontal scaling and improved performance by parallelizing data access and processing.
  • Example: In a partitioned distributed database setup, data is divided into logical partitions or shards, with each shard managed by a separate node. Queries are routed to the appropriate shard based on the partitioning criteria, optimizing query performance and resource utilization.

Pros and Cons

Advantages of Distributed Computing Systems:

  • Scalability: Distributed systems typically offer greater scalability compared to centralized systems, enabling the seamless addition of new devices or systems to enhance processing and storage capabilities.
  • Reliability: Distributed systems often boast higher reliability than centralized systems, capable of maintaining operations even in the event of a device or system failure.
  • Flexibility: Distributed systems generally provide enhanced flexibility over centralized systems, offering easier configuration and reconfiguration to adapt to evolving computing requirements.

Challenges of Distributed Computing Systems

  • Complexity: Distributed systems often entail greater complexity compared to centralized systems, requiring coordination and management across multiple devices or systems.
  • Security: Securing a distributed system can pose challenges, as security measures must be implemented on each individual device or system to safeguard the entire system.
  • Performance: Distributed systems may not match the performance levels of centralized systems, given that processing and data storage are spread across numerous devices or systems.

Applications of Distributed Computing Systems:

  1. Cloud Computing: Distributed computing systems in the form of cloud computing deliver resources like computing power, storage, and networking via the Internet.
  2. Peer-to-Peer Networks: Peer-to-peer networks serve as distributed computing systems facilitating resource sharing among users, including files and computing power.
  3. Distributed Architectures: Modern computing systems, such as microservices architectures, adopt distributed architectures to distribute processing and data storage across multiple devices or systems.

How do distributed databases work?

Understanding Distributed Database Operations:

  • In a distributed database system, nodes represent individual servers or computers, each housing a portion of the data.
  • These nodes operate independently and share no physical components, such as computers, virtual machines, or servers.
  • Each node runs on distributed database management system (DDBMS) software to manage its respective data set.

Importance of Data Distribution:

  • Effective data distribution is crucial for optimizing efficiency, ensuring security, and facilitating user access within a distributed database.
  • Data distribution, also known as data partitioning, determines how data is distributed among nodes to achieve these objectives.

Methods of Data Distribution:

1. Horizontal Partitioning:

  • Involves dividing data tables into rows and distributing them across multiple nodes.

2. Vertical Partitioning:

  • Splits tables into columns and distributes them across multiple nodes.

Resulting Data Sets:

  • The data sets resulting from horizontal or vertical partitioning are often referred to as shards, each representing a subset of the original table’s data.

Distributed database system communication

In a distributed database system, communication among nodes is vital for seamless operation. Unlike centralized databases, nodes in distributed systems operate independently, necessitating effective communication protocols.

There are three primary types of communication in distributed databases:

  1. Broadcast Communication:
  • Broadcasts a single message to all nodes within the distributed database system.
  • Facilitates dissemination of information to all nodes simultaneously.

2. Multicast Communication:

  • Sends a message to a subset of nodes within the distributed database system.
  • Targets specific nodes for message delivery while excluding others.

3. Unicast Communication:

  • Involves one-to-one messaging between individual nodes within the distributed database system.
  • Enables direct communication between specific nodes without broadcasting to others.

Transaction Management

Distributed databases often support distributed transactions spanning multiple nodes.

Fault Tolerance

Due to the inherent complexity of distributed systems, ensuring fault tolerance is crucial for maintaining system reliability. Common fault tolerance processes include:

1. Data Replication:

  • Maintains multiple copies of data across nodes, servers, or sites.
  • Types of replication include full replication, partial replication, and merge replication.

2. Backup Protocols:

  • Implements automated data backup strategies to preserve data integrity and system availability.
  • Types of backups include full, differential, and incremental backups.

3. Continuous Failure Detection:

  • Monitors distributed systems for technical issues, disasters, or cyberattacks.
  • Techniques include heartbeating, watchdog timers, and data checksums.

Load Balancing

Efficiently distributing user requests and queries across database nodes is essential for optimal performance and resource utilization. Load balancing techniques include:

1. Load Balancer Deployment:

  • Deploys load balancer software to evenly distribute user requests across nodes.
  • Factors such as proximity, current load, and system rules guide load balancing decisions.

Query Optimization

Distributed databases employ query optimization techniques to enhance query performance and minimize data transfer overhead. Cost-based query optimization considers factors such as query complexity and data location to determine the most efficient query execution strategy.

By implementing effective communication, transaction management, fault tolerance, load balancing, and query optimization strategies, distributed databases can achieve scalability, reliability, and performance across distributed computing environments.

Case Studies and Real-World Applications:

Case Study: Optimizing Database Performance with CacheFront at Uber

Introduction:
Uber’s Docstore, a distributed database built on MySQL®, faces the challenge of serving low-latency reads at a high scale. As demands increase, traditional scaling methods become cost-prohibitive and operationally complex. To address this, Uber developed CacheFront, an integrated caching solution aimed at improving latency, scalability, and cost efficiency.

Challenges:

  • Growing demand for low-latency reads amidst complex microservices and dependency call graphs.
  • Vertical and horizontal scaling limitations in traditional database approaches.
  • Imbalance between read and write request rates impacting database performance.
  • Cost-prohibitive scaling options failing to address latency issues effectively.

Docstore Architecture:

Docstore architecture.

CacheFront Solution:

  • CacheFront minimizes the need for costly scaling by integrating caching directly into Docstore.
  • Reduces resource allocation to the database engine, improving cost efficiency.
  • Improves P50 and P99 latencies, stabilizing read latency spikes during microbursts.
  • Replaces custom caching solutions, streamlining development efforts and improving productivity.
  • Transparent integration allows seamless adoption without additional boilerplate for developers.

Implementation:

  • Utilizes a cache-aside strategy for cached reads, asynchronously populating Redis with data.
  • Leverages change data capture (CDC) for cache invalidation, ensuring consistency within seconds of database changes.
  • Implements negative caching to cache non-existent rows, further optimizing read operations.
  • Utilizes sharding and cache warming to ensure scalability and fault tolerance across regions.

Results:

Significant latency improvements with P75 latency down by 75% and P99.9 latency down by over 67%.

  • Successful handling of over 6 million requests per second (RPS) with a 99% cache hit rate.
  • Reduced resource requirements with only 3,000 Redis cores serving approximately 99.9% cache hits, compared to 60,000 CPU cores originally required.
  • Currently supports over 40 million requests per second across all Docstore instances in production.

Using CacheFront in Projects:

1. Identify Performance Needs: Assess the need for low-latency reads and high scalability in the project.
2. Evaluate Integrated Caching: Consider integrating a caching layer directly into your database to improve performance and scalability.
3. Implement Cache Invalidation: Utilize change data capture mechanisms for cache invalidation to ensure consistency with database changes.
4. Optimize Cache Strategy: Experiment with caching strategies such as negative caching and cache warming to further enhance performance.
5. Monitor and Measure: Continuously monitor cache performance and measure latency improvements to optimize cache configurations.

For More Details follow this link: How Uber Serves Over 40 Million Reads Per Second from Online Storage Using an Integrated Cache | Uber Blog

Case Study: Detecting Speech and Music in Audio Content at Netflix

Introduction:
Ever wondered how their favorite shows like Stranger Things or Casa de Papel create captivating audio experiences? From dramatic scores to impactful sound effects, audio plays a crucial role in storytelling. To unravel the secrets behind these audio experiences, they need to understand the components like dialogue, music, and effects.

Practical Applications of Speech & Music Detection:

  1. Audio Dataset Preparation:
  • Classifying and segmenting audio is essential for creating training datasets.
  • For instance:

a. Segregating music segments aids in music retrieval tasks.

b. Identifying utterances helps in speech-related tasks like speaker diarization and emotion classification.

2. Dialogue Analysis & Processing:

  • Netflix uses speech-gated loudness for catalog-wide loudness management, ensuring a consistent volume experience for viewers.
  • Algorithms for dialogue intelligibility and speech transcription are applied only to regions with measured speech, enhancing accuracy.

3. Music Information Retrieval:

  • Music activity metadata is crucial for quality control and content analysis.
  • Tasks like singer identification and song lyrics transcription contribute to annotating musical passages in captions and subtitles.

4. Localization & Dubbing:

  • Accurate speech segmentation aids in post-production tasks like translation and dubbing.
  • Segmentation also helps in authoring accessibility features like Audio Description (AD).

Approach to Speech and Music Activity Detection:

  • Instead of relying on clean labels, they utilized Netflix’s extensive audio catalog with noisy labels, significantly expanding their dataset’s scale.
  • Dataset, TVSM, encompasses diverse content from various countries and genres, featuring both speech and music labels at the frame level.
  • They employed manual annotation to ensure label quality and consistency, overcoming challenges in distinguishing between music, speech, and other audio elements.

Model Architecture & Evaluation:

  • Adopted a convolutional recurrent neural network (CRNN) architecture, tailored to handle input/output requirements and model complexity.
  • Training involved a random sampling strategy, producing 20-second segments from the dataset.
  • Extensive evaluation, including an ablation study, highlighted the effectiveness of their approach in detecting speech and music activities.

Results:

  • Models exhibited excellent performance across diverse audio datasets, affirming the viability of utilizing real-world datasets with noisy labels for speech and music detection.
  • The robustness of their system underscores the importance of investing in algorithmically assisted tools for audio content understanding, facilitating tasks throughout the content production and delivery lifecycle.

For More Details follow this link: Detecting Speech and Music in Audio Content | by Netflix Technology Blog | Netflix TechBlog

So let’s see how much you’ve retained from the blog

Conclusion

Building a scalable data pipeline is a complex process and requires a robust framework, most of the tech and practices followed by Big Tech is difficult to follow in personal projects. However you can try out smaller scaled down versions of similar projects and get familiar with the practices. Keep updating yourself with the Blogs by these Tech Houses for latest updates and advancements in the industry. We will try to cover as much as possible on the front but practice and debugging is what will make you sharp Data Engineer. Happy Learning !!!

--

--