Zero downtime data migrations and system design for a booking platform
🎬S2:E06 Targeting 99.999%
Hey, TechFlixers!
Before we get started, I have some EXCITING news to share. Starting from the next episode, TechFlix Weekly will become TechFlix Daily!!!
Why switch to a daily newsletter?
We aim to provide a consistent, in-depth engineering resource for hardcore software engineers. We currently track over 150 engineering and technology blogs and plan to increase to 300+. To keep up with this growth, a daily newsletter will ensure you never miss important updates.
What’s changing?
You'll get one email a day, perfect for a quick morning read, a tea break, or before bed. We're working on consistent delivery, and your support is appreciated!
Each edition will be shorter, crisper, and more visual. Just 5-7 minutes a day can significantly boost your software engineering knowledge. You'll find real solutions and references for deep engineering problems.
The new format includes:
Spotlight Story: One main story, plus related case studies or article links.
Tea Time: Stays the same.
PowerUp: Intuition around a brief interview question to help you revise and upskill.
In this episode, we explore one of the most important aspects of software engineering—database migrations.
🔦 Spotlight
📢 Stripe's Quest for 99.999% Uptime
Stripe's latest blog post delves into how their document database infrastructure achieves exceptional reliability with zero-downtime migrations. By leveraging a customized version of MongoDB, Stripe built DocDB to handle massive real-time data and support over five million queries per second.
Key Highlights
DocDB: An in-house extension of MongoDB, optimized for low latency (minimal delay) and high performance.
Data Movement Platform: Enables client-transparent migrations, ensuring data consistency and availability without downtime. This means users never notice the data moving.
Sharding Strategy: Distributes data across thousands of shards (smaller, manageable pieces of data), allowing seamless scaling and enhanced performance.
Check out the full article for a deep dive into their innovative solutions.
Other interesting reads
Migrating critical traffic at scale with zero downtime at Netflix.
How Apollo 24|7 migrated to Google Cloud with zero downtime.
🚀 Power Up
Designing a Flight Booking System
Let’s explore the nitty gritty details of this question, which appeared in a recent DeShaw interview. Note that a similar approach can also be used for any other type of booking system.
Step-by-Step Breakdown
The UI is the front door of your system, where users search for flights, book tickets, and make payments. This should be intuitive and responsive. Modern web frameworks like NextJS can be used.
Backend Services:
Flight Search Service: Allows users to search for available flights based on their criteria (destination, date, etc).
Booking Service: Manages the booking process, ensuring that seats are reserved properly.
Payment Service: Handles payment transactions securely.
User Service: Manages user authentication and profiles.
3. Database: It stores user data, flight schedules, booking information, and payment details. Ensure it’s robust and secure.
Techstack Summary
For the Flight Search Service, utilize Python with Flask or FastAPI for efficient development, combined with Elasticsearch for powerful search capabilities and Redis for caching.
Use Apache Kafka for real-time data processing (like showing flight data). The Booking Service should be built with Java and Spring Boot for robust transaction management or Node.js with Express for asynchronous handling.
Implement PostgreSQL for database needs and use RabbitMQ or Apache Kafka for message queuing.
For the Payment Service, integrate with payment gateways like Stripe or PayPal SDK, and ensure security with OAuth2, JWT, and PCI-DSS compliance.
The User Service can be developed using Auth0 or Firebase Authentication for user management and MongoDB for flexible schema storage.
For the overarching Database solution, combine PostgreSQL for relational data and MongoDB for non-relational data, ensuring encryption, regular backups, and data replication with read replicas and sharding for scalability and security.
Booking Process
Search Flights: The user inputs search criteria, and the Flight Search Service queries the database and returns available flights.
Select Flight: The user selects a flight; the Booking Service checks seat availability.
Reserve Seat: Booking Service uses optimistic locking to reserve the seat.
Payment: The payment service processes the transaction.
Confirm Booking: Booking Service updates the database with confirmed booking details.
Follow-Up Interview Questions
How would you handle the issue of double booking due to concurrent requests?
To handle this, implement Optimistic Locking and Distributed Transactions.
Optimistic Locking
Each seat in the database has a version number.
When a user selects a seat, the system reads the seat’s version.
Before confirming the booking, the system checks if the seat’s version number has changed.
If unchanged, the booking proceeds and the version number is incremented.
If changed, it indicates another user has already booked the seat, so the system prompts the user to choose another seat.
Distributed Transactions
Use a distributed lock manager (e.g., Redis) to manage locks across multiple instances.
Before booking, acquire a lock on the seat.
Complete the booking transaction while holding the lock.
Release the lock once the transaction is complete.
How would you design the system to handle peak load during high-traffic events, like a major sale or holiday season?
Auto-Scaling: Configure auto-scaling rules to add more instances of your services when CPU or memory usage exceeds a certain threshold.
Caching
Implement caching for read-heavy operations like flight searches.
Use in-memory data stores like Redis or Memcached to cache frequently accessed data, reducing the load on the database.
Cache user sessions and flight search results to speed up response times.
Load Balancing
Ensure the load balancer supports health checks to route traffic only to healthy instances.
Use a combination of round-robin and least-connections algorithms to optimize traffic distribution.
Rate Limiting and Throttling
Implement rate limiting to prevent a single user from overwhelming the system with requests.
Throttle requests to critical services during peak times to maintain system stability.
Circuit Breaker Pattern
Implement the circuit breaker pattern to prevent cascading failures.
If a service fails, the circuit breaker trips and subsequent requests return an error immediately, preventing the system from being overwhelmed.
Monitor the status of services and reset the circuit breaker when the service recovers.
📨 Post Credits
And that’s a wrap! Stay tuned for the next episode for a new beginning!
Coming soon.