Deleting Production Data

🎬S3:E07 Why? How?

Aug 01, 2024

Hey, TechFlixers!

Storing data has a cost associated with it. Often, as a company grows in size, the data stored can increase tremendously. How to optimize storage? How to clean and delete data to reduce costs?

Let’s find out with a real world case study with Medium.

🔦 Spotlight

📢 Medium: How to delete old data from DynamoDB

Medium had a large amount of data stored in DynamoDB, a cloud-based database, which was becoming expensive to store and maintain.
They decided to clean up the data by deleting old items that were no longer needed. Data in this category can include:
- Any data that are not required after a few months—for example metrics that ensure that a post that has been already recommended to the user is not recommended again.
- Data like emails sent, that’s not required to be stored forever.
There are three possible scenarios for cleaning up the data: doing nothing, deleting bad items, and migrating good items to a new table.
They estimate the costs of each scenario using their own use case as an example, considering factors such as scanning and deletion costs, storage costs before and after cleanup, and the time it takes to complete the migration.
The best option depends on the specific use case and suggests that companies should try to reproduce these estimates with their own data before starting any cleanup work.

Source 🔗

🚀 Power Up

Design Patterns for Data Deletion

Tombstoning
- What it is: Instead of deleting records outright, mark them as deleted.
- Use case: Soft deletions where you might need to restore data later.
- Example: A user deactivates their account but might return. Tombstoning allows you to reactivate without data loss.
Archival
- What it is: Move old data to a cheaper, slower storage tier.
- Use case: Data that must be retained for compliance but isn't accessed frequently.
- Example: Financial records older than five years moved from a live database to cold storage.
Log Rotation
- What it is: Regularly delete or archive logs after a certain period.
- Use case: Log files that are essential for a short duration.
- Example: Server logs that are only needed for the last 30 days.

Algorithms for Efficient Data Deletion

Time-Based Deletion
- Algorithm: Use timestamp fields to identify and delete old records.
- Efficiency: Low overhead as it can be scheduled during low-usage periods.
- Example: Deleting session data older than 90 days.
Batch Deletion
- Algorithm: Process deletions in batches to avoid overwhelming the system.
- Efficiency: Reduces load compared to deleting records individually.
- Example: Deleting old log entries in chunks of 1000.
Priority-Based Deletion
- Algorithm: Assign priority levels to data and delete the least critical first.
- Efficiency: Ensures important data remains accessible longer.
- Example: In an email system, delete promotional emails before user-generated content.

Use Cases and Scenarios

E-commerce Platform
- Scenario: Order history older than two years.
- Solution: Archive old order data and delete customer session data older than 30 days.
- Result: Reduces primary database size, cutting storage costs.
Social Media Network
- Scenario: User activity logs that clutter the database.
- Solution: Implement log rotation and batch deletion.
- Result: Keeps the database lean, improving performance and reducing costs.
SaaS Application
- Scenario: Tenant data where customers have left.
- Solution: Use tombstoning for soft deletes and archival for long-term storage.
- Result: Flexible data management, allowing for potential restoration if needed.

Pro Tips

Automate the Process: Use automated scripts to schedule deletions during off-peak hours.
Monitor and Adjust: Regularly review deletion policies and adjust based on data access patterns.
Test Before Deleting: Simulate deletions in a staging environment to avoid unexpected issues in production.

Optimizing data storage isn't just about reducing costs—it's about maintaining a streamlined and efficient system. By implementing these strategies, you can ensure your data management practices are both cost-effective and robust.