Hey, TechFlixers!
Today, we're diving into how Pinterest supercharged its analytics with StarRocks, achieving faster insights and better performance. In our Spotlight, we’ll see how they cut costs and boosted efficiency. Then, Power Up unpacks the essentials of fast analytics platforms.
🔦 Spotlight
📢 Delivering Faster Analytics at Pinterest
Pinterest, a visual discovery platform with over 500 million monthly active users, needed to improve its analytics system to provide real-time insights to advertisers and team members.
The company had been using Druid, but as its scale and requirements increased, it decided to migrate to StarRocks, an open-source OLAP database that supports standard SQL types and schemas, joins, sub-queries, and materialized views.
With the help of StarRocks, Pinterest was able to reduce p90 latency by 50% using only 32% of the instances required. This resulted in a three-fold increase in cost-performance efficiency and streamlined its data ingestion process, achieving data freshness of just 10 seconds.
🚀 Power Up
Let's dive into the nuts and bolts of fast analytics platforms, focusing on what makes them tick and why they're essential for companies like Pinterest.
What is an Analytics Platform?
It is a system that allows organizations to collect, process, and analyze large amounts of data to extract actionable insights. For a platform like Pinterest, where user engagement and advertiser satisfaction hinge on real-time feedback, having a fast analytics system is crucial. Fast analytics means quicker insights, leading to better decisions and a more responsive user experience.
General Architecture of an Analytics Platform
Data Ingestion: This is the process of importing data from various sources. It can include batch data (collected and processed at intervals) or streaming data (continuous, real-time data). The aim is to get data into the system as quickly as possible. We explored more on this in the last episode.
Data Storage: This is where the ingested data lives. It needs to be optimized for both read and write operations. Databases like Druid and StarRocks store this data in a way that allows for quick retrieval and processing.
Data Processing: This component transforms raw data into a format that's easy to query and analyze. It often involves operations like filtering, aggregation, and joining datasets.
Query Engine: This is the brain of the analytics platform. It takes user queries (usually written in SQL), finds the data, processes it, and returns results. The efficiency of this engine determines the speed of the platform.
Visualization: This involves displaying the data in a user-friendly manner, often through dashboards, charts, or reports.
Key Considerations for Fast Analytics Platforms
Data Freshness: For real-time insights, data must be processed and made available for querying almost immediately. Pinterest achieved this with a data freshness of just 10 seconds, meaning that the data is available for analysis within 10 seconds of being generated.
Latency Reduction: Low latency is crucial for a responsive system. Pinterest cut their p90 latency by 50%, meaning that 90% of their queries are completed within a very short time frame.
Cost-Performance Efficiency: The goal is to maximize performance while minimizing costs. By reducing the number of instances needed, Pinterest tripled its cost-performance efficiency. This balance is achieved through careful selection of technologies and architectures that optimize resource use.
Support for Complex Queries: As data complexity grows, the ability to perform sophisticated queries becomes essential. This includes handling joins, sub-queries, and maintaining materialized views to speed up access to frequently queried data.
What is an OLAP Database?
An OLAP (Online Analytical Processing) database is a specialized type of database designed for analyzing and querying large amounts of data quickly.
Unlike regular databases that handle daily transactions (like updating a shopping cart), OLAP databases focus on answering complex questions about data, such as identifying trends or comparing metrics across different dimensions.
Why OLAP Matters
Fast Queries: OLAP databases can quickly run complex queries, which is crucial for providing real-time insights.
Large Data Handling: They are built to handle massive amounts of data efficiently.
Data Freshness: They allow for rapid data updates, ensuring that analytics are based on the most recent information.
Scalability: They can grow with the data, keeping performance high even as data volume increases.
Druid and StarRocks are two examples of OLAP databases used for analytics.
Druid: Known for its fast data ingestion and real-time analytics capabilities, Druid is ideal for streaming data, where new information needs to be processed continuously and quickly.
StarRocks: Offers advanced SQL support, including complex queries and joins. It's great for detailed analysis and reporting, handling more sophisticated queries efficiently.
These databases have in-built query engines that determine the speed of the analytics platform.
(Note that these aren’t the only options available)
In real life, these systems help a business quickly adjust to market changes, provide timely feedback to advertisers, and ensure a smooth user experience.
For instance, an e-commerce site can instantly analyze shopping trends to adjust promotions, or a social media platform can monitor user engagement in real-time to optimize content recommendations.
Explore Further
Real-Time Analytics for Mobile App Crashes using Apache Pinot at Uber
Hybrid Tables for Data Reliability: Uber uses a mix of real-time and offline tables in Apache Pinot (another OLAP DB), ensuring continuous data availability even during failures.
Efficient Data Handling: They flatten and compress nested crash data, enabling faster queries and reducing storage costs.
Massive Event Processing: Netflix ingests over 2 million events per second and queries 1.5 trillion rows, using Apache Druid for high-speed data handling.
Efficient Data Aggregation: The platform utilizes time-based data chunks and rollup techniques (combining multiple records that share the same dimensional values into a single summarized record) to optimize storage and query speed.
Real-Time Insights and Testing: Netflix employs Druid for real-time device monitoring and A/B testing, ensuring a high-quality user experience.
That’s it for this edition. From Pinterest's analytics upgrade to the fundamentals of OLAP, we’ve covered some game-changing insights. Stay tuned for more tech deep dives.