Presto @ Pinterest

medium.com

We have hundreds of petabytes of data and tens of thousands of Hive tables. Our Presto clusters are comprised of a fleet of 450 r4.8xl EC2 instances. Presto clusters together have over 100 TBs of memory and 14K vcpu cores. Within Pinterest, we have close to more than 1,000 monthly active users (out of total 1,600+ Pinterest employees) using Presto, who run about 400K queries on these clusters per month.

...that's a lot. To give you a sense of scale, those EC2 instances alone would cost ~$700k / month if they were undiscounted (which they most certainly are not).

This post is particularly interesting to me because, in the era of Snowflake & Bigquery, it hasn't always been 100% clear to me what Presto's long-term role in the ecosystem is. I still don't have a perfectly clear answer to that question, and I do believe that the modern commercial databases absolutely reduce the number of appropriate deployments for Presto, but my theories are:

  1. Cost. My guess is that at this scale using Snowflake or Bigquery would be many times the cost of Presto.
  2. Control. There are many decisions that the team at Pinterest has made that would not be implement-able in a commercial product.
  3. Competition. Companies of this scale don't necessarily want their data to be running through a commercial product owned by another company.

It all eventually comes down to resources. It's very clear reading this post that Pinterest has a huge investment in Presto and that it's working quite well for them. It's also clear that below some threshold, it simply wouldn't make sense to invest the resources to run this type of infrastructure effectively.

Read more...
Linkedin

Want to receive more content like this in your inbox?