Stitch Fix: Putting the Power of Kafka into the Hands of Data Scientists

multithreaded.stitchfix.com

It's been a while since I read an amazing "here's how we built our killer data infrastructure" post, but this one more than scratched the itch. It details a year-long project involving design, technology selection, and implementation.

The original goals:

  1. Fully self-service for data scientists (remember, this is the company that believes engineers shouldn't write ETL)
  2. Enable real-time analysis
  3. High-fidelity change capture

Even if you're not imagining going through a project like this tomorrow, this post is the distillation of three people's thinking over the course of an entire year, and contains tons of wisdom. My absolute favorite part was at the end where the author discusses their strong preference for investing in open source:

Never have I worked on a project this challenging that involved writing so little code. We spent about two-thirds of the project timeline on research, design, debate and prototyping. Over months we whittled away at every component until it was as small and maintainable as possible. We tried to salvage as much as we could from our legacy infrastructure. More than once we submitted patches upstream to Kafka Connect to avoid extra complexity on our end.

An amazing model for much of modern software development.

Read more...
Linkedin

Want to receive more content like this in your inbox?