How to Datalab: Running notebooks against large datasets

Streaming your big data down to your local compute environment is slow and costly. In this episode of AI Adventures, we’ll see how to bring your notebook environment to your data!

Datalab and Sagemaker (Google's and Amazon's notebook products, respectively) are interesting to me for exactly this point: it's just not a great idea to process data using your local processor. Sure—there plenty of times when doing this would work just fine, but if you accustom yourself to that workflow, all of your tooling will be built around it. When you find yourself needing to process a larger dataset, you'll all the sudden have to step into a different tool set.

Instead, default to processing all of your data in the cloud. This doesn't necessarily mean you need to use a cloud provider's notebook product! You can absolutely set yourself up a local notebook that uses cloud resources to perform computation, but this article presents a wonderful walkthrough of how Google Cloud Datalab makes the workflow really seamless and easy.

If you're still using local CPU cycles on numpy, give this a shot. It's easy and it scales really well.


Want to receive more content like this in your inbox?