Data Science Roundup reader Vicki Boykis has a great new post on the idiosyncrasies of SparkR's dataframe. As a data scientist using R, SparkR is an incredibly powerful tool to extend your existing skillset into the world of parallelized computing, but it's important to understand what's going on under the hood. Vicki's article does a great job of showing exactly that.
Also, I'm embarrassed by just how much I enjoyed this joke from the article:
Some people, when confronted with a problem, think “I know, I’ll use multithreading”. Nothhw tpe yawrve o oblems.
It's a good point: use Spark only when you have to.Read more...