The size of some of the recently-released language models are intense. This is problematic for two reasons:
First, it hinders democratization. If we believe in a world where millions of engineers are going to use deep learning to make every application and device better, we won’t get there with massive models that take large amounts of time and money to train.
Second, it restricts scale. There are probably less than 100 million processors in every public and private cloud in the world. But there are already 3 billion mobile phones, 12 billion IoT devices, and 150 billion micro-controllers out there. In the long term, it’s these small, low power devices that will consume the most deep learning, and massive models simply won’t be an option.
This is the best post I've read on model efficiency. It goes deep in certain tactical areas but remains extremely accessible at all points.Read more...