An unfinished blog.

The hardware lottery says something along the lines of “the algorithms that achieve the best results are the ones that utilize the available hardware optimally.”

I prefer thinking about it differently. The nature of deep, high dimensional pattern detection in the real world is that it is necessarily highly parallelizable. The reason why the hardware lottery exists is because in order to discover certain patterns, you need a certain amount of compute. More compute means better results. And more parallelization means more compute.

For a dataset and a task, all deep learning algorithms capture the same patterns, albeit in different ways. Maybe some algorithms are optimized to find certain patterns more efficiently. But the underlying patterns in the data are the same, and the process of “learning”, in particular “machine learning”, learns those patterns.

Maybe I’m being pedantic, but it’s not that the best algorithm is that which best fits the available hardware. The best algorithm needs a lot of compute, so it necessarily needs to use a highly parallel processor efficiently.

(I acknowledge that both statements are practically equivalent - I phrase it differently because the new phrasing more closely matches my intuition.)