People, in my experience, tend to find it hard to get their head around many “big data” concepts. It’s only when they attempt to implement initiatives, and are frustrated by the basics, that they start to “get it”.
One of the most basic things that people seem to misunderstand is the challenge of moving data around. Most big data tutorials assume that you already have terabytes (or petabytes) of data on your cluster. However, how does it get there in the first place? If you have hundreds of terabytes of data in traditional storage how do you get in onto your Spark (for instance) cluster?
There’s no magic answer. Moving that amount of data is painful—plain and simple.
Microsoft has recognized this and introduced an Azure Import/Export service. Basically you snail-mail them your hard disk(s) and they upload the data to distributed storage via their high-speed secure internal network.
It’s a start, but is trailing Amazon’s solution to this problem by, oh, around 100 petabytes.