Member-only story
Managing Large Datasets at Scale
How to leverage Graviti’s Tensorbay platform to manage large unstructured datasets
Datasets play a central role in machine learning and are at the centre of the recent deep learning revolution. Dataset sizes range from a few KBs used by beginners to learn the basics of machine learning to hundreds of terabytes of data collected by large companies such as Google and Facebook. These large datasets are ever-increasing as users worldwide interact with the services of these large companies. These large datasets are used to train machine learning models to improve speech recognition, optimize delivery routes for parcel delivery, serve targetted ads and much more.
However, most publicly available datasets don't belong to the “big data” realm and are often much smaller in size. Such datasets can be found on various platforms such as kaggle. Smaller datasets are generally stored locally for model training since using a cloud service would be overkill (but of course, models can then be deployed to a service or the cloud).
As the dataset size and complexity grows, this method is no longer viable to train models since the dataset might not fit into the local storage or there isn’t enough computing power. In such cases, the dataset is stored on a cloud server instead.