machine learning datasets 24 June 2021
I know it’s easy to get carried away by all the shiny, sophisticated ML models out there, but we need to have a serious conversation about datasets. In most industrial applications of ML it’s the datasets that limit performance not model complexity.
Lack of high quality training datasets is the biggest challenge for industrial machine learning right now. That should not be surprising if you consider how much bigger the supply of models is compared to datasets. A big portion of the ML algorithms development happens thanks to open source and academic communities which gladly publish their work in exchange for reputation points. In contrast, developing and curating datasets is often less valued and not as glamorous.
My experience from working on data-driven solutions for industrial problems is that there is no shortage of data and huge volumes are available in raw form; however converting them into training and validation datasets is what usually blocks ML/AI progress.
I won't be controversial if I attribute 80% of a data scientist's time to what is collectively termed preprocessing, cleaning or structuring the data. The result of all of those activities is a dataset that hopefully accurately represents reality and can serve as ground truth for model development and evaluation. Creating high quality datasets however takes effort, skills and a right blend of domain knowledge and experience with data.
This is exactly what we are passionate about at unifai. We are approaching dataset development as an engineering practice and we are creating high quality datasets to increase adoption of machine learning in the industry. Our experience in data science, data engineering, software development and heavy asset industry puts us in a unique position to deliver datasets with a perfect mix of domain knowledge.
Unifai is not alone in betting on datasets, one of the biggest names in AI: Andrew Ng recently started promoting a more data-centric view of AI instead of the model-centric view. In a recent announcement together with landing.ai and deeplearning.ai he launched a competition that focuses on improving data while keeping the model fixed to inspire new methods for improving data.
Get in touch with me if you share similar challenges or if you're sitting on piles of raw data and your organization wants to get some quick wins with machine learning.
This post was originally published on LinkedIn.