How Crowdsourcing Can Help Small Businesses Leverage Machine Learning

Machine Learning (ML) is increasingly becoming a competitive differentiator for many businesses especially in financial, healthcare and retail verticals. However, very few firms are able to leverage the potential of machine learning to generate business value, create competitive advantage or create deep research insights. Part of the challenge is that machine learning involves developing core competencies across several disciplines as outlined below:

  • Sourcing massive training datasets
  • Data Wrangling (data cleansing, feature engineering, etc.)
  • Predictive model building using statistical methods and ML languages
  • Validating predictive model accuracy

While firms typically have in house domain experts who understand domain specific data, the key challenge they face is in sourcing the needed data at scale to develop more realistic predictive models. Another challenge firms face is lack of in house expertise with knowledge of advanced statistical techniques or ML languages expertise to build truly reliable predictive models.

Crowdsourcing platforms for Machine Learning are coming of age to address these challenges. Several for profit and not for profit businesses are leveraging crowdsourcing Machine Learning platforms & services to meet their business objectives or offer deeper insights.


Major challenge in creating an optimal predictive model is scale of training data. As the saying goes garbage in, garbage out and this is very much true in the case of developing good prediction models for machine learning. Without enough or relevant training data the predicted models are of no good use even if the underlying algorithms are very powerful.

It all boils down to having the right ingredients, i.e.:

  • Massive training data sets i.e. sufficient data
  • Business relevant data
  • Cleansed data

Few crowdsourcing platforms like Amazon’s Mechanical Turk & CrowdFlower are aiming to solve this dataset sourcing and data cleansing problem.

Amazon’s Mechanical Turk (AMT)

AMT is a crowdsourcing marketplace where a company can publish and coordinate a wide set of Human Intelligence Tasks (HITs), such as classification, tagging, surveys, and image data processing. By leveraging a global workforce of on demand workforce massive sets of data can be tagged, labeled and cleaned to make the data ready as training data for predictive modeling.

One of the challenges associated with open platforms like AMT is the raw nature of the training data. Due to the scale of the training data multiple Turks will be working on the same dataset and the resulting output dataset will not be homogenous. Also more often than not people working on these datasets will have no prior domain knowledge leading to questionable quality of the dataset impacting quality of the model prediction.

Realizing this problem several big name firms are using their own internal platforms to control the quality of the datasets being generated. Having internal souring will also help since domain expertise can be leveraged to create more meaningful training datasets. By leveraging powerful wrangling tools like Trifacta, internal business analysts & data scientists can automate some of the data wrangling activities.


Once sufficient training datasets are acquired the next big challenge firms face is developing accurate predictive models that help their business needs. Several innovative companies are providing crowdsourcing platforms & services to tackle this need. With a vast army of highly qualified statisticians & data scientists these providers help businesses tackle the problem of building complex models.

Let’s have a closer look at some of the platforms solving model building for Machine Learning.


Kaggle is an industry pioneer in this space and is basically a marketplace that matches firms seeking data science solutions with machine learning experts. Kaggle is revolutionizing Machine Learning/Data science landscape with a single objective of hosting data science problems to solve critical business needs. Firms post challenging problems in the portal in the form of competitions, with a fixed schedule to solve them, and a person or a team registered in the portal can submit a solution.

Some recent competitions as posted on Kaggle site below.

Source: Kaggle

As can be seen from the above competitions posted in Kaggle several key industry leaders are leveraging this platform to address critical business needs. One of the challenges with Kaggle training datasets is they are rarely clean and need to be cleansed before predictive models can be built. This does not seem to deter competitors and businesses in participating in the platform and Kaggle has one of the highest number of participants in the machine learning model building competition.


Numerai is a $15 billion asset management/hedge fund that uses crowdsourcing competition to source predictions that they use internally to make trades. Their claim to fame is bringing networks effects in the financial industry as quoted below.

“The most valuable hedge fund in the 21st century will be the first hedge fund to bring network effects to capital allocation.” – Numerai

Numerai’s focus is on pure prediction (ex: predict stock prices) based on the models generated. Numerai also encrypts their financial data, buys rights to the data and cleans the data before presenting it to competitors. In this sense the data is already cleansed and competitors need not worry about the data cleansing part unlike Kaggle.

Based on Numerai reported data, the error rates for prediction seem to be continually going down starting with 0.5 (random prediction).

Numerai is disrupting the hedge fund industry by crowdsourcing its investment decisions to anonymous machine learning experts around the globe. If Numerai succeeds it could pave the way for new investment management model in the financial industry.

DREAM Challenges

DREAM challenges is part of an open science mission and is a non-profit community effort including researches from universities. The focus of DREAM challenges is mainly in biology and medicine.

“As the volume and complexity of data continues to increase, it is critical to develop new methods to use data to address fundamental questions to better understand and improve biological sciences and human health.” – DREAM Challenges

DREAM challenges are created and managed by experts in systems biology, statistics, and challenge design so that the results will be consistent and reproducible in a meaningful way. Organizers of the competition pre-test all data and predictions and develop custom scoring methodologies to ensure quality data and rigorous performance evaluation. Similar to Numerai, training data is cleansed and provided by in-house domain experts as competitors come up with predictive models that will be evaluated by organizers for their accuracy and reliability.

As depicted by some recent DREAM challenges below the DREAM community is trying to leverage the wisdom of the crowd to find new and better computational models in solving fundamental problems in biological sciences and human health.

Source: DREAM Challenges


Platforms like Numerai, Kaggle and DREAM Challenges are helping firms address complexity of ML model building by a creating a statistical black box model hiding the underlying complexities. On the data sourcing front platforms like AMT & CrowdFlower are trying to address the data sanity issues through manual efforts of crowdsourced labor pool.

Firms have to embrace open data principles to fully take advantage of the crowdsourced model. Data privacy is a key challenge that is preventing firms from fully embracing this model. This concern is especially true in the case of for profit businesses where data could be proprietary in nature. One possible solution is to leverage Numerai model of encrypting the data so data privacy concerns could be addressed. As AI becomes ubiquitous, vertical niche crowdsourcing models and Machine Learning as a service could be the next big wave in this space.

Leave a Reply

Your email address will not be published. Required fields are marked *