We are all well aware that to effectively use an ML or AI model to solve a specific problem, it is crucial to have high-quality training data for the model itself. No matter how efficient or accurate the model is, if it is provided with and trained upon a poor quality dataset, it will never produce the desired or correct output. One important quality-related attribute of any dataset, regardless of the problem, is diversity. In this tutorial, we will talk about why diverse data is important for your models and what are the different ways in which you can introduce diversity and variability in your datasets. So, without wasting any time, let’s begin!
Why is Diversity in a Dataset Important?
We talked about having diverse data in one’s dataset, but what exactly is diversity and why do we even need it? Diversity is basically the variety that you have in your data. At times, while trying to solve a problem through ML or AI models, the data that we collect can be too huge in quantity and that can have a severe impact on the performance of the model. What to do then? Well, you can definitely opt to cut down the data so that your model can process it faster, but what about all that valuable information that you will lose if you remove important data from your dataset? Surely that will reduce the accuracy of your model. The main question that now arises is, how can we find a middle ground where your dataset is of a suitable size for the model to process it in a reasonable amount of time and that the information that the dataset contains is variable enough to tackle the full range of cases that the intended system will have to confront? The answer is simple, DIVERSITY.
At times when the datasets are too large, the only way to do anything useful with them is to extract much smaller subsets from them and analyze those subsets instead. The subsets, however, need to be diverse enough so that the model can learn to handle and deal with all the different causes of the problem that it is trying to solve. Using diverse subsets is much more practical as compared to using a dataset with say a million data points, as that can be impossible to use on a desktop computer. Take the example of the face recognition and classification model. If this model is trained on a dataset of images showing different faces of people, and for each person, there are images taken from different angles, under changed lighting conditions, from varying distances from the camera lens, and in contrasting backgrounds, etc. Then the model is most likely to classify the faces more accurately as compared to being trained on a dataset that has thousands of similar types of images. In short, representative and diverse datasets are more likely to provide useful insights as compared to those who do not cover all facets of the problem at hand.
How to Introduce Diversity in a Dataset?
Diversity in a dataset can be achieved in quite a number of ways. If you are collecting the data purely from somewhere, you can include diverse data items in it by gathering relevant data from various different resources as compared to just a single resource. Also, keeping in mind the context of the problem that the model needs to solve, helps in the process of elimination of the different resources from which you can gather data and only leaves a handful of genuine data sources. There are many sources for open datasets that you can utilize.
If you are gathering the data items on your own, e.g. you are taking pictures for an image classification problem, you can make sure that your dataset is as diverse and variable as possible by:
- Taking pictures at different angles.
- Taking pictures under different lighting conditions.
- Taking pictures at varying the distance of the camera lens from the object in question.
- Varying the object size and shape if possible and then taking pictures.
- Changing the background of the object in question and then taking pictures.
- In the case of a colored object, taking pictures consisting of different colors.
The same concepts of diversity can apply to datasets consisting of data of a different type and nature as well.
If you want to include diversity in a subset of a large dataset, one way can be to create a similarity matrix which is basically a huge table consisting of points and that maps every point in the dataset against every other point. The point of intersection of the row representing one data item and the column representing another constitutes the points’ similarity score on some standard measure. However, this method of dealing with matrices can be quite a time consuming and resource-intensive since we are talking about practically a million data items in a matrix. You can opt for different algorithms to include variability in your subsets e.g. MIT Researcher’s Algorithm. In this algorithm, a small subset from a much larger dataset is chosen at random and the algorithm then selects one point inside the subset and another outside it randomly as well. It then chooses any one of three simple operations i.e. swapping of the points, adding the point outside the subset to the subset, or deleting the point inside the subset on the basis of a number of factors which include the size of the large set, the size of the subset itself, etc. This process continues till the subset is diverse enough to meet a certain measurable level.
Fairness and Ethics
As mentioned above, one way to have diverse data is to collect it from different resources. However, while doing so, it is important to keep fairness, ethics, values, and good morals in mind. If you are collecting data from e.g. a website, it is important for you to first ask for permission from the owner of the data before utilizing it for your work or personal use. You can formally do so by dropping the concerned an email or contacting him/her in any other way if possible instead of getting access to it without formal consent. Also, you should also provide references to the different sources from where you gathered the data in your formal documentation or anywhere else where you can.
Given the nature of its process, crowdsourcing is a very efficient way to deliver diversity in your data. Here at DATUMO, we crowdsource our tasks to diverse users located globally to ensure quality and quantity simultaneously. Moreover, our in-house managers double-check the quality of the collected or processed data!
Creating and maintaining diversity in your dataset is not an easy task. Thinking about and maintaining all the things mentioned above is quite a burden. Especially, for small- to medium-sized companies, managing human resources and technical specialties are very challenging. Therefore, it is often more efficient to find another service that does laborious works (including both collection and preprocessing) for you. For that, we could be your perfect solution! Check us out at datumo.com for more information! Let us be your HELP!
To sum it all up, in this tutorial, we started off by talking about how important it is to have a dataset that meets a certain standard of quality and one very important constituent of a good quality dataset is diversity. Generally, a good dataset is composed of plentiful training data. Diversity of the training data ensures that it can provide more discriminative information to the model so that it can accurately predict results. We then discussed ways in which one can introduce diversity in his/her dataset e.g. by collecting data from different sources, using different algorithms to derive a diverse subset from a large set of data etc. Lastly, we touched upon the ethics and code of conduct that should be adopted while introducing variability in one’s dataset.