Creating the Best Quality Image Dataset

Creating the Best Quality Image Dataset

Apart from building a model for solving a specific problem, it is of equal importance to have a good quality dataset for the problem at hand, because no matter how efficient or accurate your model is, if provided with the wrong dataset, you will never attain the desired output.

A good dataset is crucial in achieving the highest possible accuracy of your model. It is also important that the dataset is processed in such a way that our model can make complete sense of the information. That way, the model can successfully learn from that dataset. Thus, the goal of our tutorial is to discuss ways to gather a dataset of raw images and then filter out the images to create the best possible dataset for image classification/computer vision projects. So let’s begin!

Step 1: Planning

In the planning phase, before actually collecting the images, you must assess the context of the problem that you need to solve and then choose the best possible way to build a dataset for that problem. For example, there are many sources for open datasets that you can utilize if you are doing a common image classification project. Similarly, you can also take the pictures on your own or download them from a source. We’ll be discussing both of these ways to gather images for a dataset.

Step 2: Gathering Images

Taking Pictures on your own:

 

While taking pictures, you should consider the following pointers so that your training dataset is both as variable and diverse as it can possibly get:

  1. Take pictures of the object to be classified at different angles
  2. Change your lighting conditions
  3. Change the object size
  4. Vary the distance of your camera from the object
  5. Vary the background of the object
  6. Take good quality images and in focus
  7. For a colored object, take images consisting of different colors

Following these pointers will help to ensure that your image dataset is as realistic as possible. Training with such images will ensure good performance as a higher diversity of datasets, in turn, leads to higher accuracy.

 

Downloading images through the Fatkun Batch Download Image extension:

 

 

Pre-requisites:

  1. Google Chrome Browser. If you do not already have it downloaded, you can download it from here.
  2. Fatkun Batch Download Image. If you do not already have it downloaded, you can download it from here.

 

Steps:

  1. After you are finished with the installation, search for the website and the pictures that you want to possess.
  2. Click on the extension’s icon and with the aid of this, you can opt for either the current tabs or the open tabs.
  3. Now an extension will get opened which would display a new tab showing all images that have been detected by it. All the pictures that appear on the extension’s tab by default have opted for the purpose of download. Once you have made the choice; you can click on ‘save image’.
  4. The extension would now provide you with the warning and will ask where to save the file before it is been downloaded and you have to give the confirmation for each image.

 

Hence, you can automatically download the images. The extension would create for you a new folder based on the title of the website and there you could download all the desired images. You could even click on more options so that with the aid of the link you could simply filter the images, rename and sort them as per size.

Step 3: Image Filtering

After having the images downloaded in a bulk, you are most likely to realize at first glance that some of the images that you have downloaded are either unclear, low in resolution, irrelevant and duplicates of other images. Therefore it is very important to rid your image set from such images first to construct the best possible image dataset for your model.

 

Deleting image duplicates

 

While constructing an image dataset, it is very crucial that clear preference is given to quality in comparison to the quantity of the images at hand. Therefore, if there appear to be a lot of exact duplicates, you should filter them out using something like a Resnet18 which helps to remove duplication using duplicate feature vectors. This is, however, not very practical for large datasets but the idea with this is that duplicated images may allow models to cheat on performance metrics if they get put in both train and test splits, so reducing them as much as possible is good. Phashing is another potential duplication removal method. However, the con of this technique is that it sometimes falsely detects non-duplicate images as duplicates because of the fact that at times images are resized down quite small and turned to black and white so there is a tradeoff.

 

Deleting very small images

 

It is important to remove very small images from the image set since these images give very little information and are mostly of poor quality. Thus, it is a good practice to standardize a reasonable threshold for image size, so that when an image size lies below or even way above the threshold, it is a sign that the image needs to be removed. For example, most image models take images between 224×224 and 512×512 so this helps to cut out the low-quality images that you may have downloaded. This process can also be automated by using various Python scripts to save time and improve accuracy.

 

Manual Pruning

 

This is not the most practical but at times inevitable to improve the quality of your dataset because no matter how much you automate this process of image cleansing, you cannot beat the human eye when it comes to sensing good quality images. This is an attempt to remove low quality or non-relevant images from the different classes in your dataset. This step sort of cascades to the quality of the final model and classes so it is recommended to be very aggressive in deleting images if you want well-defined classes.

Some Other Key Pointers

Amount of Data

Choosing the right amount of data i.e the number of images that you should use to train your model is also an important factor to consider. For Machine Learning projects, it should be at least 10 times the number of features per class. As for Deep Learning projects, it should be at least 100 times the number of features per class.

The sample quantities should be balanced among classes

  1. The samples should represent the real situations where you are going to apply your model. For instance, if you are training a face classifier for using in situations where faces are smaller than 30×30 pixels you should have low quality and low resolution face images in the training stage.
  2. Samples should have a maximum variety possible. For instance, in face classification, your dataset must have faces of people of different ages, ethnicities, genders, illumination conditions, orientation in-plane and out-of-plane, etc.

Most times, quality controlling within a company is quite a burden, especially your company is a small- or medium-sized company; having enough human resources is always a great challenge for companies in such sizes. Therefore, it is often more efficient to find another service that does laborious works for you. We could be your perfect solution!

Here at DATUMO, we crowdsource our tasks to diverse users located globally to ensure the quality and quantity on time. Moreover, our in-house managers double-check the quality of the collected or processed data.

In this post, we talked about the different practices that you can follow to make the best quality image dataset for your Machine learning or Deep learning projects. We first talked about how planning and simplifying the best route to gather images is important before jumping into the image actual collection process. It is important to remember the context of the problem that you are trying to solve through your project and then choose a collection technique. You can gather the images on your own or through a website or outside source e.g an open dataset. We discussed the best practices to follow when collecting the pictures by yourself and also talked about how quick and easy it is to use the Fatkun Batch Download Image extension for Chrome for the bulk download of images. Filtering the collected images by removing duplicates, deleting small pictures and manual pruning are key factors that lead to a good dataset. Lastly, deciding upon the perfect quantity of data, balancing the quantities in the different classes, introducing diversity in your samples as well as emphasizing on the context of the usage of the model are the practices, which if followed thoroughly, are guaranteed to make the best quality image dataset.

Your AI Data Standard

LLM Evaluation Platform
Newsletter
Related Posts