Diversity? Accuracy? Important Properties of your Dataset

Diversity? Accuracy? Important Properties of your Dataset

We know that a dataset is basically a collection of data. It can consist of tables where each column in a table represents a particular variable in question and each row represents a value of that variable. A dataset can also consist of various documents or files. Regardless of the format, having a good quality dataset is extremely important because it is directly linked to the sustainability of your algorithms like the Machine Learning model i.e. the model will fail to serve its purpose if the quality of the dataset is poor or not up to the mark. So in this tutorial, we will be discussing some of the characteristics that a good dataset should possess. So let’s get started!

Prerequisites

Before you go ahead, please note that there are a few prerequisites for this tutorial. To follow the code samples given, you should have some basic programming knowledge in any language (preferably in Python). You should also be familiar with basic machine learning concepts. We will be using Google Colab for writing the code in our examples but you can work on any code editor of your liking.

Characteristics of a Good Quality Dataset

A dataset is of high quality if it fulfills its purpose and satisfies its requirements of use by the application or the client. A good machine learning model is of no use if it is trained on poor quality data. Therefore, it is vital to have a dataset of high quality. A good quality dataset should ideally have the following propoerties:

1. High Accuracy:

Accuracy refers to the correctness of the data. To be more precise, it refers to how close a value is to the actual value that represents the problem correctly. To analyze the accuracy of the information in your dataset, you need to ask yourself whether the information correctly reflects upon the concerned situation or problem or not. For example, if you are dealing with a dataset of houses and some of the house sizes are given in square centimeters or if some of the house prices are negative, then that information is most likely to be inaccurate.

2. Reliability:

Reliability is an important data attribute and it basically ensures that the different dataset values or datums do not contradict each other and that overall, the information that the dataset contains can be trusted; it discusses the qualitative aspects of your dataset. A model trained on reliable data is more likely to make correct predictions than one trained on unreliable data. When measuring or determining the reliability of your dataset, you should make sure that it does not contain:

Duplicated values

As they are repeated values, they hold no importance as they do not provide any new information in the dataset and should be removed. You can remove duplicate values in Python using Pandas’ drop_duplicates function. Consider the example below which contains information about students in some universities and has some duplicates.

# FirstName LastName     Sex  Age Degree  Graduation
0     Jamie   Fallon    Male   20     SE        2019
1      Erin   Silver  Female   23     EE        2020
2      Phil   Rhodes    Male   19     ME        2021
3     Jamie   Fallon    Male   20     SE        2019
4      Erin   Silver  Female   23     EE        2020
5     Jamie   Fallon    Male   22     SE        2020

To remove duplicate rows:

Python Code:
new_info = info.drop_duplicates()
print(new_info)
Output:
# FirstName LastName     Sex  Age Degree  Graduation
0     Jamie   Fallon    Male   20     SE        2019
1      Erin   Silver  Female   23     EE        2020
2      Phil   Rhodes    Male   19     ME        2021
5     Jamie   Fallon    Male   22     SE        2020

To remove duplicates on the basis of columns, specify a subset or column that should be unique. In our example, there are 2 Jamie Fallons. If we want to remove the record of one who graduates later, we could do so by using the following code:

Python Code:
info = info.sort_values('Graduation', ascending=True)
info = info.drop_duplicates(subset='FirstName', keep='first')
print(info)
Output:
# FirstName LastName     Sex  Age Degree  Graduation
0     Jamie   Fallon    Male   20     SE        2019
1      Erin   Silver  Female   23     EE        2020
2      Phil   Rhodes    Male   19     ME        2021
Missed or omitted values

Should also be removed or recoded i.e. represented differently. You can remove omitted values in Python using Pandas’ drop_dropna function. Consider the example below with some missed values:

FirstName LastName     Sex  Age Degree Graduation
   0      Erin   Silver     NaN   23     EE       2020
   1      Phil   Rhodes    Male   19     ME       2021
   2     Helen    David  Female   23     EE       2020
   3     Jamie   Fallon    Male   22     SE        NaT
Python Code:
new_info= info.dropna()
print(new_info)
Output:
FirstName LastName     Sex  Age Degree Graduation
1      Phil   Rhodes    Male   19     ME       2021
2     Helen    David  Female   23     EE       2020
 
Special characters, punctuations, stop words

In the case of textual data, special characters, punctuation marks and stop words like ‘a’, ‘is’, ‘and’, ‘the’ etc. do not add any meaning to the text. The model does not understand the grammar of the text, rather the nouns and the adjectives used. Thus, they should be removed from your textual data. This is called text pre-processing and for this, you need to:

1. Make the necessary downloads and imports.
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
import string
from nltk.corpus import stopwords
2.  Split the words in the text file into tokens or list items. Consider this text file:
Python Code:
text_file = 'text_file.txt'
tokens = word_tokenize(text)
print(tokens)
Output:
['This', 'is', '@', '@', '@', '@', '@', 'a', 'random', 'text', ',', 'file', 'thAt', 'CONTAINS', '(', 'punctuation', ')', 'marks', ',', 'special', '#', 'characters', '!', '!', '!', "''", '"', 'and', 'spaces', 'is', 'for', 'the', 'purpose', 'of', 'understanding', 'how', 'important^^^', 'text', 'preprocessing', 'is', 'fOr', 'the', '&', 'quality', 'of', 'a', 'dataset', '.', 'A', 'high', 'quality', 'dataset', 'has', 'preprocessed', '&', 'clean', 'text..', '!', '!']
 
3. Remove all punctuation marks and special characters from the list.
Python Code:
intermediate_step = str.maketrans('', '', string.punctuation)
punctuation_removed = [a.translate(intermediate_step) for a in tokens]
print(punctuation_removed)
Output:
['This', 'is', '', '', '', '', '', 'a', 'random', 'text', '', 'file', 'thAt', 'CONTAINS', '', 'punctuation', '', 'marks', '', 'special', '', 'characters', '', '', '', '', '', 'and', 'spaces', 'is', 'for', 'the', 'purpose', 'of', 'understanding', 'how', 'important', 'text', 'preprocessing', 'is', 'fOr', 'the', '', 'quality', 'of', 'a', 'dataset', '', 'A', 'high', 'quality', 'dataset', 'has', 'preprocessed', '', 'clean', 'text', '', '']
 
4. Remove all non-alphabets from the list of tokens.
Python Code:
only_alphabets = [a for a in punctuation_removed if a.isalpha()]
print(only_alphabets
Output:
['This', 'is', 'a', 'random', 'text', 'file', 'thAt', 'CONTAINS', 'punctuation', 'marks', 'special', 'characters', 'and', 'spaces', 'is', 'for', 'the', 'purpose', 'of', 'understanding', 'how', 'important', 'text', 'preprocessing', 'is', 'fOr', 'the', 'quality', 'of', 'a', 'dataset', 'A', 'high', 'quality', 'dataset', 'has', 'preprocessed', 'clean', 'text']
5. Convert all words to lower-case for consistency.
Python Code:
lowercase_tokens = [a.lower() for a in only_alphabets]
print(lowercase_tokens)
Output:
['this', 'is', 'a', 'random', 'text', 'file', 'that', 'contains', 'punctuation', 'marks', 'special', 'characters', 'and', 'spaces', 'is', 'for', 'the', 'purpose', 'of', 'understanding', 'how', 'important', 'text', 'preprocessing', 'is', 'for', 'the', 'quality', 'of', 'a', 'dataset', 'a', 'high', 'quality', 'dataset', 'has', 'preprocessed', 'clean', 'text']
6. Remove all stop words from the list.
Python Code:
stop_words = set(stopwords.words('english'))
stop_words_removed = [w for w in words if not w in stop_words]
print(stop_words_removed)
Output:
['random', 'text', 'file', 'contains', 'punctuation', 'marks', 'special', 'characters', 'spaces', 'purpose', 'understanding', 'important', 'text', 'preprocessing', 'quality', 'dataset', 'high', 'quality', 'dataset', 'preprocessed', 'clean', 'text']

3. Consistency in Feature Representation:

Consistency in feature representation is an important characteristic of a good dataset as it ensures data compatibility. For this, you need to:

  • Convert non-numeric features to numeric e.g. to perform matrix multiplication, the data needs to be numeric as such operations cannot be performed on strings.
  • Resize inputs to a fixed size especially in the case of image models as they require images in their dataset to be of the same size. This can be done through Python Imaging Library known as PIL in short by using its image resize function.
  • Normalize numeric features i.e. change the values of numeric columns to a scale such that the differences between the ranges remain unchanged. The actual range of values is converted to a standard range of values, typically between -1 to 0 or 0 to +1 or -1 to +1. This helps models to perform better and increases overall accuracy.

4. Right Dataset Size:

The right quantity or size of the dataset in an extremely important characteristic that in turn affects its overall quality. No matter how efficient your model is, the dataset size can become a bottleneck in terms of its accuracy. There is no hard and fast rule about the size of the dataset, as it is specific to the type of problem that you are trying to solve, so size is mostly based on good judgment and should be sufficient to yield expected performance outcomes. You can however ensure that your dataset is at least an order of magnitude more than the trainable parameters. You can also split the training and testing sets in the ratio 80/20 or 70/30 or use an alternative approach like K-fold cross-validation.

  • Resize inputs to a fixed size especially in the case of image models as they require images in their dataset to be of the same size. This can be done through Python Imaging Library known as PIL in short by using its image resize function.
  • Normalize numeric features i.e. change the values of numeric columns to a scale such that the differences between the ranges remain unchanged. The actual range of values is converted to a standard range of values, typically between -1 to 0 or 0 to +1 or -1 to +1. This helps models to perform better and increases overall accuracy.

5. Diversity in Dataset:

Diversity is a critical factor when it comes to making a good quality dataset. Consider the example of an image recognition and classification model. If this model is trained on a dataset of images showing different breed of dogs, and for each breed, there are images taken from different angles, under changed lighting conditions, from varying distances, in contrasting backgrounds and showing their tails, paws, etc. differently, then the model is most likely to classify the dogs more accurately as compared to if it is trained on a model which has similar images. In short, non-representative or non-diverse datasets are unlikely to provide useful insights as compared to those who cover all facets of the problem in question.

6. Completeness:

Completeness refers to how comprehensive the dataset is. This is an important attribute of measuring data quality as it makes sure that every important and relevant piece of information is put into the dataset for the model to train on. If the information is incomplete, the data may become unusable.

7. Up to date Data:

Having up to date data in your dataset is an important data quality characteristic because if the data is not up to date, it may not be applicable to the current scenario or problem that the model is intended to solve. Take the example of a house price prediction model that predicts house prices based on their sizes in square feet. If the dataset for this contains house prices from the 1960s, it may not be applicable to predict house prices for the year 2020.

8. Relevance:

This refers to how important or relevant the data is to the concerned problem. If you gather a dataset that contains some irrelevant or unrelated information to the problem that you’re trying to solve, then you will not attain the desired results and waste your time. Your dataset should strictly contain relevant information and should meet the requirements for the intended use.

Creating and maintaining the best quality dataset is not an easy task. Especially, for a small- to medium-sized companies, managing human resources and technical specialties are very challenging. Therefore, it is often more efficient to find another service that does laborious works (including both collection and preprocessing) for you. For that, we could be your perfect solution!

Here at DATUMO, we crowdsource our tasks to diverse users located globally to ensure the quality and quantity simultaneously. Moreover, our in-house managers double-check the quality of the collected or processed data! Check us out at datumo.com for more information! Let us be your HELP!

To sum it all up, we discussed how important it is for a model to be fed a high-quality dataset as it drives the quality of the overall machine learning model. We pointed out the main characteristics of a good quality dataset such that it is accurate, complete, reliable, up to date, diverse, and relevant. We also discussed the steps required in text pre-processing. With these aforementioned factors, we can make certain that a high-performance machine learning dataset is built and that we are able to reap the benefits of a robust and accurate machine learning model that has learned from such a superior quality training dataset.

Your AI Data Standard

LLM Evaluation Platform
Newsletter
Related Posts