What is Captcha, and how does it collect data?

What is Captcha, and how does it collect data?

Many of you may or may not be familiar with the term ‘CAPTCHA’ itself, but would have definitely come across it at some point while surfing the internet. Let me remind you.

 

Now, it looks familiar, isn’t it?

 

You MUST have seen an image similar to this while trying to create an account on a website, posting a comment on a blog, or while purchasing something online. Tests similar to the image above are called CAPTCHAs, short for Completely Automated Public Turing Tests to Tell Computers and Humans Apart. So in this tutorial, we will be discussing what CAPTCHAs are, the main concepts behind these tests, and how they work in general. We will talk about the different types of CAPTCHAs, as well as some of the applications of CAPTCHAs, an immensely important one being data collection for Machine learning or Artificial Intelligence projects. So without further ado, let’s get started!

What is a CAPTCHA, and Why is it needed?

Basically, CAPTCHA is a test that is supposed to differentiate between human beings and computers based on traits or characteristics that are possessed by only humans but not by machines. While performing an activity such as signing up on a website, you are required to type in a sequence of characters, letters, or digits that are given or select a particular set of images, e.g., that of cats, to prove that you are actually a human and not a machine. This is because machines cannot make out a distorted or puzzled text while humans can. For the visually impaired, there is Audio CAPTCHA, which includes garbled spoken letters that cannot be identified or understood by machines. All of this is done to prevent bots or automated computer programs from getting unauthorized access to an application. CAPTCHA stops from adding spans and malicious URLs to an application’s database, hampering an application’s performance, or endangering its security in any other way. In short, the goal is to stop anyone from trying to exploit any weaknesses or shortcomings of a system.

 

Different types of CAPTCHAs

There are several different types of CAPTCHAs. We will introduce the six most popular ones here.

1. Character or Text Recognition

 

This is the most common type of CAPTCHA where you are given a sequence of characters with a slight variation or distortion, and you are supposed to type them in the given input field.

 

Example of character or text recognition
2. Image Recognition

 

Image recognition CAPTCHA consists of different forms of image tests which can include naming images, distinguishing images from a set of different images, as well as identifying an odd image out of the set. This type of CAPTCHA makes use of the bots’ weaknesses in solving image recognition related problems which is a human’s forte.

 

Example of image recognition
3. Audio CAPTCHA

 

As mentioned before, this type of CAPTCHA is for the visually impaired. It mostly consists of audio of spoken letters or numbers which are slightly distorted in order to puzzle a bot but easy for a human to comprehend.

 

Several forms of CAPTCHA support the audio type as well
4. Social media login

 

While trying to register or create an account on a website, you are sometimes asked to sign up using your social media accounts e.g., Facebook account, Gmail account, etc. This is done because bots do not have social media accounts, so this method prevents them from registering. It also saves the authentic user’s time as he/she does not need to create an account from scratch and can instead use an already created social media account. As the accounts are genuine, this method of registering helps to increase a site’s security.

 
Example of social media login
5. Math problem

 

A form appears with a simple math problem, requiring you to solve for the answer. Most of the questions involve simple arithmetics e.g. 8+2, 11–9, etc. These can be quite tricky for robots to solve because it has to perform not only image recognition but also a multiple semantic analysis of numbers and symbols to solve the problem.

 

6. Time-based CAPTCHA

 

Recording the amount of time taken to solve a form and differentiating between a bot and a human on the basis of that time is another type of CAPTCHA. Bots mostly fill out all the fields in a form immediately whereas humans naturally take some time to enter the information.

There are several different types of CAPTCHAs. We will introduce the six most popular ones here.

1. Character or Text Recognition

 

This is the most common type of CAPTCHA where you are given a sequence of characters with a slight variation or distortion, and you are supposed to type them in the given input field.

 

Example of character or text recognition
2. Image Recognition

 

Image recognition CAPTCHA consists of different forms of image tests which can include naming images, distinguishing images from a set of different images, as well as identifying an odd image out of the set. This type of CAPTCHA makes use of the bots’ weaknesses in solving image recognition related problems which is a human’s forte.

 

Example of image recognition
3. Audio CAPTCHA

 

As mentioned before, this type of CAPTCHA is for the visually impaired. It mostly consists of audio of spoken letters or numbers which are slightly distorted in order to puzzle a bot but easy for a human to comprehend.

 

Several forms of CAPTCHA support the audio type as well
4. Social media login

 

While trying to register or create an account on a website, you are sometimes asked to sign up using your social media accounts e.g., Facebook account, Gmail account, etc. This is done because bots do not have social media accounts, so this method prevents them from registering. It also saves the authentic user’s time as he/she does not need to create an account from scratch and can instead use an already created social media account. As the accounts are genuine, this method of registering helps to increase a site’s security.

 
Example of social media login
5. Math problem

 

A form appears with a simple math problem, requiring you to solve for the answer. Most of the questions involve simple arithmetics e.g. 8+2, 11–9, etc. These can be quite tricky for robots to solve because it has to perform not only image recognition but also a multiple semantic analysis of numbers and symbols to solve the problem.

 

6. Time-based CAPTCHA

 

Recording the amount of time taken to solve a form and differentiating between a bot and a human on the basis of that time is another type of CAPTCHA. Bots mostly fill out all the fields in a form immediately whereas humans naturally take some time to enter the information.

How does CAPTCHA work?

CAPTCHA works on the basis of analyzing the variations or differences between humans and automated computer programs. CAPTCHA is supposed to be 3 things:

  1. Easy for a human to solve
  2. Hard for a bot to solve or understand
  3. Easy for a tester machine to create and grade

The different types of CAPTCHAs as mentioned above give tasks to solve, which are challenging for computers but not so much for humans. A CAPTCHA test makes use of humans’ strengths in the domains of invariant recognition, segmentation, and context. We all know that humans can identify characters even if they are in different shapes or forms, arranged in a different pattern, etc. Even if a character is overlapping with another, a human being can segment the characters and understand them in their proper context, which a computer finds quite tough to perform simultaneously.

Data Collection and CAPTCHA

Apart from differentiating between a human and a computer, a CAPTCHA can also help to gather data to train machine learning models. You could very well consider it a crowdsourcing technique for data collection and annotation. Think about it, the information gathered from e.g. image recognition CAPTCHA is a way of using human intelligence to annotate a dataset. When a user selects all the images which contain or show a cat, he or she is in fact helping to build an annotated dataset of cat images. The same is the case for text recognition as well. Thus, CAPTCHAs are not only effective in ensuring the security of a website but are also a useful method to create fully annotated datasets for Machine Learning or AI models.

Advantages of CAPTCHA

  1. Data collection and annotation to create large annotated datasets for Machine Learning models.
  2. Prevention of fake registrations on a website by allowing Social Media logins. To prevent fake registrations, big websites like Facebook and Gmail are integrated with CAPTCHA.
  3. Prevention of spam comments by only allowing humans to post a comment. This is done to prevent spammers to wrongfully raise their websites’ search engine ranks by bombarding their websites’ comments section and leaving fake reviews of products.
  4. Increasing the security of online purchasing or shopping by ensuring that the buyers are only humans. At times, competitors of business may use invalid names, emails, shipping addresses to order your products so that you waste your time and money to deliver those products.

To train the industry-level algorithms, companies need data collection & annotation, and they are often very challenging. Moreover, it is difficult to control the quality within a company, especially your company is a small- or medium-sized company. Therefore, it is often more efficient to find another service that does laborious works for you. We could be your perfect solution!

Here at DATUMO, we crowdsource our tasks to diverse users located globally to ensure the quality and quantity on time. Moreover, our in-house managers double-check the quality of the collected or processed data. If you need data? If you need preprocessed data? Let us know!

To sum it all up, we started off this tutorial by discussing what CAPTCHAs really are and the reasons why they are needed i.e. to tell a human user apart from a bot and in a nutshell, to prevent someone from exploiting a site or application’s weaknesses, etc. We talked about how a CAPTCHA plays with a human’s strengths and a computer’s weaknesses, we then touched upon the different types of CAPTCHAs like text, audio, image, etc. We also learned that besides ensuring a site’s security, the information gathered through CAPTCHAs can be an effective way of involving human intellect to create a full-fledged dataset for AI or ML. Lastly, we talked about the different advantages or usefulness of CAPTCHAs overall.

Your AI Data Standard

LLM Evaluation Platform
Newsletter
Related Posts