KorQuad Dataset 2.0

This Korean question and answer dataset for web documents MRC was created by LG CNS and Datumo. We created 80,000+ question and answer pairs based on Wikipedia which results in total of 100,000+ pairs of questions and answers, including those from KorQuAD 1.0.

About the dataset

Machine Reading Comprehension(MRC) is a natural language processing project which requires the AI model to comprehend the given text and the question, and point out the answer within the text. This is the core technology of automated question answering technology. The Korean standard dataset for MRC is called KorQuAD 1.0 and this dataset can be used not only for training AI models, but also as an objective standard to assess the performance of different models.

The preexisting Korean dataset performed question answering for short articles such as those from Wikipedia or the news. However, most texts we face in the real world, such as the texts from the web, product manuals, contracts, charts, lists, and so on, are constructed with various types of sentences. The sentences vary in structure, length, and complexity and most of the times MRC is required to perform within the whole text, rather than a single paragraph. As such, there was a gap between the actual tasks that needed MRC and academic research which eventually led to a problem of not being applicable, despite the algorithm’s academic performance.

In order to solve this problem, LG CNS created the KorQuAD 2.0 to enable machine reading comprehension for texts and documents with various types of sentences. Datumo collected training data of 80,000 question and answer pairs formed from approximately 50,000 Wikipedia texts. Additional to the 20,000 KorQuAD 1.0 dataset, Datumo and LG CNS created a total of 100,000 datasets.

Credit: https://korquad.github.io/dataset/KorQuAD_2.0/KorQuAD_2.0_paper.pdf

Dataset Specification

Number of documents and questions

	Training	Validation	Assessment	Total
Documents	38,496	4,736	4,736	47,957
Questions	83,486	10,165	9,309	102,960

Ratio of answer types

Categorized according to the length of the answer

Ratio of answer types

	Short	Long
Text	Choose answer from paragraph	Choose whole paragraph as an answer
Table	Choose answer from table	Choose whole table as an answer
List	Choose answer from list	Choose whole list as an answer

Long answer examples

Subtitle repetition (38%)

Q. How was the relationship between Oscar Peterson and Norman Granz formed?

Title. Oscar Peterson – #biography – #Norman Granz

Subtitle variation (47%)

Q. Does Lee have any brothers or sisters?

Title. Lee – #siblings

Creation (15%)

Q. What is the name of the law practiced in order to protect cultural assets?

Title. Volcanic Caves of the Upper Geomunoreum Lava Tube System – #limitedaccess

*Long answers refer to the whole paragraph within the according title.

Short answer examples

Short Answers

Phrase variation (48.0%)

Q. What year did Korea sell bottled water for foreign tourists?

“… around 1988 Seoul Olympics, Korea allowed selling bottled water for foreign tourists, but was prohibited soon after. …

Word variation (15.4%)

Q. What is the name of the field manager of Lotte Chiba who got fired during season in 2009?

“…As the news regarding the dismissal of field manager Bobby Valentine was published, some of the fans…”

Sentence combinations (8.0%)

Q. What is the name of the Korean mobile device manufacturer that used “Don’t Cha” as their commercial music?

“… their debut single “Don’t Cha” ranked number one in the UK, Australia, Canada, etc… Also this song has been used as the commercial music for mobile devices of SKY, a Korean mobile device manufacturer…

Chart/ List

Q. Which party is the second runner-up part of?

James Earl Carter Jr. (Walter Frederick Mondale) / Democratic / GA, MN
Gerald Rudolph Ford Jr. (Robert Joseph Dole) / Republican/ MI, KS
Ronald Wilson Reagan (Robert Joseph Dole) / Republican/ CA, KS

Number of data

100,000 total

Number of participants

1,372 people

Project period

July to August 2019

LG AI & Big Data Research Center

Korean question-answer dataset

Impressive quality of various data

“With Datumo, we were able to efficiently collect KorQuad 2.0, a Question-Answer dataset in Korean. We loved the quality and diversity of data, collected from broad workers. Especially, Datumo’s user guideline for our task was very impressive, capturing our expectations into clear explanations for the workers.”

AI & Big Data Research Center

Collect

Documents from Wikipedia

Create Q&A pairs

Large pool of crowd-workers from Cash Mission*

Validation

Cross-inspection for every single data

*: Datumo’s own developed crowd-sourcing platform which consists of 190K+ crowd-workers.

Process of annotation

KorQuAD 2.0 has been created based on the data collected from Cash Mission. During this process, every crowd-worker was tested to validate their capabilities of forming appropriate MRC questions, before participation. Through our crowd-sourcing platform, we were able to have a variety of participants and ensure consistent data quality by utilizing our unique tutorial system**.

**: Datumo’s tutorial system consists of strict guidelines and tests in order to accurately assess the worker’s understanding of the project.

Collection of documents

Used Wikipedia to collect structured documents on various topics
Selected the top 150,000 Wikipedia documents sorted by [Page view] from June 2016 to May 2019, to collect documents that people find interesting
Added additional 50,000 documents in order to cover a broader range of text domain

Formation of question-answer pairs

Used Cash Mission, Datumo’s own developed crowd-sourcing platform
Provided workers with parts of documents instead of the whole article, in order to prevent workers forming Q&A pairs only from the beginning of from “easier” sections (avoid data bias)
Created Q&A pairs based on the length of the answers – Long/Short

Validation

Data quality assurance

Cross-validation carried out by 2-3 validators, per data
Disqualified validators with project success rate lower than 85%

Specific guidelines to enhance data quality

Screenshot of the guideline on Cash Mission

Each and every participant was required to take a test to take part in the actual labeling project. Participants were required to check if the question provided as an example was correct or not, along with their reasons for choosing the answer.

Data to solve pragmatic NLP problems

https://korquad.github.io/

KorQuAD 2.0 is currently open to anyone who needs it and is being used to measure the AI model performances of major companies such as Samsung SDS, Kakao, and Naver. It has expanded the range of machine reading from simple sentences to long, complex ones, which solves some of the major issues in the NLP industry. Moreover, as fair evaluation continues, it is founding the basis for developing more effective NLP models.

References

https://korquad.github.io/dataset/KorQuAD_2.0/KorQuAD_2.0_paper.pdf

https://korquad.github.io/

Your AI Data Standard

LLM Evaluation Platform

Learn more

About Datumo

Learn more

AI Dataset datumo KorQuad Machine Reading Comprehension MRC

KorQuAD Dataset 2.0

About the dataset

Dataset Specification

Number of documents and questions

Ratio of answer types

Long answer examples

Short answer examples

Number of data

Number of participants

Project period

LG AI & Big Data Research Center

Impressive quality of various data

Collect

Create Q&A pairs

Validation

Process of annotation

Collection of documents

Formation of question-answer pairs

Validation

Specific guidelines to enhance data quality

Data to solve pragmatic NLP problems

References

Your AI Data Standard

LLM Evaluation Platform

About Datumo

Related Posts

Datumo: Future of Trustworthy AI

Datumo Eval Use Case: IoT Chatbot Evaluation and Red Teaming

AI for The Underdogs #3