Breaking the Data Boundaries With General Self-Supervised Learning Approach
Understand how Data2vec achieves general self-supervision on speech, vision, and text data Moden data is complex, diverse, and unsupervised. Data can have different modalities, such as text, image, and audio. In the last two decades, Artificial intelligence (AI) has demonstrated...
Preprocessing and Augmenting Images for Classification in TensorFlow Keras
Image preprocessing and augmentation are important steps before you train statistical algorithms e.g machine learning models on images. Image preprocessing and augmentation can help increase dataset size by generating similar copies of images. It can also improve data versatility which...
The Datasets You Need for Developing Your First Chatbot
Conversational AI assistants are everywhere! Chatbots’ fast response times benefit those who want a quick answer to something without having to wait for long periods for human assistance; that’s handy! This is especially true when you need some immediate advice...
Understanding Input Data Shapes For Neural Networks in TensorFlow Keras
Data comes in a variety of shapes and forms. A dataset can be as simple as a table containing data in the form of rows and columns, and as complex as images and videos with multiple colour channels. In the...
Can Unsupervised Speech Recognition Eliminate Speech Data Annotation?
Learn how modern speech recognition algorithms like wav2vec can potentially recognize all languages in the world Many AI researchers believe that we are decades away from achieving Artificial General Intelligence (AGI) or true human-like intelligence. Though far away, recent advancements in...
What is machine translation and how does it work?
Automatic, instant translation from one language to another once seemed like a gimmick straight out of science fiction. Today, machine learning has infiltrated nearly every industry and revolutionized language translation worldwide. What was once a distant dream is now an...
KorQuAD Dataset 2.0
This Korean question and answer dataset for web documents MRC was created by LG CNS and Datumo. We created 80,000+ question and answer pairs based on Wikipedia which results in total of 100,000+ pairs of questions and answers, including those...
AI for The Underdogs #3
Technology to tear down communication barriers *This dataset has been collected and annotated as part of Datumo’s <2021 AI Training Data Sponsorship Program> and is downloadable from Datumo’s Open Datasets website. Datasets for those with visual and mobility impairment How is...
Advancing AI Technology for Vulnerable Populations 2: Currency Data Collection
Technology for the marginalized 90 percent *This dataset has been collected and annotated as part of Datumo’s <2021 AI Training Data Sponsorship Program> and is downloadable from Datumo’s Open Datasets website. Currency information datasets for people with visual impairment How is...









