Data Augmentation In Natural Language Processing | 2021 | ExentAI

Data Augmentation In Natural Language Processing

Data augmentation is a process that increases the amount and diversity of data with the use of slightly modified copies of the data. Deep learning relies heavily on data augmentation and the process is also useful when increasing the size and variability of a dataset.

Applying data augmentation to an image will use techniques like zooming, rotation, and flipping. Data augmentation can also be used in text or Natural Language Processing (NLP).

Artificial intelligence or AI has made significant changes to all industries and sectors. Businesses rely heavily on AI development services to enhance business operations, boost growth and productivity, and offer customers a better service.

As a consumer, you are likely to have come across AI-powered applications in different contexts. The customer service chatbot you interacted with on your favourite e-commerce platform, the voice-operated GPS system you have installed in your vehicle, and the speech-to-text software you use on your phone were all developed by AI service providers. And they all use NLP.

NLP gives machines or computers the ability to understand text and spoken words in a similar manner to humans. NLP uses computational linguistics, machine learning, and deep learning models. How can data augmentation be used in NLP?

In January 2019, Jason Wei and Kai Zou presented a paper titled ‘EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks’.

In the nine-page document, Wei and Zou presented Easy Data Augmentation or EDA, which consisted of four simple but powerful operations. “On five text classification tasks, we show that EDA improves performance for both convolutional and recurrent neural networks. EDA demonstrates particularly strong results for smaller datasets,” they stated.

The study found that, on average, training with EDA across five datasets while using only 50 percent of the available training set achieved the same accuracy as normal training with all available data.

Techniques

The study used four techniques. In synonym replacement, words that were not stop words were chosen from the sentence and replaced with a synonym chosen at random. In random insertion, a synonym of a word that was not a stop word was found and then inserted into a random position in the sentence. In random swap, two random words in the sentence were swapped.

In random deletion, a word with the probability p was deleted at random.

Wei and Zou used the sentence “A sad, superior human comedy played out on the back roads of life” as an example to demonstrate the four techniques.

Replacing sad and back, the sentence was converted to “A lamentable, superior human comedy played out on the backward road of life.”

Using the RI technique, the sentence became “A sad, superior human comedy played out on funniness the back roads of life.”

Random swap gave them the sentence “A sad, superior human comedy played out on roads back the of life.”

The fourth technique, random deletion, changed the sentence to “A sad, superior human out on the roads of life.”

Limitations

One of the main limitations of the approach was that performance gain was marginal with sufficient data. The paper states that the average performance gain was less than one percent for the five classification tasks when training with full datasets.

“And while performance gains seem clear for small datasets, EDA might not yield substantial improvements when using pre-trained models,” Wei and Zou wrote.

Application

“Our paper aimed to address the lack of standardised data augmentation in NLP (compared to vision) by introducing a set of simple operations that might serve as a baseline for future investigation,” Wei and Zou wrote.

The authors went on to say that the most recent work in NLP focused on making the neural model larger or more complex. EDA does the opposite.

“We introduce simple operations, the result of asking the fundamental question, how can we generate sentences for augmentation without changing their true labels?” they wrote, adding that they do not expect EDA to be the go-to augmentation method for NLP.

Wei and Zou explained that easy-to-use high-performing augmentation techniques would soon be introduced. These techniques are bound to have an impact on the use of data augmentation by any natural language processing company.