Robustness of LLMs to Perturbations in Text

Ayush Singh, Navpreet Singh, Shubham Vatsal
inQbator AI at eviCore Healthcare
Evernorth Health Services
[email protected]
Abstract

Having a clean dataset has been the foundational assumption of most natural language processing (NLP) systems. However, properly written text is rarely found in real-world scenarios and hence, oftentimes invalidates this foundational assumption. Recently, large language models (LLMs) have achieved remarkable performance in a wide array of NLP tasks. Nevertheless, the degree to which these LLMs are robust to semantic preservation of morphological variations in text has been sparsely studied. In a world becoming increasingly dependent on LLMs for most of the NLP tasks, it becomes crucial to know their robustness to numerous forms of noise found in real-world text. In this work, we systematically evaluate LLMs’ resilience to corrupt variations of the original text. We do so by artificially introducing different levels of noise into the discussed datasets. We show that contrary to popular beliefs, generative LLMs are quiet robust to commonly found perturbations in text. Additionally, we test LLMs’ performance on multiple benchmarks achieving a new state of the art on the task of grammar error correction. To empower future research, we are also releasing a dataset annotated by humans stating their preference for LLM vs. human-corrected outputs along with the code to reproduce our results.

Robustness of LLMs to Perturbations in Text


Ayush Singh, Navpreet Singh, Shubham Vatsal inQbator AI at eviCore Healthcare Evernorth Health Services [email protected]


1 Introduction

Modern day language processing (NLP) pipelines have been highly dependent on the input data to be as clean as possible. While this assumption works well in well-curated settings, it often gets invalidated in real-world setting leading to brittle systems that work well in laboratories, albeit, breaking down on the noisy naturally occurring data Wu et al. (2021). Consequently, there emerges a need to test any new innovation in the field of NLP on it’s robustness against several different forms of noise that are pervasive in real-world scenarios Galliers and Spärck Jones (1993).

Noise in real-world datasets can originate from a plethora of resources. Some of the errors can originate from human factors like spelling or grammatical errors while others can be machine induced like errors from optical character recognition (OCR), or automated speech recognition (ASR) systems. These various forms of noise have been a significant impediment deploying systems built on cleaner datasets in a real word setting. For example, even notes taken by native speakers oftentimes include spelling and grammatical mistakes, while text written by non-native speakers miss determiners and exhibit numerous forms of valid but orthogonal variations. Their impact on downstream performance can vary in degree ranging from a slight change in prediction probability to completely flipping the polarity or semantic meaning of a text (See Table 1 for examples). The field of NLP that deals with detecting when a meaning has changed or shifted is known as Lexical Semantic Change (LSC) detection Gulordava and Baroni (2011). Although LSC approaches help detect this shift, they do not offer the types of change that can be made to shift from one semantic to another such as translating an incorrect text into a lexical and grammatically correct one without changing the meaning.

Type Error Correct
Agreement He enjoys reading novels, play chess, and watch movies on weekends. He enjoys reading novels, playing chess, and watching movies on weekends.
Determiner They enjoy the sushi for dinner They enjoy sushi for dinner
Morphological I do not swim as well as he do. I do not swim as well as he does.
Multiple I sea the see from the seasoar I saw the sea from the seesaw
Preposition She put the book in the table. She put the book on the table.
Punctuation My favorite fruits are apples bananas and oranges My favorite fruits are apples, bananas, and oranges.
Syntax She the store went to. She went to the store.
Tense/Aspect I has been to London last year. I went to London last year.
Unidiomatic She was head over heels in the clouds when she received the promotion. She was on cloud nine when she received the promotion.
Table 1: Examples of various types of errors commonly found in real-world natural language.

Traditionally, machine learning (ML) systems have handled noise in text by using data cleaning pipelines comprising multiple phases, the most important one being known as grammar error correction (GEC) phase. Bryant et al. (2023) pointed out that GEC is a misnomer and has lately been more generally referred to as language error correction (LEC) that involves not only grammatical mistakes such as improper subject-verb agreement, but also spelling or type errors and other forms of errors as well. However, LEC is not an easy task, which is why ML has been employed for it. Although ML has progressed the state of LEC, it has not solved it entirely. In the last decade, emergence of advanced methods like subword embeddings has increased robustness to noise and hints at a future where downstream systems are so robust to noise that LEC might not even be needed anymore. Lately, transformer-based large language models (LLMs) have shown great promise in this area. With LLMs being used predominantly in all aspects of NLP, it is imperative to evaluate the robustness of LLMs to the fundamental task of LEC especially when recent work has shown a high degree of sensitivity by these systems to even word-level perturbations Srivastava et al. (2020); Wang et al. (2023).

LLMs have shown remarkable performance in most avenues of NLP tasks. This unprecedented gain comes from learning from a large corpus of natural language in an auto-regressive predictive manner. LLMs are trained on a mixture of clean and noisy text which allows them to incorporate robustness towards minor irregularities in text. However, a comprehensive evaluation of the degree to which they are robust to this noise has not been done yet. Tangentially, there has been a resurgence in the application of LLMs to LEC tasks as well Bryant et al. (2023).

In this work, we systematically measure the degree to which robustness to semantic-preserving corruption holds for LLMs. We define semantic-preserving by one or more set of corruptions that can happen till the point where a human is able to equate original and corrupted text. We measure robustness by differentiating internal LLM encoding of the clean text with that of it’s corresponding corrupt version. We corrupt the text synthetically with increasing degree of severity in the form of individual perturbations as well as combinations of them. Controlling the level of corruption allows us to have a better grasp on what aspects of noise affects LLMs. Additionally, we also measure LLMs performance on downstream tasks of LSC and LEC, which captures nuances of commonly found errors in-the-wild. Our contributions are threefold 1) using real and synthetic datasets, we show the extent to which this robustness holds in LLMs against various types of errors 2) report performance of LLMs on downstream LSC and LEC tasks 3) we share a human annotated data on preference of LLM corrected text vs. that of humans themselves.

2 Related Work

Noise

in natural language text has been a well-studied area of research for some time now. Research in this domain started with crude categorization of different types and progressed to more recent fine-grained specifications Lopresti (2008); Dey and Haque (2009); Passonneau et al. (2009); Xing et al. (2013); Al Sharou et al. (2021). Lopresti (2008) studied the negative effects of OCR system errors on NLP systems while Dey and Haque (2009) studied the negative effects of noise on text mining applications. With the advent of LLMs, Srivastava et al. (2020); Wang et al. (2023); Náplava et al. (2021) studied the impact of noisy text on LLMs and showed that it has negative results. However, they did not evaluate the more recent modern-day generative LLMs.

Lexical Semantic Change.

LSC detection techniques are split into following three categories (i) semantic vector spaces, (ii) topic distributions, and (iii) sense clusters . Because LLMs encode information about the meaning of an input into its internal layers called dense representations or popularly known as embeddings, the first category of LSC detection via semantic vector spaces has gained traction recently Gulordava and Baroni (2011); Kim et al. (2014); Xu and Kemp (2015); Eger and Mehler (2017); Hamilton et al. (2016); Hellrich and Hahn (2016). Schlechtweg et al. (2020) proposed the 2020 SemEval task on unsupervised lexical semantic change (LSC) detection which spurred a lot of interest in the field. Since then, several methods have been proposed to detect LSC ranging from using distance metrics Rosenfeld and Erk (2018); Qiu and Xu (2022) to differences in contextual dispersion between the two vectors Kisselew et al. (2016). Even though the choice of distance metric depends on the underlying task at hand, the most common metric has been cosine distance. Martinc et al. (2019); Montariol et al. (2021) proved the efficacy of leveraging contextualized embeddings in detecting diachronic semantic shifts on various corpus and different languages. We apply their techniques with more recent generative LLMs. Furthermore, Schlechtweg et al. (2019); Shoemark et al. (2020) corroborated the efficacy of this approach by doing a systematic comparison of semantic change detection approaches with embeddings using cosine similarity. For an in-depth review of all approaches, please refer to recent survey by Tahmasebi et al. (2021).

Datasets.

In order to measure the performance of the proposed LEC techniques, several benchmark datasets have been created. Yannakoudakis et al. (2011) created the first dataset by manually annotating scripts from ESOL examinations into 88 categories. Later, Ng et al. (2014) introduced a CoNLL-2014 Shared Task on Grammatical Error Correction, however, both of these datasets were low in volume and had some inherent problems in the annotation as revealed by Bryant et al. (2019), who then also introduced a new dataset as well as LEC task named Building Educational Applications-2019. Napoles et al. (2017) also introduced an LEC dataset which was much cleaner and operated on sentence level. While datasets created for LEC specifically measure the ability of a system to successfully do LEC, they do not allow the flexibility to measure the correlation of mistakes with that of downstream performance. To that end, synthetically created datasets fill this crucial gap created by organically derived datasets. Ko et al. (2023) showed the benefits of more synthetic datasets in the field of NLP so that we can measure certain aspects of models that otherwise would go undetected in organic datasets.

Large language models for LEC.

Recently, LLM’s ability to handle large number of NLP tasks has hinted at its potential success in GEC Raheja et al. (2023). At the same time, there is a large body of research showing LLM’s sensitivity to noise but not in a systematic manner Srivastava et al. (2020); Wang et al. (2023). Even LLM based LSC models only allow detection of a semantic change whereas there are times when one would also want to also correct the change in one way or another. This is where LEC shines which involves automatically correcting the syntactical errors in a given text as best as possible. Several works have explored using automated methods for LEC, from rule-based heuristics Sidorov et al. (2013); Xing et al. (2013), to machine learning models Garg et al. (2021), and more recently neural network-based methods Malykh (2019); Zhang et al. (2022); Raheja et al. (2023). Recently, Bout et al. (2023) posed GEC as a sequence-to-sequence task where the encoder takes inappropriate text as input and the decoder decodes the edits that needs to be made. Fang et al. (2023) evaluated Chat-GPT on LEC and found that it has excellent capabilities in not only English but also multilingual corrections; however, they did not compare against the recent version of GPT, nor did they evaluate against open source models.

Prompting LLMs.

The primary means of interaction with LLMs have been to prompt it with an instruction. Additionally, Brown et al. (2020) found that supplying examples with the instruction helps improve the performance. This is known as few-shot prompting while the former is called zero-shot as no examples are provided Kojima et al. (2022). Recently, Wei et al. (2022) found out that further performance boost can be achieved by asking LLMs to explain how it arrived at an answer, naming it as Chain-of-Thought (CoT) reasoning. Though several research came after the aforementioned three prompting techniques Zhou et al. (2022a, b), none of them brought a dramatic improvement.

3 Methods

In this section, we elaborate on the LSC detection methods used to assess the degree of the effect of corruption on the LLMs as well as the techniques used to configure LLMs to perform LEC.

3.1 Problem statement

We hypothesize that any change in how an LLM encodes a sequence can be measured by calculating the difference in its dense representations. Therefore, we measure the difference between an originally corrupt or noisy text x𝑥\vec{x}over→ start_ARG italic_x end_ARG with its corrected counterpart y𝑦\vec{y}over→ start_ARG italic_y end_ARG.

Our null hypothesis H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, therefore is that the internal encoding of LLM should be different for a text and it’s semantically similar but incorrect version. On the other hand, our alternative hypothesis Hasubscript𝐻𝑎H_{a}italic_H start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT states that there should be no difference in how LLM encodes a text and it’s grammatically incorrect version as the LLM has learned to extract semantic meaning despite of problems in how a text is written. To prove our hypothesis, we measure how distant is the embedding of the corrupted text from it’s correct counterpart. If this distance is not significant enough then our null hypothesis is rejected.

3.2 Lexical Semantic Change (LSC) Detection

In order to measure the similarity of dense representations of x𝑥\vec{x}over→ start_ARG italic_x end_ARG and y𝑦\vec{y}over→ start_ARG italic_y end_ARG, or when a LSC is detected, we use a standard metric called cosine similarity. The cosine similarity measures the angular distance of two vectors by first performing their dot product, divided by the product of their lengths as depicted by the following equation:

Simcosine(x,y)𝑆𝑖subscript𝑚𝑐𝑜𝑠𝑖𝑛𝑒𝑥𝑦\displaystyle Sim_{cosine}(\vec{x},\vec{y})italic_S italic_i italic_m start_POSTSUBSCRIPT italic_c italic_o italic_s italic_i italic_n italic_e end_POSTSUBSCRIPT ( over→ start_ARG italic_x end_ARG , over→ start_ARG italic_y end_ARG ) =xy|x||y|absent𝑥𝑦𝑥𝑦\displaystyle=\frac{\vec{x}\cdot\vec{y}}{|\vec{x}||\vec{y}|}= divide start_ARG over→ start_ARG italic_x end_ARG ⋅ over→ start_ARG italic_y end_ARG end_ARG start_ARG | over→ start_ARG italic_x end_ARG | | over→ start_ARG italic_y end_ARG | end_ARG (1)
=i=1Nxiyii=1Nxi2i=1Nyi2absentsuperscriptsubscript𝑖1𝑁subscript𝑥𝑖subscript𝑦𝑖superscriptsubscript𝑖1𝑁superscriptsubscript𝑥𝑖2superscriptsubscript𝑖1𝑁superscriptsubscript𝑦𝑖2\displaystyle=\frac{\sum_{i=1}^{N}x_{i}\cdot y_{i}}{\sqrt{\sum_{i=1}^{N}x_{i}^% {2}}\cdot\sqrt{\sum_{i=1}^{N}y_{i}^{2}}}= divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG

3.3 Perturbations

Even though benchmark LEC datasets have a incorrect version and it’s corresponding human corrected version, the benchmarks only capture a certain aspect of error occurring in text. Fortunately, prior research has done the immense job of categorizing the type of errors pervasive in real-world data because of which we can simulate the errors and measure robustness of LLM against it. To that end, among the numerous form of errors, we used domain knowledge and heuristics to rank the errors and selected for most relevant errors. These errors are elaborated as follows:

OCR error

OCR augmenter emulates the common recognition errors, such as substituting the numeral "0" with the characters "o" or "O". This is achieved through a predefined mapping table, which targets the replacement of such recognized characters.

Spelling and keyboard mistakes

Spelling mistake errors are introduced by substituting words with commonly misspelled alternatives, which are stored in a predefined mapping table. Keyboard input errors emulates mistyping by replacing random characters with those located within one keyword distance away on the QWERTY keyboard layout.

Split, swap and delete word

Split augmenter randomly splits the word into two separate sub words. Swap augmenter randomly exchanges adjacent words within the text. The delete augmenter removes words randomly simulating the occurrence of missing or omitted content.

Contextual insert and substitute word

The contextual augmenter leverages prominent word embedding models like Word2Vec Mikolov et al. (2013) or employing LLM like BERT Devlin et al. (2018) to identify and substitute or insert similar alternative words, enriching the vocabulary diversity within the text.

Substitute, insert, swap and delete character

This versatile character augmenter performs substitution, insertion, deleting and swapping of random characters throughout the textual input.

Synonym and antonym swap

This augmenter utilizes WordNet and PPDB for the strategic replacement of words with their synonyms or antonyms. It conducts preliminary checks before swapping to ensure the appropriateness of the replacement. Words that serve as determiners (e.g., a, an, the, etc) and words without a synonym or antonym are excluded.

3.4 Prompting

We evaluate LLMs for LEC task by prompting them to correct the errors in the dataset. Even though there are numerous advanced forms of prompting, however, they have been found to be task and dataset dependent. Therefore, in this work, we kept prompting variations to minimal. We ask the model to correct the language of the text with as little external knowledge as possible in the following prompt format:

You are an English language expert who is responsible for grammatical, lexical and orthographic error corrections given an input sentence. Your job is to fix grammatical mistakes, awkward phrases, spelling errors, etc. following standard written usage conventions, but your corrections must be conservative. Please keep the original sentence (words, phrases, and structure) as much as possible. The ultimate goal of this task is to make the given sentence sound natural to native speakers of English without making unnecessary changes. Corrections are not required when the sentence is already grammatical and sounds natural.

Here is the input sentence containing errors that needs to be corrected.

Input Sentence:

### {input_sentence} ###

4 Experiments

We split our experiments into increasingly complex phases starting with single perturbations and ramping the combinations up to five perturbations. For perturbations, we sequentially corrupt where each method’s probability of an augmentation is 30% with a maximum of 10 words that can be operated with either of substitution, insertion or deletion. Additionally, we discard any samples whose unigram Jaccard similarity coefficient less than 0.7 after all the corruptions have taken place. The models we chose for comparing dense representations are latest (fourth) version of GPT OpenAI (2023) with 8k context window (text-embedding-ada-002), the 7 billion version of an open source model LLaMa 3 Touvron et al. (2023) with 4k context window (decoder head embedding) and a non-generative model BERT Devlin et al. (2018) with 512 context window (CLS token embedding). We kept all the hyper-parameters to default as the temperature, frequency penalty, and presence penalty to 0 and max tokens to 1000. All of the aforementioned LLMs are based on similar transformer based architectures Vaswani et al. (2017). Additionally, we evaluate LLMs on two recent LEC benchmark datasets that earlier works have not evaluated against.

4.1 Datasets

For LEC task, we use two benchmark datasets named JFLEG and BEA-19. The JHU FLuency-Extended GUG corpus (JFLEG) extends the GUG (Grammatical/Ungrammatical) corpus by Heilman et al. (2014) with a layer of annotation via four human annotators Napoles et al. (2017). The key differentiator with JFLEG is that the corrections were made with inclination for fluency rather than minimal edits. This is done on the GUG corpus which is a cross-section of ungrammatical data, containing sentences written by English language learners with different L1s and proficiency levels. BEA-19 Bryant et al. (2019) was introduced as a shared task in the workshop of Building Educational Applications 2019. BEA dataset contains essays on approximately 50 topics written by more than 300 authors from around the world (including native English as well as British and American undergraduates speakers).

Apart from including both the aforementioned datasets in our LSC task, we also generate synthetic datasets for LSC, we use the IMDB movie review dataset Maas et al. (2011) and sub-sample 1000 reviews from it. The IMDB dataset was crowd-sourced reviews of movies with varying ratings. Even though parallel rating is available for each review, we did not need it for our usecase. For perturbations, we used the work by Ma (2019) to generate different semantic preserving variations of a given text.

Dataset # sentence pairs Scorer
BEA-19 train 561,410 -
BEA-19 dev 2,377 ERRANT
BEA-19 test 4,477 ERRANT
JFLEG 747 GLEU
IMDB 11,495 -
Table 2: Corpus statistics of the datasets used in our experiments.

4.2 Annotation

Bryant et al. (2023) posited that most of the LEC evaluation metrics have been calibrated against human preferences as it is a challenging task to evaluate the quality of a correction computation even when ground truth is present. Taking this into account, we also setup an annotation task as a true measure of LLM’s ability to perform GEC. We hypothesized that LLMs might behaviorally perform LEC that might be different from the way humans approach LEC.

We sub-sample records from both the JFLEG and BEA dataset to be corrected by both GPT and humans (for humans, we already have the ground truth). We focus more on JFLEG over BEA as the former has four expert human annotations per data point compared to that of one from BEA in the dev set. Additionally, BEA dataset itself was corrected via the same population that wrote the incorrect sentences, which makes the ground truth annotations of relatively lower quality. Our annotators demographically ranged from all over the world and had graduate degrees in STEM fields. We used the Label Studio platform and provided the following instructions to the annotators:

  • For a given incorrectly written English text, please select one of two corrections presented.

  • Select the one that deems most correct to you semantically, grammatically and syntactically.

  • When both outputs seem correct, pick the one whose words or syntax is as close to the original incorrect sentence.

  • When both corrections are same, tick the box that they are same.

5 Results

Refer to caption
Figure 1: Cosine similarity of clean and corrupted text for all perturbations. Combination’s have first character of each perturbation as label (See Section 3.3 for more).
GPT LLaMa BERT
Perturbation μ𝜇\muitalic_μ σ𝜎\sigmaitalic_σ μ𝜇\muitalic_μ σ𝜎\sigmaitalic_σ μ𝜇\muitalic_μ σ𝜎\sigmaitalic_σ
BEA+1P 98 1.1 50 10.6 98 1.1
BEA+2P 95 1.3 39 3.3 95 1.5
BEA+3P 93 1.3 40 4.7 95 0.4
BEA+4P 92 0.4 31 1.6 93 1.3
BEA+5P 91 0.9 32 1.7 91 0.6
IMDB+1P 98 0.6 83 9.0 99 0.5
IMDB+2P 97 0.7 79 2.7 97 0.8
IMDB+3P 96 1.2 78 3.9 98 0.1
IMDB+4P 95 1.3 70 1.6 96 0.8
IMDB+5P 94 3.4 70 1.5 95 0.4
JFLEG+1P 92 3.5 11 22.9 93 3.3
JFLEG+2P 83 1.6 6 1.4 84 2.6
JFLEG+3P 84 2.1 5 1.9 85 0.4
JFLEG+4P 79 0.8 8 2.4 82 0.8
JFLEG+5P 79 0.9 6 2.7 80 1.1
Table 3: Mean (μ𝜇\muitalic_μ) and standard deviation (σ𝜎\sigmaitalic_σ) of cosine similarity of LSC detection task for all three models.

As shown in Table 3, even after severe degradation of the original text, the embeddings of clean and corrupted version of text remain fairly same throughout. This leads us to reject our null hypothesis H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and accept alternative hypothesis Hasubscript𝐻𝑎H_{a}italic_H start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT that LLMs are robust to semantic-preserving variations of text. To our surprise, LLaMa did not fair well to different forms of perturbations and the null hypothesis cannot be rejected for LLaMa. Additionally, as can be seen in figure 1, both BERT Devlin et al. (2018) and GPT OpenAI (2023) remain fairly robust to different forms of perturbations whereas LLaMa Touvron et al. (2023) in some cases simply collapsed on detecting similarity. We elaborate on the worse performance on LLaMa in the discussion section 6.

Dataset JFLEG BEA-19 test
TagGEC Stahlberg and Kumar (2021) 64.7 70.4
T5-XXL Rothe et al. (2021) - 75.9
GECToR Tarnavskyi et al. (2022) 58.6 73.2
BART Bout et al. (2023) - 75.9
ChatGPT Fang et al. (2023) 61.4 36.1
LLaMa3-7b zero-shot 51.9 35.2
GPT-4 zero-shot 64.9 59.6
Table 4: Results for both datasets on pre-existing supervised methods and more recent unsupervised methods, where GLEU score is used for JFLEG and ERRANT for BEA-19 test set.

Apart from passing the LSC detection test, as shown in Table 4, the unsupervised GPT approach achieves a new state of art on the JFLEG dataset. On the BEA-19 dataset, GPT surpasses the previous state of art in unsupervised domain i.e. ChatGPT by a significant margin of 23 points or 70% better. Even LLaMa performed equivalent to ChatGPT being only 1 point behind in F0.5subscript𝐹0.5F_{0.5}italic_F start_POSTSUBSCRIPT 0.5 end_POSTSUBSCRIPT ERRANT score. Nevertheless, unsupervised model’s performance still lags behind that of models trained specifically on the BEA-19 dataset. The drastically varying results on both the datasets opens doors to error analysis on why performance soared in one while lagged in the other. We setup some probing and annotation tasks to resolve this in the following section.

6 Discussion

System Text
Original New and new technology has been introduced to the society.
Human 1 New technology has been introduced into the society.
Human 2 Newer and newer technology has been introduced into society.
GPT New technology has been introduced to society.
GPT Formal Society has witnessed the successive introduction of pioneering technologies.
GPT Emotional The influx of groundbreaking technology into society has been both exhilarating and transformative.
GPT Professional The continuous integration of new technology into society has been a significant and impactful progression.
GPT Casual Loads of new tech keep popping up and changing how society rolls.
Original i have studed for just examination wayse studed.
Human 1 I have studied for just examination.
Human 2 I have studied for just examination we studied.
GPT I have studied just for the sake of examinations.
GPT Formal I diligently prepared solely for the examination through focused study sessions.
GPT Emotional I poured my heart into studying, purely for the examination’s sake.
GPT Professional My preparation was exclusively geared towards the examination, with dedicated study efforts.
GPT Casual I’ve only been studying for the exam, just hitting the books for that.
Table 5: Examples of how human and systems corrected the sentences from the JFLEG dataset.

The authors of BEA-19 use a span-based evaluation metric F0.5subscript𝐹0.5F_{0.5}italic_F start_POSTSUBSCRIPT 0.5 end_POSTSUBSCRIPT. In span-based correction, a system is only rewarded if a system edit exactly matches a reference edit in terms of both its token offsets and correction string. This is a harsh metric especially for a task like LEC where there could be so many possible variations of the correct answer to any given input. Additionally, because all of the metrics for GEC are dependent on comparing the correction with human corrected ground truth. This comparison makes an inherent assumption that human correction is the utmost form of correction, whereas from our findings, this turned out not to be true. On the contrary, we argue that because LLMs have been trained on far larger datasets than humans can ever peruse, therefore, LLM will have better estimate of the correction. In some of the cases, we even found out that GPT’s LEC was far better than the ground truth corrections themselves.

To measure the aspect or novelty of GPT that pre-existing metrics were not able to surface, we setup a preference learning task among 3 annotators where each document was annotated thrice. We sampled 100 records from both the JFLEG and BEA-19 dataset which were corrected by both GPT and then compared to corrections of humans. To assess the reliability of the annotations, we use Fleiss Kappa score which came to 0.62 and the inter-annotator agreement (IAA) which was found to be 76%. Furthermore, the annotators found the correction of GPT and human to be the same on average 11% of the time. There were also occurrences when both the corrections seemed correct and therefore annotators were unable to decide, this happened 9% of the time. As shown in table 6, analysing the annotations revealed that annotators preferred the correction by GPT 73% and 68% respectively for JFLEG and BEA-19 datasets more than those done by human themselves. This shows the superior capabilities of GPT on tasks like LEC.

Dataset Ann 1 Ann 2 Ann 3 Mean
JFLEG 76.63 71.83 71.42 73.29
BEA-19 86.53 56.75 62.16 68.46
Table 6: Preference scores of GPT over human correction for each of the three annotators.

Furthermore, unlike pre-existing LEC systems, GPT goes beyond by offering the ability to further tune how one would like the correction to be made with respect to focus on fluency, formalism, or strict grammar and more by means of prompts. Raheja et al. (2023) shows the extent of different types of edits that can be formulated by LLMs. Examples of this can be seen in Table 5.

On the one hand, GPT performed well on LEC and LSC, on the other hand, LLaMa, did not perform as well, even though being an LLM. We hypothesized that this might be happening due to 1) LLaMa being a decoder only architecture where GPT’s embedding API might only be using encoder model 2) LLaMa not being used to operating on sentences that are short or even single sentences. We validated the latter hypothesis by setting up an experiment where we combined upto ten sentences before passing them to LLaMa for LEC. As expected, the LSC detection performance increased by 54%.

The methodology behind how to measure performance of LEC systems goes as far back as the LEC tasks itself, one that has always been called out as not on par with judging true performance of the system Bryant et al. (2023). This is why most works ultimately aim to measure performance with human preferences. Similarly, in our study, we found GLEU and ERRANT do not paint the complete picture of GPT. Our annotators reported that the ground truth themselves were oftentimes incorrect, and as evident by preferences, annotators preferred GPT’s correction over that of the humans. Another important aspect to note here, apart from the ground truth being unclean, is that our study compares supervised with unsupervised methods. Supervised models use 70%-80% of the available data as their training set which allows learning the nuisances of the dataset unlike GPT, where no matter the incorrect ground truths, all experiments were conducted in a zero-shot setting. Therefore, even though GPT’s performance on BEA dataset lags behind state-of-the-art system, we posit that it is not due to lack of ability rather the state of art systems being trained to mimic human corrections, which in of itself are sub-optimal.

Unlike BEA-19, JFLEG uses GLEU as its evaluation metric which is not only less stringent than F0.5subscript𝐹0.5F_{0.5}italic_F start_POSTSUBSCRIPT 0.5 end_POSTSUBSCRIPT, but as our experiments show also aligns with human preferences better than F0.5subscript𝐹0.5F_{0.5}italic_F start_POSTSUBSCRIPT 0.5 end_POSTSUBSCRIPT. To reiterate, in a LEC task, there could be many possible variations of the correct answer for any given input. The authors of JFLEG take this factor into consideration and provide four different variations of ground truth corrections which makes it quite suitable for modern-day generative LLMs.

7 Conclusion & Future Work

Discerning whether or not a system is robust to noise and more importantly, understands semantically what a corrupted text means is the foundation of NLP. By building systems that are, for the most part, immune to noises occurring in real-world data, we make sure our NLP systems are not fragile and exhibit unintended behavior when deployed in the wild. In this work, we set out to show that modern-day LLMs do not care about corruptions as long as they are semantically the same. We do this by combining two tangential fields of NLP, Lexical Semantic Change detection and Language error correction. On the one hand, we used LSC techniques to show that the internal encoding of LLMs remains unchanged in response to corruptions in text; on the other hand, we show that unsupervised LLMs can perform zero-shot on par and even better in the downstream task of LEC. We also share preference dataset with the community. Our work paves the way for advanced LLM based LEC systems as we depart from the predominant inclusion of the LEC module as part of standard NLP systems.

As a part of future work, we have several fronts where we strive to extend our work. First, we aim to expand our study to build and study LEC on longer passages and documents rather than just sentence-level corrections. Second, we also aim to include machine translation as part of standard LEC practices, and motivate the community at large to consider this an important step forward in the field of text normalization and future of LEC systems. Third, we aim to refine the perturbation methods as they could change the meaning of sentences at times. Finally, as discussed in Section 6, we would like to improve the state of open source models like LLaMa to make further progress in unsupervised LEC as it has been novel in the corrections as compared to humans, as evident from our study.

8 Limitations

One drawback of using LLMs like GPT is the side-effects incurred i.e. unintended transformations being performed on the text. This gets even worse with smaller LLMs like LLaMa. As an example from BEA-19 dataset, a sentence “Around the city, you can find many places where people throw frigo, kitchen, "amianto", old things or furniture.” contains a French and an Italian phrases frigo and amianto respectively. Even though we did not explicitly asked in the prompt shown in 3.4, when GPT error corrected this, it automatically translated frigo and amianto to their corresponding English translations of fridge and asbestos respectively. This can be seen both as an advantage and disadvantage, for instance, GPT did an even superior task of LEC and implicit multi-lingual machine translation as a part of LEC task. We leave further investigation into this phenomena for future work.

References

  • Al Sharou et al. (2021) Khetam Al Sharou, Zhenhao Li, and Lucia Specia. 2021. Towards a better understanding of noise in natural language processing. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pages 53–62.
  • Bout et al. (2023) Andrey Bout, Alexander Podolskiy, Sergey Nikolenko, and Irina Piontkovskaya. 2023. Efficient grammatical error correction via multi-task training and optimized training schedule.
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  • Bryant et al. (2019) Christopher Bryant, Mariano Felice, Øistein E. Andersen, and Ted Briscoe. 2019. The BEA-2019 shared task on grammatical error correction. In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 52–75, Florence, Italy. Association for Computational Linguistics.
  • Bryant et al. (2023) Christopher Bryant, Zheng Yuan, Muhammad Reza Qorib, Hannan Cao, Hwee Tou Ng, and Ted Briscoe. 2023. Grammatical error correction: A survey of the state of the art.
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  • Dey and Haque (2009) Lipika Dey and SK Mirajul Haque. 2009. Studying the effects of noisy text on text mining applications. In Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data, pages 107–114.
  • Eger and Mehler (2017) Steffen Eger and Alexander Mehler. 2017. On the linearity of semantic change: Investigating meaning variation via dynamic graph models. arXiv preprint arXiv:1704.02497.
  • Fang et al. (2023) Tao Fang, Shu Yang, Kaixin Lan, Derek F. Wong, Jinpeng Hu, Lidia S. Chao, and Yue Zhang. 2023. Is ChatGPT a highly fluent grammatical error correction system? a comprehensive evaluation.
  • Galliers and Spärck Jones (1993) Julia Rose Galliers and K Spärck Jones. 1993. Evaluating natural language processing systems. Technical report, University of Cambridge, Computer Laboratory.
  • Garg et al. (2021) Siddhant Garg, Goutham Ramakrishnan, and Varun Thumbe. 2021. Towards robustness to label noise in text classification via noise modeling. In Proceedings Of The 30th ACM International Conference On Information & Knowledge Management, pages 3024–3028.
  • Gulordava and Baroni (2011) Kristina Gulordava and Marco Baroni. 2011. A distributional similarity approach to the detection of semantic change in the google books ngram corpus. In Proceedings of the GEMS 2011 workshop on geometrical models of natural language semantics, pages 67–71.
  • Hamilton et al. (2016) William L Hamilton, Jure Leskovec, and Dan Jurafsky. 2016. Cultural shift or linguistic drift? comparing two computational measures of semantic change. In Proceedings of the conference on empirical methods in natural language processing. Conference on empirical methods in natural language processing, volume 2016, page 2116. NIH Public Access.
  • Heilman et al. (2014) Michael Heilman, Aoife Cahill, Nitin Madnani, Melissa Lopez, Matthew Mulholland, and Joel Tetreault. 2014. Predicting grammaticality on an ordinal scale. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 174–180, Baltimore, Maryland. Association for Computational Linguistics.
  • Hellrich and Hahn (2016) Johannes Hellrich and Udo Hahn. 2016. Bad company—neighborhoods in neural embedding spaces considered harmful. In Proceedings of coling 2016, the 26th international conference on computational linguistics: Technical papers, pages 2785–2796.
  • Kim et al. (2014) Yoon Kim, Yi-I Chiu, Kentaro Hanaki, Darshan Hegde, and Slav Petrov. 2014. Temporal analysis of language through neural language models. arXiv preprint arXiv:1405.3515.
  • Kisselew et al. (2016) Max Kisselew, Laura Rimell, Alexis Palmer, and Sebastian Padó. 2016. Predicting the direction of derivation in english conversion. In Proceedings of the 14th sigmorphon workshop on computational research in phonetics, phonology, and morphology, pages 93–98.
  • Ko et al. (2023) Ching-Yun Ko, Pin-Yu Chen, Payel Das, Yung-Sung Chuang, and Luca Daniel. 2023. On robustness-accuracy characterization of large language models using synthetic datasets. In International Conference on Machine Learning.
  • Kojima et al. (2022) Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
  • Lopresti (2008) Daniel Lopresti. 2008. Optical character recognition errors and their effects on natural language processing. In Proceedings of the second workshop on Analytics for Noisy Unstructured Text Data, pages 9–16.
  • Ma (2019) Edward Ma. 2019. Nlp augmentation. https://github.com/makcedward/nlpaug.
  • Maas et al. (2011) Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA. Association for Computational Linguistics.
  • Malykh (2019) Valentin Malykh. 2019. Robust to noise models in natural language processing tasks. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 10–16, Florence, Italy. Association for Computational Linguistics.
  • Martinc et al. (2019) Matej Martinc, Petra Kralj Novak, and Senja Pollak. 2019. Leveraging contextual embeddings for detecting diachronic semantic shift. arXiv preprint arXiv:1912.01072.
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26.
  • Montariol et al. (2021) Syrielle Montariol, Matej Martinc, and Lidia Pivovarova. 2021. Scalable and interpretable semantic change detection. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4642–4652, Online. Association for Computational Linguistics.
  • Náplava et al. (2021) Jakub Náplava, Martin Popel, Milan Straka, and Jana Straková. 2021. Understanding model robustness to user-generated noisy texts. arXiv preprint arXiv:2110.07428.
  • Napoles et al. (2017) Courtney Napoles, Keisuke Sakaguchi, and Joel Tetreault. 2017. JFLEG: A fluency corpus and benchmark for grammatical error correction. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 229–234, Valencia, Spain. Association for Computational Linguistics.
  • Ng et al. (2014) Hwee Tou Ng, Siew Mei Wu, Ted Briscoe, Christian Hadiwinoto, Raymond Hendy Susanto, and Christopher Bryant. 2014. The CoNLL-2014 shared task on grammatical error correction. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task, pages 1–14, Baltimore, Maryland. Association for Computational Linguistics.
  • OpenAI (2023) OpenAI. 2023. GPT-4 technical report.
  • Passonneau et al. (2009) Rebecca J Passonneau, Cynthia Rudin, Axinia Radeva, and Zhi An Liu. 2009. Reducing noise in labels and features for a real world dataset: Application of nlp corpus annotation methods. In Computational Linguistics and Intelligent Text Processing: 10th International Conference, CICLing 2009, Mexico City, Mexico, March 1-7, 2009. Proceedings 10, pages 86–97. Springer.
  • Qiu and Xu (2022) Wenjun Qiu and Yang Xu. 2022. Histbert: A pre-trained language model for diachronic lexical semantic analysis. arXiv preprint arXiv:2202.03612.
  • Raheja et al. (2023) Vipul Raheja, Dhruv Kumar, Ryan Koo, and Dongyeop Kang. 2023. Coedit: Text editing by task-specific instruction tuning. arXiv preprint arXiv:2305.09857.
  • Rosenfeld and Erk (2018) Alex Rosenfeld and Katrin Erk. 2018. Deep neural models of semantic shift. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 474–484.
  • Rothe et al. (2021) Sascha Rothe, Jonathan Mallinson, Eric Malmi, Sebastian Krause, and Aliaksei Severyn. 2021. A simple recipe for multilingual grammatical error correction. arXiv preprint arXiv:2106.03830.
  • Schlechtweg et al. (2019) Dominik Schlechtweg, Anna Hätty, Marco Del Tredici, and Sabine Schulte im Walde. 2019. A wind of change: Detecting and evaluating lexical semantic change across times and domains. arXiv preprint arXiv:1906.02979.
  • Schlechtweg et al. (2020) Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky, and Nina Tahmasebi. 2020. Semeval-2020 task 1: Unsupervised lexical semantic change detection. arXiv preprint arXiv:2007.11464.
  • Shoemark et al. (2020) Philippa Shoemark, Farhana Ferdousi Liza, Dong Nguyen, Scott Hale, and Barbara McGillivray. 2020. Room to glo: A systematic comparison of semantic change detection approaches with word embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 66–76. Association for Computational Linguistics.
  • Sidorov et al. (2013) Grigori Sidorov, Anubhav Gupta, Martin Tozer, Dolors Catala, Angels Catena, and Sandrine Fuentes. 2013. Rule-based system for automatic grammar correction using syntactic n-grams for english language learning (l2). In Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task, pages 96–101.
  • Srivastava et al. (2020) Ankit Srivastava, Piyush Makhija, and Anuj Gupta. 2020. Noisy text data: Achilles’ heel of bert. In Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020), pages 16–21.
  • Stahlberg and Kumar (2021) Felix Stahlberg and Shankar Kumar. 2021. Synthetic data generation for grammatical error correction with tagged corruption models. arXiv preprint arXiv:2105.13318.
  • Tahmasebi et al. (2021) Nina Tahmasebi, Lars Borin, and Adam Jatowt. 2021. Survey of computational approaches to lexical semantic change detection. Computational approaches to semantic change, 6(1).
  • Tarnavskyi et al. (2022) Maksym Tarnavskyi, Artem Chernodub, and Kostiantyn Omelianchuk. 2022. Ensembling and knowledge distilling of large sequence taggers for grammatical error correction. arXiv preprint arXiv:2203.13064.
  • Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, 30.
  • Wang et al. (2023) Haoyu Wang, Guozheng Ma, Cong Yu, Ning Gui, Linrui Zhang, Zhiqi Huang, Suwei Ma, Yongzhe Chang, Sen Zhang, Li Shen, Xueqian Wang, Peilin Zhao, and Dacheng Tao. 2023. Are large language models really robust to word-level perturbations?
  • Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  • Wu et al. (2021) Di Wu, Yiren Chen, Liang Ding, and Dacheng Tao. 2021. Bridging the gap between clean data training and real-world inference for spoken language understanding. arXiv preprint arXiv:2104.06393.
  • Xing et al. (2013) Junwen Xing, Longyue Wang, Derek F Wong, Lidia S Chao, and Xiaodong Zeng. 2013. Um-checker: A hybrid system for english grammatical error correction. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task, pages 34–42.
  • Xu and Kemp (2015) Yang Xu and Charles Kemp. 2015. A computational evaluation of two laws of semantic change. In CogSci.
  • Yannakoudakis et al. (2011) Helen Yannakoudakis, Ted Briscoe, and Ben Medlock. 2011. A new dataset and method for automatically grading ESOL texts. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 180–189, Portland, Oregon, USA. Association for Computational Linguistics.
  • Zhang et al. (2022) Yue Zhang, Bo Zhang, Zhenghua Li, Zuyi Bao, Chen Li, and Min Zhang. 2022. Syngec: Syntax-enhanced grammatical error correction with a tailored gec-oriented parser. arXiv preprint arXiv:2210.12484.
  • Zhou et al. (2022a) Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. 2022a. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625.
  • Zhou et al. (2022b) Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. 2022b. Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910.