¹¹institutetext: University of Florida, Gainesville, FL, USA ¹¹email: {youms⁰,chathuri.jayawee,bonniejdorr,juliana.prada}@ufl.edu²²institutetext: IHMC, Pensacola, FL, USA ²²email: [email protected]

DAHRS: Divergence-Aware Hallucination-Remediated SRL Projection

Sangpil Youm Corresponding Author.11 Brodie Mather 22 Chathuri Jayaweera 11
Juliana Prada 11 Bonnie Dorr 11

Abstract

Semantic role labeling (SRL) enriches many downstream applications, e.g., machine translation, question answering, summarization, and stance/belief detection. However, building multilingual SRL models is challenging due to the scarcity of semantically annotated corpora for multiple languages. Moreover, state-of-the-art SRL projection (XSRL) based on large language models (LLMs) yields output that is riddled with spurious role labels. Remediation of such hallucinations is not straightforward due to the lack of explainability of LLMs. We show that hallucinated role labels are related to naturally occurring divergence types that interfere with initial alignments. We implement Divergence-Aware Hallucination-Remediated SRL projection (DAHRS), leveraging linguistically-informed alignment remediation followed by greedy First-Come First-Assign (FCFA) SRL projection. DAHRS improves the accuracy of SRL projection without additional transformer-based machinery, beating XSRL in both human and automatic comparisons, and advancing beyond headwords to accommodate phrase-level SRL projection (e.g., EN-FR, EN-ES). Using CoNLL-2009 as our ground truth, we achieve a higher word-level F1 over XSRL: 87.6% vs. 77.3% (EN-FR) and 89.0% vs. 82.7% (EN-ES). Human phrase-level assessments yield 89.1% (EN-FR) and 91.0% (EN-ES). We also define a divergence metric to adapt our approach to other language pairs (e.g., English-Tagalog).

Keywords:

semantic role labeling, hallucination remediation, explainability, divergences

1 Introduction

The natural language processing (NLP) task of semantic role labeling (SRL) captures “who did what to whom” for many downstream applications, e.g., machine translation, question answering, and summarization [21, 14]. Semantic roles are central to inferring unstated information (e.g., stances [26, 25] and emotional cues [3]) that are absent from the output of NLP tools such as dependency parsing.

Disappointingly, SRL has been studied primarily in English due to highly available English-specific SRL annotated datasets [12]. The scarcity of multilingual SRL-annotated corpora motivates the need for cross-language approaches that project semantic roles from English to other languages.

Many studies have explored pre-trained SRL models [28, 34] and generative AI approaches for semantic tasks that include SRL [36]. These LLM-centric studies tend to focus exclusively on English. The associated LLMs thus introduce hallucinations without obvious recourse due to an inherent lack of explainability.

Our approach, “Divergence-Aware Hallucination-Remediated SRL Projection” (DAHRS) adopts a generalized characterization of divergence types [9, 23] and corrects alignnments, remediating hallucinated semantic-role transfer from source to target languages (e.g., English-French and English-Spanish). We introduce a greedy “First-Come First-Assign” (FCFA) algorithm within DAHRS that projects roles from corrected initial alignments. FCFA also remediates the hallucinated lack of semantic role projections emerging from corrected initial alignments.

The key insight here is that leveraging linguistic knowledge overcomes deficiencies in current transformer-based alignment-projection approaches. Transformer-based alignment treats target words as a bag-of-words, frequently aligning source-language terms to hallucinated target-language terms. By contrast, DAHRS injects an awareness of naturally occurring language divergences, e.g., one-to-many/many-to-one translations or word/phrase order distinctions, into alignment. Straightforward correction of alignments that would otherwise lead to hallucinated incorrect roles supports effective and explainable transfer of semantic roles from the source language to the target language.

State-of-the-art XSRL [6] addresses a subset of language divergences explored in this paper: nominalizations and separable verb prefixes. In cases where the initial alignment is correct, XSRL fails to project valid roles in the context of other types of divergences, often hallucinating a lack of semantic role projections on the right-hand side. DAHRS is designed to address two types of hallucinations simultaneously: alignment and projection. The performance of DAHRS is compared to that of XSRL using data processed by both methods (see section 5).

Hallucination remediation in DAHRS starts with token-level and phrase-level corrections to an initial transformer-based mBERT [11] alignment. Following this, additional hallucination remediation takes place during projection. Fig. 1 illustrates two representative cases of divergences that have triggered hallucinations in prior work: Light Verb and Structural.¹¹1Fig. 1 inputs: (a) EN: The dow ’s dive was the 12th - worst ever and the sharpest since the market fell 156.83 FR: La chute du dow jones a été la 12e - la pire et la plus forte depuis que le marché a chuté de 156.83. (b) EN: Some “circuit breakers” installed after the october 1987 crash failed their first test. FR: Certains “disjoncteurs” installés après l’écrasement d’octobre 1987 ont échoué leur premier test. Square brackets ‘[]’ indicate SRL projections, with unaligned words indicated by $\epsilon$ . The output shown at each stage explainably pin-points which sub-components fail or succeed (alignment or projection, or both).

(a) Light Verb Divergence. The single verb fell maps to a combination of a “light” verb (a) and content word “fallen” (chuté). Despite the correct initial mBERT alignment, XSRL is unable to “see past” this divergence to project semantic roles to the target-language side. The inherent uninterpretability of the underlying models impedes the ability to determine what has gone awry, but we observe that this divergence type almost leads to a hallucinated lack of SRL assignments. By contrast, DAHRS correctly transfers labels V, ARG1 (EN market to FR marché), and ARG2 (EN 156.83 to FR 156.83), leaving a, de appropriately unassigned. Also, chuté is an adjectival participle in French, but its verbal nature supports ARG1 assignments, so the V label is retained by design.

(a) Light Verb (Hallucinated lack of roles):
market fell 156.83 - marché a chuté de 156.83

mBERT-based Alignment:

market	—	marché
$\epsilon$	—	a
fell	—	chuté
$\epsilon$	—	de
156.83	—	156.83

XSRL:

[ARG1] market	—	marché
$\epsilon$	—	a
[V] fell	—	chuté
$\epsilon$	—	de
[ARG2] 156.83	—	156.83

DAHRS:

[ARG1] market	—	[ARG1] marché
$\epsilon$	—	a
[V] fell	—	[V] chuté
$\epsilon$	—	de
[ARG2] 156.83	—	[ARG2] 156.83

(b) Structural (Hallucinated incorrect roles):
october 1987 crash -écrasement d’ octobre 1987

mBERT-based Alignment:

october	—	écrasement
october	—	octobre
$\epsilon$	—	d’
1987	—	1987
crash	—	ècrasement

XSRL:

[ARGM-TMP] october	—	[ARGM-TMP] octobre
$\epsilon$	—	d’
[ARGM-TMP] 1987	—	[ARGM-TMP] 1987
[ARGM-TMP] crash	—	$\epsilon$

DAHRS:

[ARGM-TMP] october	—	[ARGM-TMP] octobre
$\epsilon$	—	d’
[ARGM-TMP] 1987	—	[ARGM-TMP] 1987
[ARGM-TMP] crash	—	[ARGM-TMP] écrasement

Figure 1: Divergence cases corresponding to two hallucination types: (a) Light Verbs introduce one-to-many/many-to-one divergences that impede XSRL transfer of semantic roles even when the initial alignment is correct, thus hallucinating a lack of roles on the target-language side; (b) Structural divergences introduce word/phrase order distinctions that result in extra, spuriously aligned terms, thus hallucinating incorrect roles.

(b) Structural Divergence. A difference in source/target word order (October 1987 crash vs. crash of October 1987) combined with a bag-of-words design leads to an incorrect mBERT alignment. Here, October aligns to Octobre (October) and a (hallucinated) occurrence of écrasement, while crash aligns to a second occurrence of écrasement. The resulting XSRL projection includes incorrect role transfers, leaving crash unaligned and thus without a role. By contrast, DAHRS applies alignment remediation, mapping crash to écrasement, and October to Octobre, and correctly transferring ARGM-TMP to French.

DAHRS identifies divergence types, remediates hallucinations at both the token/phrase level, and applies greedy FCFA SRL projection. Divergence handling couples alignment remediation with FCFA, which is parameterized to include syntactic properties of the source language (e.g., English is head-initial) to accommodate proper SRL projection. This simple, efficient design transcends “yet another transformer” in both accuracy and explainability.

While numerous studies have focused on improving explainability in diverse NLP tasks and applications such as classification [22] or medical NLP [4], to our knowledge, ours is the first to address explainability for SRL in NLP. Our visualization of alignment and projection decisions (see Fig. 1) displays accessible, linguistically relevant representations associated with SRL transfers (and predicates, indicated as “V”). These visualized linguistic representations display how and why each SRL projection is made, highlighting the handling of translation divergences throughout the entire process.

Below we present related work, followed by a description of DAHRS. We then present automated and human-validated evaluations. We demonstrate that DAHRS outperforms XSRL in accuracy (87.6% vs. 77.3% F1 (EN-FR), 89.0% vs. 82.7% F1 (EN-ES)). We discuss the potential for generalization to low-resource languages. We then conclude and explore future work.

2 Related Work

Early applications for annotation-projection include: dependency parsing [19]; part-of-speech taggers [38]; machine translation [39, 33]; divergence-inspired alignment [10]; and creation of syntactic-dependency datasets for multiple languages [27]. We borrow the notion of annotation projection to produce explainable, cross-language SRL that advances the state of the art.

A contrasting SRL annotation projection approach is one where a source-language model is modified for direct applicability to a new language, using cross-lingually shared representations [20]. Such “model transferring” approaches do not align datasets across languages, but instead induce a separate dataset. By contrast, annotation projection approaches (including our own) propagate available information from one language to another via alignment.

Translation-based models provide an alternative approach for transferring SRL annotations. These have demonstrated promising performance due to recent improvements in neural machine translation (NMT) [12, 13, 17]. Translation-based projection involves tree-to-tree mappings to build cross-lingual SRL-annotated corpora [31], based on tree/graph-based representations [33]. By contrast, our approach aims to accommodate divergences for SRL projection via word-to-word mapping without relying on additional structure (e.g., trees or graphs).

Prior studies have demonstrated the benefits of embedding models in cross-language SRL projection. For example, Polyglot SRL [29] employs word vectors and is trained on the union of annotations between two languages. A cross-lingual encoder-decoder model is applied to simultaneously translate and apply SRL for resource-poor languages [5]. Adding a syntactic information layer to the embedding models demonstrates plausibility of transferring semantic roles [15]. By contrast, our approach enables improved SRL projection without additional vector-based machinery. Instead, we factor out syntactic variations, as these are not central to the transfer of semantic roles, and introduce a greedy SRL projection algorithm that is both accurate and efficient.

Translation divergences and associated alignment errors lead to considerable noise, often resulting in the implementation of intricate techniques. For example, projection probability distributions and gold-standard annotated data have been employed to improve alignment performance [1]. XSRL uses translations produced by DeepL [7], more than 10% of which are human-judged as improperly translated and removed. An mBERT [8] aligner is applied, followed by an additional transformer-based mechanism (BERT Score) [40], to project semantic roles to the target sentence. Although these approaches offer valuable SRL projection strategies, two major concerns are the added complexity (e.g., BERT-based scoring) and, in the case of XSRL, human filtering to remove noisy translations. The latter negatively impacts the resulting training data coverage.

While our approach involves projection, it differs from those above in that it operates on all translated sentence pairs (no human filtering) and produces a greedily induced SRL projection. The resulting annotations are consistent with translation divergence studies. Decisions on projected labels are made readily accessible and easily visualized, rather than hidden behind black box algorithms.

3 Divergence-Aware Hallucination-Remediated SRL Projection (DAHRS)

Refer to caption — Figure 2: Divergence-Aware Hallucination-Remediated SRL Projection (DAHRS) pipeline from English to French

DAHRS’s key contribution is its ability to compensate for potential semantic role errors emerging from hallucinated alignments that coincide with naturally occurring cross-language divergences. Leveraging source-language knowledge (e.g., English is head initial) coupled with a greedy FCFA algorithm, DAHRS transfers semantic roles to the target language.

Fig. 2 illustrates the DAHRS step-wise pipeline with an English-to-French example. DAHRS’s input is an initial mBERT-style alignment, as in XSRL, but prior to SRL projection it corrects hallucinated alignments and transfers semantic roles without additional transformer-based processing.

Fig. 3 shows three key steps in DAHRS: divergence identification (see Section 3.1), alignment correction (DAHRS₁ and DAHRS₂, see Section 3.2), and FCFA projection (DAHRS₃, also in Section 3.2). When divergence identification uncovers a divergence, DAHRS modifies the alignment prior to SRL projection. Otherwise it directly projects semantic roles through FCFA projection.

3.1 Divergence Identification

For divergence identification, DAHRS relies on a sub-categorization of divergences into three types, as shown in Fig. 4. For example, with regard to the divergences illustrated in Section 1, Light Verb divergences are associated with (a) one-to-many and (b) many-to-one sub-categories, and Structural divergences are associated with (c) the ordering sub-category.

(a) One-to-many
laptops	——	ordinateurs ; 17-23
laptops	——	portables ; 17-24

(b) Many-to-one
fell	——	effondrée ; 4-6
apart	——	effondrée ; 5-6

(c) Ordering
october	——	écrasement ; 9-7
october	——	octobre ; 9-9
$\epsilon$	——	d’ ; $\epsilon$ -8
1987	——	1987 ; 10-10
crash	——	ècrasement ; 11-7

Figure 4: Three subcategories of divergences (token level): One-to-many, Many-to-one, and Ordering

The identification of these divergence sub-categories for a given source-target input pair relies on position-value pairs. These pairs indicate the tokens and phrases that are mapped singularly or repeatedly across the source and target inputs. Divergence types are identified across tokenized source and target sentences, where each token is assigned a position value starting from 0.

Consider the French sentence fragment ordinateurs portable (laptops) in Fig. 4(a). This string is associated with position values of 17 in English and 23,24 in French. Source and target word mappings are denoted by a hyphenated position-value pair. For example, 17-23 and 17-24 indicate the 17th English word (laptops) aligns with the 23rd and 24th word French words (ordinateurs portable). This case is identified as a one-to-many divergence, i.e., a single source token aligns with multiple target tokens. Analogously, a many-to-one divergence is identified when multiple source tokens align with a single target token, as in Fig. 4(b), where fell(4) and apart(5) align with effondrée(6).

An ordering divergence is detected when a single source token is mBERT-aligned with multiple target tokens (one-to-many) while one of those same target tokens aligns with a different source token (many-to-one). Returning to our earlier example, October 1987 crash (translated in French as crash of October 1987), as shown in Fig. 4(c): october(9) aligns with écrasement(7) and octobre(9), while one of target tokens, écrasement(7) also aligns with crash(11).

Although state-of-the-art (mBERT-based) word-to-word alignment establishes a reasonable source-to-target baseline, ordering divergences are not adequately handled, due to mBERT’s bag-of-words design. These lead to incorrect alignments that must be remediated in order to avoid hallucinated SRL projections. We note that ordering distinctions have been a focus in statistical machine translation (SMT) for quite some time [32], but these have heretofore not been remediated for projection.

Subsequent to identifying divergence types, as described below, our approach remediates hallucinations due to divergences and projects semantic roles through FCFA SRL projection.

3.2 DAHRS Algorithms

DAHRS’s three key steps each correspond to a component-level algorithm: alignment correction at the token level (Algorithm 1) and phrase level (Algorithm 2) to remediate hallucinated incorrect role projections, followed by FCFA SRL projection (Algorithm 3) which remediates hallucinated lack of role projections.

3.2.1 $DAHRS{}_{1}$ : Token-Level Hallucination Remediation.

We remediate alignment hallucinations at the token level, using $DAHRS{}_{1}$ (see Algorithm 1). Such hallucinations are discerned from input pairs for one-to-one (tLevelOneToOne), one-to-many (tLevelOneToMany), many-to-one (tLevelManyToOne) alignments. Additionally, a head-initial flag (headInitialFlag) ensures proper SRL projection. This algorithm outputs a list of remediated alignments (remOneToOne).

Algorithm 1 Token-level Hallucination Remediation (

DAHRS{}_{1}

)

Input tLevelOneToOne, tLevelOneToMany, tLevelManyToOne, headInitialFlag

Output remOneToOne

1:function

DAHRS{}_{1}

(tLevelOneToOne,tLevelOneToMany, tLevelManyToOne,headInitialFlag)

2: remOnetoOne

\leftarrow[]

3: remTargetWords

\leftarrow[]

4: for

(src,tgt)\in\textit{tLevelOneToOne}

5: remOneToOne.insert((src, tgt))

6: tgtOneOne

\leftarrow

targets of tLevelOneToOne

7: for

(src,tgtList)\in tLevelOneToMany

8: for

tgt\in tgtList

9: if

tgt\in tgtOneOne

then

10: tgtList.delete(tgt)

11: else

12: remOneToOne.insert((src,tgt))

13: srcOneOne

\leftarrow

sources of tLevelOneToOne

14: for (srcList,tgt)

\in

tLevelManyToOne do

15: if

src\in srcOneOne

then

16: srcList.delete(src)

17: else

18: if

\textit{headInitialFlag}\equiv\textit{True}

then

19: remOneToOne.insert((srcList[0],tgt))

20: else

21: remOneToOne.insert((srcList[1],tgt))

22: return remOneToOne

$DAHRS{}_{1}$ initializes remediated alignments (remOneToOne), inserting mBERT-aligned source-target token pairs specified in the tLevelOneToOne list (lines 4-5). Next the target tokens in the one-to-many pair list (tLevelOneToMany) are examined for alignment with other source tokens, preparing for hallucination remediation (line 6). If a target token is found to be aligned with an alternate source token, the hallucinated alignment is removed from the target token list (tgtList) (lines 7-10). This action remediates alignment hallucinations that emerge in the context of ordering divergences. For example, in the earlier baseline alignment in Fig. 4(c), the word october is incorrectly aligned with écrasement. This is detected due to the simultaneous october-octobre alignment (where no other source word aligns with octobre). The spurious october-écrasement alignment is hypothesized to be a hallucination and is removed.

After remedying spurious alignments in the one-to-many pairs, $DAHRS{}_{1}$ proceeds to store the corrected source and target pairs in the output (remOneToOne) (lines 11-12). In the earlier baseline alignment in Fig. 4(a), $DAHRS{}_{1}$ correctly maps laptops to both ordinateurs and portables.

In the case of many-to-one alignment, $DAHRS{}_{1}$ examines the source tokens in the one-to-one pair list (tLevelOneToOne) for alignment with other target tokens, preparing for additional hallucination remediation (line 13). In this case, the algorithm addresses the potential for hallucinated (downstream) SRL projections due to the presence of particles or modifiers (e.g., apart in fell apart) that are aligned with the main verb.

Remediation removes such tokens from the source token list (srcList) (lines 14-16). For eaxample, in the earlier baseline alignment in Fig. 4(b), the apart-effondrée alignment is deleted. The remaining fell-effondrée alignment is retained and is positioned in the output (remOneToOne) according to the headIntitalFlag, where “True” indicates a head-initial language, selecting the first token and “False” indicates a head-final language (lines 17-21).

3.2.2 $DAHRS{}_{2}$ : Phrase-Level Hallucination Remediation.

$DAHRS{}_{2}$ shown in Algorithm 2 advances beyond the token-level processing of state-of-the-art (XSRL) in that it includes handling of phrases for SRL projection.

Algorithm 2 Phrase-level Hallucination Remediation (

DAHRS{}_{2}

)

Input pLevelOneToOne, pLevelOneToMany, pLevelManyToOne , srcPhRange, tgtPhRange, funcWordIdx, headInitialFlag

Output remOneToOne

1:function

DAHRS{}_{2}

(pLevelOneToOne,pLevelOneToMany, pLevelManyToOne, srcPhRange, tgtPhRange, funcWordIdx, headInitialFlag)

2: if

pLevelManyToOne=\emptyset

and

pLevelOneToMany=\emptyset

then

3: remOneToOne

\leftarrow

pLevelOneToOne

4: return remOneToOne

5: else

6: tgtOneOne

\leftarrow

targets index of pLevelOneToOne

7: for

(src,tgtList)\in pLevelOneToMany

8: for

tgt\in tgtList

9: if tgt

\notin

tgtPhRange

\mid

tgt

\in

tgtPlevelOneOne then

10: tgtList.delete(tgt)

11: for

tgt\in tgtList

12: remOneToOne.insert((src,tgt))

13: srcOneOne

\leftarrow

sources of pLevelOneToOne

14: for

(srcList,tgt)\in pLevelManyToOne

15: for

src\in srcList

16: if src

\notin

srcPhRange

\mid

src

\in

srcOneOne

\mid

src

\in

funcWordIdx then

17: srcList.delete(src)

18: if size of srcList

\equiv

1 then

19: remOneToOne.insert((srcList[0],tgt))

20: else

21: if

\textit{headInitialFlag}\equiv\textit{True}

then

22: remOneToOne.insert((srcList[0],tgt))

23: else

24: remOneToOne.insert((srcList[1],tgt))

25: remOneToOne

\leftarrow

remOneToOne

+

pLevelOneToOne

26: return remOneToOne

Phrase-level processing is similar to what is described above, but phrase identification is employed: BIO (Begin-Inside-Outside) tags are assigned to the source-language side via SRL-BERT [34].²²2SRL-BERT achieves an F1 Score of 86.49 on the English Ontonotes dataset [37], and it can be used non-exclusively. https://allenai.org/terms. These BIO-delineated phrasal units are brought together with alignment corrections for more robust alignment hallucination remediation. A phrase range is determined by arranging the source words in the order they appear within the sentence and employing BIO tags to identify phrases on the English side.³³3A phrase consists of a token that begins with a “B” tag and continues with tokens that have an “I” tag. The following token will have a new “B”, an “O”, or end of the sentence, indicating the end of the phrase.

Phrase information (start to end indices), encoded as a source phrase range (srcPhRange) and target phrase range (tgtPhRange), acts as phrase-level hallucination remediation input. Other inputs are lists of phrase-level alignment pairs: one-to-one (pLevelOneToOne), one-to-many (pLevelOneToMany), many-to-one (pLevelManyToOne). To support remediation, a list of function words (funcWordIdx) and a head-initial flag (headInitialFlag) are also introduced. This algorithm returns lists of remediated mappings (remOneToOne).

First, $DAHRS{}_{2}$ examines whether the mBERT-aligned input is indicative of a one-to-many or many-to-one divergence within a given phrase (a BIO-tagged pair). If no such divergence is present, all the tokens in the phrase are returned as output (remOneToOne) without correction (lines 2–4).

The next step remediates a detected hallucinated alignment resulting from an ordering divergence (lines 7–10). For each target list of phrasal one-to-many alignments, two aspects are examined: whether any target tokens are outside the corresponding phrase range, and whether any target tokens are simultaneously aligned with other source tokens. Tokens meeting one of these conditions are removed from the target list (tgtList). After $DAHRS{}_{2}$ remediates spurious alignments in the one-to-many pair list, non-hallucinated source and target token pairs are stored in the output (remOneToOne) (lines 11–12).

[I-ARG1] circuit	—	$\epsilon$
[I-ARG1] breakers	—	disjoncteurs; 4-2
[B-V] installed	—	installés; 6-4
[I-ARGM-TMP] after	—	après; 7-5
[I-ARGM-TMP] the	—	l’ ; 8-6
[I-ARGM-TMP] october	—	écrasement ; 9-7
[I-ARGM-TMP] october	—	octobre ; 9-9
[I-ARGM-TMP] 1987	—	1987 ; 10-10
[I-ARGM-TMP] crash	—	écrasement ; 11-7

Figure 5: One-to-many (yellow) and Many-to-one (green) phrase-level alignments

Lastly, $DAHRS{}_{2}$ remediates hallucinated alignments arising from many-to-one divergences (lines 14-24). Three conditions are tested for each source token that aligns to a given target token: whether the source token is correctly located within corresponding range, whether it is aligned with another target token, and whether the source token is function word. Any source token matching one of those conditions is removed from the source list (srcList). Following this step, the algorithm opts for the first option if headInitialFlag is true, or the second option otherwise.

We illustrate DAHRS₃ in Fig. 5, where the target token écrasement, has two distinct source token options (october (9) and crash (11)). Both fall within the correct source phrase range (7-30). Since the source token october already maps to octobre, october is removed from the source options for écrasement.

3.2.3 $DAHRS{}_{3}$ : First-Come First-Assign (FCFA) SRL Projection.

$DAHRS{}_{3}$ is a new greedy FCFA SRL projection that transfers semantic roles using the remediated alignments (one-to-one mappings, remOneToOne), as shown in Algorithm 3. Alignments are provided as an input along with corresponding role labels transferred from English (srcSRLSet).

Algorithm 3 First-Come First-Assign (FCFA) SRL Projection (

DAHRS{}_{3}

)

Input remOneToOne, srcSRLSet

Output tgtSRLList

1:function FCFA(remOneToOne, srcSRLSet)

2: tgtSRLList

\leftarrow[]

3: for

srcIdx,tgtIdx\in\textit{remOneToOne}

4: srcSRL

\leftarrow

srcIdx th item of srcSRLSet

5: if

\textit{tgtIdx}\equiv\textit{eps}

then

6: tgtSRL

\leftarrow

None

7: else

8: tgtSRL

\leftarrow

srcSRL

9: tgtSRLList.insert((tgtIdx,tgtSRL))

10: return tgtSRLList

Source side semantic roles are assigned to the remediated aligned target token (lines 3–9). Projection yields two outputs: a human interpretable alignment representation and a JSON formatted SRL representation. For example, in Fig. 6 (a), token-level FCFA projects label (“O”) to octobre and écrasement from october and crash (ordering). In addition, the source label from laptops is projected to both ordinateurs and portables, leveraging the correct (one-to-many) alignment. Advancing beyond state-of-the-art (XSRL), many-to-one handling results in the retention of V for effondrée and the elimination of the hallucinated ARG4 for apart.

(a) Token-level FCFA SRL Projection

One-to-many
[O] laptop — [O] ordinateurs ; 17-23 [O] laptop — [O] portables ; 17-24

Many to one
[B-V] fall — [B-V] effondrée; 4-6 [B-ARG4] apart — $\epsilon$

Ordering
[O] october — $\epsilon$ ; 9-7 [O] october — [O] octobre ; 9-9 [O] 1987 — [O] 1987 ; 10-10 [O] crash — [O] écrasement ; 11-7

(b) Phrase-level FCFA SRL Projection

[I-ARG1] circuit	—	$\epsilon$
[I-ARG1] breakers	—	[I-ARG1] disjoncteurs
[B-V] installed	—	[B-V] installés
[B-ARGM-TMP] after	—	[B-ARGM-TMP] après
[I-ARGM-TMP] the	—	[I-ARGM-TMP] l’
[I-ARGM-TMP] october	—	[I-ARGM-TMP] octobre
[I-ARGM-TMP] 1987	—	[I-ARGM-TMP] 1987
[I-ARGM-TMP] crash	—	[I-ARGM-TMP] écrasement

Figure 6: Token/phrase-level FCFA SRL projections

Phrase-level projection operates similarly. In Fig. 6 (b), october aligns with octobre and “ARGM-TMP” transfers to octobre (one-to-many). Due to the correct alignment of crash with écrasement, “ARGM-TMP” is transferred to écrasement (many-to-one). Furthermore, phrase-level projection considers whether the source language is head-initial or head-final. For example, [B-V-closed], [B-ARGM-MNR-down] — B-V-fermé, DAHRS projects “V” from closed, rather than “ARGM-MNR” from down.

3.3 Explainability and Visualization

In contrast to blackbox LLMs, which do not elucidate the decisions behind language alignment and SRL projections, DAHRS builds readily visualized representations that explain how it arrives at its output. Whereas prior work [18] has proposed metrics such as ‘goodness’, ‘user satisfaction’, and ‘understandability’ as proxies for explainability, DAHRS integrates human-interpretable representations directly into alignment and projection.

Two visualized products of our implementation (with French, Spanish as our test case) are: (a) a set of linguistically annotated alignment representations (one for each predicate indicated as “V”) that provides a window into why/how the system produces its output while elucidating errors that can be readily remedied, as depicted in Fig. 5; (b) a JSON formatted representation that specifies all semantic role-labeled tokens for each sentence, as depicted in Fig. 2 (French semantic role-labeled sentence). These examples showcase our handling of hallucination remediation in the face of divergences and highlight the assignment of predicates and corresponding semantic roles on the target side.

3.4 Model as a Diagnostic Tool

DAHRS employs a direct alignment-based source-to-target transfer mechanism, without requiring a filter or BERT Score (as implemented in XSRL). Moreover, the model based on this algorithm is an effective tool for assessing the accuracy of predicate and semantic role projection in longstanding community standard datasets. To illustrate this point, we explore a human-tagged English evaluation dataset from CoNLL-2009 [16], which has also been translated to French and Spanish data as part of XSRL’s research [6].

Preliminary tests using these datasets for SRL projection yield a much lower precision for DAHRS than that of XSRL: DAHRS: 65.9 (FR), 66.3 (ES), XSRL: 80.7 (FR), 85.4 (ES). Further investigation reveals that these data sets include a very large number of spurious V tags for non-predicates: 8341 (DAHRS) vs. 3777 (XSRL), 8401 (DAHRS) vs. 3870 (XSRL) for FR, ES, respectively. This is corroborated through analysis of part-of-speech (POS) attributes, which reveals that many verbs mislabeled as predicates do not have POS tag V or VB(D).

This overabundance of incorrectly labeled non-predicates in the pre-existing English CoNLL-09 dataset (where spurious V-tagged tokens/phrases would more appropriately be labeled ARG0, ARG1, ARG2) leads to significantly corrupted projections. We thus leverage DAHRS as a diagnostic tool, paving the way for refinements of the CoNLL-2009 gold dataset. We automatically remove the Y ( $=$ Yes) flag for predicates that do not have part-of-speech V or VB(D).⁴⁴4We have simplified the notion of predicate considerably in this discussion, focusing on verbs; however, other parts of speech may serve as predicates. For example, destruction of the city is a nominal phrase conveying a destroy event with a single argument: the city. Future work aims to explore other parts of speech as predicates. Correspondingly, incorrect transferal of falsely labeled predicates from the source is drastically reduced.

With this annotation refinement, we provide the updated new CoNLL-2009 dataset to the community. Correction of spurious predicate labels significantly improves the transferal of predicates and semantic roles during the application of DAHRS. In Section 4, all experiments use this newly updated dataset.

4 Data and Experimental Setup

We use our updated English CoNLL-2009 data for projecting semantic roles to French and Spanish datasets. Human-validated FR/ES datasets, parallel to the EN-CoNLL, are provided by XSRL. The original CoNLL-2009 data incorporates semantic roles for headwords only. In our headword-level experiment, semantic roles from English headwords are projected to the headwords of the FR/ES datasets. Since phrase-level test datasets are unavailable, we employ AllenNLP’s SRL-BERT to assign phrase-level semantic roles to the English corpora, which are then projected onto FR/ES corpora.

Phrasal-level semantic role assignment further enhances the accuracy of SRL, ensuring phrasal coverage—a significant advance over the head-word labeling in the original resource. For instance, without our phrase-level enrichment, the word The is considered a headword during SRL assignment in The Dow Jones industrials closed at 2569.26. The result is a single, inappropriate semantic role assignment of ARG1 to the word The. However, with our enrichment, an appropriate phrasal-level semantic role assignment is made possible: [ARG1- The Dow Jones industrials] [V-closed] [ARGM-EXT at 2569.26]. This corrected output yields a more thorough, accurate representation, which is crucial for downstream tools such as those enumerated in Section 1.

French and Spanish corpora, including their semantic roles, are projected from 2046 English sentences using XSRL (see details in section 2) and DAHRS. Subsequently, we evaluate these against both the community standard ground truth CoNLL-2009 (headword) from XSRL and the human judgment (phrasal). Our experiments run on 3 cores of AMD EPYC 75F3 32-Core Processor and using a NVIDIA A100 GPU.

5 Results and Analyses

Table 1: Word/Phrase-level projection evaluation for French and Spanish: DAHRS vs. baseline (XSRL)

Model	Language	Level	P	R	F1
XSRL	French	word	80.7	74.2	77.3
DAHRS	French	word	86.8	88.3	87.6
XSRL	Spanish	word	85.4	80.3	82.7
DAHRS	Spanish	word	88.1	89.9	89.0
XSRL	French	phrase	91.9	74.4	82.2
DAHRS	French	phrase	98.9	81.1	89.1
XSRL	Spanish	phrase	99.4	78.3	87.6
DAHRS	Spanish	phrase	99.6	83.8	91.0

We explore the performance of two projection-based models: DAHRS and XSRL. DAHRS achieves higher F1 scores in comparison to XSRL on our test data in both word-level and phrasal-level (see Table 1). Performance improvements are obtained as well as explanability.

To evaluate the correctness of the French and Spanish projection outputs, we employ the ground truth data from CoNLL-2009 for the headword dataset. Linguistically trained human taggers proficient in French and Spanish evaluate the phrasal output. Both evaluations use precision (P), recall (R), and F1 scores. Thus, we have achieved explainable transferability of semantic roles more efficiently and with more accurate outputs (P, R, F1).

We evaluate DAHRS against a human-validated CoNLL-2009 that assigns semantic roles only to headwords, per the original XSRL algorithm. We compare XSRL and a variant of DAHRS that produces only headword assignments against this same ground truth. In Table 1, DAHRS projection to headwords outperforms XSRL, with an F1 of 77.3 vs. 87.6 (FR), 82.7 vs. 89.0 (ES).

Furthermore, we conduct a post-analysis and evaluation of our phrasal-rich output against human judgment by French and Spanish proficient evaluators with linguistic training who evaluated 549 total labels (FR), and 582 total labels (ES). This analysis yields a F1 score (FR-89.1%, ES-91.0%, see Table 1). To our knowledge, this is the highest score achieved for this task, surpassing performance (accuracy) of single headword assignments without the overhead of human-labeled source data for French and Spanish.

6 Discussion: Beyond EN-FR / FR-ES

We explore hallucinations associated with linguistic divergences by considering language pairs beyond EN-FR / EN-ES. We consider Tagalog, a low-resource language notably influenced by Spanish at the word level [2], yet divergent from Spanish (and English) in that its subject follows the verb (VSO). Although our current study focuses on English as the source language, our future research focuses on Tagalog with both Spanish and English as source languages, further enriching our divergence exploration. This investigation aims to verify whether the pairs exhibit the divergent properties assumed by DAHRS and to provide a framework for testing longstanding hypotheses about cross-language divergences in the context of alignment.

We introduce divergence metrics that count the number of misalignments on both the source and target sides. When the target language demonstrates a higher number of misalignments, this typically indicates a one-to-many divergence case. Conversely, when the source side yields more misalignment, this typically corresponds to a many-to-one case. DAHRS effectively transfers semantic roles to the target language in both divergent cases, revealing the potential for generalizability to new language pairs.

As an early test case, we assess the applicability of our approach to English-Tagalog (EN-TL) or Spanish-Tagalog (ES-TL),⁵⁵5We use EN-ES-TL parallel data from LORELEI [35]. measuring misalignments on the source and target sides in both language pairs. Although mBERT alignment supports Tagalog, we are motivated to verify its effectiveness through this analysis, given that Tagalog is a low-resource language. We investigate alignment accuracy for EN-TL/ES-TL with the aid of a proficient human evaluator with ChatGPT [30] support. On average, 3.27 words (21.69%) and 4.04 words (26.77%) per sentence are corrected in EN-TL and ES-TL alignment, respectively.

After applying this alignment correction, we measure the misalignment in the source and target sides, revealing a decrease in misalignment of the EN-TL (3.94 (22.77%) to 3.0 (14.94%) words on the source side, 4.54 (34.82%) to 3.65 (26.05%) words on the target side). Notably, the findings demonstrate comparable misalignment in both language pairs (EN-TL and ES-TL).

Alignment regeneration and correction are prerequisites for employing DAHRS for low-resource languages like Tagalog. Base alignment (mBERT) is insufficient for aligning Tagalog with other languages such as English or Spanish necessitating meticulous customization of the alignment for such low-resource cases.

7 Conclusions and Future Work

We present a model for cross-language semantic role projection. Our work enhances semantically informed language processing with minimal overhead via a two-step process that rapidly identifies divergence cases and produces explainable, visualizable SRL output. We demonstrate performance improvements in accuracy without requiring a human-labeled French/Spanish corpus. Our evaluation relies on a community standard ground truth with SRL-tagged headwords (CoNNL-2009). Notable improvements are demonstrated when considering entire phrases, as evidenced by human judgments.

Future work will focus on expanding to other languages (Tagalog is underway) where hand-annotated labels are scarce. Although French and Spanish are investigated above, divergence-causing hallucinations, remediated by acknowledging the syntactic property of languages during DAHRS have been noted across many other languages, e.g., Spanish (categorial; [9]), Korean (structural; [23]), or German (light verb; [24]). As such, it is expected that DAHRS applies multilingually, both for mid-resource language pairs (e.g., English-Spanish/French) and for those that are low-resource language pairs (e.g., English-Tagalog).

Finally, our experiments reveal that a new model, DAHRS, improves the multilingual SRL projection task. We provide French and Spanish corpora, including SRL information per predicates. Additionally, we utilize DAHRS as a diagnostic tool to verify the accuracy of ground truth. Through this diagnostic tool, we identify errors in the data, enabling us to update and reproduce data for the language community. These data resources are not only beneficial for the SRL task but also may be leveraged for other tasks.

Acknowledgements

This research is based upon work supported by Defense Advanced Research Projects Agency (DARPA) under Contract No. HR001121C0186. Any opinions, findings and conclusions or recommendations expressed in this research are those of the authors and do not necessarily reflect the views of the US Government.

References

[1] Akbik, A., Chiticariu, L., Danilevsky, M., Li, Y., Vaithyanathan, S., Zhu, H.: Generating high quality proposition banks for multilingual semantic role labeling. In: Proc. of ACL-IJCNLP (2015)
[2] Baklanova, E., Bellamy, K.: Spanish suffixes in tagalog: The case of common nouns. In: Traces of Contact in the Lexicon. BRILL (2023)
[3] Campagnano, C., Conia, S., Navigli, R.: SRL4E – Semantic Role Labeling for Emotions: A Unified Evaluation Framework. In: Proc. of ACL (2022)
[4] Danilevsky, M., Qian, K., Aharonov, R., Katsis, Y., Kawas, B., Sen, P.: A survey of the state of explainable ai for natural language processing. In: Proc. of AACL-IJCNLP (2020)
[5] Daza, A., Frank, A.: Translate and label! an encoder-decoder approach for cross-lingual semantic role labeling. In: Proc. of EMNLP-IJCNLP (2019)
[6] Daza, A., Frank, A.: X-srl: A parallel cross-lingual semantic role labeling dataset. In: Proc. of EMNLP (2020)
[7] DeepL SE: DeepL: Neural machine translation software (2017)
[8] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proc. of NAACL-HLT (2018)
[9] Dorr, B.J.: Machine translation divergences: A formal description and proposed solution. Computational Linguistics (1994)
[10] Dorr, B.J., Pearl, L., Hwa, R., Habash, N.: DUSTer: A method for unraveling cross-language divergences for statistical word-level alignment. In: Proc. of the 5th Conference of the Association for Machine Translation in the Americas: Technical Papers. vol. 2499 (2002)
[11] Dou, Z.Y., Neubig, G.: Word alignment by fine-tuning embeddings on parallel corpora. In: Proc. of the EACL (2021)
[12] Fei, H., Zhang, M., Ji, D.: Cross-Lingual Semantic Role Labeling with High-Quality Translated Training Corpus. In: Proc. of ACL (2020)
[13] Gehring, J., Auli, M., Grangier, D., Yarats, D., Dauphin, Y.N.: Convolutional sequence to sequence learning. In: Proc. of ICML. vol. 3 (2017)
[14] Genest, P.E., Lapalme, G.: Framework for Abstractive Summarization using Text-to-Text Generation. In: Monolingual@ACL (2011)
[15] Guarasci, R., Silvestri, S., Pietro, G.D., Fujita, H., Esposito, M.: Bert syntactic transfer: A computational experiment on italian, french and english languages. Computer Speech & Language 71 (2022)
[16] Hajič, J., Ciaramita, M., Johansson, R., Meyers, A., Štěpánek, J., Nivre, J., Straňák, P., Surdeanu, M., Xue, N.B., Zhang, Y.: 2009 conll shared task part 2 (2012)
[17] Hassan, H., Aue, A., Chen, C., Chowdhary, V., Clark, J., Federmann, C., Huang, X., Junczys-Dowmunt, M., Lewis, W., Li, M., Liu, S., Liu, T.Y., Luo, R., Menezes, A., Qin, T., Seide, F., Tan, X., Tian, F., Wu, L., Wu, S., Xia, Y., Zhang, D., Zhang, Z., Zhou, M.: Achieving human parity on automatic chinese to english news translation. arXiv preprint arXiv:1803.05567 (2018)
[18] Hoffman, R.R., Mueller, S.T., Klein, G., Litman, J.: Metrics for explainable ai: Challenges and prospects (2018)
[19] Hwa, R., Resnik, P., Weinberg, A., Cabezas, C., Kolak, O.: Bootstrapping parsers via syntactic projection across parallel texts. Natural language engineering (2005)
[20] Kozhevnikov, M., Titov, I.: Cross-lingual transfer of semantic role labeling models. In: Proc. of ACL (2013)
[21] Liu, D., Gildea, D.: Semantic Role Features for Machine Translation (2010)
[22] Liu, H., Yin, Q., Wang, W.Y.: Towards explainable nlp: A generative explanation framework for text classification. In: Proc. of ACL (2020)
[23] Maniyar, S.N., Kulkarni, S.B., Bhise, P.R.: Linguistic divergence in various language pair in machine translation perceptive. IOSR Journal of Computer Engineering (IOSR-JCE) 23(1) (2021)
[24] Marzouk, S.: Chapter 3 german light verb construction in the course of the development of machine translation. In: Translation, interpreting, cognition: The way out of the box. Language Science Press (2021)
[25] Mather, B., Dorr, B.J., Dalton, A., de Beaumont, W., Rambow, O., Schmer-Galunder, S.M.: From stance to concern: Adaptation of propositional analysis to new tasks and domains. In: Findings of the ACL (2022)
[26] Mather, B., Dorr, B.J., Rambow, O., Strzalkowski, T.: A general framework for domain-specialization of stance detection. In: Proc. of FLAIRS (2021)
[27] McDonald, R., Nivre, J., Quirmbach-Brundage, Y., Goldberg, Y., Das, D., Ganchev, K., Hall, K., Petrov, S., Zhang, H., Täckström, O., Bedini, C., Castelló, B., Lee, J.: Universal dependency annotation for multilingual parsing. In: Proc. of ACL (2013)
[28] Mehta, S.V., Lee, J.Y., Carbonell, J.: Towards semi-supervised learning for deep semantic role labeling. In: Proc. of EMNLP (2018)
[29] Mulcaire, P., Swayamdipta, S., Smith, N.A.: Polyglot semantic role labeling. In: Proc. of ACL (2018)
[30] OpenAI: ChatGPT: Large-scale language model (2021)
[31] Pražák, O., Konopík, M.: Cross-lingual srl based upon universal dependencies. In: Proc. of RANLP (2017)
[32] Rottmann, K., Vogel, S.: Word reordering in statistical machine translation with a pos-based distortion model. In: Proc. of IEEE on TMI (2007)
[33] Shen, Y., Chu, C., Cromieres, F., Kurohashi, S.: Cross-language projection of dependency trees with constrained partial parsing for tree-to-tree machine translation. In: Proc. of the First Conference on Machine Translation (2016)
[34] Shi, P., Lin, J.: Simple bert models for relation extraction and semantic role labeling. arXiv preprint arXiv:1904.05255 (2019)
[35] Tracey, J., Strassel, S., Graff, D., Wright, J., Chen, S., Ryant, N., Kulick, S., Griffitt, K., Delgado, D., Arrigo, M.: Lorelei (low resource languages for emergent incidents) tagalog representative language pack (2023)
[36] Tsai, H.C., Kuo, C.W., Huang, Y.F.: Llamaloop: Enhancing information retrieval in llama with semantic relevance feedback loop. Preprint at Reserch Square (2023)
[37] Weischedel, R., Palmer, M., Marcus, M., Hovy, E., Pradhan, S., Ramshaw, L., Xue, N., Tylor, A., Kaufman, J., Franchini, M., El-Bachouti, M., Belvin, R., Houston, A.: Ontonotes release 5.0 (2013)
[38] Yarowsky, D., Ngai, G.: Inducing multilingual pos taggers and np bracketers via robust projection across aligned corpora. In: Second Meeting of the NAACL (2001)
[39] Zhang, M., Jiang, H., Aw, A., Li, H., Tan, C.L., Li, S.: A tree sequence alignment-based tree-to-tree translation model. In: Proc. of the ACL-HLT (2008)
[40] Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675 (2019)

DAHRS: Divergence-Aware Hallucination-Remediated SRL Projection

Abstract

Keywords:

1 Introduction

2 Related Work

3 Divergence-Aware Hallucination-Remediated SRL Projection (DAHRS)

3.1 Divergence Identification

3.2 DAHRS Algorithms

3.2.1 DAHRS1DAHRS{}_{1}italic_D italic_A italic_H italic_R italic_S start_FLOATSUBSCRIPT 1 end_FLOATSUBSCRIPT: Token-Level Hallucination Remediation.

3.2.2 DAHRS2DAHRS{}_{2}italic_D italic_A italic_H italic_R italic_S start_FLOATSUBSCRIPT 2 end_FLOATSUBSCRIPT: Phrase-Level Hallucination Remediation.

3.2.3 DAHRS3DAHRS{}_{3}italic_D italic_A italic_H italic_R italic_S start_FLOATSUBSCRIPT 3 end_FLOATSUBSCRIPT: First-Come First-Assign (FCFA) SRL Projection.

3.3 Explainability and Visualization

3.4 Model as a Diagnostic Tool

4 Data and Experimental Setup

5 Results and Analyses

6 Discussion: Beyond EN-FR / FR-ES

7 Conclusions and Future Work

Acknowledgements

References

3.2.1 $DAHRS{}_{1}$ : Token-Level Hallucination Remediation.

3.2.2 $DAHRS{}_{2}$ : Phrase-Level Hallucination Remediation.

3.2.3 $DAHRS{}_{3}$ : First-Come First-Assign (FCFA) SRL Projection.