Automated English-Korean Translation For Enhanced Coalition Communications
Automated English-Korean Translation For Enhanced Coalition Communications
Automated English-Korean
Translation for Enhanced
Coalition Communications
Clifford J. Weinstein, Young-Suk Lee, Stephanie Seneff, Dinesh R. Tummala,
Beth Carlson, John T. Lynch, Jung-Taik Hwang, and Linda C. Kukolich
This article describes our progress on automated, two-way English-Korean
translation of text and speech for enhanced military coalition communications.
Our goal is to improve multilingual communications by producing accurate
translations across a number of languages. Therefore, we have chosen an
interlingua-based approach to machine translation that readily extends to
multiple languages. In this approach, a natural-language-understanding system
transforms the input into an intermediate-meaning representation called a
semantic frame, which serves as the basis for generating output in multiple
languages. To produce useful, accurate, and effective translation systems in the
short term, we have focused on limited military-task domains, and have
configured our system as a translators aid so that the human translator can
confirm or edit the machine translation. We have obtained promising results in
translation of telegraphic military messages in a naval domain, and have
successfully extended the system to additional military domains. The system has
been demonstrated in a coalition exercise and at Combined Forces Command in
the Republic of Korea. From these demonstrations we learned that the system
must be robust enough to handle new inputs, which is why we have developed a
multistage robust translation strategy, including a part-of-speech tagging
technique to handle new words, and a fragmentation strategy for handling
complex sentences. Our current work emphasizes ongoing development of these
robust translation techniques and extending the translation system to
application domains of interest to users in the military coalition environment in
the Republic of Korea.
operates worldwide in a variety of international environments that require language translation.
Translators who can interpret military terminology
are a scarce commodity in countries such as the Republic of Korea (R.O.K.), and U.S. military leaders
there support the development of bilingual machine
translation. Although U.S. and R.O.K. military personnel have been working together for more than
forty years, the language barrier still significantly re-
duces the speed and effectiveness of coalition command and control. During hostilities, any time saved
by computers that can quickly and accurately translate command-and-control information could provide an advantage over the enemy and reduce the possibility of miscommunication with allies.
Machine translation has been a challenging area of
research for four decades, as described by W.L.
Hutchins and H.L. Somers [1], and was one of the
original problems addressed with the development of
VOLUME 10, NUMBER 1, 1997
35
Machine-Translation Background
The pyramid diagram of Figure 1 shows source-language analysis along the left side and target-language
generation along the right side, and three machinetranslation strategies: interlingua, transfer, and direct.
Most machine-translation strategies cut off the
source-language analysis at some point along the way,
and perform a bilingual transfer. The interlingua approach is different. It eliminates a bilingual transfer
phase by producing a language-independent meaning
representation called the interlingua that is directly
usable for target-language generation. In addition, it
greatly facilitates the development of a multilingual
system, because the same interlingua can be used to
generate multiple target languages. Although achieving a language-independent interlingual representation is a difficult challenge for general domains, the
interlingua approach offers significant advantages in
limited domains.
Direct translation systems do little source-language
analysis, proceeding immediately to a transfer. They
produce a word-for-word translation, much like an
automated bilingual-dictionary lookup. The resulting
translation generally does not have proper word order, syntax, or meaning in the target language, although it may be of some help to a user.
Transfer systems perform some intermediate form
36
Transfer
So
urc
e-l
an
gu
ag
e
an
aly
sis
Interlingua
ion
rat
ne
ge
ge
ua
ng
t-la
rge
Ta
computers. Although general, effective solutions remain elusive, we have made substantial advances in
developing an automated machine-translation system
to aid human translators in limited domains, specifically for military translation tasks in the Combined
Forces Command (CFC) in Korea. Our strategy to
enhance the probability of success in this effort has
been threefold: first, to build upon the tremendous
advances in the research and development community over the past decade in natural-language understanding and generation, machine translation, and
speech recognition; second, to carefully choose limited but operationally important translation applications to make the task manageable; and third, to facilitate user interaction with the translation system, so
that the primary goal is not a fully automated translator but an aid that helps the human translator be
more effective.
Direct
Source text
Target text
terlingua, transfer, and direct approaches to machine translation. The interlingua approach differs from the other two
by producing a language-independent meaning representation called the interlingua that is directly usable for targetlanguage generation.
C4I information
and displays
English text
or speech
English
understanding
Korean
understanding
Korean text
or speech
Semantic frame
English text
or speech
English
generation
Korean
generation
Korean text
or speech
Other languages
FIGURE 2. Architecture of the common coalition language system at Lincoln Laboratory (CCLINC). The un-
derstanding modules convert Korean or English input into a language-independent meaning interlingual
representation known in this case as a semantic frame. The use of semantic frames allows the CCLINC system to extend to multiple languages. The meaning representation in the semantic frame could also be used
to provide two-way communication between a user and a Command, Control, Communications, Computing,
and Intelligence (C4I) system.
module of CCLINC converts each input into an interlingual representation. In CCLINC, this interlingual representation is called a semantic frame. In the
case of speech input, the understanding module in
Figure 2 performs speech recognition and understanding of the recognition output. Our current
speech-recognition system and its performance on
speech translation are described in a later section. Although our original work on this project involved
speech-to-speech translation [2], we have recently
emphasized text translation [3] in response to the priorities of U.S. military users in Korea. An ongoing effort by Korean researchers in English-to-Korean text
translation is described in Reference 4.
The CCLINC translation system provides feedback to the originator on its understanding of each
input sentence by forming a paraphrase in the
originators language. For example, when an English
speaker enters a sentence into the system, the sentence is first transformed into a semantic frame by the
English-understanding module. Then the Englishgeneration module produces a paraphrase of what the
37
Language
understanding
with TINA
English
text input
Language
generation
with GENESIS
Semantic
frame
English grammar
and analysis
lexicon
Korean
text output
Korean grammar
and generation
lexicon
FIGURE 3. Process flow for English-to-Korean text translation in CCLINC. The TINA language-understanding sys-
tem utilizes the English grammar and analysis lexicon to analyze the English text input and produce a semantic frame
representing the meaning of the input sentence. The GENESIS language-generation system utilizes the Korean
grammar and generation lexicon to produce a Korean output sentence based on the semantic frame.
This article deals mainly with our work in Englishto-Korean text translation. Although the CCLINC
translation system is general and extendable, most of
our work to date has focused on English-to-Korean
text translation because it is the application of most
interest to U.S. forces in Korea. Our work has also
included two-way English-Korean translation of both
speech and text. We have started developing an interlingua-based Korean-to-English translation subsystem in CCLINC. (Our previous Korean-to-English system was developed by SYSTRAN, Inc.,
under a subcontract.) Our initial work on this project
included translation from English speech and text to
French text [2].
Input sentence: 0819 z uss sterett taken under fire by a kirov with ssn-12s.
sentence
full_parse
statement
pre-adjunct
subject
participial_phrase
time_expression
np
passive
gmt_time
a_ship
vp_taken_under_fire
numeric_time ship_mod
numeric gmt
uss
ships
ship_name
vtake under_fire
v_by_agent
v_by
v_with_instrument
np
v_with
uss
sterett
kirov
np
a_missile
missiles
with ssn-12 s
FIGURE 4. Parse-tree example based on English input sentence. The parse tree represents the structure of the input sentence, and is represented in terms of both general syntactic categories, such as
the subject or participial phrase, and domain-specific semantic categories, highlighted in red, of material being translated, such as the ship name.
38
:statement
:time_expression :topic z
:pred 819
:topic :name sterett
:pred uss
:pred taken_under_fire
:pred v_by
:topic :quantifier indef
:name kirov
:pred v_with_instrument
:topic :name ssn-12
:number pl
Paraphrase:
819 Z USS Sterett taken under fire by a kirov with SSN-12s.
Translation:
8 19
SSN-12 .
FIGURE 5. Semantic frame, paraphrase, and translation for
the example sentence of Figure 4. The semantic frame represents the meaning of the input in terms of fundamental language-neutral categories such as topic and predicate, and is
used as the basis for generation of both the English paraphrase and the Korean output sentence. Entries in red in the
semantic frame are replaced by the corresponding vocabulary items in the Korean-generation lexicon.
39
V1
ha PRESENT han
PAST hayss
PP hayss PSV toy
indef
kirov
khirob
ssn-12
ssn-12 misail
sterett
stheret
take_under_fire
V1 phokyek
uss
mikwunham
pyocwunsikan
v_by
uyhay
v_with_instrument
lo
the example semantic frame in Figure 5. The paraphrase in this case is essentially identical to the original (except that 0819 is replaced by 819). The Korean
output is Hangul text composed from the 24 basic
letters and 16 complex letters of the Korean alphabet.
To produce translation output, the language-generation system requires three data files: a lexicon, a set
of message templates, and a set of rewrite rules. These
files are language-specific and external to the core language-generation system. Consequently, extending
the language-generation system to a new language requires creating only the data files for the new language. A pilot study of applying the GENESIS system to Korean language generation can be found in
Reference 9. For generating a sentence, all the vocabulary items in the semantic frame such as z, uss,
and by are replaced by the corresponding vocabulary
items provided in the lexicon. All phrase-level constituents represented by topic and pred are combined
recursively to derive the target-language word order,
as specified in the message templates. We give examples below of the data files that are necessary to
generate Korean translation output.
Table 1 shows a sample language-generation lexicon necessary to generate the Korean translation output of the input sentence from the semantic frame in
Figure 50819 z uss sterett taken under fire by a kirov
with ssn-12s. Words and concepts in the semantic
frame are given in the left column of the table, and
the corresponding forms in Korean are given in the
right column. The Korean forms are in Yale
Romanized Hangul, a representation of Korean text
in a phonetic form that uses the Roman alphabet
[10]. Because the semantic frame uses English as its
specification language, lexicon entries contain words
and concepts found in the semantic frame with corresponding forms in Korean. (For a discussion about
designing interlingua lexicons, see Reference 11.)
In the lexicon, P stands for the part of speech
preposition; N noun; D determiner; and V verb.
Verbs are classified into several subgroups according
to grammatical rules that govern which tense forms
are used. The first row of the example in Table 1 says
that the entry V1 is a category verb ha for which the
present tense is han, past tense hayss, past participle
hayss, and passive voice toy.
(a) statement
(b) topic
:quantifier :noun_phrase
(c) predicate
:topic :predicate
(d) np-uss
:predicate :noun_phrase
(e) np-v_by
41
Following consonant
Following vowel
Nominative
Case
Accusative
Case
John-i
John-ul
Maria-ka
Maria-lul
ing a Hangul text editor. When the translation is acceptable, the user clicks on the check icon, and the
translated sentence is moved to the output window at
the bottom. Here, the translation of the prior sentence starting with 0819 z USS Sterett is shown in the
output window. If the user wishes to view the translation process in more detail, the parse tree or semantic
frame can be viewed by clicking on the tree or frame
icons.
In configuring our system as a translators aid, we
provide the user with as much help as possible. If the
system is unable to parse and understand the input
sentence, a word-for-word translation is provided to
the user, consisting of a sequence of word translations
from the Korean-generation module. If some of the
English words are not in the generation lexicon, the
original English word is included in the translation
output in the place where its Korean equivalent
would have occurred. In both cases, the problem is
noted on the output.
The interlingua-based Korean-to-English translation system operates with the same graphical user in-
FIGURE 6. Graphical user interface of translators aid in English-to-Korean translation mode. The input is entered by
voice, through the keyboard, or from a file to the top window. The English paraphrase is shown below the input window, and the Korean translation of that sentence (in Hangul characters) is shown in the window below the English
paraphrase. The user can edit the translation output by using a Hangul text editor. If the translation is acceptable, the
translated sentence can be moved to the bottom window by clicking on the check icon. The parse tree and the semantic frame of the input sentence can be displayed by clicking on the tree and the frame buttons, respectively.
42
terface, except the U.S. and Korean flags are interchanged in the translation icon, and the input language is Korean. The SYSTRAN transfer-based Korean-to-English translation system, however, does not
provide the user a paraphrase, parse tree, or semantic
frame.
English-to-Korean System Development on
Naval Message Domain: A Domain-Specific
Grammar Approach
From June 1995 to April 1996 we trained our system
on the MUC-II corpus, a collection of naval operational report messages from the Second Message Understanding Conference (MUC-II). These messages
were collected and prepared by the center for Naval
Research and Development (NRaD) to support
DARPA-sponsored research in message understanding. Lincoln Laboratory utilized these messages for
DARPA-sponsored machine-translation research. We
chose to use the MUC-II corpus for the following
reasons: (1) the messages were typical of actual military messages that our users would be interested in
translating, including high usage of telegraphic text
and military jargon and acronyms; (2) the domain
was limited but useful, so that we felt that our interlingua approach could be applied with reasonable
probability of success; and (3) the corpus was available to us in usable form.
The MUC-II Naval Message Corpus
MUC-II data consist of a set of naval operational report messages that feature incidents involving different platforms such as aircraft, surface ships, submarines, and land targets. The MUC-II corpus consists
of 145 messages that average 3 sentences per message
and 12 words per sentence [13, 14]. The total vocabulary size of the MUC-II corpus is about 2000
words. The following example shows that MUC-II
messages are highly telegraphic with many instances
of sentence fragments and missing articles:
At 1609 hostile forces launched massive recon effort from captured airfield against friendly units.
Have positive confirmation that battle force is targeted (2035z). Considered hostile act.
43
44
To accommodate sentences with a preposition omission, the grammar needs to allow all instances of
noun phrase NP to be ambiguous between an NP
and a prepositional phrase PP. The following examples show how allowing the grammar an input in
which the copula verb be is omitted causes the past
tense form of a verb to be interpreted as either the
main verb with the appropriate form of be omitted as
in phrase a, or as a reduced relative clause modifying
the preceding noun, as in phrase b.
Aircraft launched at 1300 z.
(a) Aircraft were launched at 1300 z.
(b) Aircraft which were launched at 1300 z.
Syntactic ambiguity and the resultant misparse induced by such an omission often lead to a mistranslation. For example, the phrase TU-95 destroyed 220
nm could be misparsed as an active rather than a passive sentence due to the omission of the verb was, and
the prepositional phrase 220 nm could be misparsed
as the direct object of the verb destroy. The semantic
frame reflects these misunderstandings because it is
derived directly from the parse tree, as shown in Figure 7. The semantic frame then becomes the input to
the generation system, which produces the following
nonsensical Korean translation output:
TU-95-ka
220 hayli-lul
pakoy-hayssta.
TU-95-NOM 220 nautical mile-OBJ destroyed.
The sensible translation is
TU-95-ka
220 hayli-eyse
pakoy-toyessta.
TU-95-NOM 220 nautical mile-LOC was destroyed.
In the examples, NOM stands for the nominative case
marker, OBJ the object case marker, and LOC the
locative postposition. The problem with the nonsensical translation above is that the object particle lul
necessarily misidentifies the preceding locative phrase
220 hayli as the object of the verb. This type of misunderstanding is not reflected in the English paraphrase because English does not have case particles
that overtly mark the case role of an NP.
Many instances of syntactic ambiguity are resolved
:statement
:topic nn_head
:name tu-95
:pred destroy
:mode past
:topic nn_head
:name nm
:pred cardinal
:topic 220
45
:statement
:topic aircraft
:name tu-95
:pred destroy
:mode psv
:pred at_locative
:topic distance
:name nm
:pred 220
Translation: 95 220 .
FIGURE 8. Semantic frame for the accurate translation of
the input TU-95 destroyed 220 nm. Entries in red are replaced
by the corresponding vocabulary items in the Korean-generation lexicon. Unlike the semantic frame in Figure 7, the
locative expression 220 nm is understood correctly as the
locative expression, and the sentence is translated in passive voice. The correct translation results from the domainspecific knowledge of the grammar and the grammar-training capability of the Korean-understanding system.
els of the latter, whereas the former contains only domain-specific semantic categories at the lower levels.
On a closer examination, the input sequence at the
second-stage parsing does not consist solely of parts
of speech, but of the mix of parts of speech and
words. Unless the word is a verb or preposition, we
replace the word with its part of speech. By not substituting parts of speech for words that are verbs and
prepositions, we avoid ambiguity [15, 19].
pre_adjunct
subject
predicate
time_expression
np
vp_reply
gmt_time
adjective
noun
vreply
numeric_time
cardinal
gmt
0819
adverb_phrase
adverb
unknown
contact
replied
incorrectly
FIGURE 9. Parse tree derived from a mixed sequence of words and part-of-speech tags. The input
sentence at the top is converted into the mixed sequence below it by using the part-of-speech tagger.
This mixed sequence is the input to the parser. In the parse tree, part-of-speech units are shown in
red. When parsing is complete, the part-of-speech units are replaced by the words in the original
sentence. For example, adjective is replaced by unknown, and adverb is replaced by incorrectly.
47
amount of training data to achieve high rates of tagging accuracy, this rule-based tagger achieves performance comparable to or higher than that of stochastic
taggers, even with a training corpus of modest size.
Given that the size of our training corpus is small
(7716 words), a rule-based tagger is well suited to our
needs.
The rule-based part-of-speech tagger operates in
two stages. First, each word in the tagged training
corpus has a lexicon entry consisting of a partially ordered list of tags, indicating the most likely tag for
that word, and all other tags seen with that word (in
no particular order). Every word is initially assigned
its most likely tag in isolation. Unknown words are
assumed to be nouns, and then cues based upon prefixes, suffixes, infixes, and adjacent word co-occurrences are used to update the most likely tag. Second,
after the most likely tag for each word is assigned,
contextual transformations are used to improve the
accuracy.
We evaluated the tagger performance on the TEST
data set both before and after training on the MUC-II
corpus. Table 4 presents the results of our evaluations.
Tagging statistics before training are based on the
lexicon and rules acquired from the Brown corpus
and the Wall Street Journal (WSJ) corpus. Tagging
statistics after training are divided into two categories,
both of which are based on the rules acquired from
training data sets of the MUC-II corpus. The only
difference between the two is that in one case (after
training I) we use a lexicon acquired from the MUCII corpus, and in the other case (after training II) we
use a lexicon acquired by combining the Brown corpus, the WSJ corpus, and the MUC-II corpus.
Table 4 shows that the tagger achieves a tagging accuracy of up to 98% after training and using the comTable 4. Rule-Based Part-of-Speech Tagger
Evaluation on the TEST Data Set
48
Training status
Tagging accuracy
Before training
After training I
After training II
100
:statement
:time_expression
:topic
:pred
819
90
Training
sentences
80
86
Test
sentences
80
70
:pred reply_v
:mode
:adverb
contact
unknown
60
Percent
:topic :name
:pred
50
40
past
incorrectly
30
32
20
10
0
Domainspecific
grammar
Domainspecific
grammar
Part-of-speech
tagger plus
domain-specific
grammar
49
Korean-to-English Translation
In the early stages of our project, we learned that
SYSTRAN, Inc., a company with a long and successful history of work in machine translation, had just
embarked on a Department of Defense (DoD)sponsored project in Korean-to-English translation [30,
31]. Rather than develop the Korean-to-English part
of the system ourselves, we chose to gain leverage
from that work, and initiated a subcontract with
SYSTRAN to adapt their Korean-to-English system
50
100
90
85
80
70
60
Percent
50
54
40
30
20
10
0
Sentences
perfectly recognized
Sentences
correctly translated
mance on MUC-II naval message data. The sentences averaged 12 words in length and 54% of the sentences were perfectly recognized. Speech-translation performance, shown
only for those sentences which were translated correctly by
the text-translation system, is 85%, which demonstrates the
capability of the parser to handle errors in the text input that
it receives.
to the MUC-II domain. Although this made our twoway system asymmetric in that the SYSTRAN system
uses a transfer approach instead of an interlingua approach, we decided that the advantage in expediting
development was worthwhile.
To provide a Korean MUC-II corpus, we separately contracted with another organization to produce human translations of 338 MUC-II corpus sentences into Korean. We then supplied this Korean
corpus to SYSTRAN for their training, developing,
and testing. From this Korean corpus, 220 sentences
were used for training and 118 sentences were used
for testing. During training, significant changes were
made to all modules because the system had never
dealt with telegraphic messages of this type. The system dictionary, which had about 20,000 Korean entries but lacked many terms in naval operations reports, was augmented to include the new words in
MUC-II. We found that performance on the MUCII Korean-to-English task was good; 57% of the
translations of the test sentences were at least close to
tained a vocabulary of 8500 words and 3400 sentences, each with an average size of 15 words.
The new material created challenges. In particular,
the sentences were longer and more complex than
those in the MUC-II corpus. We were motivated by
the C2W corpus to confront some of the difficult
challenges in machine translation, which in turn led
us to develop a more complete and robust translation
system, as described below.
The C2W Data
For the C2W data, we focused our effort on developing a technique to handle complex sentences that includes fragmentation of a sentence into meaningful
subunits before parsing, and composition of the corresponding semantic-frame fragments into a single
unified semantic frame. Compared to those of the
MUC-II corpus, the sentences in the C2W data are
much longer and are written in grammatical English:
A mastery of military art is a prerequisite to successful practice of military deception but the mastery of military deception takes military art to a
higher level.
Although opportunities to use deception should not
be overlooked, the commander must also recognize
situations where deception is not appropriate.
Often, the skillful application of tenets of military
operations-initiative, agility, depth, synchronization and versatility, combined with effective
OPSEC, will suffice in dominating the actions of
the opponent.
Such long, complex sentences are difficult to parse.
Acquiring a set of grammar rules that incorporate all
instances of complex sentences is not easy. Even if a
complex sentence is covered by the grammar, a long
sentence induces a higher degree of ambiguity than a
short sentence, requiring a much longer processing
time. To overcome the problems posed by understanding of complex sentences, we have been developing sentence-fragmentation and semantic-frame
composition techniques. We briefly describe these
techniques below.
VOLUME 10, NUMBER 1, 1997
51
Sentence Fragmentation
For sentence fragmentation, the input sentence is first
parsed with the Apple Pie Parser, a system developed
at New York University. This system runs on a corpus-based probabilistic grammar and produces the
parse tree with the highest score among the trees derived from the input [34]. Our sentence-fragmentation algorithm [35] is applied to the Apple Pie Parser
output, producing sentence fragments that each form
a meaningful unit. Figure 13 provides an example of
the Apple Pie Parser output and fragmenter output.
As the Apple Pie Parser output and the fragmented
output show, the fragmentation algorithm extracts
elements with category labels such as TOINF and
SBAR, each of which form an independent meaning
unit [36]. Once a fragment is extracted from the
higher-level category, the label of the extracted element is left behind to compose the component
semantic frames at a later stage. In Figure 13, two
fragments have been extracted from the input sentencean adverbial clause (although opportunities to
use deception should not be overlooked ) whose category
label in the parsing output is SBAR, and a relative
clause (where deception is not appropriate ) whose category label is also SBAR. Labels of these two extracted
elements are left in the first fragment as adverbc1 and
relclause1, respectively. Likewise, an infinitival clause
whose category label in the parsing output is TOINF
has been extracted from the adverbial clause, leaving
its label toinfc1 in the second fragment.
Input: Military art focuses on the direct use of military force to impose ones intent on an opponent.
(a) Fragment 1: Military art focuses on the direct use of military force toinfc1
sentence
:statement :topic art
:pred military
full_parse
:pred focus_v
:mode present
:number third
statement
subject
predicate
q_np
vp_focus
:pred v_on
:topic use
:pred direct
:pred n_of
q_np
det modifier nn_head
adjective
art
focuses on the
:topic force
:pred military
to_infinitive "toinfc1"
n_pp
n_of_pp
n_of
military
(b)
q_np
fragment
:tag toinfc1
:to_infinitive to_inf
(d)
:pred impose
:mode root
fragment
:topic intent
:pred ones
toinf_tag_clause
:pred v_on
toinfc_tag
to_infinitive
to_inf
:topic opponent
predicate
:statement :topic art
:pred military
vp_impose
vimpose
dir_object
q_np
v_on_pp
v_on
nn_mod nn_head
toinfc1
to
q_np
det nn_head
an opponent
(e)
:pred focus_v
:pred v_on
:topic use
:pred direct
:pred n_of
:topic force
:pred military
:to_infinitive to_inf
:pred impose
:topic intent
:pred ones
:pred v_on
:topic opponent
FIGURE 14. Operation of the robust translation system for parsing and understanding sentence fragments, composing the re-
sults into a combined semantic frame, and producing the final translation and paraphrase. In this example, two fragments are
processed. The parts of the figure are (a) parse tree 1, (b) semantic frame 1, (c) parse tree 2, (d) semantic frame 2, and (e) combined semantic frame with paraphrase and translation. The labels in red represent the categories that have been extracted by
the fragmentation algorithm.
53
English
input
Part-of-speech
tagging
Parsing
Korean
output
Input
parsed?
Yes
Korean
generation
Semantic
frame
No
Fragmentation
Part-of-speech
tagging
Parsing on
fragments
1, 2, 3, ...
Fragments
parsed?
Yes
Semantic
frames
1, 2, 3, ...
Composed
semantic
frame
No
Word-for-word
understanding
Semantic
frame
FIGURE 15. Process flow of robust translation system. Given an input sentence, the translation system assigns parts of speech
to each word. Parsing takes place with the part-of-speech sequence as input. If parsing succeeds at this stage, the corresponding semantic frame is produced. If parsing does not succeed, the input sentence is fragmented, and parsing takes place
on each fragment. Once parsing and semantic-frame generation of all of the fragments has been completed, the semantic
frames for the fragments are composed. Generation proceeds with the composed semantic frame as input.
English to Korean
Speechrecognition
tool kit
Language
understanding
(TINA);
C-code
Rule-based
part-of-speech
tagger;
C-code
Korean to English
Language
generation
(GENESIS);
C-code
Interlingua-based
system
(TINA and
GENESIS)
SYSTRAN
transfer-based
system
FIGURE 16. Major software modules of the current implementation of the CCLINC automated translation system. The graphi-
cal user interface interacts with both the English-to-Korean and Korean-to-English translation systems. The English-to-Korean
system consists of three subsystems: speech recognition, language understanding, and language generation. The languageunderstanding system interacts with two subsystems for robust processing: the rule-based part-of-speech tagger and the
Apple-Pie-Parser and fragmenter. The Korean-to-English system consists of two systems that employ different approaches to
machine translation: the interlingua-based system being developed at Lincoln Laboratory and the transfer-based system developed by SYSTRAN under a subcontract.
55
Input
CINCs Daily Guidance Letter
Purpose
Disseminate CINCs guidance of the past 24 hours
Planning guidance for Future Operations
Guidance for Integrated Task Order (ITO)
Objectives
Summary of CINCs operational guidance for future ops
Issue CINC's prioritized guidance
Products
C3 Plans provide draft to C3 prior to Morning Update for
CINC approval
C3 Plans provide approved letter to CFC Staff,
Components, and Subordinates
Translation
24
.
.
.
.
,
,
.
,
,
,
FIGURE 17. Sample slide of Commander-in-Chief (CINC) briefing material, in which each English sentence has been translated
by CCLINC. The development of CCLINC to achieve high performance for a large variety of such material is the focus of our
current work.
sibility demonstrations of automated two-way English-Korean text and speech translation for military
messages; (2) development of a modular, interlinguabased translation system that is extendable to multiple languages and to human interaction with C4I
systems; (3) development of a multistage, robust
translation system to handle complex text; (4) development of an integrated graphical user interface for a
translators aid; and (5) several successful demonstrations and technology transfer activities, including
participation in the RIMPAC 96 coalition exercise on
board the USS Coronado and the RSO&I coalition
exercises at CFC Korea.
Our plans for the future involve extending the system capability to additional application domains, including translation of operations orders and operations plans. We will expand our recently begun effort
in developing an interlingua-based Korean-to-English translation system by using the same understanding-based technology that we have applied to
English-to-Korean translation. Ultimately, we hope
to integrate the systems understanding capabilities
with C4I systems to allow multilingual human-computer and human-human communication. One such
application would involve a report translated by the
system for communication among coalition partners.
The reports meaning, captured in the semantic
frame, would be conveyed to the C4I system to update databases with situation awareness information.
Acknowledgments
This project has benefited from the contributions of
individuals inside and outside Lincoln Laboratory,
and we particularly appreciate the contributions of
and interactions with people in the DoD and research
communities. We would like to cite the contributions
of the following people: Ronald Larsen, Allen Sears,
George Doddington, John Pennella, and Lt. Comdr.
Robert Kocher, DARPA; Seok Hong, James Koh,
Col. Joseph Jaremko, Lt. Col. Charles McMaster, Lt.
David Yi, and Willis Kim, U.S. Forces Korea-Combined Forces Command; Beth Sundheim and Christine Dean, NRaD; Capt. Richard Williams and Neil
Weinstein, USS Coronado, Command Ship of the
Third Fleet; Victor Zue, James Glass, Ed Hurley, and
Christine Pao, MIT Laboratory for Computer SciVOLUME 10, NUMBER 1, 1997
57
REFERENCES
1. W.J. Hutchins and H.L. Somers, An Introduction to Machine
Translation (Academic, London, 1992).
2. D. Tummala, S. Seneff, D. Paul, C. Weinstein, and D. Yang,
CCLINC: System Architecture and Concept Demonstration
of Speech-to-Speech Translation for Limited-Domain Multilingual Applications, Proc. 1995 ARPA Spoken Language Technology Workshop, Austin, Tex., 2225 Jan. 1995, pp. 227232.
3. C. Weinstein, D. Tummala, Y.-S. Lee, and S. Seneff, Automatic English-to-Korean Text Translation of Telegraphic
Messages in a Limited Domain, 16th Int. Conf. on Computational Linguistics 96, Copenhagen, 59 Aug. 1996, pp. 705
710; C-STAR II Proc. ATR International Workshop on Speech
Translation, 1011 Sept. 1996, Kyoto, Japan.
4. K.-S. Choi, S. Lee, H. Kim, D.-B. Kim, C. Kweon, and G.
Kim, An English-to-Korean Machine Translator: MATES/
EK, Proc. 15th Int. Conf. on Computational Linguistics I,
Kyoto, Japan, 59 Aug. 1994, pp. 129131.
5. S. Seneff, TINA: A Natural Language System for Spoken
Language Applications, Computational Linguistics 18 (1),
1992, pp. 6192.
6. J. Glass, J. Polifroni and S. Seneff, Multilingual Language
Generation across Multiple Domains, 1994 Int. Conf. on
Spoken Language Processing, Yokohama, Japan, 1822 Sept.
1994, pp. 983986.
7. J. Glass, D. Goodine, M. Phillips, M. Sakai, S. Seneff, and V.
Zue, A Bilingual VOYAGER System, Proc. Eurospeech, Berlin, 2123 Sept. 1993, pp. 20632066.
8. V. Zue, S. Seneff, J. Polifroni, H. Meng, J. Glass, Multilingual
Human-Computer Interactions: From Information Access to
Language Learning, Proc. Int. Conference on Spoken Language
Processing, ICSLP-96 4, Philadelphia, 36 Oct. 1996, pp.
22072210.
9. D. Yang, Korean Language Generation in an InterlinguaBased Speech Translation System, Technical Report 1026,
MIT Lincoln Laboratory, Lexington, Mass., 21 Feb. 1996,
DTIC #ADA-306658.
10. S. Martin, A Reference Grammar of Korean (Tuttle, Rutland,
Vt., 1992).
11. C. Voss and B. Dorr, Toward a Lexicalized Grammar for
Interlinguas, J. Machine Translation 10 (12), 1995, pp. 143
184.
12. H.-M Sohn, Korean (Routledge, London, 1994).
13. B.M. Sundheim, Plans for a Task-Oriented Evaluation of
Natural Language Understanding Systems, Proc. DARPA
Speech and Natural Language Workshop, Philadelphia, 2123
Feb. 1989, pp. 197202.
14. B.M. Sundheim, Navy Tactical Incident Reporting in a
Highly Constrained Sublanguage: Examples and Analysis,
Technical Document 1477, Naval Ocean Systems Center, San
Diego, 1989.
15. Y.-S. Lee, C. Weinstein, S. Seneff, and D. Tummala, Ambiguity Resolution for Machine Translation of Telegraphic Messages, Proc. Assoc. for Computational Linguistics, Madrid, 712
July 1997.
16. R. Grishman and J. Sterling, Analyzing Telegraphic Messages, Proc. DARPA Speech and Natural Language Workshop,
Philadelphia, 2123 Feb. 1989, pp. 204208.
17. J.S. White and T.A. OConnell, Evaluation in the ARPA Machine Translation Program: 1993 Methodology, Proc. Human
58
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
.
leads the Information Systems
Technology group and is
responsible for initiating and
managing research programs
in speech technology, machine
translation, and information
system survivability. He joined
Lincoln Laboratory as an MIT
graduate student in 1967, and
became group leader of the
Speech Systems Technology
group (now Information
Systems Technology group) in
1979. He has made technical
contributions and carried out
leadership roles in research
programs in speech recognition, speech coding, machine
translation, speech enhancement, packet speech communications, information system
survivability, integrated voicedata communication networks, digital signal processing, and radar signal
processing. Since 1986, Cliff
has been the U.S. technical
specialist on the NATO
RSG10 Speech Research
Group, authoring a comprehensive NATO report and
journal article on applying
advanced speech technology in
military systems. In 1993, he
was elected an IEEE Fellow for
technical leadership in speech
recognition, packet speech,
and integrated voice-data
network. He received S.B.,
S.M., and Ph.D. degrees in
electrical engineering from
MIT.
-
is a staff member in the Information Systems Technology
group, and has been working
on machine translation since
joining Lincoln Laboratory in
1995. As a principal investigator of the Korean-English
translation project, she helps
develop and integrate several
submodules of the CCLINC
system, including English and
Korean understanding and
generation, part-of-speech
tagging, robust parsing, grammar and lexicon acquisition
and updating, and graphical
user interface. Her main
research interest is in the
development of interlingual
representation with semantic
frames for multilingual machine translation and other
multilingual applications.
Before coming to Lincoln
Laboratory, she taught
linguistics at Yale University.
She received a B.A. degree in
English linguistics and literature from Seoul National
University, Korea, where she
graduated summa cum laude
in 1985. She also has an
M.S.E. degree in computer
and information science and a
Ph.D. degree in linguistics
from the University of Pennsylvania. She is a member of
the Association for Computational Linguistics and the
Linguistic Society of America.
is a principal research scientist
in the Spoken Language Systems group at the MIT Laboratory for Computer Science.
During the 1970s, she was a
member of the research staff at
Lincoln Laboratory, where her
research encompassed a wide
range of speech processing
topics, including speech synthesis, voice encoding, feature
extraction (formants and
fundamental frequency),
speech transmission over
networks, and speech recognition. Her doctoral thesis
concerned a model for human
auditory processing of speech,
and some of her later work has
focused on the application of
auditory modeling to computer speech recognition. Over
the past several years, she has
become interested in natural
language, and has participated
in many aspects of the development of spoken language
systems, including parsing,
grammar development, discourse and dialogue modeling,
probabilistic natural-language
design, and integration between speech and natural
language. She is a member of
the Association for Computational Linguistics and the
IEEE Society for Acoustics,
Speech and Signal Processing,
serving on their Speech Technical committee. She received
a B.S. degree in biophysics,
and M.S., E.E., and Ph.D.
degrees in electrical engineering, all from MIT.
.
works to expand and adapt
machine-translation systems to
larger and new domains as a
staff member in the Information Systems Technology
group. He also develops semiautomated lexicon and grammar acquisition techniques.
He joined Lincoln Laboratory
in 1993, after researching
pattern recognition systems
and natural-language interfaces in information retrieval
during internships at Digital
Equipment Corporation. He
received an S.B degree in
computer science and engineering and an S.M. degree in
electrical engineering and
computer science from MIT.
He was awarded a National
Science Foundation Graduate
Fellowship.
59
is a former staff member of the
Information Systems Technology group. She researched and
developed algorithms for
information retrieval, machine
translation, and foreign language instruction before
leaving Lincoln Laboratory in
February 1997. Prior to this
position, she worked for GTE
Laboratories in Waltham,
Mass., developing speechrecognition algorithms for
telephone and cellular applications. She received B.E.E. and
Ph.D. degrees in electrical
engineering from Georgia
Institute of Technology.
.
worked with the Information
Systems Technology group for
twelve years before retiring in
1996 to study psychology. His
research involved test and
evaluation of speech technology systems and machinetranslation systems. He also
worked on applications for
automated speech and text
information retrieval and
classification. During his last
three years, he served as an
appointed volunteer
ombudsperson. He joined the
Optical Communications
group at Lincoln Laboratory
in 1970 and worked for five
years on various aspects of the
Lincoln Experimental Satellites (LES) 8 and 9. Then he
spent three years at the MIT
Center for Advanced Engineering Study as director of
Tutored Video Instruction, a
continuing education program
for industrial engineers that
videotapes MIT classes. This
effort was followed by two
years of developing superconducting signal processing
devices with the Analog Device Technology group at
Lincoln Laboratory. He then
joined the faculty of Boston
University as associate professor of electrical engineering for
three years before returning to
Lincoln Laboratory. He received S.B. and S.M. degrees
in electrical engineering from
MIT and a Ph.D. degree in
electrical engineering from
Stanford University.
60
-
works for JLM Technologies,
Inc., in Boston, Mass., as a
system architect and consultant, designing solutions to
client problems. Prior to
joining JLM Technologies, he
was a research assistant in the
Information Systems Technology group, working on techniques to improve the performance of machine translation
of long sentences. He received
B.S. and S.M. degrees in
computer science from MIT.
.
develops and maintains software systems for the Information Systems Technology
group. Previously she developed software for the Optical
Communications Systems
Technology group. She received a B.S. degree in applied
mathematics from MIT.