Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Machine Translation for English to Kannada

Mallamma V Reddy1, Dr. M. Hanumanthappa2


1,2
Department of Computer Science and Applications,
Bangalore University, Bangalore, INDIA
1
[email protected]
2
[email protected]

Abstract—Interlingual is an artificial the Indo-Iranian branch of the Indo-


language used to represent the meaning of European language family, the non-Indo-
natural languages, as for purposes of European Dravidian family, Austro-Asiatic,
machine translation. It is an intermediate Tai-Kadai and the Sino- Tibetan language
form between two or more languages. families. The languages that stem from the
Machine translation is the process of
translating from source language text into Dravidian family, are - Tamil, Kannada,
the target language. This paper proposes Malayalam and Telugu, spoken in the South
a new model of machine translation Indian states- Tamilnadu, Karnataka, Kerala
system in which rule-based and example- and Andhra Pradesh. Most modern
based approaches are applied for English- languages in North India, such as Hindi,
to-Kannada/Telugu sentence translation. Urdu, Punjabi, Gujarati, Bengali, Marathi,
The proposed method has 4 steps: 1)
Kashmir, Sindhi, Konkani, Rajasthani,
analyze an English sentence into a string
of grammatical nodes, based on Phrase Assamese and Oriya, stem from Sanskrit and
Structure Grammar, 2) map the input Pali.
pattern with a table of English-
Kannada/Telugu sentence patterns, 3) Kannada or Canarese is a language spoken
look up the bilingual dictionary for the in India predominantly in the state of
equivalent Kannada/Telugu words, Karnataka, Making it the 25th most spoken
reorder and then generate output language in the world. It has given birth to
sentences and 4) rank the possible so many Indian languages like, Tulu,
combinations and eliminate the Kodava etc and one of the scheduled
ambiguous output sentences by using a
languages of India and the official and
statistical method. The translated
sentences will then be stored in a bilingual administrative language of the state of
corpus to serve as a guide or template for Karnataka [1]. Telugu is also one of the
imitating the translation, i.e., the widely spoken languages in India especially
example-based approach. The future in the state of Andhra Pradesh and the
work will focus on sentence translation by district of Yanam. Both Kannada and Telugu
using semantic features to make a more
use the “UTF-8” / western windows encode
precise translation.
and draw their vocabulary mainly from
Keywords—Morphological analyser, Sanskrit.
Machine Translation [MT], part-of-
speech tagger Various efforts have been made in
developing machine translation (MT)
I. Introduction systems for practical use. Historically, there
Today, India has fifteen official are many approaches on MT research:
languages. These languages originated from transfer-based, interlingua-based, and etc.
Among these approaches, the most An outline of the hybrid algorithm is: 1) find
distinctive are rule-based and corpus-based candidate sentences which are similar to the
methods. Research on the corpus-based input sentence, 2) select the template: (a)
approach has emphasized on the importance rank the candidates by similarity to the input
of text corpora used as a source for linguistic sentence (b) cluster the Translations of the
and knowledge databases. There have been candidate sentences (c) select the highest
two major approaches among the corpus- ranked pair of the best cluster, 3) translate
based MT known as statistics-based and input sentence by analogy to a selected
example-based. It might be said that all template 4) output the adjusted sentence. For
approaches have their own pros and cons. each difference, find it and translate using
Therefore some MT [2] researchers have the rule-based modules.
selected and combined them together for
creating a new effective model. We also C. Interlingual Machine Translation
combine two potential approaches to Interlingual is an artificial language used
produce our own strategy; namely, rule- to represent the meaning of natural
based and example-based. languages, as for purposes of machine
translation. It is an intermediate form
A. Rule-Based and Example-Based between two or more languages. Interlingual
Approaches Machine Translation is a methodology that
The rule-based translation mostly employs interlingual for translation. Ideally
consists of (1) a process of analyzing input the interlingual representation of the text
sentences of a source language should be sufficient to generate sentences in
morphologically, syntactically and/or any language. Languages can have different
semantically and (2) a process of generating parts of speech. In some cases two or more
output sentences of a target language based words in one language have a equivalent
on an internal structure or Interlingua. Each single word in another language. Interlingua
process is controlled by the dictionary and approach in ÐFig. 1Ñ, addresses these
the rules. Meanwhile, the basic idea of structural differences between languages.
example-based method [2] is to translate a The disadvantage is that the design of
sentence by using translation examples of interlingual is too complex. This is due to
similar sentences. The primary steps of the fact that there is no clear methodology
example-based method are 1) collect developed so far to build a perfect
examples in a database, 2) given an input, interlingual representation.
retrieve similar examples from the database,
and 3) adapt the results of the similar An interlingual lexicon is necessary
examples to the current input and obtain the to store information about the nature and
output. behavior of each word in the language. The
information includes events and actions.
B. The Hybrid Translation Method
Many researchers apply both the rule-
based and example-based methods as their
own hybrid methods [3] propose a new
hybrid translation method that combines a
rule-based with an example-based method.
Fig 1: Interlingual Machine Translation.
A typical interlingual MT system has • The programmer should understand the
analyzer and synthesizer for each language. rules under which complex human
The analyzer produces interlingual language operates and how the
representation of the meaning of the given mechanism of this operation can be
simulated by automatic means.
text. The synthesizer produces one or more
• The simulation of human language
sentences with the meaning given by the behavior by automatic means is almost
analyzer. impossible to achieve as the language is
open and dynamic system in constant
II. Challenges in Machine change. More importantly the system is
translation not yet completely understood.

Machine translation [4] is the process of III. Machine Translation


translating from source language text into
the target language. Following is a list of The above mentioned challenges can be
challenges one has to face when attempt to solved by using all the phases involved in
do machine translation. machine translation depicted in the
following ÐFig. 2Ñ.
• Not all the words in one language have
equivalent words in another language. In
some cases a word in one language is to
be expressed by group of words in
another.
• Two given languages may have
completely different structures. For
example English has SVO structure
while Kannada/Telugu has SOV
structure.
• Sometimes there is a lack of one-to-one
correspondence of parts of speech
between two languages. For example,
color terms of Kannada/Telugu are
nouns whereas in English they are
adjectives.
• The ways sentences are put together also
differ among languages.
• Words can have more than one meaning
and sometimes group of words or whole Fig 2: A Typical Machine Translation Process.
sentence may have more than one
meaning in a language. This problem is Text Input: This is the first phase in the
called ambiguity. machine translation process [4] and is the
• Not all the translation problems can be first module in any MT system. The
solved by applying values of grammar.
sentence categories can be classified based
• It is too difficult for the software
on the degree of difficulty of translation.
programs to predict meaning.
• Translation requires not only vocabulary Sentences that have relations, expectations,
and grammar but also knowledge assumptions, and conditions make the MT
gathered from past experience. system understand very difficult. Speaker’s
intentions and mental status expressed in the and semantic analyses are often executed
sentences require discourse analysis for simultaneously and produce syntactic tree
interpretation. This is due to the inter- structure and semantic network respectively.
relationship among adjacent sentences. This results in internal structure of a
World knowledge and commonsense sentence. The sentence generation phase is
knowledge could be required for interpreting just reverse of the process of analysis.
some sentences.
TABLE I. FEW INFLECTIONS OF A VERB STEM AND
ITS CORRESPONDING MEANINGS
Reformating and reformating: This is to
make the machine translation process easier
and qualitative. The source language text
may contain figures, flowcharts, etc that do
not require any translation. So only
translation portions should be identified.
Once the text is translated the target text is
to be reformatted after post-editing.
Reformatting is to see that the target text
also contains the non-translation portion.

Pre-editing and Post editing: The level of


pre-editing and post-editing depend on the
efficiency of the particular MT system. For
some systems segmenting the long sentences TABLE II. DIFFERENT CASES AND THEIR
into short sentences may be required. Fixing CORRESPONDING CHARACTERISTIC SUFFIXES FOR
NOUNS
up punctuation marks and blocking material
that does not require translation are also
done during pre-editing. Post editing is done
to make sure that the quality of the
translation is up to the mark. Post-editing is
unavoidable especially for translation of
crucial information such as one for health.
Post-editing should continue till the MT
systems reach the human-like.

Analysis, Transfers and Generation:


Morphological analysis [5] determines the
word form such as inflections, tense,
number, part of speech, etc shown in
following ÐTable. IÑ and ÐTable. IIÑ.
Syntactic analysis determines whether the
word is subject or object. Semantic and
contextual analysis determines a proper Morphological analysis and generation:
interpretation of a sentence from the results Computational morphology deals with
produced by the syntactic analysis. Syntactic recognition, analysis and generation of
words. Some of the morphological processes TABLE IV. INFLECTIONS OF A NOUN STEM AND ITS
CORRESPONDING MEANINGS
are inflection, derivation, affixes and
combining forms as shown in ÐTable. IIIÑ.
Inflection is the most regular and productive
morphological process across languages.
Inflection alters the form of the word in
number, gender, mood, tense, aspect, person,
and case. Morphological analyser [5] gives
information concerning morphological
properties of the words it analyses.

In Kannada, adjacent words are often


joined and pronounced as one word. Such
word combinations occur in two ways-
Sandhi and Samasa. Sandhi
(Morphophonemics) deals with changes that
occur when two words or separate
morphemes come together to form a new
word. Few sandhi types are native to
Kannada and few are borrowed from
Sanskrit. We in our tool have handled only Grammar formalism: Grammar formalism
Kannada sandhi. However we do not handle is a framework to explain the basic structure
Samasa. of a language. Reserachers propose the
TABLE III. SANDHI TYPES AND EXAMPLES FOR
following grammar formalisms: Phrase
WORD COMBINATION Structure Grammar (PSG), Dependency
Grammar, Case Grammar, Systematic
Grammar, and Montague Grammar.

The variants of PSG are: Context Free


PSG, Context Sensitive PSG, Augmented
Transition Network Grammar (ATN),
Definite Clause (DC) Grammar, Categorical
Grammar, Lexical Functional Grammar
Syntactic analysis and generation: As (LFG), Generalised PSG, Head Driven PSG,
words are the foundation of speech and and Tree Adjoining (TAG).
language processing, syntax can considered
as the skeleton. Syntactic analysis concerns Not all the grammars suit a particular
with how words are grouped into classes language. PSG, for example, does suit
called parts-of-speech shown in ÐTable. IVÑ, Japanese while dependency grammar does
how they group their neighbors into phrases, suite. Case grammar is popular as sentence
and the way in which words depends on in different languages that express the same
other words in a sentence. Example contents may have the same case frames.

Parsing and Tagging: Tagging means the


identification of linguistic properties of the
individual words and parsing is the if check_RightPos (3) {"ies","ves" }
assessment of the functions of the words in then
relation to each other. cut_RightPos (3) ;
if check_RightPos (3) = "ies"
Semantic and Contextual analysis and then
Generation: A semantic analysis composes Add_char ("y") ;
the meaning representations and assigns else
them the linguistic inputs. The semantic Add_char ("f");
analyser uses lexicon and grammar to create end if
context independent meanings. The source if Search Dic( ) = TRUE then
of knowledge consists of meaning of words, break ;
meanings associated with grammatical end if
structures, knowledge about the discourse else
context and commonsense knowledge. cut RightPos (2) ;
if Search Dic( ) = TRUE then
IV. Approach break ;
The following approach is designed to end if
produce an experimental system in end if
translating English into Kannada/Telugu by else
using the 4 basic sentence patterns as a cut_RightPos (1) ;
template. After that the output sentences will if Search Dic( ) = TRUE then
be stored as raw data for further applying an break ;
example-based method. The outline of the end if
system is as follows: end if
end if
1. Morphological analysis Fig 3: Sample of morphological rules for cutting off
the suffixes of English plurality.
2. Pattern mapping
B. Pattern Mapping
3. Looking up bilingual dictionary We make an attempt to map each pair of
patterns from the simplest one to the least by
4. Disambiguating possible combinations using their similarity as the basis. In brief, a
pair that can be mapped should be identical
A. Morphological Analysis both in surface and deep structure. The two
An input sentence is first segmented into syntactic and semantic criterions, based on
a word, written English sentences are Phrase Structure Grammar [7] and Case
automatically segmented, that is, each word Grammar, respectively, of pattern mapping
is separated by a pause or space, then that we have presumed is:
analyzed morphologically into a morpheme
a) Each entry or word in a pair should
(in the form of a stem or root ) by applying have or represent the same syntactic
morphological analysis rules as shown in relationship such as "subject", "verb"
ÐFig. 3Ñ: and "object", lying in linear order
from left to right,
if check_RightPos (1) ="s" then
if check_RightPos (2) ="es" then
b) Each entry should underlie the same combinations produced inevitably by this
semantic relationship such as an process. Therefore we plan to use the
"agent" of the action, an "object" or statistical data to determine what the most
an "experiencer" etc.
likely one should be. At least it can help in
Pattern mapping or transfer between the reducing the number of candidates.
two languages involves a few steps. First, an D. Possible Combinations
English input sentence is syntactically
In this step the statistical method is used
analyzed into a series of non-terminal
to calculate the probabilities of word that
symbols (NP, VI, VT, ADJ, etc.). This string
should be translated. In other words, we
will be checked with the table of E-K
search through the statistical data stored and
sentence pattern mapping (ÐFig. 4Ñ). If the
pick out the most likely word for our
pattern of input sentence is identical to any
translation. With this method, we can
pattern of English, it will be mapped to the
eliminate a large number of possible
Kannada/Telugu sentence pattern that is
combinations or candidate sentences. The
correspondent. Next, each English lexical
output sentences that are ambiguous or have
entry will be reordered according to word
nonsensical meaning will be deleted as much
ordering of Kannada/Telugu [6] sentence
as possible. As a result, we can obtain the
pattern. If the different sections are found,
most accurate and accepted outcome. For
the rules can be of help before entering the
example, for a query the
next stage.
translation for is {green} and the
Following is the Kannada Grammatical translations for are {hang, designing}.
Productions for a Robot to explain simple Here, based on the context, we can see that
instructions like: the choice of translation [8] for the second
word is water since it is more likely to co-
occur with river.

V. Experimental Results
Cross Language Information Retrieval Tool
[9] is built by using the ASP.NET as front
end and for a Database the Kannada is
encrypted by using the Encoding system.
The sample result as below figures
Fig 4: E-K sentence pattern mapping.

C. Looking Up Bilingual Dictionary


and Generating
The bilingual dictionary of 10,000 entries
is created in dbase format and looked up for
mapping Kannada/Telugu equivalent entries
onto the input string. Then a
Kannada/Telugu output sentence is Fig 5: sample result for English-Kannada for
generated. Due to multiple meanings of one Name
word, there is a large number of possible
[3].Satoshi Shirai, Francis Bond and Yamato
Takahashi. 1997. “A hybrid rule and
example-based method for machine
translationÑ. In proceedings of the
Natural Language Processing specific
Rim Symposium 1997, pages49-54,
December.
[4].S. Kereto, C. Wongchaisuwat, Y.
Poovarawan. 1993. “Machine translation
Fig 6: sample result for English-Kannada for
Sentence research and developmentÑ. In
proceedings of the Symposium on
Conclusion Natural Language processing in
Thailand, pages 167-195, March.
In this paper we have explained the concepts [5].Dr. Ramakanth Kumar P,
and algorithms presented while et.al0ÑKannada Morphological Analyser
implementing Bilingual Translation System and Generator Using TrieÑ published in
for English to Kannada/Telugu which IJCSNS International Journal of
translates given input sentence in source Computer Science and Network Security,
language into target language using hybrid VOL.11 No.1, January 2011.
approach. New rules have been added to the [6].Ganapathiraju Madhavi, Balakrishnan
proposed system in order to make the system Mini,Balakrishnan N, Reddy Raj, ÐOm:
more efficient. This work can be extended to One tool for many (Indian)
other domains with the addition of new languagesÑ,Journal of Zhejiang
rules. University SCIENCE,Vol 6A, No. 11, pp
1348-1353, Oct 2005.
Acknowledgement
[7].Wittaya Nathong. 1988.7 Contrastive
This is the major research project entitled analysis of English and Thai.
Cross-Language Information Retrieval Ramkhamhaeng University Press,
sanctioned to Dr. M. Hanumanthappa, PI- Bangkok.
UGC-MH, Department of computer science [8]. Mallamma.V.Reddy,.Hanumanthappa.M
and applications by the University grant , ÐKannada and Telugu Native
commission. We thank to the UGC for Languages to English Cross Language
financial assistance. This paper is in Information Retrieval" published in
continuation of the project carried out at the (IJCSIT) International Journal of
Bangalore University, Bangalore, India. Computer Science and Information
Technologies, Vol. 2 (5) , Sep-Oct 2011,
References page-1876-1880. IISN: 0975-9646.
[1].The Karnataka Official Language Act, [9].Mallamma.V.Reddy, Dr. M.
ÐOfficial website of department of Hanumanthappa, ÐCLIR Project
Parliamentary Affairs and LegislationÑ, (English to Kannada and Telugu)Ñ
Government of Karnataka. Retrieved http://bangaloreuniversitydictionary//me
2007-06-29. nu.asp
[2]. Prof. Abdullah H. Homiedan.Ñ Machine
TranslationÑ.

You might also like