Skip to main content

Showing 1–9 of 9 results for author: Batsuren, K

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.13292  [pdf, other

    cs.CL cs.AI

    Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge

    Authors: Khuyagbaatar Batsuren, Ekaterina Vylomova, Verna Dankers, Tsetsuukhei Delgerbaatar, Omri Uzan, Yuval Pinter, Gábor Bella

    Abstract: The popular subword tokenizers of current language models, such as Byte-Pair Encoding (BPE), are known not to respect morpheme boundaries, which affects the downstream performance of the models. While many improved tokenization algorithms have been proposed, their evaluation and cross-comparison is still an open problem. As a solution, we propose a combined intrinsic-extrinsic evaluation framework… ▽ More

    Submitted 20 April, 2024; originally announced April 2024.

  2. State-of-the-art generalisation research in NLP: A taxonomy and review

    Authors: Dieuwke Hupkes, Mario Giulianelli, Verna Dankers, Mikel Artetxe, Yanai Elazar, Tiago Pimentel, Christos Christodoulopoulos, Karim Lasri, Naomi Saphra, Arabella Sinclair, Dennis Ulmer, Florian Schottmann, Khuyagbaatar Batsuren, Kaiser Sun, Koustuv Sinha, Leila Khalatbari, Maria Ryskina, Rita Frieske, Ryan Cotterell, Zhijing Jin

    Abstract: The ability to generalise well is one of the primary desiderata of natural language processing (NLP). Yet, what 'good generalisation' entails and how it should be evaluated is not well understood, nor are there any evaluation standards for generalisation. In this paper, we lay the groundwork to address both of these issues. We present a taxonomy for characterising and understanding generalisation… ▽ More

    Submitted 12 January, 2024; v1 submitted 6 October, 2022; originally announced October 2022.

    Comments: This preprint was published as an Analysis article in Nature Machine Intelligence. Please refer to the published version when citing this work. 28 pages of content + 6 pages of appendix + 52 pages of references

    Journal ref: Nat Mach Intell 5, 1161-1174 (2023)

  3. arXiv:2210.01734  [pdf, other

    cs.CL cs.LG

    Text Characterization Toolkit

    Authors: Daniel Simig, Tianlu Wang, Verna Dankers, Peter Henderson, Khuyagbaatar Batsuren, Dieuwke Hupkes, Mona Diab

    Abstract: In NLP, models are usually evaluated by reporting single-number performance scores on a number of readily available benchmarks, without much deeper analysis. Here, we argue that - especially given the well-known fact that benchmarks often contain biases, artefacts, and spurious correlations - deeper results analysis should become the de-facto standard when presenting new models or benchmarks. We p… ▽ More

    Submitted 4 October, 2022; originally announced October 2022.

  4. arXiv:2206.07615  [pdf, other

    cs.CL

    The SIGMORPHON 2022 Shared Task on Morpheme Segmentation

    Authors: Khuyagbaatar Batsuren, Gábor Bella, Aryaman Arora, Viktor Martinović, Kyle Gorman, Zdeněk Žabokrtský, Amarsanaa Ganbold, Šárka Dohnalová, Magda Ševčíková, Kateřina Pelegrinová, Fausto Giunchiglia, Ryan Cotterell, Ekaterina Vylomova

    Abstract: The SIGMORPHON 2022 shared task on morpheme segmentation challenged systems to decompose a word into a sequence of morphemes and covered most types of morphology: compounds, derivations, and inflections. Subtask 1, word-level morpheme segmentation, covered 5 million words in 9 languages (Czech, English, Spanish, Hungarian, French, Italian, Russian, Latin, Mongolian) and received 13 system submissi… ▽ More

    Submitted 15 June, 2022; originally announced June 2022.

    Comments: The 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

  5. arXiv:2205.03608  [pdf, other

    cs.CL

    UniMorph 4.0: Universal Morphology

    Authors: Khuyagbaatar Batsuren, Omer Goldman, Salam Khalifa, Nizar Habash, Witold Kieraś, Gábor Bella, Brian Leonard, Garrett Nicolai, Kyle Gorman, Yustinus Ghanggo Ate, Maria Ryskina, Sabrina J. Mielke, Elena Budianskaya, Charbel El-Khaissi, Tiago Pimentel, Michael Gasser, William Lane, Mohit Raj, Matt Coler, Jaime Rafael Montoya Samame, Delio Siticonatzi Camaiteri, Benoît Sagot, Esaú Zumaeta Rojas, Didier López Francis, Arturo Oncevay , et al. (71 additional authors not shown)

    Abstract: The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema. This pa… ▽ More

    Submitted 19 June, 2022; v1 submitted 7 May, 2022; originally announced May 2022.

    Comments: LREC 2022; The first two authors made equal contributions

  6. arXiv:2204.05049  [pdf, other

    cs.CL

    Using Linguistic Typology to Enrich Multilingual Lexicons: the Case of Lexical Gaps in Kinship

    Authors: Temuulen Khishigsuren, Gábor Bella, Khuyagbaatar Batsuren, Abed Alhakim Freihat, Nandu Chandran Nair, Amarsanaa Ganbold, Hadi Khalilia, Yamini Chandrashekar, Fausto Giunchiglia

    Abstract: This paper describes a method to enrich lexical resources with content relating to linguistic diversity, based on knowledge from the field of lexical typology. We capture the phenomenon of diversity through the notions of lexical gap and language-specific word and use a systematic method to infer gaps semi-automatically on a large scale. As a first result obtained for the domain of kinship termino… ▽ More

    Submitted 11 April, 2022; originally announced April 2022.

    Comments: LREC 2022

  7. arXiv:2203.04723  [pdf, other

    cs.CL

    Language Diversity: Visible to Humans, Exploitable by Machines

    Authors: Gábor Bella, Erdenebileg Byambadorj, Yamini Chandrashekar, Khuyagbaatar Batsuren, Danish Ashgar Cheema, Fausto Giunchiglia

    Abstract: The Universal Knowledge Core (UKC) is a large multilingual lexical database with a focus on language diversity and covering over a thousand languages. The aim of the database, as well as its tools and data catalogue, is to make the somewhat abstract notion of diversity visually understandable for humans and formally exploitable by machines. The UKC website lets users explore millions of individual… ▽ More

    Submitted 9 March, 2022; originally announced March 2022.

    Comments: Accepted for publication in ACL 2022

  8. arXiv:2104.05658  [pdf, other

    cs.CY cs.AI

    Towards Algorithmic Transparency: A Diversity Perspective

    Authors: Fausto Giunchiglia, Jahna Otterbacher, Styliani Kleanthous, Khuyagbaatar Batsuren, Veronika Bogin, Tsvi Kuflik, Avital Shulner Tal

    Abstract: As the role of algorithmic systems and processes increases in society, so does the risk of bias, which can result in discrimination against individuals and social groups. Research on algorithmic bias has exploded in recent years, highlighting both the problems of bias, and the potential solutions, in terms of algorithmic transparency (AT). Transparency is important for facilitating fairness manage… ▽ More

    Submitted 12 April, 2021; originally announced April 2021.

  9. arXiv:2103.16953  [pdf, other

    cs.CY

    Mitigating Bias in Algorithmic Systems -- A Fish-Eye View

    Authors: Kalia Orphanou, Jahna Otterbacher, Styliani Kleanthous, Khuyagbaatar Batsuren, Fausto Giunchiglia, Veronika Bogina, Avital Shulner Tal, AlanHartman, Tsvi Kuflik

    Abstract: Mitigating bias in algorithmic systems is a critical issue drawing attention across communities within the information and computer sciences. Given the complexity of the problem and the involvement of multiple stakeholders -- including developers, end-users, and third parties -- there is a need to understand the landscape of the sources of bias, and the solutions being proposed to address them, fr… ▽ More

    Submitted 20 February, 2022; v1 submitted 31 March, 2021; originally announced March 2021.

    ACM Class: K.4.0; A.1