Skip to main content

Showing 1–50 of 62 results for author: Van Nguyen, K

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.17716  [pdf, other

    cs.CL

    ViANLI: Adversarial Natural Language Inference for Vietnamese

    Authors: Tin Van Huynh, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: The development of Natural Language Processing (NLI) datasets and models has been inspired by innovations in annotation design. With the rapid development of machine learning models today, the performance of existing machine learning models has quickly reached state-of-the-art results on a variety of tasks related to natural language processing, including natural language inference tasks. By using… ▽ More

    Submitted 1 July, 2024; v1 submitted 25 June, 2024; originally announced June 2024.

  2. arXiv:2405.07615  [pdf, other

    cs.CL

    ViWikiFC: Fact-Checking for Vietnamese Wikipedia-Based Textual Knowledge Source

    Authors: Hung Tuan Le, Long Truong To, Manh Trong Nguyen, Kiet Van Nguyen

    Abstract: Fact-checking is essential due to the explosion of misinformation in the media ecosystem. Although false information exists in every language and country, most research to solve the problem mainly concentrated on huge communities like English and Chinese. Low-resource languages like Vietnamese are necessary to explore corpora and models for fact verification. To bridge this gap, we construct ViWik… ▽ More

    Submitted 13 May, 2024; originally announced May 2024.

  3. arXiv:2405.00543  [pdf, other

    cs.CL cs.AI

    New Benchmark Dataset and Fine-Grained Cross-Modal Fusion Framework for Vietnamese Multimodal Aspect-Category Sentiment Analysis

    Authors: Quy Hoang Nguyen, Minh-Van Truong Nguyen, Kiet Van Nguyen

    Abstract: The emergence of multimodal data on social media platforms presents new opportunities to better understand user sentiments toward a given aspect. However, existing multimodal datasets for Aspect-Category Sentiment Analysis (ACSA) often focus on textual annotations, neglecting fine-grained information in images. Consequently, these datasets fail to fully exploit the richness inherent in multimodal.… ▽ More

    Submitted 1 May, 2024; originally announced May 2024.

  4. arXiv:2404.18397  [pdf, other

    cs.CV

    ViOCRVQA: Novel Benchmark Dataset and Vision Reader for Visual Question Answering by Understanding Vietnamese Text in Images

    Authors: Huy Quang Pham, Thang Kien-Bao Nguyen, Quan Van Nguyen, Dan Quang Tran, Nghia Hieu Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: Optical Character Recognition - Visual Question Answering (OCR-VQA) is the task of answering text information contained in images that have just been significantly developed in the English language in recent years. However, there are limited studies of this task in low-resource languages such as Vietnamese. To this end, we introduce a novel dataset, ViOCRVQA (Vietnamese Optical Character Recogniti… ▽ More

    Submitted 28 April, 2024; originally announced April 2024.

  5. arXiv:2404.10652  [pdf, other

    cs.CL

    ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images

    Authors: Quan Van Nguyen, Dan Quang Tran, Huy Quang Pham, Thang Kien-Bao Nguyen, Nghia Hieu Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: Visual Question Answering (VQA) is a complicated task that requires the capability of simultaneously processing natural language and images. Initially, this task was researched, focusing on methods to help machines understand objects and scene contexts in images. However, some text appearing in the image that carries explicit information about the full content of the image is not mentioned. Along… ▽ More

    Submitted 16 April, 2024; originally announced April 2024.

    Comments: Preprint submitted to IJCV

  6. arXiv:2403.15882  [pdf, other

    cs.CL

    VLUE: A New Benchmark and Multi-task Knowledge Transfer Learning for Vietnamese Natural Language Understanding

    Authors: Phong Nguyen-Thuan Do, Son Quoc Tran, Phu Gia Hoang, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: The success of Natural Language Understanding (NLU) benchmarks in various languages, such as GLUE for English, CLUE for Chinese, KLUE for Korean, and IndoNLU for Indonesian, has facilitated the evaluation of new NLU models across a wide range of tasks. To establish a standardized set of benchmarks for Vietnamese NLU, we introduce the first Vietnamese Language Understanding Evaluation (VLUE) benchm… ▽ More

    Submitted 23 March, 2024; originally announced March 2024.

    Comments: Accepted at NAACL 2024 (Findings)

  7. arXiv:2402.02655  [pdf, other

    cs.CL

    VlogQA: Task, Dataset, and Baseline Models for Vietnamese Spoken-Based Machine Reading Comprehension

    Authors: Thinh Phuoc Ngo, Khoa Tran Anh Dang, Son T. Luu, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: This paper presents the development process of a Vietnamese spoken language corpus for machine reading comprehension (MRC) tasks and provides insights into the challenges and opportunities associated with using real-world data for machine reading comprehension tasks. The existing MRC corpora in Vietnamese mainly focus on formal written documents such as Wikipedia articles, online newspapers, or te… ▽ More

    Submitted 6 April, 2024; v1 submitted 4 February, 2024; originally announced February 2024.

    Comments: To appear as the main conference paper at EACL 2024

  8. arXiv:2401.16403  [pdf, other

    cs.CL

    ViLexNorm: A Lexical Normalization Corpus for Vietnamese Social Media Text

    Authors: Thanh-Nhi Nguyen, Thanh-Phong Le, Kiet Van Nguyen

    Abstract: Lexical normalization, a fundamental task in Natural Language Processing (NLP), involves the transformation of words into their canonical forms. This process has been proven to benefit various downstream NLP tasks greatly. In this work, we introduce Vietnamese Lexical Normalization (ViLexNorm), the first-ever corpus developed for the Vietnamese lexical normalization task. The corpus comprises over… ▽ More

    Submitted 31 January, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

    Comments: Accepted at the EACL 2024 Main Conference

  9. Automatic Textual Normalization for Hate Speech Detection

    Authors: Anh Thi-Hoang Nguyen, Dung Ha Nguyen, Nguyet Thi Nguyen, Khanh Thanh-Duy Ho, Kiet Van Nguyen

    Abstract: Social media data is a valuable resource for research, yet it contains a wide range of non-standard words (NSW). These irregularities hinder the effective operation of NLP tools. Current state-of-the-art methods for the Vietnamese language address this issue as a problem of lexical normalization, involving the creation of manual rules or the implementation of multi-staged deep learning frameworks,… ▽ More

    Submitted 25 July, 2024; v1 submitted 12 November, 2023; originally announced November 2023.

    Comments: 2023 International Conference on Intelligent Systems Design and Applications (ISDA2023)

    Journal ref: Intelligent Systems Design and Applications. Lecture Notes in Networks and Systems, vol 1049 (ISDA 2023) 1-12

  10. arXiv:2310.18046  [pdf, other

    cs.CL cs.CV

    ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model for Visual Question Answering in Vietnamese

    Authors: Khiem Vinh Tran, Hao Phu Phan, Kiet Van Nguyen, Ngan Luu Thuy Nguyen

    Abstract: In recent years, Visual Question Answering (VQA) has gained significant attention for its diverse applications, including intelligent car assistance, aiding visually impaired individuals, and document image information retrieval using natural language queries. VQA requires effective integration of information from questions and images to generate accurate answers. Neural models for VQA have made r… ▽ More

    Submitted 27 October, 2023; originally announced October 2023.

    Comments: A pre-print version and submitted to journal

  11. arXiv:2310.11166  [pdf, other

    cs.CL

    ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text Processing

    Authors: Quoc-Nam Nguyen, Thang Chau Phan, Duc-Vu Nguyen, Kiet Van Nguyen

    Abstract: English and Chinese, known as resource-rich languages, have witnessed the strong development of transformer-based language models for natural language processing tasks. Although Vietnam has approximately 100M people speaking Vietnamese, several pre-trained models, e.g., PhoBERT, ViBERT, and vELECTRA, performed well on general Vietnamese NLP tasks, including POS tagging and named entity recognition… ▽ More

    Submitted 28 October, 2023; v1 submitted 17 October, 2023; originally announced October 2023.

    Comments: Accepted at EMNLP'2023 Main Conference

  12. arXiv:2309.14677  [pdf, other

    cs.CR cs.AI

    XGV-BERT: Leveraging Contextualized Language Model and Graph Neural Network for Efficient Software Vulnerability Detection

    Authors: Vu Le Anh Quan, Chau Thuan Phat, Kiet Van Nguyen, Phan The Duy, Van-Hau Pham

    Abstract: With the advancement of deep learning (DL) in various fields, there are many attempts to reveal software vulnerabilities by data-driven approach. Nonetheless, such existing works lack the effective representation that can retain the non-sequential semantic characteristics and contextual relationship of source code attributes. Hence, in this work, we propose XGV-BERT, a framework that combines the… ▽ More

    Submitted 26 September, 2023; originally announced September 2023.

  13. arXiv:2309.02902  [pdf, other

    cs.CL

    ViCGCN: Graph Convolutional Network with Contextualized Language Models for Social Media Mining in Vietnamese

    Authors: Chau-Thang Phan, Quoc-Nam Nguyen, Chi-Thanh Dang, Trong-Hop Do, Kiet Van Nguyen

    Abstract: Social media processing is a fundamental task in natural language processing with numerous applications. As Vietnamese social media and information science have grown rapidly, the necessity of information-based mining on Vietnamese social media has become crucial. However, state-of-the-art research faces several significant drawbacks, including imbalanced data and noisy data on social media platfo… ▽ More

    Submitted 6 September, 2023; originally announced September 2023.

  14. arXiv:2308.16469  [pdf, other

    cs.CL

    Link Prediction for Wikipedia Articles as a Natural Language Inference Task

    Authors: Chau-Thang Phan, Quoc-Nam Nguyen, Kiet Van Nguyen

    Abstract: Link prediction task is vital to automatically understanding the structure of large knowledge bases. In this paper, we present our system to solve this task at the Data Science and Advanced Analytics 2023 Competition "Efficient and Effective Link Prediction" (DSAA-2023 Competition) with a corpus containing 948,233 training and 238,265 for public testing. This paper introduces an approach to link p… ▽ More

    Submitted 5 September, 2023; v1 submitted 31 August, 2023; originally announced August 2023.

    Comments: Accepted at the 10th IEEE International Conference On Data Science And Advanced Analytics (DSAA 2023)

  15. arXiv:2307.15335  [pdf, other

    cs.CL cs.CV

    BARTPhoBEiT: Pre-trained Sequence-to-Sequence and Image Transformers Models for Vietnamese Visual Question Answering

    Authors: Khiem Vinh Tran, Kiet Van Nguyen, Ngan Luu Thuy Nguyen

    Abstract: Visual Question Answering (VQA) is an intricate and demanding task that integrates natural language processing (NLP) and computer vision (CV), capturing the interest of researchers. The English language, renowned for its wealth of resources, has witnessed notable advancements in both datasets and models designed for VQA. However, there is a lack of models that target specific countries such as Vie… ▽ More

    Submitted 28 July, 2023; originally announced July 2023.

  16. arXiv:2307.08247  [pdf, other

    cs.CL

    PAT: Parallel Attention Transformer for Visual Question Answering in Vietnamese

    Authors: Nghia Hieu Nguyen, Kiet Van Nguyen

    Abstract: We present in this paper a novel scheme for multimodal learning named the Parallel Attention mechanism. In addition, to take into account the advantages of grammar and context in Vietnamese, we propose the Hierarchical Linguistic Features Extractor instead of using an LSTM network to extract linguistic features. Based on these two novel modules, we introduce the Parallel Attention Transformer (PAT… ▽ More

    Submitted 17 July, 2023; originally announced July 2023.

  17. OpenViVQA: Task, Dataset, and Multimodal Fusion Models for Visual Question Answering in Vietnamese

    Authors: Nghia Hieu Nguyen, Duong T. D. Vo, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: In recent years, visual question answering (VQA) has attracted attention from the research community because of its highly potential applications (such as virtual assistance on intelligent cars, assistant devices for blind people, or information retrieval from document images using natural language as queries) and challenge. The VQA task requires methods that have the ability to fuse the informati… ▽ More

    Submitted 6 May, 2023; originally announced May 2023.

    Comments: submitted to Elsevier

  18. arXiv:2303.18162  [pdf, other

    cs.CL

    A Multiple Choices Reading Comprehension Corpus for Vietnamese Language Education

    Authors: Son T. Luu, Khoi Trong Hoang, Tuong Quang Pham, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: Machine reading comprehension has been an interesting and challenging task in recent years, with the purpose of extracting useful information from texts. To attain the computer ability to understand the reading text and answer relevant information, we introduce ViMMRC 2.0 - an extension of the previous ViMMRC for the task of multiple-choice reading comprehension in Vietnamese Textbooks which conta… ▽ More

    Submitted 31 March, 2023; originally announced March 2023.

  19. arXiv:2303.13355  [pdf, other

    cs.CL cs.AI

    Revealing Weaknesses of Vietnamese Language Models Through Unanswerable Questions in Machine Reading Comprehension

    Authors: Son Quoc Tran, Phong Nguyen-Thuan Do, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: Although the curse of multilinguality significantly restricts the language abilities of multilingual models in monolingual settings, researchers now still have to rely on multilingual models to develop state-of-the-art systems in Vietnamese Machine Reading Comprehension. This difficulty in researching is because of the limited number of high-quality works in developing Vietnamese language models.… ▽ More

    Submitted 16 March, 2023; originally announced March 2023.

    Comments: Accepted at The 2023 EACL Student Research Workshop

  20. EVJVQA Challenge: Multilingual Visual Question Answering

    Authors: Ngan Luu-Thuy Nguyen, Nghia Hieu Nguyen, Duong T. D Vo, Khanh Quoc Tran, Kiet Van Nguyen

    Abstract: Visual Question Answering (VQA) is a challenging task of natural language processing (NLP) and computer vision (CV), attracting significant attention from researchers. English is a resource-rich language that has witnessed various developments in datasets and models for visual question answering. Visual question answering in other languages also would be developed for resources and models. In addi… ▽ More

    Submitted 17 April, 2024; v1 submitted 22 February, 2023; originally announced February 2023.

    Comments: VLSP2022 EVJVQA challenge

  21. arXiv:2301.10186  [pdf, other

    cs.CL

    ViHOS: Hate Speech Spans Detection for Vietnamese

    Authors: Phu Gia Hoang, Canh Duc Luu, Khanh Quoc Tran, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: The rise in hateful and offensive language directed at other users is one of the adverse side effects of the increased use of social networking platforms. This could make it difficult for human moderators to review tagged comments filtered by classification systems. To help address this issue, we present the ViHOS (Vietnamese Hate and Offensive Spans) dataset, the first human-annotated corpus cont… ▽ More

    Submitted 26 January, 2023; v1 submitted 24 January, 2023; originally announced January 2023.

    Comments: EACL 2023

  22. UIT-HWDB: Using Transferring Method to Construct A Novel Benchmark for Evaluating Unconstrained Handwriting Image Recognition in Vietnamese

    Authors: Nghia Hieu Nguyen, Duong T. D. Vo, Kiet Van Nguyen

    Abstract: Recognizing handwriting images is challenging due to the vast variation in writing style across many people and distinct linguistic aspects of writing languages. In Vietnamese, besides the modern Latin characters, there are accent and letter marks together with characters that draw confusion to state-of-the-art handwriting recognition methods. Moreover, as a low-resource language, there are not ma… ▽ More

    Submitted 10 November, 2022; originally announced November 2022.

    Comments: Accepted for publishing at the 16th International Conference on Computing and Communication Technologies (RIVF)

  23. arXiv:2209.10482  [pdf, other

    cs.CL

    SMTCE: A Social Media Text Classification Evaluation Benchmark and BERTology Models for Vietnamese

    Authors: Luan Thanh Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: Text classification is a typical natural language processing or computational linguistics task with various interesting applications. As the number of users on social media platforms increases, data acceleration promotes emerging studies on Social Media Text Classification (SMTC) or social media text mining on these valuable resources. In contrast to English, Vietnamese, one of the low-resource la… ▽ More

    Submitted 21 September, 2022; originally announced September 2022.

    Comments: Accepted at The 36th annual Meeting of Pacific Asia Conference on Language, Information and Computation (PACLIC 36)

  24. arXiv:2206.09600  [pdf, other

    cs.CL

    SPBERTQA: A Two-Stage Question Answering System Based on Sentence Transformers for Medical Texts

    Authors: Nhung Thi-Hong Nguyen, Phuong Phan-Dieu Ha, Luan Thanh Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: Question answering (QA) systems have gained explosive attention in recent years. However, QA tasks in Vietnamese do not have many datasets. Significantly, there is mostly no dataset in the medical domain. Therefore, we built a Vietnamese Healthcare Question Answering dataset (ViHealthQA), including 10,015 question-answer passage pairs for this task, in which questions from health-interested users… ▽ More

    Submitted 20 June, 2022; originally announced June 2022.

  25. arXiv:2206.00524  [pdf, other

    cs.CL cs.AI cs.LG

    Vietnamese Hate and Offensive Detection using PhoBERT-CNN and Social Media Streaming Data

    Authors: Khanh Q. Tran, An T. Nguyen, Phu Gia Hoang, Canh Duc Luu, Trong-Hop Do, Kiet Van Nguyen

    Abstract: Society needs to develop a system to detect hate and offense to build a healthy and safe environment. However, current research in this field still faces four major shortcomings, including deficient pre-processing techniques, indifference to data imbalance issues, modest performance models, and lacking practical applications. This paper focused on developing an intelligent system capable of addres… ▽ More

    Submitted 1 June, 2022; originally announced June 2022.

  26. arXiv:2204.07002  [pdf, other

    cs.CL

    XLMRQA: Open-Domain Question Answering on Vietnamese Wikipedia-based Textual Knowledge Source

    Authors: Kiet Van Nguyen, Phong Nguyen-Thuan Do, Nhat Duy Nguyen, Tin Van Huynh, Anh Gia-Tuan Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: Question answering (QA) is a natural language understanding task within the fields of information retrieval and information extraction that has attracted much attention from the computational linguistics and artificial intelligence research community in recent years because of the strong development of machine reading comprehension-based models. A reader-based QA system is a high-level search engi… ▽ More

    Submitted 13 August, 2022; v1 submitted 14 April, 2022; originally announced April 2022.

    Comments: Accepted by ACIIDS 2022

  27. VLSP 2021 - ViMRC Challenge: Vietnamese Machine Reading Comprehension

    Authors: Kiet Van Nguyen, Son Quoc Tran, Luan Thanh Nguyen, Tin Van Huynh, Son T. Luu, Ngan Luu-Thuy Nguyen

    Abstract: One of the emerging research trends in natural language understanding is machine reading comprehension (MRC) which is the task to find answers to human questions based on textual data. Existing Vietnamese datasets for MRC research concentrate solely on answerable questions. However, in reality, questions can be unanswerable for which the correct answer is not stated in the given textual data. To a… ▽ More

    Submitted 4 April, 2022; v1 submitted 21 March, 2022; originally announced March 2022.

    Comments: The 8th International Workshop on Vietnamese Language and Speech Processing (VLSP 2021)

  28. arXiv:2112.09488  [pdf, other

    cs.CL

    Joint Chinese Word Segmentation and Part-of-speech Tagging via Two-stage Span Labeling

    Authors: Duc-Vu Nguyen, Linh-Bao Vo, Ngoc-Linh Tran, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: Chinese word segmentation and part-of-speech tagging are necessary tasks in terms of computational linguistics and application of natural language processing. Many re-searchers still debate the demand for Chinese word segmentation and part-of-speech tagging in the deep learning era. Nevertheless, resolving ambiguities and detecting unknown words are challenging problems in this field. Previous stu… ▽ More

    Submitted 17 December, 2021; originally announced December 2021.

    Comments: In Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation (PACLIC 2021)

  29. arXiv:2110.07833  [pdf, other

    cs.CL

    Span Detection for Aspect-Based Sentiment Analysis in Vietnamese

    Authors: Kim Thi-Thanh Nguyen, Sieu Khai Huynh, Luong Luc Phan, Phuc Huynh Pham, Duc-Vu Nguyen, Kiet Van Nguyen

    Abstract: Aspect-based sentiment analysis plays an essential role in natural language processing and artificial intelligence. Recently, researchers only focused on aspect detection and sentiment classification but ignoring the sub-task of detecting user opinion span, which has enormous potential in practical applications. In this paper, we present a new Vietnamese dataset (UIT-ViSD4SA) consisting of 35,396… ▽ More

    Submitted 14 October, 2021; originally announced October 2021.

  30. arXiv:2108.13741  [pdf, other

    cs.CL cs.AI

    Monolingual versus Multilingual BERTology for Vietnamese Extractive Multi-Document Summarization

    Authors: Huy Quoc To, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen, Anh Gia-Tuan Nguyen

    Abstract: Recent researches have demonstrated that BERT shows potential in a wide range of natural language processing tasks. It is adopted as an encoder for many state-of-the-art automatic summarizing systems, which achieve excellent performance. However, so far, there is not much work done for Vietnamese. In this paper, we showcase how BERT can be implemented for extractive text summarization in Vietnames… ▽ More

    Submitted 16 October, 2021; v1 submitted 31 August, 2021; originally announced August 2021.

  31. arXiv:2108.02929  [pdf

    cs.CV

    VinaFood21: A Novel Dataset for Evaluating Vietnamese Food Recognition

    Authors: Thuan Trong Nguyen, Thuan Q. Nguyen, Dung Vo, Vi Nguyen, Ngoc Ho, Nguyen D. Vo, Kiet Van Nguyen, Khang Nguyen

    Abstract: Vietnam is such an attractive tourist destination with its stunning and pristine landscapes and its top-rated unique food and drink. Among thousands of Vietnamese dishes, foreigners and native people are interested in easy-to-eat tastes and easy-to-do recipes, along with reasonable prices, mouthwatering flavors, and popularity. Due to the diversity and almost all the dishes have significant simila… ▽ More

    Submitted 5 August, 2021; originally announced August 2021.

  32. arXiv:2105.15079  [pdf, other

    cs.CL

    SA2SL: From Aspect-Based Sentiment Analysis to Social Listening System for Business Intelligence

    Authors: Luong Luc Phan, Phuc Huynh Pham, Kim Thi-Thanh Nguyen, Tham Thi Nguyen, Sieu Khai Huynh, Luan Thanh Nguyen, Tin Van Huynh, Kiet Van Nguyen

    Abstract: In this paper, we present a process of building a social listening system based on aspect-based sentiment analysis in Vietnamese from creating a dataset to building a real application. Firstly, we create UIT-ViSFD, a Vietnamese Smartphone Feedback Dataset as a new benchmark corpus built based on a strict annotation schemes for evaluating aspect-based sentiment analysis, consisting of 11,122 human-… ▽ More

    Submitted 10 June, 2021; v1 submitted 31 May, 2021; originally announced May 2021.

  33. arXiv:2105.09043  [pdf, other

    cs.CL

    Sentence Extraction-Based Machine Reading Comprehension for Vietnamese

    Authors: Phong Nguyen-Thuan Do, Nhat Duy Nguyen, Tin Van Huynh, Kiet Van Nguyen, Anh Gia-Tuan Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: The development of natural language processing (NLP) in general and machine reading comprehension in particular has attracted the great attention of the research community. In recent years, there are a few datasets for machine reading comprehension tasks in Vietnamese with large sizes, such as UIT-ViQuAD and UIT-ViNewsQA. However, the datasets are not diverse in answers to serve the research. In t… ▽ More

    Submitted 11 June, 2021; v1 submitted 19 May, 2021; originally announced May 2021.

    Comments: Accepted by KSEM 2021 (International Conference on Knowledge Science, Engineering and Management)

  34. Conversational Machine Reading Comprehension for Vietnamese Healthcare Texts

    Authors: Son T. Luu, Mao Nguyen Bui, Loi Duc Nguyen, Khiem Vinh Tran, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: Machine reading comprehension (MRC) is a sub-field in natural language processing that aims to assist computers understand unstructured texts and then answer questions related to them. In practice, the conversation is an essential way to communicate and transfer information. To help machines understand conversation texts, we present UIT-ViCoQA, a new corpus for conversational machine reading compr… ▽ More

    Submitted 30 September, 2021; v1 submitted 4 May, 2021; originally announced May 2021.

    Comments: Published at The 13th International Conference on Computational Collective Intelligence (ICCCI 2021)

  35. arXiv:2104.11969  [pdf, ps, other

    cs.CL

    Vietnamese Complaint Detection on E-Commerce Websites

    Authors: Nhung Thi-Hong Nguyen, Phuong Phan-Dieu Ha, Luan Thanh Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: Customer product reviews play a role in improving the quality of products and services for business organizations or their brands. Complaining is an attitude that expresses dissatisfaction with an event or a product not meeting customer expectations. In this paper, we build a Open-domain Complaint Detection dataset (UIT-ViOCD), including 5,485 human-annotated reviews on four categories about produ… ▽ More

    Submitted 5 July, 2021; v1 submitted 24 April, 2021; originally announced April 2021.

  36. arXiv:2104.07376  [pdf, other

    cs.CL

    UIT-E10dot3 at SemEval-2021 Task 5: Toxic Spans Detection with Named Entity Recognition and Question-Answering Approaches

    Authors: Phu Gia Hoang, Luan Thanh Nguyen, Kiet Van Nguyen

    Abstract: The increment of toxic comments on online space is causing tremendous effects on other vulnerable users. For this reason, considerable efforts are made to deal with this, and SemEval-2021 Task 5: Toxic Spans Detection is one of those. This task asks competitors to extract spans that have toxicity from the given texts, and we have done several analyses to understand its structure before doing exper… ▽ More

    Submitted 15 April, 2021; originally announced April 2021.

    Comments: Accepted at SemEval-2021 Task 5: Toxic Spans Detection, ACL-IJCNLP 2021

  37. A Large-scale Dataset for Hate Speech Detection on Vietnamese Social Media Texts

    Authors: Son T. Luu, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: In recent years, Vietnam witnesses the mass development of social network users on different social platforms such as Facebook, Youtube, Instagram, and Tiktok. On social medias, hate speech has become a critical problem for social network users. To solve this problem, we introduce the ViHSD - a human-annotated dataset for automatically detecting hate speech on the social network. This dataset cont… ▽ More

    Submitted 20 July, 2021; v1 submitted 21 March, 2021; originally announced March 2021.

    Comments: IEA/AIE 2021: Advances and Trends in Artificial Intelligence. Artificial Intelligence Practices, pp 415-426

  38. Constructive and Toxic Speech Detection for Open-domain Social Media Comments in Vietnamese

    Authors: Luan Thanh Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: The rise of social media has led to the increasing of comments on online forums. However, there still exists invalid comments which are not informative for users. Moreover, those comments are also quite toxic and harmful to people. In this paper, we create a dataset for constructive and toxic speech detection, named UIT-ViCTSD (Vietnamese Constructive and Toxic Speech Detection dataset) with 10,00… ▽ More

    Submitted 6 September, 2021; v1 submitted 18 March, 2021; originally announced March 2021.

    Comments: IEA/AIE 2021: Advances and Trends in Artificial Intelligence. Artificial Intelligence Practices pp 572-583

  39. arXiv:2102.12136   

    cs.CL

    Augmenting Part-of-speech Tagging with Syntactic Information for Vietnamese and Chinese

    Authors: Duc-Vu Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: Word segmentation and part-of-speech tagging are two critical preliminary steps for downstream tasks in Vietnamese natural language processing. In reality, people tend to consider also the phrase boundary when performing word segmentation and part of speech tagging rather than solely process word by word from left to right. In this paper, we implement this idea to improve word segmentation and par… ▽ More

    Submitted 16 June, 2021; v1 submitted 24 February, 2021; originally announced February 2021.

    Comments: The comparison with existing methods in this paper is unfair because the hyper-parameters of Bi-LSTM are different compared with previous research. Importantly, there is a data leakage issue w.r.t this paper's experimental setup

  40. arXiv:2102.10794  [pdf, other

    cs.CL

    ReINTEL Challenge 2020: Exploiting Transfer Learning Models for Reliable Intelligence Identification on Vietnamese Social Network Sites

    Authors: Kim Thi-Thanh Nguyen, Kiet Van Nguyen

    Abstract: This paper presents the system that we propose for the Reliable Intelligence Indentification on Vietnamese Social Network Sites (ReINTEL) task of the Vietnamese Language and Speech Processing 2020 (VLSP 2020) Shared Task. In this task, the VLSP 2020 provides a dataset with approximately 6,000 trainning news/posts annotated with reliable or unreliable labels, and a test set consists of 2,000 exampl… ▽ More

    Submitted 23 February, 2021; v1 submitted 22 February, 2021; originally announced February 2021.

    Comments: 4 pages, ReINTEL Task at VLSP 2020 Workshop

  41. Gender Prediction Based on Vietnamese Names with Machine Learning Techniques

    Authors: Huy Quoc To, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen, Anh Gia-Tuan Nguyen

    Abstract: As biological gender is one of the aspects of presenting individual human, much work has been done on gender classification based on people names. The proposals for English and Chinese languages are tremendous; still, there have been few works done for Vietnamese so far. We propose a new dataset for gender prediction based on Vietnamese names. This dataset comprises over 26,000 full names annotate… ▽ More

    Submitted 23 March, 2021; v1 submitted 21 October, 2020; originally announced October 2020.

    Comments: 6 pages, 6 figures. NLPIR 2020: 4th International Conference on Natural Language Processing and Information Retrieval

  42. arXiv:2010.09623  [pdf, other

    cs.CL

    An Empirical Study for Vietnamese Constituency Parsing with Pre-training

    Authors: Tuan-Vi Tran, Xuan-Thien Pham, Duc-Vu Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: In this work, we use a span-based approach for Vietnamese constituency parsing. Our method follows the self-attention encoder architecture and a chart decoder using a CKY-style inference algorithm. We present analyses of the experiment results of the comparison of our empirical method using pre-training models XLM-Roberta and PhoBERT on both Vietnamese datasets VietTreebank and NIIVTB1. The result… ▽ More

    Submitted 19 October, 2020; v1 submitted 19 October, 2020; originally announced October 2020.

  43. arXiv:2009.14725  [pdf, other

    cs.CL

    A Vietnamese Dataset for Evaluating Machine Reading Comprehension

    Authors: Kiet Van Nguyen, Duc-Vu Nguyen, Anh Gia-Tuan Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: Over 97 million people speak Vietnamese as their native language in the world. However, there are few research studies on machine reading comprehension (MRC) for Vietnamese, the task of understanding a text and answering questions related to it. Due to the lack of benchmark datasets for Vietnamese, we present the Vietnamese Question Answering Dataset (UIT-ViQuAD), a new dataset for the low-resourc… ▽ More

    Submitted 7 November, 2020; v1 submitted 30 September, 2020; originally announced September 2020.

    Comments: Accepted by The 28th International Conference on Computational Linguistics (COLING 2020)

  44. arXiv:2009.13060  [pdf, other

    cs.CL

    A Simple and Efficient Ensemble Classifier Combining Multiple Neural Network Models on Social Media Datasets in Vietnamese

    Authors: Huy Duc Huynh, Hang Thi-Thuy Do, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: Text classification is a popular topic of natural language processing, which has currently attracted numerous research efforts worldwide. The significant increase of data in social media requires the vast attention of researchers to analyze such data. There are various studies in this field in many languages but limited to the Vietnamese language. Therefore, this study aims to classify Vietnamese… ▽ More

    Submitted 28 September, 2020; v1 submitted 28 September, 2020; originally announced September 2020.

    Comments: Accepted by The 34th Pacific Asia Conference on Language, Information and Computation (PACLIC2020)

  45. arXiv:2009.12319  [pdf, other

    cs.CL

    Empirical Study of Text Augmentation on Social Media Text in Vietnamese

    Authors: Son T. Luu, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: In the text classification problem, the imbalance of labels in datasets affect the performance of the text-classification models. Practically, the data about user comments on social networking sites not altogether appeared - the administrators often only allow positive comments and hide negative comments. Thus, when collecting the data about user comments on the social network, the data is usually… ▽ More

    Submitted 9 October, 2020; v1 submitted 25 September, 2020; originally announced September 2020.

    Comments: Accepted by The 34th Pacific Asia Conference on Language, Information and Computation

  46. arXiv:2009.11005  [pdf, other

    cs.CL

    Exploiting Vietnamese Social Media Characteristics for Textual Emotion Recognition in Vietnamese

    Authors: Khang Phuoc-Quy Nguyen, Kiet Van Nguyen

    Abstract: Textual emotion recognition has been a promising research topic in recent years. Many researchers aim to build more accurate and robust emotion detection systems. In this paper, we conduct several experiments to indicate how data pre-processing affects a machine learning method on textual emotion recognition. These experiments are performed on the Vietnamese Social Media Emotion Corpus (UIT-VSMEC)… ▽ More

    Submitted 27 October, 2020; v1 submitted 23 September, 2020; originally announced September 2020.

    Comments: 6 pages, 9 tables, 2 figures of table, conference

  47. arXiv:2009.02935  [pdf, other

    cs.CL

    UIT-HSE at WNUT-2020 Task 2: Exploiting CT-BERT for Identifying COVID-19 Information on the Twitter Social Network

    Authors: Khiem Vinh Tran, Hao Phu Phan, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: Recently, COVID-19 has affected a variety of real-life aspects of the world and led to dreadful consequences. More and more tweets about COVID-19 has been shared publicly on Twitter. However, the plurality of those Tweets are uninformative, which is challenging to build automatic systems to detect the informative ones for useful AI applications. In this paper, we present our results at the W-NUT 2… ▽ More

    Submitted 13 November, 2020; v1 submitted 7 September, 2020; originally announced September 2020.

    Comments: Accepted by 2020 The 6th Workshop on Noisy User-generated Text (W-NUT) - EMNLP 2020

    Journal ref: https://www.aclweb.org/anthology/2020.wnut-1.53/

  48. An Experimental Study of Deep Neural Network Models for Vietnamese Multiple-Choice Reading Comprehension

    Authors: Son T. Luu, Kiet Van Nguyen, Anh Gia-Tuan Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: Machine reading comprehension (MRC) is a challenging task in natural language processing that makes computers understanding natural language texts and answer questions based on those texts. There are many techniques for solving this problems, and word representation is a very important technique that impact most to the accuracy of machine reading comprehension problem in the popular languages like… ▽ More

    Submitted 18 February, 2021; v1 submitted 20 August, 2020; originally announced August 2020.

    Comments: Published in the 2020 IEEE Eighth International Conference on Communications and Electronics (ICCE)

  49. arXiv:2006.11138  [pdf, other

    cs.CL

    New Vietnamese Corpus for Machine Reading Comprehension of Health News Articles

    Authors: Kiet Van Nguyen, Tin Van Huynh, Duc-Vu Nguyen, Anh Gia-Tuan Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: Large-scale and high-quality corpora are necessary for evaluating machine reading comprehension models on a low-resource language like Vietnamese. Besides, machine reading comprehension (MRC) for the health domain offers great potential for practical applications; however, there is still very little MRC research in this domain. This paper presents ViNewsQA as a new corpus for the Vietnamese langua… ▽ More

    Submitted 11 February, 2021; v1 submitted 19 June, 2020; originally announced June 2020.

  50. arXiv:2006.07804  [pdf, other

    cs.CL

    Vietnamese Word Segmentation with SVM: Ambiguity Reduction and Suffix Capture

    Authors: Duc-Vu Nguyen, Dang Van Thin, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: In this paper, we approach Vietnamese word segmentation as a binary classification by using the Support Vector Machine classifier. We inherit features from prior works such as n-gram of syllables, n-gram of syllable types, and checking conjunction of adjacent syllables in the dictionary. We propose two novel ways to feature extraction, one to reduce the overlap ambiguity and the other to increase… ▽ More

    Submitted 14 June, 2020; originally announced June 2020.

    Comments: In Proceedings of the 16th International Conference of the Pacific Association for Computational Linguistics (PACLING 2019)