Search | arXiv e-print repository

arXiv:2401.16638 [pdf, other]

Breaking Free Transformer Models: Task-specific Context Attribution Promises Improved Generalizability Without Fine-tuning Pre-trained LLMs

Authors: Stepan Tytarenko, Mohammad Ruhul Amin

Abstract: Fine-tuning large pre-trained language models (LLMs) on particular datasets is a commonly employed strategy in Natural Language Processing (NLP) classification tasks. However, this approach usually results in a loss of models generalizability. In this paper, we present a framework that allows for maintaining generalizability, and enhances the performance on the downstream task by utilizing task-sp… ▽ More Fine-tuning large pre-trained language models (LLMs) on particular datasets is a commonly employed strategy in Natural Language Processing (NLP) classification tasks. However, this approach usually results in a loss of models generalizability. In this paper, we present a framework that allows for maintaining generalizability, and enhances the performance on the downstream task by utilizing task-specific context attribution. We show that a linear transformation of the text representation from any transformer model using the task-specific concept operator results in a projection onto the latent concept space, referred to as context attribution in this paper. The specific concept operator is optimized during the supervised learning stage via novel loss functions. The proposed framework demonstrates that context attribution of the text representation for each task objective can improve the capacity of the discriminator function and thus achieve better performance for the classification task. Experimental results on three datasets, namely HateXplain, IMDB reviews, and Social Media Attributions, illustrate that the proposed model attains superior accuracy and generalizability. Specifically, for the non-fine-tuned BERT on the HateXplain dataset, we observe 8% improvement in accuracy and 10% improvement in F1-score. Whereas for the IMDB dataset, fine-tuned state-of-the-art XLNet is outperformed by 1% for both accuracy and F1-score. Furthermore, in an out-of-domain cross-dataset test, DistilBERT fine-tuned on the IMDB dataset in conjunction with the proposed model improves the F1-score on the HateXplain dataset by 7%. For the Social Media Attributions dataset of YouTube comments, we observe 5.2% increase in F1-metric. The proposed framework is implemented with PyTorch and provided open-source on GitHub. △ Less

Submitted 29 January, 2024; originally announced January 2024.

Comments: 8 pages, 3 figures, 5 tables, To be published in 2024 AAAI workshop on Responsible Language Models (ReLM)

ACM Class: I.2.7; I.2.4

arXiv:2311.03078 [pdf]

BanLemma: A Word Formation Dependent Rule and Dictionary Based Bangla Lemmatizer

Authors: Sadia Afrin, Md. Shahad Mahmud Chowdhury, Md. Ekramul Islam, Faisal Ahamed Khan, Labib Imam Chowdhury, MD. Motahar Mahtab, Nazifa Nuha Chowdhury, Massud Forkan, Neelima Kundu, Hakim Arif, Mohammad Mamun Or Rashid, Mohammad Ruhul Amin, Nabeel Mohammed

Abstract: Lemmatization holds significance in both natural language processing (NLP) and linguistics, as it effectively decreases data density and aids in comprehending contextual meaning. However, due to the highly inflected nature and morphological richness, lemmatization in Bangla text poses a complex challenge. In this study, we propose linguistic rules for lemmatization and utilize a dictionary along w… ▽ More Lemmatization holds significance in both natural language processing (NLP) and linguistics, as it effectively decreases data density and aids in comprehending contextual meaning. However, due to the highly inflected nature and morphological richness, lemmatization in Bangla text poses a complex challenge. In this study, we propose linguistic rules for lemmatization and utilize a dictionary along with the rules to design a lemmatizer specifically for Bangla. Our system aims to lemmatize words based on their parts of speech class within a given sentence. Unlike previous rule-based approaches, we analyzed the suffix marker occurrence according to the morpho-syntactic values and then utilized sequences of suffix markers instead of entire suffixes. To develop our rules, we analyze a large corpus of Bangla text from various domains, sources, and time periods to observe the word formation of inflected words. The lemmatizer achieves an accuracy of 96.36% when tested against a manually annotated test dataset by trained linguists and demonstrates competitive performance on three previously published Bangla lemmatization datasets. We are making the code and datasets publicly available at https://github.com/eblict-gigatech/BanLemma in order to contribute to the further advancement of Bangla NLP. △ Less

Submitted 6 November, 2023; originally announced November 2023.

arXiv:2306.06147 [pdf]

doi 10.1145/3580305.3599904

SentiGOLD: A Large Bangla Gold Standard Multi-Domain Sentiment Analysis Dataset and its Evaluation

Authors: Md. Ekramul Islam, Labib Chowdhury, Faisal Ahamed Khan, Shazzad Hossain, Sourave Hossain, Mohammad Mamun Or Rashid, Nabeel Mohammed, Mohammad Ruhul Amin

Abstract: This study introduces SentiGOLD, a Bangla multi-domain sentiment analysis dataset. Comprising 70,000 samples, it was created from diverse sources and annotated by a gender-balanced team of linguists. SentiGOLD adheres to established linguistic conventions agreed upon by the Government of Bangladesh and a Bangla linguistics committee. Unlike English and other languages, Bangla lacks standard sentim… ▽ More This study introduces SentiGOLD, a Bangla multi-domain sentiment analysis dataset. Comprising 70,000 samples, it was created from diverse sources and annotated by a gender-balanced team of linguists. SentiGOLD adheres to established linguistic conventions agreed upon by the Government of Bangladesh and a Bangla linguistics committee. Unlike English and other languages, Bangla lacks standard sentiment analysis datasets due to the absence of a national linguistics framework. The dataset incorporates data from online video comments, social media posts, blogs, news, and other sources while maintaining domain and class distribution rigorously. It spans 30 domains (e.g., politics, entertainment, sports) and includes 5 sentiment classes (strongly negative, weakly negative, neutral, and strongly positive). The annotation scheme, approved by the national linguistics committee, ensures a robust Inter Annotator Agreement (IAA) with a Fleiss' kappa score of 0.88. Intra- and cross-dataset evaluation protocols are applied to establish a standard classification system. Cross-dataset evaluation on the noisy SentNoB dataset presents a challenging test scenario. Additionally, zero-shot experiments demonstrate the generalizability of SentiGOLD. The top model achieves a macro f1 score of 0.62 (intra-dataset) across 5 classes, setting a benchmark, and 0.61 (cross-dataset from SentNoB) across 3 classes, comparable to the state-of-the-art. Fine-tuned sentiment analysis model can be accessed at https://sentiment.bangla.gov.bd. △ Less

Submitted 9 June, 2023; originally announced June 2023.

Comments: Accepted in KDD 2023 Applied Data Science Track; 12 pages, 14 figures

arXiv:2305.10698 [pdf]

Ranking the locations and predicting future crime occurrence by retrieving news from different Bangla online newspapers

Authors: Jumman Hossain, Rajib Chandra Das, Md. Ruhul Amin, Md. Saiful Islam

Abstract: There have thousands of crimes are happening daily all around. But people keep statistics only few of them, therefore crime rates are increasing day by day. The reason behind can be less concern or less statistics of previous crimes. It is much more important to observe the previous crime statistics for general people to make their outing decision and police for catching the criminals are taking s… ▽ More There have thousands of crimes are happening daily all around. But people keep statistics only few of them, therefore crime rates are increasing day by day. The reason behind can be less concern or less statistics of previous crimes. It is much more important to observe the previous crime statistics for general people to make their outing decision and police for catching the criminals are taking steps to restrain the crimes and tourists to make their travelling decision. National institute of justice releases crime survey data for the country, but does not offer crime statistics up to Union or Thana level. Considering all of these cases we have come up with an approach which can give an approximation to people about the safety of a specific location with crime ranking of different areas locating the crimes on a map including a future crime occurrence prediction mechanism. Our approach relies on different online Bangla newspapers for crawling the crime data, stemming and keyword extraction, location finding algorithm, cosine similarity, naive Bayes classifier, and a custom crime prediction model △ Less

Submitted 18 May, 2023; originally announced May 2023.

Comments: 9 pages

arXiv:2210.07286 [pdf, other]

Augmenting Online Classes with an Attention Tracking Tool May Improve Student Engagement

Authors: Arnab Sen Sharma, Mohammad Ruhul Amin, Muztaba Fuad

Abstract: Online remote learning has certain advantages, such as higher flexibility and greater inclusiveness. However, a caveat is the teachers' limited ability to monitor student interaction during an online class, especially while teachers are sharing their screens. We have taken feedback from 12 teachers experienced in teaching undergraduate-level online classes on the necessity of an attention tracking… ▽ More Online remote learning has certain advantages, such as higher flexibility and greater inclusiveness. However, a caveat is the teachers' limited ability to monitor student interaction during an online class, especially while teachers are sharing their screens. We have taken feedback from 12 teachers experienced in teaching undergraduate-level online classes on the necessity of an attention tracking tool to understand student engagement during an online class. This paper outlines the design of such a monitoring tool that automatically tracks the attentiveness of the whole class by tracking students' gazes on the screen and alerts the teacher when the attention score goes below a certain threshold. We assume the benefits are twofold; 1) teachers will be able to ascertain if the students are attentive or being engaged with the lecture contents and 2) the students will become more attentive in online classes because of this passive monitoring system. In this paper, we present the preliminary design and feasibility of using the proposed tool and discuss its applicability in augmenting online classes. Finally, we surveyed 31 students asking their opinion on the usability as well as the ethical and privacy concerns of using such a monitoring tool. △ Less

Submitted 13 October, 2022; originally announced October 2022.

Comments: 18 pages, 10 figures,

arXiv:2206.00372 [pdf]

BD-SHS: A Benchmark Dataset for Learning to Detect Online Bangla Hate Speech in Different Social Contexts

Authors: Nauros Romim, Mosahed Ahmed, Md. Saiful Islam, Arnab Sen Sharma, Hriteshwar Talukder, Mohammad Ruhul Amin

Abstract: Social media platforms and online streaming services have spawned a new breed of Hate Speech (HS). Due to the massive amount of user-generated content on these sites, modern machine learning techniques are found to be feasible and cost-effective to tackle this problem. However, linguistically diverse datasets covering different social contexts in which offensive language is typically used are requ… ▽ More Social media platforms and online streaming services have spawned a new breed of Hate Speech (HS). Due to the massive amount of user-generated content on these sites, modern machine learning techniques are found to be feasible and cost-effective to tackle this problem. However, linguistically diverse datasets covering different social contexts in which offensive language is typically used are required to train generalizable models. In this paper, we identify the shortcomings of existing Bangla HS datasets and introduce a large manually labeled dataset BD-SHS that includes HS in different social contexts. The labeling criteria were prepared following a hierarchical annotation process, which is the first of its kind in Bangla HS to the best of our knowledge. The dataset includes more than 50,200 offensive comments crawled from online social networking sites and is at least 60% larger than any existing Bangla HS datasets. We present the benchmark result of our dataset by training different NLP models resulting in the best one achieving an F1-score of 91.0%. In our experiments, we found that a word embedding trained exclusively using 1.47 million comments from social media and streaming sites consistently resulted in better modeling of HS detection in comparison to other pre-trained embeddings. Our dataset and all accompanying codes is publicly available at github.com/naurosromim/hate-speech-dataset-for-Bengali-social-media △ Less

Submitted 1 June, 2022; originally announced June 2022.

arXiv:2112.04298 [pdf, other]

GCA-Net : Utilizing Gated Context Attention for Improving Image Forgery Localization and Detection

Authors: Sowmen Das, Md. Saiful Islam, Md. Ruhul Amin

Abstract: Forensic analysis of manipulated pixels requires the identification of various hidden and subtle features from images. Conventional image recognition models generally fail at this task because they are biased and more attentive toward the dominant local and spatial features. In this paper, we propose a novel Gated Context Attention Network (GCA-Net) that utilizes non-local attention in conjunction… ▽ More Forensic analysis of manipulated pixels requires the identification of various hidden and subtle features from images. Conventional image recognition models generally fail at this task because they are biased and more attentive toward the dominant local and spatial features. In this paper, we propose a novel Gated Context Attention Network (GCA-Net) that utilizes non-local attention in conjunction with a gating mechanism in order to capture the finer image discrepancies and better identify forged regions. The proposed framework uses high dimensional embeddings to filter and aggregate the relevant context from coarse feature maps at various stages of the decoding process. This improves the network's understanding of global differences and reduces false-positive localizations. Our evaluation on standard image forensic benchmarks shows that GCA-Net can both compete against and improve over state-of-the-art networks by an average of 4.7% AUC. Additional ablation studies also demonstrate the method's robustness against attributions and resilience to false-positive predictions. △ Less

Submitted 7 April, 2022; v1 submitted 8 December, 2021; originally announced December 2021.

Comments: Accepted for publication at the CVPR 2022 Media Forensics Workshop

arXiv:2112.01902 [pdf, other]

HS-BAN: A Benchmark Dataset of Social Media Comments for Hate Speech Detection in Bangla

Authors: Nauros Romim, Mosahed Ahmed, Md Saiful Islam, Arnab Sen Sharma, Hriteshwar Talukder, Mohammad Ruhul Amin

Abstract: In this paper, we present HS-BAN, a binary class hate speech (HS) dataset in Bangla language consisting of more than 50,000 labeled comments, including 40.17% hate and rest are non hate speech. While preparing the dataset a strict and detailed annotation guideline was followed to reduce human annotation bias. The HS dataset was also preprocessed linguistically to extract different types of slang c… ▽ More In this paper, we present HS-BAN, a binary class hate speech (HS) dataset in Bangla language consisting of more than 50,000 labeled comments, including 40.17% hate and rest are non hate speech. While preparing the dataset a strict and detailed annotation guideline was followed to reduce human annotation bias. The HS dataset was also preprocessed linguistically to extract different types of slang currently people write using symbols, acronyms, or alternative spellings. These slang words were further categorized into traditional and non-traditional slang lists and included in the results of this paper. We explored traditional linguistic features and neural network-based methods to develop a benchmark system for hate speech detection for the Bangla language. Our experimental results show that existing word embedding models trained with informal texts perform better than those trained with formal text. Our benchmark shows that a Bi-LSTM model on top of the FastText informal word embedding achieved 86.78% F1-score. We will make the dataset available for public use. △ Less

Submitted 3 December, 2021; originally announced December 2021.

Comments: Submitted to ICON 21 (Rejected)

arXiv:2110.05906 [pdf, other]

Energy-cost aware off-grid base stations with IoT devices for developing a green heterogeneous network

Authors: Khondoker Ziaul Islam, MD. Sanwar Hossain, B. M. Ruhul Amin, Ferdous Sohel

Abstract: Heterogeneous network (HetNet) is a specified cellular platform to tackle the rapidly growing anticipated data traffic. From communications perspective, data loads can be mapped to energy loads that are generally placed on the operator networks. Meanwhile, renewable energy aided networks offer to curtail fossil fuel consumption, so to reduce environmental pollution. This paper proposes a renewable… ▽ More Heterogeneous network (HetNet) is a specified cellular platform to tackle the rapidly growing anticipated data traffic. From communications perspective, data loads can be mapped to energy loads that are generally placed on the operator networks. Meanwhile, renewable energy aided networks offer to curtail fossil fuel consumption, so to reduce environmental pollution. This paper proposes a renewable energy based power supply architecture for off-grid HetNet using a novel energy sharing model. Solar photovoltaic (PV) along with sufficient energy storage devices are used for each macro, micro, pico, or femto base station (BS). Additionally, biomass generator (BG) is used for macro and micro BSs. The collocated macro and micro BSs are connected through end-to-end resistive lines. A novel weighted proportional-fair resource-scheduling algorithm with sleep mechanisms is proposed for non-real time (NRT) applications by trading-off the power consumption and communication delays. Furthermore, the proposed algorithm with extended discontinuous reception (eDRX) and power saving mode (PSM) for narrowband internet of things (IoT) applications extends battery lifetime for IoT devices. HOMER optimization software is used to perform optimal system architecture, economic, and carbon footprint analyses while Monte-Carlo simulation tool is used for evaluating the throughput and energy efficiency performances. The proposed algorithms are valid for the practical data of the rural areas. We demonstrate the proposed power supply architecture is energy-efficient, cost-effective, reliable, and eco-friendly. △ Less

Submitted 12 October, 2021; originally announced October 2021.

arXiv:2107.14095 [pdf, other]

Exploring the Scope and Potential of Local Newspaper-based Dengue Surveillance in Bangladesh

Authors: Nazia Tasnim, Md. Istiak Hossain Shihab, Moqsadur Rahman, Sheikh Rabiul Islam, Mohammad Ruhul Amin

Abstract: Dengue fever has been considered to be one of the global public health problems of the twenty-first century, especially in tropical and subtropical countries of the global south. The high morbidity and mortality rates of Dengue fever impose a huge economic and health burden for middle and low-income countries. It is so prevalent in such regions that enforcing a granular level of surveillance is qu… ▽ More Dengue fever has been considered to be one of the global public health problems of the twenty-first century, especially in tropical and subtropical countries of the global south. The high morbidity and mortality rates of Dengue fever impose a huge economic and health burden for middle and low-income countries. It is so prevalent in such regions that enforcing a granular level of surveillance is quite impossible. Therefore, it is crucial to explore an alternative cost-effective solution that can provide updates of the ongoing situation in a timely manner. In this paper, we explore the scope and potential of a local newspaper-based dengue surveillance system, using well-known data-mining techniques, in Bangladesh from the analysis of the news contents written in the native language. In addition, we explain the working procedure of developing a novel database, using human-in-the-loop technique, for further analysis, and classification of dengue and its intervention-related news. Our classification method has an f-score of 91.45%, and matches the ground truth of reported cases quite closely. Based on the dengue and intervention-related news, we identified the regions where more intervention efforts are needed to reduce the rate of dengue infection. A demo of this project can be accessed at: http://erdos.dsm.fordham.edu:3009/ △ Less

Submitted 7 July, 2021; originally announced July 2021.

Comments: 5 Pages, Joint KDD 2021 Health Day and 2021 KDD Workshop on Applied Data Science for Healthcare

arXiv:2102.09603 [pdf, other]

Towards Solving the DeepFake Problem : An Analysis on Improving DeepFake Detection using Dynamic Face Augmentation

Authors: Sowmen Das, Selim Seferbekov, Arup Datta, Md. Saiful Islam, Md. Ruhul Amin

Abstract: The creation of altered and manipulated faces has become more common due to the improvement of DeepFake generation methods. Simultaneously, we have seen detection models' development for differentiating between a manipulated and original face from image or video content. In this paper, we focus on identifying the limitations and shortcomings of existing deepfake detection frameworks. We identified… ▽ More The creation of altered and manipulated faces has become more common due to the improvement of DeepFake generation methods. Simultaneously, we have seen detection models' development for differentiating between a manipulated and original face from image or video content. In this paper, we focus on identifying the limitations and shortcomings of existing deepfake detection frameworks. We identified some key problems surrounding deepfake detection through quantitative and qualitative analysis of existing methods and datasets. We found that deepfake datasets are highly oversampled, causing models to become easily overfitted. The datasets are created using a small set of real faces to generate multiple fake samples. When trained on these datasets, models tend to memorize the actors' faces and labels instead of learning fake features. To mitigate this problem, we propose a simple data augmentation method termed Face-Cutout. Our method dynamically cuts out regions of an image using the face landmark information. It helps the model selectively attend to only the relevant regions of the input. Our evaluation experiments show that Face-Cutout can successfully improve the data variation and alleviate the problem of overfitting. Our method achieves a reduction in LogLoss of 15.2% to 35.3% on different datasets, compared to other occlusion-based techniques. Moreover, we also propose a general-purpose data pre-processing guideline to train and evaluate existing architectures allowing us to improve the generalizability of these models for deepfake detection. △ Less

Submitted 25 August, 2021; v1 submitted 18 February, 2021; originally announced February 2021.

arXiv:2012.07538 [pdf, other]

Sentiment analysis in Bengali via transfer learning using multi-lingual BERT

Authors: Khondoker Ittehadul Islam, Md. Saiful Islam, Md Ruhul Amin

Abstract: Sentiment analysis (SA) in Bengali is challenging due to this Indo-Aryan language's highly inflected properties with more than 160 different inflected forms for verbs and 36 different forms for noun and 24 different forms for pronouns. The lack of standard labeled datasets in the Bengali domain makes the task of SA even harder. In this paper, we present manually tagged 2-class and 3-class SA datas… ▽ More Sentiment analysis (SA) in Bengali is challenging due to this Indo-Aryan language's highly inflected properties with more than 160 different inflected forms for verbs and 36 different forms for noun and 24 different forms for pronouns. The lack of standard labeled datasets in the Bengali domain makes the task of SA even harder. In this paper, we present manually tagged 2-class and 3-class SA datasets in Bengali. We also demonstrate that the multi-lingual BERT model with relevant extensions can be trained via the approach of transfer learning over those novel datasets to improve the state-of-the-art performance in sentiment classification tasks. This deep learning model achieves an accuracy of 71\% for 2-class sentiment classification compared to the current state-of-the-art accuracy of 68\%. We also present the very first Bengali SA classifier for the 3-class manually tagged dataset, and our proposed model achieves an accuracy of 60\%. We further use this model to analyze the sentiment of public comments in the online daily newspaper. Our analysis shows that people post negative comments for political or sports news more often, while the religious article comments represent positive sentiment. The dataset and code is publicly available at https://github.com/KhondokerIslam/Bengali\_Sentiment. △ Less

Submitted 3 December, 2020; originally announced December 2020.

Comments: 5 pages

arXiv:1610.00369 [pdf]

Sentiment Analysis on Bangla and Romanized Bangla Text (BRBT) using Deep Recurrent models

Authors: A. Hassan, M. R. Amin, N. Mohammed, A. K. A. Azad

Abstract: Sentiment Analysis (SA) is an action research area in the digital age. With rapid and constant growth of online social media sites and services, and the increasing amount of textual data such as - statuses, comments, reviews etc. available in them, application of automatic SA is on the rise. However, most of the research works on SA in natural language processing (NLP) are based on English languag… ▽ More Sentiment Analysis (SA) is an action research area in the digital age. With rapid and constant growth of online social media sites and services, and the increasing amount of textual data such as - statuses, comments, reviews etc. available in them, application of automatic SA is on the rise. However, most of the research works on SA in natural language processing (NLP) are based on English language. Despite being the sixth most widely spoken language in the world, Bangla still does not have a large and standard dataset. Because of this, recent research works in Bangla have failed to produce results that can be both comparable to works done by others and reusable as stepping stones for future researchers to progress in this field. Therefore, we first tried to provide a textual dataset - that includes not just Bangla, but Romanized Bangla texts as well, is substantial, post-processed and multiple validated, ready to be used in SA experiments. We tested this dataset in Deep Recurrent model, specifically, Long Short Term Memory (LSTM), using two types of loss functions - binary crossentropy and categorical crossentropy, and also did some experimental pre-training by using data from one validation to pre-train the other and vice versa. Lastly, we documented the results along with some analysis on them, which were promising. △ Less

Submitted 23 November, 2016; v1 submitted 2 October, 2016; originally announced October 2016.

arXiv:1401.6082 [pdf]

Performance Evaluation of Two-Hop Wireless Link under Nakagami-m Fading

Authors: Afsana Nadia, Arifur Rahim Chowdhury, Md. Shoayeb Hossain, Md. Imdadul Islam, M. R. Amin

Abstract: Now-a-days, intense research is going on two-hop wireless link under different fading conditions with its remedial measures. In this paper work, a two-hop link under three different conditions is considered: (i) MIMO on both hops, (ii) MISO in first hop and SIMO in second hop and finally (iii) SIMO in first hop and MISO in second hop. The three models used here give the flexibility of using STBC (… ▽ More Now-a-days, intense research is going on two-hop wireless link under different fading conditions with its remedial measures. In this paper work, a two-hop link under three different conditions is considered: (i) MIMO on both hops, (ii) MISO in first hop and SIMO in second hop and finally (iii) SIMO in first hop and MISO in second hop. The three models used here give the flexibility of using STBC (Space Time Block Coding) and combining scheme on any of the source to relay (S- R) and relay to destination (R-D) link. Even incorporation of Transmitting Antenna Selection (TAS) is possible on any link. Here, the variation of SER (Symbol Error Rate) is determined against mean SNR (Signal-to-Noise Ratio) of R-D link for three different modulation schemes: BPSK, 8-PSK and 16-PSK, taking the number of antennas and SNR of S-R link as parameters under Nakagami -m fading condition. △ Less

Submitted 21 December, 2013; originally announced January 2014.

Journal ref: IJACSA,Vol. 4,No. 7,July 2013

Showing 1–14 of 14 results for author: Amin, M R