Skip to main content

Showing 1–13 of 13 results for author: Ptaszynski, M

Searching in archive cs. Search in all archives.
.
  1. Cyberbullying Detection for Low-resource Languages and Dialects: Review of the State of the Art

    Authors: Tanjim Mahmud, Michal Ptaszynski, Juuso Eronen, Fumito Masui

    Abstract: The struggle of social media platforms to moderate content in a timely manner, encourages users to abuse such platforms to spread vulgar or abusive language, which, when performed repeatedly becomes cyberbullying a social problem taking place in virtual environments, yet with real-world consequences, such as depression, withdrawal, or even suicide attempts of its victims. Systems for the automatic… ▽ More

    Submitted 29 August, 2023; originally announced August 2023.

    Comments: 52 Pages

    Report number: Volume 60, Issue 5, September 2023, 103454

    Journal ref: Information Processing & Management 2023

  2. Vulgar Remarks Detection in Chittagonian Dialect of Bangla

    Authors: Tanjim Mahmud, Michal Ptaszynski, Fumito Masui

    Abstract: The negative effects of online bullying and harassment are increasing with Internet popularity, especially in social media. One solution is using natural language processing (NLP) and machine learning (ML) methods for the automatic detection of harmful remarks, but these methods are limited in low-resource languages like the Chittagonian dialect of Bangla.This study focuses on detecting vulgar rem… ▽ More

    Submitted 29 August, 2023; originally announced August 2023.

    Comments: 5 pages

    Report number: 978-83-232-4177-5 (PDF)

    Journal ref: Human Language Technologies as a Challenge for Computer Science and Linguistics, Zygmunt Vetulani, Patrick Paroubek (eds.)2023

  3. arXiv:2306.00660  [pdf

    cs.CL

    Improving Polish to English Neural Machine Translation with Transfer Learning: Effects of Data Volume and Language Similarity

    Authors: Juuso Eronen, Michal Ptaszynski, Karol Nowakowski, Zheng Lin Chia, Fumito Masui

    Abstract: This paper investigates the impact of data volume and the use of similar languages on transfer learning in a machine translation task. We find out that having more data generally leads to better performance, as it allows the model to learn more patterns and generalizations from the data. However, related languages can also be particularly effective when there is limited data available for a specif… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

  4. Zero-shot cross-lingual transfer language selection using linguistic similarity

    Authors: Juuso Eronen, Michal Ptaszynski, Fumito Masui

    Abstract: We study the selection of transfer languages for different Natural Language Processing tasks, specifically sentiment analysis, named entity recognition and dependency parsing. In order to select an optimal transfer language, we propose to utilize different linguistic similarity metrics to measure the distance between languages and make the choice of transfer language based on this information inst… ▽ More

    Submitted 31 January, 2023; originally announced January 2023.

    Journal ref: Information Processing & Management, Volume 60, Issue 3, 2023, 103250

  5. arXiv:2301.07295  [pdf, other

    cs.CL cs.LG eess.AS

    Adapting Multilingual Speech Representation Model for a New, Underresourced Language through Multilingual Fine-tuning and Continued Pretraining

    Authors: Karol Nowakowski, Michal Ptaszynski, Kyoko Murasaki, Jagna Nieuważny

    Abstract: In recent years, neural models learned through self-supervised pretraining on large scale multilingual text or speech data have exhibited promising results for underresourced languages, especially when a relatively large amount of data from related language(s) is available. While the technology has a potential for facilitating tasks carried out in language documentation projects, such as speech tr… ▽ More

    Submitted 17 January, 2023; originally announced January 2023.

    Comments: 14 pages

    Journal ref: Information Processing & Management, Volume 60, Issue 2, March 2023, 103148, ISSN 0306-4573

  6. Can You Fool AI by Doing a 180? $\unicode{x2013}$ A Case Study on Authorship Analysis of Texts by Arata Osada

    Authors: Jagna Nieuwazny, Karol Nowakowski, Michal Ptaszynski, Fumito Masui

    Abstract: This paper is our attempt at answering a twofold question covering the areas of ethics and authorship analysis. Firstly, since the methods used for performing authorship analysis imply that an author can be recognized by the content he or she creates, we were interested in finding out whether it would be possible for an author identification system to correctly attribute works to authors if in the… ▽ More

    Submitted 19 July, 2022; originally announced July 2022.

    Journal ref: Information Processing & Management, Volume 58, Issue 5, 2021, 102644, ISSN 0306-4573

  7. arXiv:2206.01950  [pdf

    cs.CL cs.AI cs.LG

    Comparing Performance of Different Linguistically-Backed Word Embeddings for Cyberbullying Detection

    Authors: Juuso Eronen, Michal Ptaszynski, Fumito Masui

    Abstract: In most cases, word embeddings are learned only from raw tokens or in some cases, lemmas. This includes pre-trained language models like BERT. To investigate on the potential of capturing deeper relations between lexical items and structures and to filter out redundant information, we propose to preserve the morphological, syntactic and other types of linguistic information by combining them with… ▽ More

    Submitted 4 June, 2022; originally announced June 2022.

    Journal ref: Proceedings of the 2021 International Workshop on Modern Science and Technology, September 29, 2021

  8. arXiv:2206.01949  [pdf

    cs.CL cs.AI cs.LG

    Exploring the Potential of Feature Density in Estimating Machine Learning Classifier Performance with Application to Cyberbullying Detection

    Authors: Juuso Eronen, Michal Ptaszynski, Fumito Masui, Gniewosz Leliwa, Michal Wroczynski

    Abstract: In this research. we analyze the potential of Feature Density (HD) as a way to comparatively estimate machine learning (ML) classifier performance prior to training. The goal of the study is to aid in solving the problem of resource-intensive training of ML models which is becoming a serious issue due to continuously increasing dataset sizes and the ever rising popularity of Deep Neural Networks (… ▽ More

    Submitted 4 June, 2022; originally announced June 2022.

    Comments: arXiv admin note: substantial text overlap with arXiv:2111.01689

    Journal ref: he 7th Workshop on Linguistic and Cognitive Approaches to Dialog Agents (LaCATODA 2021) collocated with IJCAI 2021,August 21--26th, 2021, Montreal, Canada. CEUR Workshop Proceedings 2935, 5-14

  9. arXiv:2206.01889  [pdf

    cs.CL cs.AI cs.LG

    Initial Study into Application of Feature Density and Linguistically-backed Embedding to Improve Machine Learning-based Cyberbullying Detection

    Authors: Juuso Eronen, Michal Ptaszynski, Fumito Masui, Gniewosz Leliwa, Michal Wroczynski, Mateusz Piech, Aleksander Smywinski-Pohl

    Abstract: In this research, we study the change in the performance of machine learning (ML) classifiers when various linguistic preprocessing methods of a dataset were used, with the specific focus on linguistically-backed embeddings in Convolutional Neural Networks (CNN). Moreover, we study the concept of Feature Density and confirm its potential to comparatively predict the performance of ML classifiers,… ▽ More

    Submitted 3 June, 2022; originally announced June 2022.

    Journal ref: Proceedings of The 6th Linguistic and Cognitive Approaches to Dialog Agents (LaCATODA 2020) IJCAI 2020 Workshop, Yokohama, Japan, January 2020

  10. arXiv:2206.00962  [pdf

    cs.CL cs.AI cs.LG

    Transfer Language Selection for Zero-Shot Cross-Lingual Abusive Language Detection

    Authors: Juuso Eronen, Michal Ptaszynski, Fumito Masui, Masaki Arata, Gniewosz Leliwa, Michal Wroczynski

    Abstract: We study the selection of transfer languages for automatic abusive language detection. Instead of preparing a dataset for every language, we demonstrate the effectiveness of cross-lingual transfer learning for zero-shot abusive language detection. This way we can use existing data from higher-resource languages to build better detection systems for low-resource languages. Our datasets are from sev… ▽ More

    Submitted 2 June, 2022; originally announced June 2022.

    Journal ref: Information Processing & Management, Volume 59, Issue 4, July 2022, paper ID: 102981

  11. arXiv:2203.02116  [pdf

    cs.CL cs.AI cs.LG

    In the Service of Online Order: Tackling Cyber-Bullying with Machine Learning and Affect Analysis

    Authors: Michal Ptaszynski, Pawel Dybala, Tatsuaki Matsuba, Fumito Masui, Rafal Rzepka, Kenji Araki, Yoshio Momouchi

    Abstract: One of the burning problems lately in Japan has been cyber-bullying, or slandering and bullying people online. The problem has been especially noticed on unofficial Web sites of Japanese schools. Volunteers consisting of school personnel and PTA (Parent-Teacher Association) members have started Online Patrol to spot malicious contents within Web forums and blogs. In practise, Online Patrol assumes… ▽ More

    Submitted 3 March, 2022; originally announced March 2022.

    Comments: 12 pages, 11 tables, 6 figures

    Journal ref: International Journal of Computational Linguistics Research, Vol. 1, Issue 3, pp. 135-154, 2010

  12. arXiv:2111.01689  [pdf

    cs.CL cs.AI cs.CY

    Improving Classifier Training Efficiency for Automatic Cyberbullying Detection with Feature Density

    Authors: Juuso Eronen, Michal Ptaszynski, Fumito Masui, Aleksander Smywiński-Pohl, Gniewosz Leliwa, Michal Wroczynski

    Abstract: We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods in order to estimate dataset complexity, which in turn is used to comparatively estimate the potential performance of machine learning (ML) classifiers prior to any training. We hypothesise that estimating dataset complexity allows for the reduction of the number of required exper… ▽ More

    Submitted 2 November, 2021; v1 submitted 2 November, 2021; originally announced November 2021.

    Comments: 73 pages, 4 figures, 19 tables, Information Processing and Management, Vol. 58, Issue 5, September 2021, paper ID: 102616

    Journal ref: Information Processing and Management, Vol. 58, Issue 5, September 2021, paper ID: 102616

  13. arXiv:1808.00926  [pdf, other

    cs.CL

    Cyberbullying Detection -- Technical Report 2/2018, Department of Computer Science AGH, University of Science and Technology

    Authors: Michał Ptaszyński, Gniewosz Leliwa, Mateusz Piech, Aleksander Smywiński-Pohl

    Abstract: The research described in this paper concerns automatic cyberbullying detection in social media. There are two goals to achieve: building a gold standard cyberbullying detection dataset and measuring the performance of the Samurai cyberbullying detection system. The Formspring dataset provided in a Kaggle competition was re-annotated as a part of the research. The annotation procedure is described… ▽ More

    Submitted 2 August, 2018; originally announced August 2018.

    Report number: 2/2018 CS AGH