Online news is information about an event or occurrence presented through the internet media. To make it easier for readers to select the news they want to read, news on online news portals is usually categorized by news label. The classification of news labels is usually based on the topic discussed in the news article. However, if a news article has more than one topic, then the news article has multiple labels. Online news portals can publish tens to hundreds of stories every day. One of the most frequently occurring topics in online news articles is sexual violence. In news articles about sexual violence, there is likely more than one topic about sexual violence, so multi-label classification is needed to facilitate the grouping of online news articles to get the appropriate keywords from the sexual violence case. Many cases of sexual violence are published in online news. In the online news with the tagline “Berita Terkini Indonesia” in 2018, there were 938 news articles about sexual violence from 39 news sources.
According to data from BPS-Statistics Indonesia, there were 5,233 cases of sexual violence in 2019, which increased to 6,872 cases of sexual violence in 2020. In 2022, the Law of the Republic of Indonesia Number 12 of 2022 was passed with the aim of “preventing the occurrence of all forms of sexual violence, treating, protecting, and rehabilitating victims of sexual violence, prosecuting and rehabilitating perpetrators of sexual violence, creating an environment free from sexual violence and ensuring that sexual violence does not recur”. The problem of sexual violence has become part of violence in general, and many cases occur throughout Indonesia, involving children, adolescents, and adults. Records from the National Commission on Violence against Women (Komisi Nasional Perempuan) from 2015 to 2020 show that sexual violence occurred at all levels of education and 27% of it occurred at the university level. Sexual violence has occurred repeatedly and continuously, but not many people understand and are sensitive to this issue and consider it a mere immoral crime. Immoral crimes are behaviors that deviate from the norms of decency, one of which is sexual harassment.
Grouping news is difficult and time-consuming if it has to be done manually, so there is a need for a method that can be used to automatically classify news on online news portals. The process of automatically classifying news can be completed with the process of text mining, as news articles are usually in the form of text. This research employs a multi-label classification of online news articles using the Multinomial Naive Bayes algorithm. Multinomial Naive Bayes is a machine learning algorithm that is frequently used for the classification of text documents. The Multinomial Naive Bayes algorithm was chosen for use in this study due to its straightforward implementation and ability to handle small training datasets, as well as its demonstrated efficacy in previous research. As evidenced, shows that classification using Naive Bayes with word weighting using TF-IDF gives a higher accuracy of 87% compared to SVM, Decision Tree, KNN, Random Forest, and Logistic Regression with word weighting BoW, Doc2Vec, and word2Vec.
Other research shows that multi-label classification using Multinomial Naive Bayes produces a low Hamming loss of 0.1247. In research on Multi-Label Classification on Indonesian Language News Topics Using Multinomial Naive Bayes classifies news articles on 13 labels with a total data of 177 data taken from jawapos.com and obtained a hamming loss of 0.18. The difference between this research and previous research is that this research performs multi-label classification of online news articles about sexual violence crime using Multinomial Naive Bayes with problem transformation methods namely Binary Relevance (BR) and Label Power-set (LP).
This research aims to perform a multi-label classification using the Multinomial Naive Bayes algorithm with a problem transformation method namely Binary Relevance and Label Power-set on online news articles with a specific topic, namely sexual violence. Based on the two problem transformation methods, the best problem transformation method is determined with the Multinomial Naive Bayes classification method for the news classification process on the online news portals detik.com and kompas.com. The labels used in this classification comply with Law No. 12 of 2022, Article 4, Sections 1 and 2. By automatically classifying the topics in online news articles and producing accurate results according to the topics, this study aims to facilitate the automatic classification of related news articles according to the type of sexual violence crime and to produce precise and accurate results for the keywords of online news articles.
Data collection of online news articles on sexual violence was carried out using the web scraping method on the portals online news, detik.com and kompas.com. The web scraping method was carried out using keywords related to sexual violence, which amounted to 20 keywords. The keywords used in this study refer to Law No. 12 of 2022 with grammatical adjustments commonly used in news articles, for example ‘catcalling’ which refers to ‘kekerasan seksual non fisik’ and ‘ancaman video seks’ which refers to ‘kekerasan seksual elektronik’. Web scraping is performed in page order, starting from the date a news was uploaded, from newest to oldest on the online news page. When collecting data, web scraping includes the title, date, link, and content of the news article. The data obtained from web scraping amounted to 3499 data, with 1783 data from detik.com and 1716 data from Kompas.com.
The data were manually cleaned by deleting data that were not included in the news about sexual violence. Because the web scraping still included articles that did not contain news about sexual violence but did contain statements about sexual violence, the data were manually cleaned. After cleaning the data, 2,203 were labeling and preprocessed. In this study, the titles and content of online news articles are used to find labels about sexual violence that match the related news articles and the data are used for classification.
The best accuracy result obtained in this study is 60.309% which shows that the Multinomial Naive Bayes algorithm with a problem transformation approach is feasible to use for multi-label classification of news articles about sexual violence. This finding is consistent with previous research which has demonstrated the efficacy of the Multinomial Naive Bayes algorithm for multi-label text classification. However, further in-depth studies are required to determine the most relevant keywords for news articles about sexual violence and to balance the amount of data on each label, to improve the accuracy of the results.
Multi-label classification can be performed using the Multinomial Naive Bayes algorithm with the problem transformation methods used as Binary Relevance (BR) and Label Power-set (LP). Based on the results of the experiment, the best accuracy is obtained using Multinomial Naive Bayes and problem transformation methods, namely Label Power-set (LP) on the split data of training and testing of 90:10, which can perform multi-label classification with the best accuracy of 60.309%. The accuracy using the Binary Relevance (BR) approach is 51.546% on the split data of training and testing of 90:10. Suggestions for future research include adding relevant keywords using query expansion so that more data can be classified, balancing the amount of data on each label so that there is not a significant imbalance in the amount of data on each label, and adding more data with multiple labels.
Author: Badrus Zaman
Source article can be accessed at https://scholar.unair.ac.id/en/publications/a-multi-label-classification-of-online-news-using-multinomial-nai/





