Applying NLP and Machine Learning for Sentiment Classification Across Multiple Twitter Datasets

Ervin Vladić, Benjamin Mehanović, Mirza Novalić, Dino Kečo, Dželila Mehanović

Abstract


Social networks have appeared as the main opinion- sharing and discussion-enabling resources in the last ten years. At the same time, the development of machine learning (ML) and natural language processing (NLP) technologies has allowed new approaches to analyzing the huge quantities of data users create. This research uses data loading, class imbalance handling, text preprocessing and tokenization, sentiment analysis, and model assessment techniques to analyze the sentiment of the tweets across two datasets. Using metrics like accuracy, precision, recall, and F1 score the study reveals that SVM and Logistic Regression are the most suitable machine- learning models for this purpose. SVM attained an accuracy of 90% for training and 77% for testing. In comparison, Logistic Regression showed 83% for training and 78% for testing for Dataset 1, and for Dataset 2 SVM attained 98% training and 86% for testing. In comparison, Logistic Regression showed balance between training and testing achieving 93% in training and 87% in testing.

Keywords


Social Media Analysis, Classification Models, Data Preprocessing, Sentiment Analysis, Logistic Regression.

References


Vladić, E., Mehanović, B., Novalić, M., Kečo, D., & Mehanović, Dž. (2024). Sentiment Classification of Tweets using Machine Learning and NLP Techniques. In Proceedings of the NLMLT 2024 Conference. DOI: 10.5121/csit.2024.142004

Orza P., Sentiment Classification: A Beginner’s Guide Levity, 2022.

Donges N., Urwin M., Introduction to Natural Language Processing (NLP) Built In, 2023. https://builtin.com/data-science/introduction-nlp. Accessed November 6, 2023.

Gulati K., Kumar S. S., Boddu R. S. K., Sarvakar K., Sharma D. K., Nomani M. Z. M., Comparative Analysis of Machine Learning-Based Classification Models Using Sentiment Classification of Tweets Related to COVID-19 Pandemic. Materials Today: Proceedings, Volume 51, Part 1, 2022, Pages 38-41. ISSN 2214-7853. https://doi.org/10.1016/j.matpr.2021.04.364.(https://www.sciencedirect. com/science/article/pii/S2214 785321032843)

Shoeb M., Ahmed J., Sentiment Analysis and Classification of Tweets Using Data Mining International Research Journal of Engineering and Technology (IRJET), 04(12), 1471, 2017.

Ahmad M., Aftab S., Ali I., Sentiment Analysis of Tweets using SVM International Journal of Computer Applications, 177(5), 25, 2017.

Rustam F., Ashraf I., Mehmood A., Ullah S., Choi G. S., Tweets Classification on the Base of Sentiments for US Airline Companies Entropy, 21(11), 1071, 2019. doi:10.3390/e21111071.

Celiktug M. F., Twitter Sentiment Analysis, 3-Way Classification: Positive, Negative or Neutral? 2018 IEEE International Conference on Big Data (Big Data), Bilkent University, Gazi University, Ankara, Turkey, 2018

Go A., Bhayani R., Huang L., Twitter sentiment classification using distant supervision. CS224N project report, Stanford, 1(12), 2009.

Heikal, M., Torki, M., & El-Makky, N. (2018). Sentiment analysis of Arabic tweets using deep learning. Procedia Computer Science, 142, 114–122. https://doi.org/10.1016/j.procs.2018.07.020

Harjule, P., Gurjar, A., Seth, H., & Thakur, P. (2020). Sentiment analysis of Twitter data using machine learning algorithms. International Journal of Computer Applications, 975(14), 1-6.

Baid, P., Gupta, A., & Chaplot, N. (2017). Sentiment Analysis of Movie Reviews using Machine Learning Techniques. International Journal of Computer Applications, 179(7), 45-49.

Sarlan A., Nadam C., Basri S., Twitter sentiment analysis. In Proceedings of the 6th International Conference on Information Technology and Multimedia, pp. 212–216. IEEE, November 2014.

Shrivastava A., Sentiment Analysis Dataset. Kaggle. https://www.kaggle.com/datasets/abhi8923shriv/sentiment-analysis- dataset/data, 2020. Accessed November 11, 2023

M. Yasser H. (2020). Twitter Tweets Sentiment Dataset. Kaggle. Retrieved from https://www.kaggle.com/datasets/yasserh/twitter- tweets-sentiment-dataset/data. Accessed April 14, 2024

slam Sujan N., Top 5 Machine Learning Libraries in Python. Towards Data Science. https://towardsdatascience.com/top-5-machine-learning- libraries-in-python-e36e3e0e02af, 2018. Accessed January 12, 2024.

Natural Language Toolkit. https://www.nltk.org/, 2023. Accessed January 18, 2024

Google for Developers. Imbalanced Data. https://developers.google. com/machine-learning/data- prep/construct/sampling-splitting/ imbalanced-data, 2023. Accessed January 21, 2024.

Natural Language Toolkit. Lancaster Stemmer. NLTK Documentation. https://www.nltk.org/api/nltk.stem.lancaster.html, 2023. Accessed February 4, 2024.

Natural Language Toolkit (nltk). nltk.stem.wordnet Module. NLTK Project. https://www.nltk.org/_modules/nltk/stem/wordnet.html, 2023. Accessed February 6, 2024.

scikit-learn. TfidfVectorizer. In scikit-learn Documentation. https://scikit- learn.org/stable/modules/generated/sklearn.feature_ extraction.text.TfidfVectorizer.html, 2023. Accessed February 11, 2024.

Rahat A. M., Kahir A., Masum A. K. M., Comparison of Naive Bayes and SVM Algorithm based on sentiment analysis using review dataset. In 2019 8th International Conference System Modeling and Advancement in Research Trends (SMART), pp. 266–270. IEEE, November 2019.

DataTechNotes. Classification Example with Linear SVC in Python. Retrieved from https://www.datatechnotes.com/2020/07/ classification- example-with-linearsvm-in-python.html, 2020. Accessed March 8, 2024.

Varshney C. J., Sharma A., Yadav D. P., Sentiment analysis using ensemble classification technique. In 2020 IEEE Students Conference on Engineering & Systems (SCES), pp. 1–6. IEEE, July 2020.

Johnson M., Bernoulli Naive Bayes Classifier. Retrieved from https://mattshomepage.com/articles/2016/Jun/07/bernoulli_nb/, 2016. Accessed March 18, 2024.

Peng C. Y. J., Lee K. L., Ingersoll G. M., An introduction to logistic regression analysis and reporting. The Journal of Educational Research, 96(1), 3–14, 2002.

Abbas M., Memon K. A., Jamali A. A., Memon S., Ahmed A., Multinomial Naive Bayes classification model for sentiment analysis. IJCSNS Int. J. Comput. Sci. Netw. Secur, 19(3), 62, 2019.

Sriram, Passive Aggressive Algorithm — For big data models. Geek Culture. Retrieved from https://medium.com/geekculture/passive- aggressive-algorithm-for-big-data-models-8cd535ceb2e6, 2021. Accessed March 23, 2024.

Bhardwaj A., What is a Perceptron? – Basics of Neural Networks: An overview of the history of perceptrons and how they work. Towards Data Science. Retrieved from https://towardsdatascience.com/ what-is- a-perceptron-basics-of-neural-networks-c4cfea20c590, 2020. Accessed March 24, 2024.

Bhardwaj A., What is a Perceptron? – Basics of Neural Networks: An overview of the history of perceptrons and how they work. Towards Data Science. Retrieved from https://towardsdatascience.com/ what-is- a-perceptron-basics-of-neural-networks-c4cfea20c590, 2020. Accessed March 24, 2024.

scikit-learn. 2024. Sklearn.metrics.accuracy_score. Retrieved from https://scikit- learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.htm l. (2024). Accessed August 25, 2024.

scikit-learn. 2024. Sklearn.metrics.precision_score. Retrieved from https://scikit- learn.org/stable/modules/generated/sklearn.metrics.precision_score.htm l. (2024). Accessed August 26, 2024.

cikit-learn. 2024. Sklearn.metrics.recall_score. Retrieved from https://scikit- learn.org/stable/modules/generated/sklearn.metrics.recall_score.html. (2024). Accessed August 26, 2024.

scikit-learn. 2024. Sklearn.metrics.f1_score. Retrieved from https://scikit- learn.org/stable/modules/generated/sklearn.metrics.f1_score.html. (2024). Accessed August 27, 2024.


Full Text: PDF

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.

IT in Innovation IT in Business IT in Engineering IT in Health IT in Science IT in Design IT in Fashion

IT in Industry @ http://www.it-in-industry.com . ISSN (Online): 2203-1731; ISSN (Print): 2204-0595