zhyDaDa的个人站点work用Python语言做社交媒体上的情感分析
zhyDaDa的个人站点work用Python语言做社交媒体上的情感分析
zhyDaDa的个人站点

用Python语言做社交媒体上的情感分析

work

Machine Learning with Python and Applications to Sentiment Analysis in Social Media

list of contents:

cover

Abstract

The main purpose of this experiment is to train the Naive Bayesian model and the Random Forest model for sentiment analysis tasks on Twitter comments and compare their performance. Previous research has validated the effectiveness of both the Naive Bayesian and Random Forest models in the field of sentiment analysis, providing a theoretical foundation for the current study.

The experimental methodology involved data preparation, model building, and model application. The experimental results revealed that the Random Forest model outperformed the Naive Bayesian model in terms of accuracy, precision, recall, and F1-score. However, the Naive Bayesian model offered advantages in terms of smaller model size and faster training speed.

The findings of the experiment indicate that the Random Forest model is more suitable for sentiment analysis tasks on Twitter comments, and Naive Bayesian model’s lightweight nature may prove beneficial in other domains.

Introduction

The field of sentiment analysis has gained significant attention in recent years due to the proliferation of social media platforms such as Twitter, where users express their opinions on various topics. The analysis of these sentiments can provide valuable insights into public opinion, brand perception, and social trends. Previous research has explored the use of machine learning models for sentiment classification, with various algorithms demonstrating their effectiveness in this domain.

I will use two models to implement the analyse the sentiment of Twitter comments, the Naive Bayesian model and the Random Forest Model. I hope to learn practical model training and evaluation methods through this experiment, and compare the performance of these two classification models in sentiment classification tasks.

The experiment’s findings are expected to shed light on the suitability of both the Naive Bayesian and Random Forest models for entity sentiment analysis on Twitter, potentially guiding future research and practice in this area.

Method

I use Google Colab to run the program. My code can be divided into 3 main part as follows:

  1. Preparation for the data
    First of all, I load the data downloaded from Kaggle Twitter Sentiment Analysis (kaggle.com). For convenience, I combine the validation and training set into one data frame and preprocess it. The raw data contains a lot of irrelevant information, which is a kind of interference for model training and requires data cleaning. The process includes removeing void and duplicated data, removeing the website links and none-words (like emoji), making words into lowercase and eliminating stopwords (like ‘the’ ,‘a’). To feed the model, the words are transformed into TF-IDF vectors at last.

  2. Build model
    I use the method ‘train_test_split’ to split the entire dataset into training and validation sets in a ratio of 2:8. Then I train two models with training data and check their performance with validation data. To evaluate their over-all performance, I print the confusion matrix and the classification report, including precision, recall and f1-score.
    In order to use the model latter, I use ‘joblib’ to save the model into ‘.pkl’ files.

  3. Use model
    I browsed Twitter and found some comments myself, loaded the model and conducted actual testing to see if it met expectations

Result

1. Naive Bayesian model

figure1

Figure 1 confusion matrix of Naive Bayesian model

| | precision | recall | f1-score | support |
| ———— | ——— | ———- | ——– | ——- |
| Irrelevant | 0.38 | 0.63 | 0.47 | 2483 |
| Negative | 0.65 | 0.48 | 0.55 | 4374 |
| Neutral | 0.62 | 0.48 | 0.54 | 3568 |
| Positive | 0.54 | 0.58 | 0.56 | 4003 |
| Accuracy | | 0.53 14428 | | |
| Macro avg | 0.55 | 0.54 | 0.53 | 14428 |
| Weighted avg | 0.56 | 0.53 | 0.54 | 14428 |

Table 1 classification report of Naive Bayesian model

2. Random Forest Model

figure2

Figure 2 confusion matrix of Random Forest Model

| | precision | recall | f1-score | support |
| ———— | ——— | —— | ——– | ——- |
| Irrelevant | 0.97 | 0.85 | 0.91 | 2483 |
| Negative | 0.92 | 0.92 | 0.92 | 4374 |
| Neutral | 0.93 | 0.90 | 0.91 | 3568 |
| Positive | 0.85 | 0.94 | 0.89 | 4003 |
| Accuracy | | 0.91 | 14428 | |
| Macro avg | 0.92 | 0.90 | 0.91 | 14428 |
| Weighted avg | 0.91 | 0.91 | 0.91 | 14428 |

Table 2 classification report of Random Forest Model

Discussion

The result shows that, the Random Forest model outperformed the Naive Bayesian model in terms of accuracy, achieving an accuracy score of 0.91 compared to 0.53 for the Naive Bayesian model. This suggests that the Random Forest model is better suited for sentiment analysis tasks on this dataset.

The high accuracy of the Random Forest model can be attributed to its ability to handle non-linear relationships between features, as well as its robustness to overfitting. In contrast, the Naive Bayesian model assumes independence between features, which may not be the case in sentiment analysis tasks where the presence of certain words or phrases can influence the sentiment of the entire text.

Through the confusion matric, we can intuitively feel the excellent performance of the Random Forest Model. The Random Forest model achieved high precision, recall, and f1-score for all sentiment classes, indicating that it is able to accurately classify tweets into the correct sentiment categories.

However, judging the pros and cons solely from a precision perspective is biased. It is worth noting that, the pkl file size of the Naive Bayesian model is 1.61 MB, while the size of the other one is 300 MB, saved 300 times more space! Also, training the Naive Bayesian model takes just about 1 minute. Compared to the Random Forest model, which takes more than half an hour to get its score, the Naive Bayesian has the advantages of small footprint and fast training, and this lightweight feature can play an advantage in other fields

There’s no doubt that, the Random Forest model is more suitable for sentiment analysis tasks on Twitter data compared to the Naive Bayesian model. However, it is important to note that the performance of both models can be further improved by fine-tuning their hyperparameters and more data sets. The experiment achieved the expected goal, which greatly improved my programming skills and gave me a deeper understanding of the practical training and use of classification models

Avatar photo
我是 zhyDaDa

前端/UI/交互/独立游戏/JPOP/电吉他/游戏配乐/网球/纸牌魔术

发表回复