Machine Learning with Python and Applications to Sentiment Analysis in Social Media
list of contents:
- Machine Learning with Python and Applications to Sentiment Analysis in Social Media
- Abstract
- Introduction
- Method
- Result
- Discussion
Abstract
The main purpose of this experiment is to train the Naive Bayesian model and the Random Forest model for sentiment analysis tasks on Twitter comments and compare their performance. Previous research has validated the effectiveness of both the Naive Bayesian and Random Forest models in the field of sentiment analysis, providing a theoretical foundation for the current study.
The experimental methodology involved data preparation, model building, and model application. The experimental results revealed that the Random Forest model outperformed the Naive Bayesian model in terms of accuracy, precision, recall, and F1-score. However, the Naive Bayesian model offered advantages in terms of smaller model size and faster training speed.
The findings of the experiment indicate that the Random Forest model is more suitable for sentiment analysis tasks on Twitter comments, and Naive Bayesian model’s lightweight nature may prove beneficial in other domains.
Introduction
The field of sentiment analysis has gained significant attention in recent years due to the proliferation of social media platforms such as Twitter, where users express their opinions on various topics. The analysis of these sentiments can provide valuable insights into public opinion, brand perception, and social trends. Previous research has explored the use of machine learning models for sentiment classification, with various algorithms demonstrating their effectiveness in this domain.
I will use two models to implement the analyse the sentiment of Twitter comments, the Naive Bayesian model and the Random Forest Model. I hope to learn practical model training and evaluation methods through this experiment, and compare the performance of these two classification models in sentiment classification tasks.
The experiment’s findings are expected to shed light on the suitability of both the Naive Bayesian and Random Forest models for entity sentiment analysis on Twitter, potentially guiding future research and practice in this area.
Method
I use Google Colab to run the program. My code can be divided into 3 main part as follows:
-
Preparation for the data
First of all, I load the data downloaded from Kaggle Twitter Sentiment Analysis (kaggle.com). For convenience, I combine the validation and training set into one data frame and preprocess it. The raw data contains a lot of irrelevant information, which is a kind of interference for model training and requires data cleaning. The process includes removeing void and duplicated data, removeing the website links and none-words (like emoji), making words into lowercase and eliminating stopwords (like ‘the’ ,‘a’). To feed the model, the words are transformed into TF-IDF vectors at last. -
Build model
I use the method ‘train_test_split’ to split the entire dataset into training and validation sets in a ratio of 2:8. Then I train two models with training data and check their performance with validation data. To evaluate their over-all performance, I print the confusion matrix and the classification report, including precision, recall and f1-score.
In order to use the model latter, I use ‘joblib’ to save the model into ‘.pkl’ files. -
Use model
I browsed Twitter and found some comments myself, loaded the model and conducted actual testing to see if it met expectations
Result
1. Naive Bayesian model
Figure 1 confusion matrix of Naive Bayesian model
| | precision | recall | f1-score | support |
| ———— | ——— | ———- | ——– | ——- |
| Irrelevant | 0.38 | 0.63 | 0.47 | 2483 |
| Negative | 0.65 | 0.48 | 0.55 | 4374 |
| Neutral | 0.62 | 0.48 | 0.54 | 3568 |
| Positive | 0.54 | 0.58 | 0.56 | 4003 |
| Accuracy | | 0.53 14428 | | |
| Macro avg | 0.55 | 0.54 | 0.53 | 14428 |
| Weighted avg | 0.56 | 0.53 | 0.54 | 14428 |
Table 1 classification report of Naive Bayesian model
2. Random Forest Model
Figure 2 confusion matrix of Random Forest Model
| | precision | recall | f1-score | support |
| ———— | ——— | —— | ——– | ——- |
| Irrelevant | 0.97 | 0.85 | 0.91 | 2483 |
| Negative | 0.92 | 0.92 | 0.92 | 4374 |
| Neutral | 0.93 | 0.90 | 0.91 | 3568 |
| Positive | 0.85 | 0.94 | 0.89 | 4003 |
| Accuracy | | 0.91 | 14428 | |
| Macro avg | 0.92 | 0.90 | 0.91 | 14428 |
| Weighted avg | 0.91 | 0.91 | 0.91 | 14428 |
Table 2 classification report of Random Forest Model
Discussion
The result shows that, the Random Forest model outperformed the Naive Bayesian model in terms of accuracy, achieving an accuracy score of 0.91 compared to 0.53 for the Naive Bayesian model. This suggests that the Random Forest model is better suited for sentiment analysis tasks on this dataset.
The high accuracy of the Random Forest model can be attributed to its ability to handle non-linear relationships between features, as well as its robustness to overfitting. In contrast, the Naive Bayesian model assumes independence between features, which may not be the case in sentiment analysis tasks where the presence of certain words or phrases can influence the sentiment of the entire text.
Through the confusion matric, we can intuitively feel the excellent performance of the Random Forest Model. The Random Forest model achieved high precision, recall, and f1-score for all sentiment classes, indicating that it is able to accurately classify tweets into the correct sentiment categories.
However, judging the pros and cons solely from a precision perspective is biased. It is worth noting that, the pkl file size of the Naive Bayesian model is 1.61 MB, while the size of the other one is 300 MB, saved 300 times more space! Also, training the Naive Bayesian model takes just about 1 minute. Compared to the Random Forest model, which takes more than half an hour to get its score, the Naive Bayesian has the advantages of small footprint and fast training, and this lightweight feature can play an advantage in other fields
There’s no doubt that, the Random Forest model is more suitable for sentiment analysis tasks on Twitter data compared to the Naive Bayesian model. However, it is important to note that the performance of both models can be further improved by fine-tuning their hyperparameters and more data sets. The experiment achieved the expected goal, which greatly improved my programming skills and gave me a deeper understanding of the practical training and use of classification models