Skip to content

Commit

Permalink
Merge branch 'main' of github.com:olafurjohannsson/sentiment-analysis
Browse files Browse the repository at this point in the history
  • Loading branch information
olafurjohannsson committed Dec 11, 2023
2 parents 6c4b0ea + 95c287a commit 63b2ea5
Show file tree
Hide file tree
Showing 5 changed files with 84 additions and 59 deletions.
80 changes: 42 additions & 38 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,21 @@

# Instructions

## Installation of Dependencies

To run the scripts, you need to install the dependencies. Follow the steps below to set up your environment.

### Prerequisites

- Python 3.x (Make sure Python 3 is installed on your system.)

### Installation Steps

1. Ensure Python 3.x is installed.
2. Install Requirements:
- `pip install -r requirements.txt`
3. Install PyTorch: The CPU version of PyTorch is already specified in the `requirements.txt` file of this project. It's recommended that you use the GPU version of PyTorch, visit the [PyTorch Get Started](https://pytorch.org/get-started/locally/) page, select your preferences, and run the provided installation command.

## Machine-translate

This section provides instructions for using the machine translation scripts included in this project: `translate_google.py` and `translate_mideind.py`. These scripts are used for translating text data into Icelandic for sentiment analysis.
Expand All @@ -21,20 +36,13 @@ This section provides instructions for using the machine translation scripts inc
- `googletrans` version 3.1.0a0
- Other dependencies: `concurrent.futures`, `threading`, `logging`

##### Installation

1. Ensure Python 3.x is installed.
2. Install the required Python packages:
- pip install pandas
- pip install googletrans==3.1.0a0

##### Usage

1. Run the script:
1. Ensure the `"IMDB-Dataset.csv"` file is located in the `Datasets` directory.
2. Run the script:

- python translate_google.py
- `python translate_google.py`

2. Select the CSV file containing the text to be translated when prompted. The file should have columns named 'review' and 'sentiment'.
3. The script will process the data and output two files in the `Datasets` directory:
- `IMDB-Dataset-GoogleTranslate.csv`: Contains translated reviews and sentiments.
- `failed-IMDB-Dataset-GoogleTranslate.csv`: Logs failed translation attempts.
Expand All @@ -48,6 +56,8 @@ To use a different dataset:
- Modify the script if your dataset columns have different names.
- Modify the script's `dataset` variable to match your dataset's filename.

##

#### Using `translate_mideind.py`

##### Overview
Expand All @@ -57,23 +67,22 @@ To use a different dataset:
##### Prerequisites

- Python 3.x
- `transformers` and `torch` libraries
- Pandas library
- PyTorch
- `transformers` library
- `Pandas` library
- Other dependencies: `re`, `logging`

##### Installation
###### Note

1. Ensure Python 3.x is installed.
2. Install the required Python packages:
- pip install transformers torch pandas
- If you plan to use GPU acceleration with PyTorch, make sure your CUDA version is compatible with the installed PyTorch version.

##### Usage

1. Run the script:
- python translate_mideind.py
2. Select the folder containing the translation model when prompted.
3. Select the CSV file containing the text to be translated. The file should have columns named 'review' and 'sentiment'.
4. The script will process the data and output two files in the `Datasets` directory:

- `python translate_mideind.py`

2. The script will process the data and output two files in the `Datasets` directory:
- `IMDB-Dataset-MideindTranslate.csv`: Contains translated reviews and sentiments.
- `failed-IMDB-Dataset-MideindTranslate.csv`: Logs failed translation attempts.

Expand Down Expand Up @@ -101,10 +110,7 @@ This section provides instructions for using the `process.py` script, which perf

#### Installation

1. Ensure Python 3.x is installed.
2. Install the required Python packages:
- pip install pandas joblib nefnir
3. Download IceNLP from [IceNLP GitHub Repository](https://github.com/hrafnl/icenlp) and extract it.
1. Download IceNLP from [IceNLP GitHub Repository](https://github.com/hrafnl/icenlp) and extract it.

#### Usage

Expand All @@ -122,6 +128,8 @@ To use a different dataset:
- The dataset should have 'review' and 'sentiment' columns.
- Modify the `dataset_path` variable in the script to match your dataset's filename.

##

### Processing English Text

This section provides instructions for using the `process_eng.py` script, which performs text normalization and preprocessing for English text.
Expand All @@ -135,10 +143,7 @@ This section provides instructions for using the `process_eng.py` script, which

#### Installation

1. Ensure Python 3.x is installed.
2. Install the required Python packages:
- pip install pandas nltk joblib
3. Download necessary NLTK data:
1. Download necessary NLTK data:
- python -m nltk.downloader punkt stopwords wordnet

#### Usage
Expand Down Expand Up @@ -185,11 +190,9 @@ This section provides instructions for using the `train.py` script, which trains
- Scikit-learn library
- Other dependencies: `os`, `time`, `numpy`

### Installation
###### Note

1. Ensure Python 3.x is installed.
2. Install the required Python packages:
- pip install transformers torch pandas scikit-learn
- If you plan to use GPU acceleration with PyTorch, make sure your CUDA version is compatible with the installed PyTorch version.

### Usage

Expand Down Expand Up @@ -226,15 +229,16 @@ the pandas columns to use as X and y, and whether to return the accuracy or the
1. Import generate_classification_report.py `import generate_classification_report as gcr`
2. Load the CSV file with the data to be tested `df = pd.read_csv('IMDB-Dataset-GoogleTranslate.csv')`
3. Invoke the function call call_model, which takes the parameters
- X_all: All review columns
- y_all: All sentiment columns
- model: The model to be used (This is a path to a file, something like `'./electra-base-google-batch8-remove-noise-model/'`)
- device: The device to be used (CUDA, cpu)
- accuracy: Whether to return accuracy or return a classification report

- X_all: All review columns
- y_all: All sentiment columns
- model: The model to be used (This is a path to a file, something like `'./electra-base-google-batch8-remove-noise-model/'`)
- device: The device to be used (CUDA, cpu)
- accuracy: Whether to return accuracy or return a classification report

### Example

Example of how to generate a report can be seen in `generate_report.ipynb` - also the `generate_classification_report.py` `eval_files()` function, which is loading multiple models.
Example of how to generate a report can be seen in `generate_report.ipynb` - also the `generate_classification_report.py` `eval_files()` function, which is loading multiple models.

# License

Expand Down
12 changes: 12 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
googletrans==3.1.0a0
joblib==1.2.0
nefnir==1.0.2
nltk==3.8.1
numpy==1.24.2
pandas==1.5.3
scikit_learn==1.2.2
tokenizer==3.4.3
torch==2.0.1
torchvision==0.15.2
torchaudio==2.0.2
transformers==4.34.1
4 changes: 1 addition & 3 deletions src/BaselineClassifiersBinary.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -356,7 +356,6 @@
" print(lr_pipeline)\n",
" print(classification_report(y_test, predict_lr, digits=4))\n",
"\n",
" \n",
" return (\n",
" (\n",
" {\n",
Expand Down Expand Up @@ -396,7 +395,6 @@
" plt.show()\n",
"\n",
"\n",
"\n",
"data2, nb_mideind, svc_mideind, lr_mideind = classify(ICELANDIC_MIDEIND_CSV)\n",
"data3, nb_google, svc_google, lr_google = classify(ICELANDIC_GOOGLE_CSV)\n",
"data1, nb_english, svc_english, lr_english = classify(ENGLISH_CSV)\n",
Expand Down Expand Up @@ -631,7 +629,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
"version": "3.10.6"
}
},
"nbformat": 4,
Expand Down
7 changes: 4 additions & 3 deletions src/generate_classification_report.py
Original file line number Diff line number Diff line change
Expand Up @@ -110,11 +110,13 @@ def generate_report(self, accuracy):
prediction = torch.max(outputs.logits, dim=1)
y_true.extend(labels.tolist())
y_pred.extend(prediction.indices.tolist())

if accuracy:
acc = accuracy_score(y_true, y_pred)
return acc
return classification_report(y_true, y_pred, output_dict=True) # NOTE: can use this if you want to print classification report
return classification_report(
y_true, y_pred, output_dict=True
) # NOTE: can use this if you want to print classification report


class DataFrameLoader:
Expand Down Expand Up @@ -165,7 +167,6 @@ def generate_report(filename, folder, device):
print("Loading model from folder {} using file {}".format(folder, filename))
dfl = DataFrameLoader(filename)
return call_model(dfl.X_all, dfl.y_all, folder, device)



def eval_files():
Expand Down
40 changes: 25 additions & 15 deletions src/generate_report.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -64,38 +64,48 @@
"import random\n",
"import generate_classification_report as gcr\n",
"\n",
"PATH1 = ''\n",
"PATH2 = ''\n",
"PATH1 = \"\"\n",
"PATH2 = \"\"\n",
"\n",
"d1 = pd.read_csv(PATH1)\n",
"d2 = pd.read_csv(PATH2)\n",
"d1.drop(['num', 'rating', 'id'], axis=1, inplace=True)\n",
"d2.drop(['movie', 'rating'], axis=1, inplace=True)\n",
"d1.drop([\"num\", \"rating\", \"id\"], axis=1, inplace=True)\n",
"d2.drop([\"movie\", \"rating\"], axis=1, inplace=True)\n",
"\n",
"\n",
"df_orig = pd.merge(d1, d2, how='outer')\n",
"df_orig = pd.merge(d1, d2, how=\"outer\")\n",
"\n",
"device = 'cuda'\n",
"model = './electra-base-google-batch8-remove-noise-model/'\n",
"device = \"cuda\"\n",
"model = \"./electra-base-google-batch8-remove-noise-model/\"\n",
"\n",
"\n",
"total = 0\n",
"for i in range(0, 10):\n",
" r = random.randint(0, 10000)\n",
" \n",
" fifty_negative = df_orig.where(lambda x: x['sentiment'] == 'Negative').dropna().sample(n=50, random_state=r)\n",
" fifty_positive = df_orig.where(lambda x: x['sentiment'] == 'Positive').dropna().sample(n=50, random_state=r)\n",
"\n",
" new_df = pd.merge(fifty_negative, fifty_positive, on=['sentiment', 'review'], how='outer')\n",
" new_df.sentiment = new_df.sentiment.apply(lambda x: 1 if x == 'Positive' else 0)\n",
" fifty_negative = (\n",
" df_orig.where(lambda x: x[\"sentiment\"] == \"Negative\")\n",
" .dropna()\n",
" .sample(n=50, random_state=r)\n",
" )\n",
" fifty_positive = (\n",
" df_orig.where(lambda x: x[\"sentiment\"] == \"Positive\")\n",
" .dropna()\n",
" .sample(n=50, random_state=r)\n",
" )\n",
"\n",
" new_df = pd.merge(\n",
" fifty_negative, fifty_positive, on=[\"sentiment\", \"review\"], how=\"outer\"\n",
" )\n",
" new_df.sentiment = new_df.sentiment.apply(lambda x: 1 if x == \"Positive\" else 0)\n",
" X_all = new_df.review\n",
" y_all = new_df.sentiment\n",
" accuracy = gcr.call_model(X_all, y_all, model, device, accuracy=True)\n",
" total += accuracy\n",
" print('acc: {0:.4f}, seed: {1}, i: {2}'.format(accuracy, r, i))\n",
" print(\"acc: {0:.4f}, seed: {1}, i: {2}\".format(accuracy, r, i))\n",
"\n",
"\n",
" \n",
"print('Average accuracy: ', total/10)\n"
"print(\"Average accuracy: \", total / 10)"
]
},
{
Expand Down

0 comments on commit 63b2ea5

Please sign in to comment.