Merge branch 'main' of github.com:olafurjohannsson/sentiment-analysis

cadia-lvl · Dec 11, 2023 · 63b2ea5 · 63b2ea5
2 parents 6c4b0ea + 95c287a
commit 63b2ea5
Show file tree

Hide file tree

Showing 5 changed files with 84 additions and 59 deletions.
diff --git a/README.md b/README.md
@@ -4,6 +4,21 @@
 
 # Instructions
 
+## Installation of Dependencies
+
+To run the scripts, you need to install the dependencies. Follow the steps below to set up your environment.
+
+### Prerequisites
+
+-   Python 3.x (Make sure Python 3 is installed on your system.)
+
+### Installation Steps
+
+1. Ensure Python 3.x is installed.
+2. Install Requirements:
+    - `pip install -r requirements.txt`
+3. Install PyTorch: The CPU version of PyTorch is already specified in the `requirements.txt` file of this project. It's recommended that you use the GPU version of PyTorch, visit the [PyTorch Get Started](https://pytorch.org/get-started/locally/) page, select your preferences, and run the provided installation command.
+
 ## Machine-translate
 
 This section provides instructions for using the machine translation scripts included in this project: `translate_google.py` and `translate_mideind.py`. These scripts are used for translating text data into Icelandic for sentiment analysis.
@@ -21,20 +36,13 @@ This section provides instructions for using the machine translation scripts inc
 -   `googletrans` version 3.1.0a0
 -   Other dependencies: `concurrent.futures`, `threading`, `logging`
 
-##### Installation
-
-1. Ensure Python 3.x is installed.
-2. Install the required Python packages:
-    - pip install pandas
-    - pip install googletrans==3.1.0a0
-
 ##### Usage
 
-1. Run the script:
+1. Ensure the `"IMDB-Dataset.csv"` file is located in the `Datasets` directory.
+2. Run the script:
 
-    - python translate_google.py
+    - `python translate_google.py`
 
-2. Select the CSV file containing the text to be translated when prompted. The file should have columns named 'review' and 'sentiment'.
 3. The script will process the data and output two files in the `Datasets` directory:
     - `IMDB-Dataset-GoogleTranslate.csv`: Contains translated reviews and sentiments.
     - `failed-IMDB-Dataset-GoogleTranslate.csv`: Logs failed translation attempts.
@@ -48,6 +56,8 @@ To use a different dataset:
 -   Modify the script if your dataset columns have different names.
 -   Modify the script's `dataset` variable to match your dataset's filename.
 
+##
+
 #### Using `translate_mideind.py`
 
 ##### Overview
@@ -57,23 +67,22 @@ To use a different dataset:
 ##### Prerequisites
 
 -   Python 3.x
--   `transformers` and `torch` libraries
--   Pandas library
+-   PyTorch
+-   `transformers` library
+-   `Pandas` library
 -   Other dependencies: `re`, `logging`
 
-##### Installation
+###### Note
 
-1. Ensure Python 3.x is installed.
-2. Install the required Python packages:
-    - pip install transformers torch pandas
+-   If you plan to use GPU acceleration with PyTorch, make sure your CUDA version is compatible with the installed PyTorch version.
 
 ##### Usage
 
 1. Run the script:
-    - python translate_mideind.py
-2. Select the folder containing the translation model when prompted.
-3. Select the CSV file containing the text to be translated. The file should have columns named 'review' and 'sentiment'.
-4. The script will process the data and output two files in the `Datasets` directory:
+
+    - `python translate_mideind.py`
+
+2. The script will process the data and output two files in the `Datasets` directory:
     - `IMDB-Dataset-MideindTranslate.csv`: Contains translated reviews and sentiments.
     - `failed-IMDB-Dataset-MideindTranslate.csv`: Logs failed translation attempts.
 
@@ -101,10 +110,7 @@ This section provides instructions for using the `process.py` script, which perf
 
 #### Installation
 
-1. Ensure Python 3.x is installed.
-2. Install the required Python packages:
-    - pip install pandas joblib nefnir
-3. Download IceNLP from [IceNLP GitHub Repository](https://github.com/hrafnl/icenlp) and extract it.
+1. Download IceNLP from [IceNLP GitHub Repository](https://github.com/hrafnl/icenlp) and extract it.
 
 #### Usage
 
@@ -122,6 +128,8 @@ To use a different dataset:
 -   The dataset should have 'review' and 'sentiment' columns.
 -   Modify the `dataset_path` variable in the script to match your dataset's filename.
 
+##
+
 ### Processing English Text
 
 This section provides instructions for using the `process_eng.py` script, which performs text normalization and preprocessing for English text.
@@ -135,10 +143,7 @@ This section provides instructions for using the `process_eng.py` script, which
 
 #### Installation
 
-1. Ensure Python 3.x is installed.
-2. Install the required Python packages:
-    - pip install pandas nltk joblib
-3. Download necessary NLTK data:
+1. Download necessary NLTK data:
     - python -m nltk.downloader punkt stopwords wordnet
 
 #### Usage
@@ -185,11 +190,9 @@ This section provides instructions for using the `train.py` script, which trains
 -   Scikit-learn library
 -   Other dependencies: `os`, `time`, `numpy`
 
-### Installation
+###### Note
 
-1. Ensure Python 3.x is installed.
-2. Install the required Python packages:
-    - pip install transformers torch pandas scikit-learn
+-   If you plan to use GPU acceleration with PyTorch, make sure your CUDA version is compatible with the installed PyTorch version.
 
 ### Usage
 
@@ -226,15 +229,16 @@ the pandas columns to use as X and y, and whether to return the accuracy or the
 1. Import generate_classification_report.py `import generate_classification_report as gcr`
 2. Load the CSV file with the data to be tested `df = pd.read_csv('IMDB-Dataset-GoogleTranslate.csv')`
 3. Invoke the function call call_model, which takes the parameters
-- X_all: All review columns
-- y_all: All sentiment columns
-- model: The model to be used (This is a path to a file, something like `'./electra-base-google-batch8-remove-noise-model/'`)
-- device: The device to be used (CUDA, cpu)
-- accuracy: Whether to return accuracy or return a classification report
+
+-   X_all: All review columns
+-   y_all: All sentiment columns
+-   model: The model to be used (This is a path to a file, something like `'./electra-base-google-batch8-remove-noise-model/'`)
+-   device: The device to be used (CUDA, cpu)
+-   accuracy: Whether to return accuracy or return a classification report
 
 ### Example
 
-Example of how to generate a report can be seen in `generate_report.ipynb` - also the `generate_classification_report.py`  `eval_files()` function, which is loading multiple models.
+Example of how to generate a report can be seen in `generate_report.ipynb` - also the `generate_classification_report.py` `eval_files()` function, which is loading multiple models.
 
 # License
 

diff --git a/requirements.txt b/requirements.txt
@@ -0,0 +1,12 @@
+googletrans==3.1.0a0
+joblib==1.2.0
+nefnir==1.0.2
+nltk==3.8.1
+numpy==1.24.2
+pandas==1.5.3
+scikit_learn==1.2.2
+tokenizer==3.4.3
+torch==2.0.1 
+torchvision==0.15.2 
+torchaudio==2.0.2
+transformers==4.34.1
diff --git a/src/BaselineClassifiersBinary.ipynb b/src/BaselineClassifiersBinary.ipynb
@@ -356,7 +356,6 @@
     "    print(lr_pipeline)\n",
     "    print(classification_report(y_test, predict_lr, digits=4))\n",
     "\n",
-    "    \n",
     "    return (\n",
     "        (\n",
     "            {\n",
@@ -396,7 +395,6 @@
     "    plt.show()\n",
     "\n",
     "\n",
-    "\n",
     "data2, nb_mideind, svc_mideind, lr_mideind = classify(ICELANDIC_MIDEIND_CSV)\n",
     "data3, nb_google, svc_google, lr_google = classify(ICELANDIC_GOOGLE_CSV)\n",
     "data1, nb_english, svc_english, lr_english = classify(ENGLISH_CSV)\n",
@@ -631,7 +629,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.12"
+   "version": "3.10.6"
   }
  },
  "nbformat": 4,

diff --git a/src/generate_classification_report.py b/src/generate_classification_report.py
@@ -110,11 +110,13 @@ def generate_report(self, accuracy):
                 prediction = torch.max(outputs.logits, dim=1)
                 y_true.extend(labels.tolist())
                 y_pred.extend(prediction.indices.tolist())
-        
+
         if accuracy:
             acc = accuracy_score(y_true, y_pred)
             return acc
-        return classification_report(y_true, y_pred, output_dict=True) # NOTE: can use this if you want to print classification report
+        return classification_report(
+            y_true, y_pred, output_dict=True
+        )  # NOTE: can use this if you want to print classification report
 
 
 class DataFrameLoader:
@@ -165,7 +167,6 @@ def generate_report(filename, folder, device):
     print("Loading model from folder {} using file {}".format(folder, filename))
     dfl = DataFrameLoader(filename)
     return call_model(dfl.X_all, dfl.y_all, folder, device)
-
 
 
 def eval_files():

diff --git a/src/generate_report.ipynb b/src/generate_report.ipynb
@@ -64,38 +64,48 @@
     "import random\n",
     "import generate_classification_report as gcr\n",
     "\n",
-    "PATH1 = ''\n",
-    "PATH2 = ''\n",
+    "PATH1 = \"\"\n",
+    "PATH2 = \"\"\n",
     "\n",
     "d1 = pd.read_csv(PATH1)\n",
     "d2 = pd.read_csv(PATH2)\n",
-    "d1.drop(['num', 'rating', 'id'], axis=1, inplace=True)\n",
-    "d2.drop(['movie', 'rating'], axis=1, inplace=True)\n",
+    "d1.drop([\"num\", \"rating\", \"id\"], axis=1, inplace=True)\n",
+    "d2.drop([\"movie\", \"rating\"], axis=1, inplace=True)\n",
     "\n",
     "\n",
-    "df_orig = pd.merge(d1, d2, how='outer')\n",
+    "df_orig = pd.merge(d1, d2, how=\"outer\")\n",
     "\n",
-    "device = 'cuda'\n",
-    "model = './electra-base-google-batch8-remove-noise-model/'\n",
+    "device = \"cuda\"\n",
+    "model = \"./electra-base-google-batch8-remove-noise-model/\"\n",
     "\n",
     "\n",
     "total = 0\n",
     "for i in range(0, 10):\n",
     "    r = random.randint(0, 10000)\n",
-    "    \n",
-    "    fifty_negative = df_orig.where(lambda x: x['sentiment'] == 'Negative').dropna().sample(n=50, random_state=r)\n",
-    "    fifty_positive = df_orig.where(lambda x: x['sentiment'] == 'Positive').dropna().sample(n=50, random_state=r)\n",
     "\n",
-    "    new_df = pd.merge(fifty_negative, fifty_positive, on=['sentiment', 'review'], how='outer')\n",
-    "    new_df.sentiment = new_df.sentiment.apply(lambda x: 1 if x == 'Positive' else 0)\n",
+    "    fifty_negative = (\n",
+    "        df_orig.where(lambda x: x[\"sentiment\"] == \"Negative\")\n",
+    "        .dropna()\n",
+    "        .sample(n=50, random_state=r)\n",
+    "    )\n",
+    "    fifty_positive = (\n",
+    "        df_orig.where(lambda x: x[\"sentiment\"] == \"Positive\")\n",
+    "        .dropna()\n",
+    "        .sample(n=50, random_state=r)\n",
+    "    )\n",
+    "\n",
+    "    new_df = pd.merge(\n",
+    "        fifty_negative, fifty_positive, on=[\"sentiment\", \"review\"], how=\"outer\"\n",
+    "    )\n",
+    "    new_df.sentiment = new_df.sentiment.apply(lambda x: 1 if x == \"Positive\" else 0)\n",
     "    X_all = new_df.review\n",
     "    y_all = new_df.sentiment\n",
     "    accuracy = gcr.call_model(X_all, y_all, model, device, accuracy=True)\n",
     "    total += accuracy\n",
-    "    print('acc: {0:.4f}, seed: {1}, i: {2}'.format(accuracy, r, i))\n",
+    "    print(\"acc: {0:.4f}, seed: {1}, i: {2}\".format(accuracy, r, i))\n",
+    "\n",
     "\n",
-    "    \n",
-    "print('Average accuracy: ', total/10)\n"
+    "print(\"Average accuracy: \", total / 10)"
    ]
   },
   {