Skip to content

Commit

Permalink
fixing
Browse files Browse the repository at this point in the history
  • Loading branch information
birkirarndal committed Dec 14, 2023
1 parent 63b2ea5 commit 3ae91d4
Show file tree
Hide file tree
Showing 5 changed files with 159 additions and 258 deletions.
77 changes: 59 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,39 @@

# Instructions

## Setting Up a Virtual Environment

To avoid conflicts with other projects or system-wide Python packages, it's recommended to set up a virtual environment for this project. Here's how to do it:

### Prerequisites

- Python 3.x (Ensure Python 3 is installed on your system.)

### Creating a Virtual Environment

1. **Navigate to Your Project Directory**:
Open a terminal or command prompt and navigate to the root directory of this project.

2. **Create a Virtual Environment**:
Run the following command to create a virtual environment named `env` (you can choose any name you prefer):

```
python -m venv env
```

This command creates a new directory `env` within your project where all dependencies will be installed.

3. **Activate the Virtual Environment**:

- On **Windows**, run:
```
.\env\Scripts\activate
```
- On **macOS or Linux**, run:
```
source env/bin/activate
```

## Installation of Dependencies

To run the scripts, you need to install the dependencies. Follow the steps below to set up your environment.
Expand All @@ -16,8 +49,10 @@ To run the scripts, you need to install the dependencies. Follow the steps below

1. Ensure Python 3.x is installed.
2. Install Requirements:
- `pip install -r requirements.txt`
3. Install PyTorch: The CPU version of PyTorch is already specified in the `requirements.txt` file of this project. It's recommended that you use the GPU version of PyTorch, visit the [PyTorch Get Started](https://pytorch.org/get-started/locally/) page, select your preferences, and run the provided installation command.
```
pip install -r requirements.txt
```
3. Install PyTorch: It's **recommended** that you use the GPU version (CUDA) of PyTorch, visit the [PyTorch Get Started](https://pytorch.org/get-started/locally/) page, select your preferences, and run the provided installation command.

## Machine-translate

Expand All @@ -41,9 +76,11 @@ This section provides instructions for using the machine translation scripts inc
1. Ensure the `"IMDB-Dataset.csv"` file is located in the `Datasets` directory.
2. Run the script:

- `python translate_google.py`
```
python src/translate_google.py
```

3. The script will process the data and output two files in the `Datasets` directory:
3. The script will translate the data and output two files in the `Datasets` directory:
- `IMDB-Dataset-GoogleTranslate.csv`: Contains translated reviews and sentiments.
- `failed-IMDB-Dataset-GoogleTranslate.csv`: Logs failed translation attempts.

Expand Down Expand Up @@ -80,7 +117,9 @@ To use a different dataset:

1. Run the script:

- `python translate_mideind.py`
```
python src/translate_mideind.py
```

2. The script will process the data and output two files in the `Datasets` directory:
- `IMDB-Dataset-MideindTranslate.csv`: Contains translated reviews and sentiments.
Expand Down Expand Up @@ -115,7 +154,9 @@ This section provides instructions for using the `process.py` script, which perf
#### Usage

1. Run the script:
- python process.py
```
python src/process.py
```
2. When prompted, select the `icetagger.bat` file located in the extracted IceNLP directory (`IceNLP-1.5.0\IceNLP\bat\icetagger`).
3. Ensure the dataset file (`IMDB-Dataset-MideindTranslate.csv`) is located in the `Datasets` directory relative to the script.
4. The script will process the dataset and output the processed data to `Datasets/IMDB-Dataset-MideindTranslate-processed-nefnir.csv`.
Expand Down Expand Up @@ -144,13 +185,17 @@ This section provides instructions for using the `process_eng.py` script, which
#### Installation

1. Download necessary NLTK data:
- python -m nltk.downloader punkt stopwords wordnet
```
python -m nltk.downloader punkt stopwords wordnet
```

#### Usage

1. Ensure the dataset file (`IMDB-Dataset.csv`) is located in the `Datasets` directory relative to the script.
1. Ensure the dataset file (`IMDB-Dataset.csv`) is located in the `Datasets` directory.
2. Run the script:
- python process_eng.py
```
python src/process_eng.py
```
3. The script will process the dataset and output the processed data to `Datasets/IMDB-Dataset-Processed.csv`.

#### Custom Dataset
Expand All @@ -175,7 +220,7 @@ This section provides instructions for using the `BaselineClassifiersBinary.ipyn

### Usage

Go into `BaselineClassifiersBinary.ipynb` and run the cells. You have to change the `ICELANDIC_GOOGLE_CSV`, `ICELANDIC_MIDEIND_CSV` and `ENGLISH_CSV` variables to point to the correct datasets. The cell will train and print out the classification reports for each model. It will also show a diagram. You can refer to the next cell if you want to print out the most important features, altough this is not necessary.
Go into `BaselineClassifiersBinary.ipynb` and run the cells. You have to change the `ICELANDIC_GOOGLE_CSV`, `ICELANDIC_MIDEIND_CSV` and `ENGLISH_CSV` variables to point to the correct datasets. The cell will train and print out the classification reports for each model. It will also show a diagram. You can refer to the next cell if you want to print out the most important features, although this is not necessary.

## Transformer Models

Expand All @@ -199,7 +244,9 @@ This section provides instructions for using the `train.py` script, which trains
1. Place the dataset file (default: `"IMDB-Dataset-GoogleTranslate.csv"`) in the `Datasets` directory relative to the script.
2. Modify the script if you want to use a different pre-trained model or dataset.
3. Run the script:
- python train.py
```
python src/train.py
```
4. The script will train the model using the specified dataset and save the trained model and tokenizer in the `Models` directory.

### Custom Dataset
Expand All @@ -212,18 +259,12 @@ To use a different dataset:

## Generating Classification Reports

This section provides instructions for using the `generate_report.py` script, which generates a classification report for a trained model.
This section provides instructions for using the `generate_report.ipynb` script, which generates a classification report for a trained model.
This is useful mostly for the transformer models, as the baseline classifiers generate their own reports via the same libraries.

This function will call the model and generate a classification report for the model. What it expects is the path to a folder of the model, the device to use,
the pandas columns to use as X and y, and whether to return the accuracy or the classification report.

### Installation

1. Ensure Python 3.x is installed.
2. Install the required Python packages:
- pip install transformers torch pandas scikit-learn

### Usage

1. Import generate_classification_report.py `import generate_classification_report as gcr`
Expand Down
3 changes: 0 additions & 3 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,4 @@ numpy==1.24.2
pandas==1.5.3
scikit_learn==1.2.2
tokenizer==3.4.3
torch==2.0.1
torchvision==0.15.2
torchaudio==2.0.2
transformers==4.34.1
328 changes: 94 additions & 234 deletions src/BaselineClassifiersBinary.ipynb

Large diffs are not rendered by default.

6 changes: 3 additions & 3 deletions src/generate_report.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"execution_count": 16,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -76,7 +76,7 @@
"df_orig = pd.merge(d1, d2, how=\"outer\")\n",
"\n",
"device = \"cuda\"\n",
"model = \"./electra-base-google-batch8-remove-noise-model/\"\n",
"model = \"../Models/electra-base-google-batch8-remove-noise-model/\"\n",
"\n",
"\n",
"total = 0\n",
Expand Down Expand Up @@ -130,7 +130,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
"version": "3.10.6"
},
"orig_nbformat": 4
},
Expand Down
3 changes: 3 additions & 0 deletions src/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,10 @@
LEARNING_RATE = 1e-6
BATCH_SIZE = 8

# model_name = "roberta-base"
model_name = "mideind/IceBERT"
# model_name = "jonfd/electra-base-igc-is"

model_save_dir = "Icebert-google-batch8-test-model"

np.random.seed(RANDOM_SEED)
Expand Down

0 comments on commit 3ae91d4

Please sign in to comment.