Skip to content

Commit

Permalink
Add more instructions in the README
Browse files Browse the repository at this point in the history
  • Loading branch information
bryant1410 committed Jun 18, 2023
1 parent a702dd4 commit 1e760ee
Show file tree
Hide file tree
Showing 4 changed files with 51 additions and 14 deletions.
60 changes: 48 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,24 +18,60 @@ models.
## Obtained Results

Under [results/](results) you can find the detailed results obtained with our method for the 3 different scores tested
(read the paper for details). They come from the output of running the code in this repository (see below to reproduce
it).
(read the paper for more information). They come from the output of running the code in this repository (see below to
reproduce it).

## Reproducing the Results

### Setup
1. With Python >= 3.8, run the following commands:

With Python >= 3.8, run the following commands:
```bash
pip install -r requirements.txt
python -c "import nltk; nltk.download(['omw-1.4', 'wordnet'])"
spacy download en_core_web_trf
mkdir data
````

```bash
pip install -r requirements.txt
python -c "import nltk; nltk.download(['omw-1.4', 'wordnet'])"
spacy download en_core_web_trf
huggingface-cli login
mkdir data
```
2. Compute the CLIP scores for each image-sentence pair and save it to a CSV file. For this step, we used [a Google
Colab](https://colab.research.google.com/drive/1I10mjHD-_brEtaKdHqhvkHjRFjVzK1hl?usp=sharing). You can see the
results in [this Google Sheet](https://docs.google.com/spreadsheets/d/1TPYLRk_f6zMm7pYy8xLPeeS6EsybSCoxjnONOrg40vA/edit?usp=sharing).
This file is available in a CSV format at [TODO](todo). Place it under `data/svo_probes_with_scores.csv`.
3. Compute a CSV file that contains the negative sentences for each of the negative triplets. We lost the script for
this step, but it's about taking the previous CSV file as input and taking the sentence for the same triplet in the
`pos_triplet` column (you can use [the original SVO-Probes file](https://github.com/deepmind/svo_probes/blob/main/svo_probes.csv)
if there are missing sentences). This file should have the columns `sentence` and `neg_sentence`, in the same order
as the column `sentence` from the previous CSV file. We provide this file already processed at [TODO](todo).
Place it under `data/neg_d.csv`.
4. Merge the information from this 2 files:
```bash
./merge_csvs_and_filter.py > data/merged.csv
```
We provide the output of this script at [TODO](todo).
5. Compute word frequencies in a 10M-size subset from [LAION-400M](https://laion.ai/blog/laion-400-open-dataset/):
```bash
./compute_word_frequencies.py > data/words_counter_LAION.json
```
We provide the output of this script at [TODO](todo).
6. TODO: how to obtain Levin.
7. TODO: how to obtain LIWC.
8. TODO: how to obtain the General Inquirer.
9. Run the following:
```bash
./main.py --dependent-variable-name pos_clip_score --no-neg-features > results/pos_scores.txt
./main.py --dependent-variable-name neg_clip_score > results/neg_scores.txt
./main.py --dependent-variable-name clip_score_diff > results/score_diff.txt
```
**We'll write more instructions soon.**
We provide this files already under [results/](results). Run `./main.py --help` to see all the available options.
We also recommend you looking at the code to see what it does. Note that this repository includes code for
preliminary experiments that we didn't report in the paper (for clarity) and we include here in case it's useful.
## Citation
Expand Down
1 change: 1 addition & 0 deletions compute_clip_scores.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#!/usr/bin/env python
"""Script to compute CLIP scores for other datasets. Not used for the paper results."""
from __future__ import annotations

import argparse
Expand Down
2 changes: 1 addition & 1 deletion compute_word_frequencies.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ def load_laion_texts() -> Iterable[str]:


def main() -> None:
max_count = 10_000_000 # This is a reasonable number, and it's also roughly what LAION's 1st parquet file has.
max_count = 10_000_000 # This is a reasonable number, and it's also roughly what LAION's 1st Parquet file has.
word_counts = Counter(word
for text in tqdm(itertools.islice(load_laion_texts(), max_count), total=max_count)
for word in text.split())
Expand Down
2 changes: 1 addition & 1 deletion merge_csvs_and_filter.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@

def parse_args() -> argparse.Namespace:
parser = ArgumentParserWithDefaults()
parser.add_argument("--probes-path", default="data/svo_probes.csv")
parser.add_argument("--probes-path", default="data/svo_probes_with_scores.csv")
parser.add_argument("--neg-path", default="data/neg_d.csv")
return parser.parse_args()

Expand Down

0 comments on commit 1e760ee

Please sign in to comment.