GitHub - AMR-KELEG/Crosslingual-IR: Preliminary investigations of crosslingual IR

Answer/Passage retrieval using MS-MARCO

The MS-MARCO dataset has queries (from user bing queries), relevant passages, and the answers. The queries' answer types are taxonomized into: description, numeric, entity, location, person. I focus on the description queries since they are more relevant to an exploratory knowledge seeking, the application I can see as useful for CrossLingual IR (CLIR).

Experiment (1)

Setup: retrieve the correct answer(s) for each query.

Zero-shot retrieval of multilingual passages using XQUAD-r

Downloading the datasets: python download_datasets.py
Running the zero-shot xlm-roberta baseline: python run_baseline.py

Experiment

Setup: questions in Arabic, retrieve paragraphs in any of the languages.
Finding: language bias indeed exists, where the returned paragraph are in Arabic, and non-Arabic paragraph are rarely retrieved.

Cross-lingual retrieval

Make sure you have access to the following dataset on HF kaust-generative-ai/fineweb-edu-ar
Generate an access token through: https://huggingface.co/settings/tokens
Download the parallel Arabic/English arXiv paragraphs after setting the access token as an environment variable

export HF_TOKEN="REPLACE_THIS_WITH_YOUR_ACCESS_TOKEN!"

wget --header="Authorization: Bearer ${HF_TOKEN}" -c "https://huggingface.co/datasets/kaust-generative-ai/fineweb-edu-ar/resolve/main/ar/train/chunk_0_0.zip" -O "chunk_0_0_ar.zip"
wget --header="Authorization: Bearer ${HF_TOKEN}" -c "https://huggingface.co/datasets/kaust-generative-ai/fineweb-edu-ar/resolve/main/en/train/chunk_0_0.zip" -O "chunk_0_0_en.zip"

DATA_DIR="data/fineweb_edu"
mkdir -p ${DATA_DIR}
mv chunk_0_0_{ar,en}.zip ${DATA_DIR}

Run the zero-shot xlm-roberta baseline python fineweb_edu_retrieval.py

Experiment

Setup: question in Arabic or English, retrieve fineweb_edu documents in any of the two languages.
Findings:
- The similarity scores are really high
  - This might be a reason of the long length of the paragraphs
  - TODO: Split paragraphs into smaller segments (initially in a random way?)
  - TODO: How to perform alignment if such segmentation is done!
- Language bias still exists!
- Preliminary Result (Qualitiative analysis) Naive alignment of the parallel documents' embeddings using contrastive learning on a small batch of data detiriorates the quality of the retrieved documents.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
README.md		README.md
download_datasets.py		download_datasets.py
fineweb_edu_retrieval.py		fineweb_edu_retrieval.py
run_baseline.py		run_baseline.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Answer/Passage retrieval using MS-MARCO

Experiment (1)

Zero-shot retrieval of multilingual passages using XQUAD-r

Experiment

Cross-lingual retrieval

Experiment

About

Uh oh!

Releases

Packages

Languages

AMR-KELEG/Crosslingual-IR

Folders and files

Latest commit

History

Repository files navigation

Answer/Passage retrieval using MS-MARCO

Experiment (1)

Zero-shot retrieval of multilingual passages using XQUAD-r

Experiment

Cross-lingual retrieval

Experiment

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages