WHALE-embeddings

This repository contains code related to the resource article Embedding the Web: An Open Billion-Scale Repository for Knowledge Graph Embeddings.

Project Structure

The code of the Embedding the Web approach comprises two main sub-modules:

Crawler: Responsible for downloading and preprocessing the raw RDF data dumps from the WebDataCommons structured data corpus.
Training: Handles job scheduling, batch creation, and execution of embedding training (via Slurm on a cluster or locally with dicee).

WHALE-embeddings/
├── Crawler/
│ ├── download_data.sh # Download raw .nq.gz files into data/raw
│ ├── domain_extraction.py # Split triples by domain into data/domain_dataset
│ └── run_sed.sh # Clean domain_dataset files via sed regex
├── Training/
│ ├── schedule_jobs.sh # Scheduler: create batches and submit or run jobs
│ └── run_embeddings_array.sh# Job script: run dicee or sbatch array tasks
├── pipeline.sh # Top-level orchestration: download, extract, schedule
├── file.list_sample # Sample URL list fallback
├── .gitignore
└── README.md # (this file)

Crawler

download_data.sh
- Reads file.list (or file.list_sample if missing) of HTTP URLs.
- Extracts metadata tags (e.g. html-embedded-jsonld) and organizes downloads into data/raw/<metadata>/.
domain_extraction.py
- Walks each data/raw/<metadata>/ folder.
- Reads every .gz, buckets triples by base-URL domain, and writes per-domain .txt files into data/domain_dataset/<metadata>/.
- Logs triple counts into data/domain_logs/<metadata>.csv.
run_sed.sh
- Iterates over all subfolders under data/domain_dataset/.
- Applies a sed regex to each file, replacing blank-node prefixes (_:id) with full resource URIs.
- Provides an in-console progress bar per folder.

Training

schedule_jobs.sh
- Loops over all data/domain_dataset/<dataset>/ folders.
- Creates size-sorted batch files under data/batch_files_<dataset>/.
- On a cluster: submits Slurm array jobs with sbatch, exporting batch_file, dataset, and metafolder, plus resource directives (--cpus-per-task, --mem, --output, --error).
- Locally: iterates each batch file and invokes dicee directly for each KG file.
run_embeddings_array.sh
- A Slurm array job script that:
  - Activates the dice environment.
  - Reads the appropriate slice of batch_file (10 lines per task).
  - Runs dicee on each data path, logging failures.
  - Archives results from /dev/shm into embeddings/<dataset>/models/<metafolder>_<job>_<task>/.

Installation

Prerequisites

Git
Conda (Miniforge or Anaconda)
wget, grep, gzip, sed, tar (standard Unix tools)
On-cluster: Slurm (sbatch, squeue)

Clone Repository

git clone https://github.com/dice-group/WHALE-embeddings.git
cd WHALE-embeddings
chmod +x Crawler/*.sh Training/*.sh pipeline.sh

Download URL List

Fetch the full list of structured-data dumps (or use the provided sample):

wget -q -O file.list http://webdatacommons.org/structureddata/2023-12/files/file.list

If you skip the above, pipeline.sh will fall back to file.list_sample

Run the Pipeline

./pipeline.sh

The pipeline will download the data, extract domain-specific datasets, and then schedule (or run locally) the embedding training jobs. Outputs and logs will be stored under embeddings/.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WHALE-embeddings

Project Structure

Crawler

Training

Installation

Prerequisites

Clone Repository

Download URL List

Run the Pipeline

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
Crawler		Crawler
Training		Training
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
file.list_sample		file.list_sample
pipeline.sh		pipeline.sh

License

dice-group/WHALE-embeddings

Folders and files

Latest commit

History

Repository files navigation

WHALE-embeddings

Project Structure

Crawler

Training

Installation

Prerequisites

Clone Repository

Download URL List

Run the Pipeline

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages