Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v2] Remove MultilingualTask #1832

Open
Tracked by #1791
Samoed opened this issue Jan 18, 2025 · 1 comment
Open
Tracked by #1791

[v2] Remove MultilingualTask #1832

Samoed opened this issue Jan 18, 2025 · 1 comment

Comments

@Samoed
Copy link
Collaborator

Samoed commented Jan 18, 2025

I think we can integrate the MultilingualTask class into the AbsTask class. The most useful function is already there, and it can be easily integrated into load_data.

self.dataset = {}
for lang in self.hf_subsets:
self.dataset[lang] = datasets.load_dataset(
name=lang,
**self.metadata.dataset,
)

Additionally, we should reupload datasets with fast_loading, as using parquet files will make them much faster.

self.dataset = {}
merged_dataset = datasets.load_dataset(
**self.metadata.dataset
) # load "default" subset
for split in merged_dataset.keys():
df_split = merged_dataset[split].to_polars()
df_grouped = dict(df_split.group_by(["lang"]))
for lang in set(df_split["lang"].unique()) & set(self.hf_subsets):
self.dataset.setdefault(lang, {})
self.dataset[lang][split] = datasets.Dataset.from_polars(
df_grouped[(lang,)].drop("lang")
) # Remove lang column and convert back to HF datasets, not strictly necessary but better for compatibility
for lang, subset in self.dataset.items():
self.dataset[lang] = datasets.DatasetDict(subset)

@KennethEnevoldsen
Copy link
Contributor

completely agree with this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants