Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] refactor of dataset builder and executor #537

Open
wants to merge 87 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
Show all changes
87 commits
Select commit Hold shift + click to select a range
d11f89c
ignore __dj__produced_data__
cyruszhang Nov 15, 2024
41dea26
add download framework; add wiki support
cyruszhang Nov 19, 2024
50f8d3d
refactor formatter; add dataset_builder
cyruszhang Nov 22, 2024
817caab
merge with master
cyruszhang Nov 25, 2024
a089de4
add config files and test entry
cyruszhang Nov 26, 2024
5a717d7
initial dataset_builder
cyruszhang Dec 2, 2024
9c79844
Merge branch 'main' into feat/cyruszhang/data-downloader
cyruszhang Dec 2, 2024
ffba7e7
add mixture dataset support; type/subtype
cyruszhang Dec 4, 2024
79ae980
RayExecutor with ExecutorBase
cyruszhang Dec 4, 2024
e6a6e71
get rid of subtype for local dataset; depending on ext for proper rou…
cyruszhang Dec 4, 2024
eb300f0
use source instead of sub_type for remote dataset configs
cyruszhang Dec 4, 2024
456eea1
arxiv downloader return Dataset instead of DJDataset
cyruszhang Dec 4, 2024
c25e40f
rewrite CLI datapath with test cases
cyruszhang Dec 5, 2024
75ffe3f
add executor and dataload strategy logic
cyruszhang Dec 6, 2024
4ec1ef9
Merge branch 'main' into feat/cyruszhang/data-downloader
cyruszhang Dec 6, 2024
4fb6e17
add layered load strategies
cyruszhang Dec 6, 2024
84803cd
Merge branch 'main' into feat/cyruszhang/data-downloader
cyruszhang Dec 9, 2024
cb5b80a
fix circular dependency; add dataset config test
cyruszhang Dec 10, 2024
daf7a85
update dataset_path parsing in config
cyruszhang Dec 10, 2024
7c48892
fix download test case; add wildcard matching for load strategy
cyruszhang Dec 11, 2024
940b44d
add test case for load strategy wild card matching
cyruszhang Dec 11, 2024
b80f991
add more test cases for datapath rewrite logic; fix rewrite to handle…
cyruszhang Dec 11, 2024
0d5d4ba
materialize symlinks for duplicates
cyruszhang Dec 11, 2024
f3a4ec4
add load strategy validation framework
cyruszhang Dec 12, 2024
70fffd2
add DataValidator logic
cyruszhang Dec 16, 2024
bbc303d
data validator as separate pre-processing
cyruszhang Dec 16, 2024
4b6065f
update data validator logic and add/fix test cases
cyruszhang Dec 25, 2024
0b153ab
[nit] rename test
cyruszhang Jan 2, 2025
171b361
[nit] rename test again
cyruszhang Jan 2, 2025
6841d19
add builder test cases; update ds config validation logic
cyruszhang Jan 2, 2025
3128d05
[minor] update test case naming
cyruszhang Jan 2, 2025
7b6b2bd
add support for max_sample_num in dataset configs; add tests
cyruszhang Jan 6, 2025
161f059
fix test cases and update dataset builder code
cyruszhang Jan 6, 2025
8cb322f
merge main
cyruszhang Jan 6, 2025
afe906d
handle weights and sample_nums
cyruszhang Jan 8, 2025
1217e61
support ExecutorType enum
cyruszhang Jan 9, 2025
755abca
Merge branch 'main' into feat/cyruszhang/data-downloader
cyruszhang Jan 9, 2025
5dd17fe
flip on DatasetBuilder; replace formatter
cyruszhang Jan 9, 2025
eb3b123
minor fix
cyruszhang Jan 9, 2025
7c171fb
add ExecutorBase to RayExecutor
cyruszhang Jan 9, 2025
195aff8
Merge branch 'main' into feat/cyruszhang/data-downloader
cyruszhang Jan 21, 2025
dd95df0
fix bugs; use str for executor_type
cyruszhang Jan 23, 2025
530efa8
add add_same_content_to_new_column reference
cyruszhang Jan 23, 2025
3b726bd
ray data defaults to json
cyruszhang Jan 24, 2025
cac8e5e
fix dataset_path bug; add ray config test
cyruszhang Jan 24, 2025
a99c9b5
tests video on ray config
cyruszhang Jan 24, 2025
3c9caf5
add default cfg logic; fix data_mixture demo
cyruszhang Jan 24, 2025
b9f6a99
default executor + local data; fix analyzer bug
cyruszhang Jan 27, 2025
e05f146
Merge branch 'main' into feat/cyruszhang/data-downloader
cyruszhang Jan 27, 2025
acccc01
pass through num_proc param for ray executor when loading dataset
cyruszhang Jan 27, 2025
1823cd6
fix bugs for huggingface dataset loading; add sample config
cyruszhang Jan 27, 2025
2963118
fix typo in configs
cyruszhang Jan 29, 2025
4472aef
remove absolute path logic; remove dup test files
cyruszhang Feb 7, 2025
7964867
update .gitignore for dup files in tests
cyruszhang Feb 7, 2025
96207ba
fix RayDataset schema validation issue
cyruszhang Feb 7, 2025
9b1d738
fix wiki downloader tests
cyruszhang Feb 7, 2025
828e7ba
remove mixture formatter; logic captured in dataloader
cyruszhang Feb 7, 2025
4ffb3cf
remove unused mixture formatter
cyruszhang Feb 13, 2025
7c16b23
minor fixes for CR comments
cyruszhang Feb 13, 2025
f73dd41
resolve eager RayExecutor importing
cyruszhang Feb 13, 2025
8aae265
bugfix: handle missing configs
cyruszhang Feb 13, 2025
1d65a3a
add schema support for datasets
cyruszhang Feb 13, 2025
96a4997
bugfix: handle relative path problem in tests
cyruszhang Feb 14, 2025
2f49eec
fix test cases
cyruszhang Feb 14, 2025
643e7d7
add schema support for DJDataset; remove eager Ray imports; add data …
cyruszhang Feb 19, 2025
0412e36
revert relative path for demo multi-modal data
cyruszhang Feb 19, 2025
17e70cc
proper type mapping for HF and ray datasets; add test cases
cyruszhang Feb 23, 2025
2073660
add get method for DJDataset and tests
cyruszhang Feb 24, 2025
f3f5e13
add proper validators and test cases for SwiftMessage and DJ_conversa…
cyruszhang Feb 24, 2025
c5c4d0a
add validation demo; add validators config entry
cyruszhang Feb 24, 2025
4019fd7
fix test bug; _strategies is class variable and could cause dirty dat…
cyruszhang Feb 24, 2025
38204d0
add ray relative path resolution logic, for both config file and data…
cyruszhang Feb 26, 2025
f8849d7
merge master
cyruszhang Feb 26, 2025
2514b1f
revert to lazy loading
cyruszhang Feb 26, 2025
28a969a
remove debug loggins; remove home directory test due to docker issue
cyruszhang Feb 26, 2025
be2dacb
add dataset and validator into config_all.yaml
cyruszhang Feb 26, 2025
a8825cc
merge main
cyruszhang Feb 26, 2025
7d20af4
merge main
cyruszhang Feb 27, 2025
50a208b
add test case for dataset config priority
cyruszhang Feb 27, 2025
810da84
add more documentations
cyruszhang Feb 27, 2025
c1fe973
minor fixes per code review
cyruszhang Mar 3, 2025
80b8a78
minor fixes per code review
cyruszhang Mar 3, 2025
f863c98
fix DefaultExecutor and RayDataset imports
cyruszhang Mar 3, 2025
4a0fd87
merge main
cyruszhang Mar 3, 2025
09e3dde
Merge branch 'feat/cyruszhang/data-downloader' of https://github.com/…
cyruszhang Mar 3, 2025
fa7c82c
merge main
cyruszhang Mar 3, 2025
79fae34
revert test setup
cyruszhang Mar 3, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
arxiv downloader return Dataset instead of DJDataset
cyruszhang committed Dec 4, 2024
commit 456eea14e9f69638e9dfee525f6c408810fdcd97
4 changes: 2 additions & 2 deletions data_juicer/download/arxiv.py
Original file line number Diff line number Diff line change
@@ -5,10 +5,10 @@
import tarfile
import tempfile

from datasets import Dataset
from downloader import (DocumentDownloader, DocumentExtractor,
DocumentIterator, download_and_extract, get_arxiv_urls)

from data_juicer.core.data import DJDataset
from data_juicer.utils.file_utils import (expand_outdir_and_mkdir,
get_all_files_paths_under)

@@ -355,7 +355,7 @@ def download_arxiv(
keep_raw_download=False,
force_download=False,
url_limit=None,
) -> DJDataset:
) -> Dataset:
"""
Downloads Arxiv tar files and extracts them