Skip to content
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Commit c543fce

Browse files
committedSep 26, 2024··
Updated README.md to communicate easier install options
1 parent 847b020 commit c543fce

File tree

2 files changed

+170
-55
lines changed

2 files changed

+170
-55
lines changed
 

‎README.md

Lines changed: 124 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,59 +1,155 @@
1-
# Dataloader for BigWig files
1+
# :lollipop: Epigenetics Dataloader for BigWig files
22

3-
Faster batched dataloading of BigWig files and corresponding sequence data powered by GPU for deep learning applications.
4-
This library is meant for loading batches of data with the same dimensionality, which allows for some assumptions that can
5-
speed up the loading process. As can be seen from the plot below, when loading a small amount of data, pyBigWig is very fast,
6-
but does not exploit the batched nature of data loading for machine learning.
3+
Fast batched dataloading of BigWig files containing epigentic track data and corresponding sequences powered by GPU
4+
for deep learning applications.
75

8-
In the benchmark below we also created PyTorch dataloaders (with set_start_method('spawn')) using pyBigWig to compare to
9-
the realistic scenario where multiple CPUs would be used per GPU. We see that the throughput of the CPU dataloader does
10-
not go up linearly with the number of CPUs, and therefore it becomes hard to get the needed throughput to keep the GPU,
11-
training the neural network,saturated during the learning steps.
6+
## Quickstart
127

8+
### Installation with conda/mamba
139

14-
![benchmark.png](images%2Fbenchmark.png)
10+
Bigwig-loader mainly depends on the rapidsai kvikio library and cupy, both of which are best installed using
11+
conda/mamba. Bigwig-loader can now also be installed using conda/mamba. To create a new environment with bigwig-loader
12+
installed:
1513

16-
This is the problem bigwig-loader solves. This is an example of how to use bigwig-loader:
14+
```shell
15+
mamba create -n my-env -c rapidsai -c conda-forge -c bioconda -c dataloading bigwig-loader
16+
```
17+
18+
Or add this to you environment.yml file:
19+
20+
```yaml
21+
name: my-env
22+
channels:
23+
- rapidsai
24+
- conda-forge
25+
- bioconda
26+
- dataloading
27+
dependencies:
28+
- bigwig-loader
29+
```
30+
31+
and update:
32+
33+
```shell
34+
mamba env update -f environment.yml
35+
```
36+
37+
### Installation with pip
38+
Bigwig-loader can also be installed using pip in an environment which has the rapidsai kvikio library
39+
and cupy installed already:
40+
41+
```shell
42+
pip install bigwig-loader
43+
```
44+
45+
### PyTorch Example
46+
We wrapped the BigWigDataset in a PyTorch iterable dataset that you can directly use:
1747

1848
```python
49+
# examples/pytorch_example.py
1950
import pandas as pd
20-
from bigwig_loader.dataset import BigWigDataset
51+
import torch
52+
from torch.utils.data import DataLoader
2153
from bigwig_loader import config
54+
from bigwig_loader.pytorch import PytorchBigWigDataset
2255
from bigwig_loader.download_example_data import download_example_data
2356

24-
# Download some data to play with
57+
# Download example data to play with
2558
download_example_data()
59+
example_bigwigs_directory = config.bigwig_dir
60+
reference_genome_file = config.reference_genome
2661

27-
# created by running examples/create_train_val_test_intervals.py
28-
train_regions = pd.read_csv("train_regions.tsv", sep="\t")
29-
30-
# now there is some example data here
31-
bigwig_dir = config.bigwig_dir
32-
reference_genome = config.reference_genome
33-
print("Loading from:", bigwig_dir)
62+
train_regions = pd.DataFrame({"chrom": ["chr1", "chr2"], "start": [0, 0], "end": [1000000, 1000000]})
3463

35-
dataset = BigWigDataset(
64+
dataset = PytorchBigWigDataset(
3665
regions_of_interest=train_regions,
37-
collection=bigwig_dir,
38-
reference_genome_path=reference_genome,
66+
collection=example_bigwigs_directory,
67+
reference_genome_path=reference_genome_file,
3968
sequence_length=1000,
40-
center_bin_to_predict=1000,
69+
center_bin_to_predict=500,
4170
window_size=1,
42-
batch_size=256,
71+
batch_size=32,
4372
super_batch_size=1024,
4473
batches_per_epoch=20,
4574
maximum_unknown_bases_fraction=0.1,
4675
sequence_encoder="onehot",
76+
n_threads=4,
77+
return_batch_objects=True,
4778
)
4879

49-
for encoded_sequences, epigenetics_profiles in dataset:
50-
print(encoded_sequences)
51-
print(epigenetics_profiles)
80+
# Don't use num_workers > 0 in DataLoader. The heavy
81+
# lifting/parallelism is done on cuda streams on the GPU.
82+
dataloader = DataLoader(dataset, num_workers=0, batch_size=None)
83+
5284

85+
class MyTerribleModel(torch.nn.Module):
86+
def __init__(self):
87+
super().__init__()
88+
self.linear = torch.nn.Linear(4, 2)
89+
90+
def forward(self, batch):
91+
return self.linear(batch).transpose(1, 2)
92+
93+
94+
model = MyTerribleModel()
95+
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
96+
97+
def poisson_loss(pred, target):
98+
return (pred - target * torch.log(pred.clamp(min=1e-8))).mean()
99+
100+
for batch in dataloader:
101+
# batch.sequences.shape = n_batch (32), sequence_length (1000), onehot encoding (4)
102+
pred = model(batch.sequences)
103+
# batch.values.shape = n_batch (32), n_tracks (2) center_bin_to_predict (500)
104+
loss = poisson_loss(pred[:, :, 250:750], batch.values)
105+
print(loss)
106+
optimizer.zero_grad()
107+
loss.backward()
108+
optimizer.step()
53109
```
54110

111+
### Other frameworks
112+
113+
A framework agnostic Dataset object can be imported from `bigwig_loader.dataset`. This dataset object
114+
returns cupy tensors. Cupy tensors adhere to the cuda array interface and can be zero-copy transformed
115+
to JAX or tensorflow tensors.
116+
117+
```python
118+
from bigwig_loader.dataset import BigWigDataset
119+
120+
dataset = BigWigDataset(
121+
regions_of_interest=train_regions,
122+
collection=example_bigwigs_directory,
123+
reference_genome_path=reference_genome_file,
124+
sequence_length=1000,
125+
center_bin_to_predict=500,
126+
window_size=1,
127+
batch_size=32,
128+
super_batch_size=1024,
129+
batches_per_epoch=20,
130+
maximum_unknown_bases_fraction=0.1,
131+
sequence_encoder="onehot",
132+
)
133+
134+
```
55135
See the examples directory for more examples.
56136

137+
## Background
138+
139+
This library is meant for loading batches of data with the same dimensionality, which allows for some assumptions that can
140+
speed up the loading process. As can be seen from the plot below, when loading a small amount of data, pyBigWig is very fast,
141+
but does not exploit the batched nature of data loading for machine learning.
142+
143+
In the benchmark below we also created PyTorch dataloaders (with set_start_method('spawn')) using pyBigWig to compare to
144+
the realistic scenario where multiple CPUs would be used per GPU. We see that the throughput of the CPU dataloader does
145+
not go up linearly with the number of CPUs, and therefore it becomes hard to get the needed throughput to keep the GPU,
146+
training the neural network,saturated during the learning steps.
147+
148+
149+
![benchmark.png](images%2Fbenchmark.png)
150+
151+
This is the problem bigwig-loader solves. This is an example of how to use bigwig-loader:
152+
57153
### Installation
58154

59155
1. `git clone git@github.com:pfizer-opensource/bigwig-loader`

‎examples/pytorch_example.py

Lines changed: 46 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1,42 +1,61 @@
11
import pandas as pd
2+
import torch
23
from torch.utils.data import DataLoader
34

45
from bigwig_loader import config
56
from bigwig_loader.download_example_data import download_example_data
67
from bigwig_loader.pytorch import PytorchBigWigDataset
78

9+
download_example_data()
10+
example_bigwigs_directory = config.bigwig_dir
11+
reference_genome_file = config.reference_genome
812

9-
def run():
10-
# Download some data to play with
11-
download_example_data()
13+
train_regions = pd.DataFrame(
14+
{"chrom": ["chr1", "chr2"], "start": [0, 0], "end": [1000000, 1000000]}
15+
)
1216

13-
# created by running examples/create_train_val_test_intervals.py
14-
train_regions = pd.read_csv("train_regions.tsv", sep="\t")
17+
dataset = PytorchBigWigDataset(
18+
regions_of_interest=train_regions,
19+
collection=example_bigwigs_directory,
20+
reference_genome_path=reference_genome_file,
21+
sequence_length=1000,
22+
center_bin_to_predict=500,
23+
window_size=1,
24+
batch_size=32,
25+
super_batch_size=1024,
26+
batches_per_epoch=20,
27+
maximum_unknown_bases_fraction=0.1,
28+
sequence_encoder="onehot",
29+
n_threads=4,
30+
return_batch_objects=True,
31+
)
1532

16-
# now there is some example data here
17-
bigwig_dir = config.bigwig_dir
18-
reference_genome = config.reference_genome
19-
print("Loading from:", bigwig_dir)
33+
dataloader = DataLoader(dataset, num_workers=0, batch_size=None)
2034

21-
dataset = PytorchBigWigDataset(
22-
regions_of_interest=train_regions,
23-
collection=bigwig_dir,
24-
reference_genome_path=reference_genome,
25-
sequence_length=1000,
26-
center_bin_to_predict=1000,
27-
window_size=1,
28-
batch_size=256,
29-
batches_per_epoch=20,
30-
maximum_unknown_bases_fraction=0.1,
31-
sequence_encoder="onehot",
32-
)
3335

34-
loader = DataLoader(dataset, batch_size=None, num_workers=0)
36+
class MyTerribleModel(torch.nn.Module):
37+
def __init__(self):
38+
super().__init__()
39+
self.linear = torch.nn.Linear(4, 2)
3540

36-
for input, target in loader:
37-
print(input)
38-
print(target)
41+
def forward(self, batch):
42+
return self.linear(batch).transpose(1, 2)
3943

4044

41-
if __name__ == "__main__":
42-
run()
45+
model = MyTerribleModel()
46+
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
47+
48+
49+
def poisson_loss(pred, target):
50+
return (pred - target * torch.log(pred.clamp(min=1e-8))).mean()
51+
52+
53+
for batch in dataloader:
54+
# batch.sequences.shape = n_batch (32), sequence_length (1000), onehot encoding (4)
55+
pred = model(batch.sequences)
56+
# batch.values.shape = n_batch (32), n_tracks (2) center_bin_to_predict (500)
57+
loss = poisson_loss(pred[:, :, 250:750], batch.values)
58+
print(loss)
59+
optimizer.zero_grad()
60+
loss.backward()
61+
optimizer.step()

0 commit comments

Comments
 (0)
Please sign in to comment.