|
1 |
| -# Dataloader for BigWig files |
| 1 | +# :lollipop: Epigenetics Dataloader for BigWig files |
2 | 2 |
|
3 |
| -Faster batched dataloading of BigWig files and corresponding sequence data powered by GPU for deep learning applications. |
4 |
| -This library is meant for loading batches of data with the same dimensionality, which allows for some assumptions that can |
5 |
| -speed up the loading process. As can be seen from the plot below, when loading a small amount of data, pyBigWig is very fast, |
6 |
| -but does not exploit the batched nature of data loading for machine learning. |
| 3 | +Fast batched dataloading of BigWig files containing epigentic track data and corresponding sequences powered by GPU |
| 4 | +for deep learning applications. |
7 | 5 |
|
8 |
| -In the benchmark below we also created PyTorch dataloaders (with set_start_method('spawn')) using pyBigWig to compare to |
9 |
| -the realistic scenario where multiple CPUs would be used per GPU. We see that the throughput of the CPU dataloader does |
10 |
| -not go up linearly with the number of CPUs, and therefore it becomes hard to get the needed throughput to keep the GPU, |
11 |
| -training the neural network,saturated during the learning steps. |
| 6 | +## Quickstart |
12 | 7 |
|
| 8 | +### Installation with conda/mamba |
13 | 9 |
|
14 |
| - |
| 10 | +Bigwig-loader mainly depends on the rapidsai kvikio library and cupy, both of which are best installed using |
| 11 | +conda/mamba. Bigwig-loader can now also be installed using conda/mamba. To create a new environment with bigwig-loader |
| 12 | +installed: |
15 | 13 |
|
16 |
| -This is the problem bigwig-loader solves. This is an example of how to use bigwig-loader: |
| 14 | +```shell |
| 15 | +mamba create -n my-env -c rapidsai -c conda-forge -c bioconda -c dataloading bigwig-loader |
| 16 | +``` |
| 17 | + |
| 18 | +Or add this to you environment.yml file: |
| 19 | + |
| 20 | +```yaml |
| 21 | +name: my-env |
| 22 | +channels: |
| 23 | + - rapidsai |
| 24 | + - conda-forge |
| 25 | + - bioconda |
| 26 | + - dataloading |
| 27 | +dependencies: |
| 28 | + - bigwig-loader |
| 29 | +``` |
| 30 | +
|
| 31 | +and update: |
| 32 | +
|
| 33 | +```shell |
| 34 | +mamba env update -f environment.yml |
| 35 | +``` |
| 36 | + |
| 37 | +### Installation with pip |
| 38 | +Bigwig-loader can also be installed using pip in an environment which has the rapidsai kvikio library |
| 39 | +and cupy installed already: |
| 40 | + |
| 41 | +```shell |
| 42 | +pip install bigwig-loader |
| 43 | +``` |
| 44 | + |
| 45 | +### PyTorch Example |
| 46 | +We wrapped the BigWigDataset in a PyTorch iterable dataset that you can directly use: |
17 | 47 |
|
18 | 48 | ```python
|
| 49 | +# examples/pytorch_example.py |
19 | 50 | import pandas as pd
|
20 |
| -from bigwig_loader.dataset import BigWigDataset |
| 51 | +import torch |
| 52 | +from torch.utils.data import DataLoader |
21 | 53 | from bigwig_loader import config
|
| 54 | +from bigwig_loader.pytorch import PytorchBigWigDataset |
22 | 55 | from bigwig_loader.download_example_data import download_example_data
|
23 | 56 |
|
24 |
| -# Download some data to play with |
| 57 | +# Download example data to play with |
25 | 58 | download_example_data()
|
| 59 | +example_bigwigs_directory = config.bigwig_dir |
| 60 | +reference_genome_file = config.reference_genome |
26 | 61 |
|
27 |
| -# created by running examples/create_train_val_test_intervals.py |
28 |
| -train_regions = pd.read_csv("train_regions.tsv", sep="\t") |
29 |
| - |
30 |
| -# now there is some example data here |
31 |
| -bigwig_dir = config.bigwig_dir |
32 |
| -reference_genome = config.reference_genome |
33 |
| -print("Loading from:", bigwig_dir) |
| 62 | +train_regions = pd.DataFrame({"chrom": ["chr1", "chr2"], "start": [0, 0], "end": [1000000, 1000000]}) |
34 | 63 |
|
35 |
| -dataset = BigWigDataset( |
| 64 | +dataset = PytorchBigWigDataset( |
36 | 65 | regions_of_interest=train_regions,
|
37 |
| - collection=bigwig_dir, |
38 |
| - reference_genome_path=reference_genome, |
| 66 | + collection=example_bigwigs_directory, |
| 67 | + reference_genome_path=reference_genome_file, |
39 | 68 | sequence_length=1000,
|
40 |
| - center_bin_to_predict=1000, |
| 69 | + center_bin_to_predict=500, |
41 | 70 | window_size=1,
|
42 |
| - batch_size=256, |
| 71 | + batch_size=32, |
43 | 72 | super_batch_size=1024,
|
44 | 73 | batches_per_epoch=20,
|
45 | 74 | maximum_unknown_bases_fraction=0.1,
|
46 | 75 | sequence_encoder="onehot",
|
| 76 | + n_threads=4, |
| 77 | + return_batch_objects=True, |
47 | 78 | )
|
48 | 79 |
|
49 |
| -for encoded_sequences, epigenetics_profiles in dataset: |
50 |
| - print(encoded_sequences) |
51 |
| - print(epigenetics_profiles) |
| 80 | +# Don't use num_workers > 0 in DataLoader. The heavy |
| 81 | +# lifting/parallelism is done on cuda streams on the GPU. |
| 82 | +dataloader = DataLoader(dataset, num_workers=0, batch_size=None) |
| 83 | + |
52 | 84 |
|
| 85 | +class MyTerribleModel(torch.nn.Module): |
| 86 | + def __init__(self): |
| 87 | + super().__init__() |
| 88 | + self.linear = torch.nn.Linear(4, 2) |
| 89 | + |
| 90 | + def forward(self, batch): |
| 91 | + return self.linear(batch).transpose(1, 2) |
| 92 | + |
| 93 | + |
| 94 | +model = MyTerribleModel() |
| 95 | +optimizer = torch.optim.SGD(model.parameters(), lr=0.01) |
| 96 | + |
| 97 | +def poisson_loss(pred, target): |
| 98 | + return (pred - target * torch.log(pred.clamp(min=1e-8))).mean() |
| 99 | + |
| 100 | +for batch in dataloader: |
| 101 | + # batch.sequences.shape = n_batch (32), sequence_length (1000), onehot encoding (4) |
| 102 | + pred = model(batch.sequences) |
| 103 | + # batch.values.shape = n_batch (32), n_tracks (2) center_bin_to_predict (500) |
| 104 | + loss = poisson_loss(pred[:, :, 250:750], batch.values) |
| 105 | + print(loss) |
| 106 | + optimizer.zero_grad() |
| 107 | + loss.backward() |
| 108 | + optimizer.step() |
53 | 109 | ```
|
54 | 110 |
|
| 111 | +### Other frameworks |
| 112 | + |
| 113 | +A framework agnostic Dataset object can be imported from `bigwig_loader.dataset`. This dataset object |
| 114 | +returns cupy tensors. Cupy tensors adhere to the cuda array interface and can be zero-copy transformed |
| 115 | +to JAX or tensorflow tensors. |
| 116 | + |
| 117 | +```python |
| 118 | +from bigwig_loader.dataset import BigWigDataset |
| 119 | + |
| 120 | +dataset = BigWigDataset( |
| 121 | + regions_of_interest=train_regions, |
| 122 | + collection=example_bigwigs_directory, |
| 123 | + reference_genome_path=reference_genome_file, |
| 124 | + sequence_length=1000, |
| 125 | + center_bin_to_predict=500, |
| 126 | + window_size=1, |
| 127 | + batch_size=32, |
| 128 | + super_batch_size=1024, |
| 129 | + batches_per_epoch=20, |
| 130 | + maximum_unknown_bases_fraction=0.1, |
| 131 | + sequence_encoder="onehot", |
| 132 | +) |
| 133 | + |
| 134 | +``` |
55 | 135 | See the examples directory for more examples.
|
56 | 136 |
|
| 137 | +## Background |
| 138 | + |
| 139 | +This library is meant for loading batches of data with the same dimensionality, which allows for some assumptions that can |
| 140 | +speed up the loading process. As can be seen from the plot below, when loading a small amount of data, pyBigWig is very fast, |
| 141 | +but does not exploit the batched nature of data loading for machine learning. |
| 142 | + |
| 143 | +In the benchmark below we also created PyTorch dataloaders (with set_start_method('spawn')) using pyBigWig to compare to |
| 144 | +the realistic scenario where multiple CPUs would be used per GPU. We see that the throughput of the CPU dataloader does |
| 145 | +not go up linearly with the number of CPUs, and therefore it becomes hard to get the needed throughput to keep the GPU, |
| 146 | +training the neural network,saturated during the learning steps. |
| 147 | + |
| 148 | + |
| 149 | + |
| 150 | + |
| 151 | +This is the problem bigwig-loader solves. This is an example of how to use bigwig-loader: |
| 152 | + |
57 | 153 | ### Installation
|
58 | 154 |
|
59 | 155 | 1. `git clone git@github.com:pfizer-opensource/bigwig-loader`
|
|
0 commit comments