Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error creating WisecondorFF reference #1

Open
dy-lin opened this issue Apr 22, 2022 · 5 comments
Open

Error creating WisecondorFF reference #1

dy-lin opened this issue Apr 22, 2022 · 5 comments

Comments

@dy-lin
Copy link

dy-lin commented Apr 22, 2022

I'm having some issues building the WisecondorFF reference, but no issues converting from BAM to NPZ. I've run it with a log level for debugging. For context, I'm creating the reference from 97 normal samples (97 NPZ files). Snippet below, log file attached.

...
[INFO - 2022-03-17 15:46:23]: Removed 11512 bins that have < 500 observations
[INFO - 2022-03-17 15:46:23]: Scaling up by a factor of 50 from 5,000 to 250,000.
../WisecondorFF/src/main.py:343: RuntimeWarning: invalid value encountered in true_divide
  all_data = all_data / sum_per_sample
../WisecondorFF/src/main.py:346: RuntimeWarning: invalid value encountered in greater
  mask = sum_per_bin > 0
Traceback (most recent call last):
  File "../WisecondorFF/src/main.py", line 1171, in <module>
    main()
  File "../WisecondorFF/src/main.py", line 39, in wrap
    output = f(*args, **kwargs)
  File "../WisecondorFF/src/main.py", line 1167, in main
    args.func(args)
  File "../WisecondorFF/src/main.py", line 39, in wrap
    output = f(*args, **kwargs)
  File "../WisecondorFF/src/main.py", line 541, in wcr_reference
    args.binsize, args.refsize, fs_samples, fs_total_mask, fs_bins_per_chr, rc=False
  File "../WisecondorFF/src/main.py", line 389, in reference_prep
    pca_corrected_data, pca = train_pca(masked_data)
  File "../WisecondorFF/src/main.py", line 354, in train_pca
    pca.fit(t_data)
  File "/home/dlin/.conda/envs/pegasus2_pipeline/lib/python3.6/site-packages/scikit_learn-0.22.1-py3.6-linux-x86_64.egg/sklearn/decomposition/_pca.py", line 344, in fit
    self._fit(X)
  File "/home/dlin/.conda/envs/pegasus2_pipeline/lib/python3.6/site-packages/scikit_learn-0.22.1-py3.6-linux-x86_64.egg/sklearn/decomposition/_pca.py", line 391, in _fit
    copy=self.copy)
  File "/home/dlin/.conda/envs/pegasus2_pipeline/lib/python3.6/site-packages/scikit_learn-0.22.1-py3.6-linux-x86_64.egg/sklearn/utils/validation.py", line 594, in check_array
    context))
ValueError: Found array with 0 feature(s) (shape=(97, 0)) while a minimum of 1 is required.

peg2_ref-n519-29682026.txt

@tomokveld
Copy link
Owner

ValueError: Found array with 0 feature(s) (shape=(97, 0)) while a minimum of 1 is required. indicates there are no features left in t_data when attempting to train the PCA. The first call to this function is made when building the fragment size specific reference.

A look at the log file reveals the problem, which shows that all regions (when selecting for the fragment size) are masked. The selection of fragment size regions is based on both the absolute and normalized read count (in my experience a read count that is too low within a region yields unreliable fragment size estimates). The absolute read count is used here as a lower bound, whereas the normalized count may be higher (variable per sample).

...
[INFO - 2022-03-17 15:43:17]: Loading: ./reference_npzs/CKXROD.npz
[INFO - 2022-03-17 15:43:18]: Scaling up by a factor of 50 from 5,000 to 250,000.
[INFO - 2022-03-17 15:43:19]: Removed 11512 bins that have < 500 observations
...

In any case I pushed an update that will let you specify the below parameters, such that you can tweak the absolute and normalized cutoffs (which are naturally dependent on chosen region size and sample coverage). When using a smaller region size (250 kb), it may be necessary to set RC_CLIP_ABS to a lower value (in general I don't think you need to touch RC_CLIP_NORM).

I recommend you to either set RC_CLIP_ABS lower such that sufficient regions remain to actually build a reference or use a bigger region size (the fragment size is a lot more reliable at a bigger size) such that this should no longer be an issue.

  -cn RC_CLIP_NORM, --rc_clip_norm RC_CLIP_NORM
                        Lower bound cutoff based on the normalized read count
                        across regions (regions that fall below this cutoff
                        are masked from the fragment size estimation)
                        (default: 0.0001)
  -ca RC_CLIP_ABS, --rc_clip_abs RC_CLIP_ABS
                        Lower bound cutoff based on the absolute read count
                        across regions (regions that fall below this cutoff
                        are masked from the fragment size estimation)
                        (default: 500)

@dy-lin
Copy link
Author

dy-lin commented Apr 28, 2022

The error persists with the new commit. I've reduced RC_CLIP_ABS to 1 and increased the region size to 1000kb as well (1000kb to 25kb sweep). The data does have very low coverage (< 0.16). Are there any other parameters that should be adjusted?
peg2_ref_FF-n139-30653240.txt

@tomokveld
Copy link
Owner

Ok that's rather strange, and you did use paired end reads? What aligner was used?

If you can could you send one of the NPZ files that you're using to build the reference? I can have a look at the read distributions then.

@dy-lin
Copy link
Author

dy-lin commented Jul 4, 2022

This particular dataset was single-end reads aligned with BWA. Unfortunately I cannot send the NPZ files without going through red tape as it is confidential patient data. However, we recently received data from a new cohort (19 NK reference samples) where the reference seemed to have build without issue, even without specifying RC_CLIP_ABS at 250kb size. This cohort has much better average coverage (0.25-0.6) than the previous one (0.06-0.16), which could explain the issues.

I've attached the new cohort log file where you can see that the observation threshold for removal are determined per npz file:

[INFO - 2022-07-04 15:01:56]: Loading: ../reference_npzs/176_S1.npz
[INFO - 2022-07-04 15:01:59]: Scaling up by a factor of 50 from 5,000 to 250,000.
[INFO - 2022-07-04 15:02:03]: Removed 9592 bins that have < 794 observations
[INFO - 2022-07-04 15:02:03]: Scaling up by a factor of 50 from 5,000 to 250,000.
[INFO - 2022-07-04 15:02:03]: Loading: ../reference_npzs/177_S2.npz
[INFO - 2022-07-04 15:02:07]: Scaling up by a factor of 50 from 5,000 to 250,000.
[INFO - 2022-07-04 15:02:11]: Removed 8335 bins that have < 866 observations
[INFO - 2022-07-04 15:02:11]: Scaling up by a factor of 50 from 5,000 to 250,000.
[INFO - 2022-07-04 15:02:11]: Loading: ../reference_npzs/179_S3.npz
[INFO - 2022-07-04 15:02:14]: Scaling up by a factor of 50 from 5,000 to 250,000.
[INFO - 2022-07-04 15:02:17]: Removed 8612 bins that have < 559 observations
...
[INFO - 2022-07-04 16:23:41]: Loading: ../reference_npzs/216_S20.npz
[INFO - 2022-07-04 16:23:44]: Scaling up by a factor of 50 from 5,000 to 250,000.
[INFO - 2022-07-04 16:23:48]: Removed 8445 bins that have < 760 observations
[INFO - 2022-07-04 16:23:48]: Scaling up by a factor of 50 from 5,000 to 250,000.
[INFO - 2022-07-04 16:25:12]: Ended: wcr_reference, duration 210.2965271472931s
[INFO - 2022-07-04 16:25:12]: Ended: main, duration 210.30400109291077s

peg2_ref_default.txt

@tomokveld
Copy link
Owner

So you definitely need paired-end reads, since the method, in its current form, relies on the distance (insert size) measured between aligned reads to finally compare insert size distributions between regions.

I assume the new data is paired since it actually passes the construction phase, and the log seems to be as you would expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants