Auto-identify bulk data as heavy or light chain #121
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Previously our only bulk data was heavy chain data, so we assumed that if the pcp file did not distinguish between heavy and light e.g. parent sequences, that we were talking about heavy chains. Now we have light chain data in the same format, and in theory we could have a pcp file with mixed heavy and light chain bulk data.
To handle this, we process pcp_dfs into a format that always contains heavy/light differentiated
_h
and_l
columns, but we automatically infer the chain type for each pcp based on the v family name.If the pcp file already has differentiated
_h
and_l
columns, as with paired data, then we assume that no inference is necessary and only check for all necessary columns and make sure that all the heavy chain and light chain v families seem to be heavy or light, as claimed.I also added a more informative error message for when masked parent-child nt pairs are identical, since I moved that filtering step to pre-processing in dnsm-experiments.