Auto-identify bulk data as heavy or light chain #121

willdumm · 2025-03-01T05:39:15Z

Previously our only bulk data was heavy chain data, so we assumed that if the pcp file did not distinguish between heavy and light e.g. parent sequences, that we were talking about heavy chains. Now we have light chain data in the same format, and in theory we could have a pcp file with mixed heavy and light chain bulk data.

To handle this, we process pcp_dfs into a format that always contains heavy/light differentiated _h and _l columns, but we automatically infer the chain type for each pcp based on the v family name.
If the pcp file already has differentiated _h and _l columns, as with paired data, then we assume that no inference is necessary and only check for all necessary columns and make sure that all the heavy chain and light chain v families seem to be heavy or light, as claimed.

I also added a more informative error message for when masked parent-child nt pairs are identical, since I moved that filtering step to pre-processing in dnsm-experiments.

willdumm added 3 commits February 28, 2025 15:04

infer bulk data type from v gene

d44cc3d

pcp_df handling and better error message

0c1cef6

fix typo and allow missing cdr regions

bb8b854

willdumm marked this pull request as ready for review March 3, 2025 05:52

willdumm requested a review from matsen March 3, 2025 05:52

matsen approved these changes Mar 3, 2025

View reviewed changes

willdumm merged commit 25a3a56 into main Mar 4, 2025
2 checks passed

willdumm deleted the wd-vanwinkle-data branch March 4, 2025 19:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto-identify bulk data as heavy or light chain #121

Auto-identify bulk data as heavy or light chain #121

willdumm commented Mar 1, 2025

Auto-identify bulk data as heavy or light chain #121

Auto-identify bulk data as heavy or light chain #121

Conversation

willdumm commented Mar 1, 2025