Skip to content

Commit

Permalink
tcrdist draft version implemented (#502)
Browse files Browse the repository at this point in the history
* tcrdist draft version implemented

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* tcrdist tests added

* fixed ir_dist _get_distance_calculator parameter handling

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* handling of empty input sequences fixed

* additional tests for tcrdist added

* tcrdist test with comparison against reference implementation added

* formatting of tcrdist tests improved

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* comments for TCRdist added

* code formatting

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* handling of default values for cutoff and n_jobs in _get_distance_calculator adapted

* auto formatting disabled for tcr_dict_distance_matrix

* added data type hints to functions and adapted function comments

* changed testdata import for test cases

* changed __init__ and _nb_tcrdist_mat in TCRdistDistanceCalculator to keywords only

* keywords only for _nb_tcrdist_mat removed, because it doesn't work with numba

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* unused control variable changed to _

* creation of numba lookup matrix changed

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update CHANGELOG

* Update docstring

* Update ir_dist docstring

* Update description in tutorial

---------

Co-authored-by: Gregor Sturm <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
3 people authored Apr 21, 2024
1 parent d63483c commit d1db848
Show file tree
Hide file tree
Showing 8 changed files with 641 additions and 9 deletions.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,10 @@ and this project adheres to [Semantic Versioning][].
[keep a changelog]: https://keepachangelog.com/en/1.0.0/
[semantic versioning]: https://semver.org/spec/v2.0.0.html

## Unreleased

- Add "TCRdist" as new metric ([#502](https://github.com/scverse/scirpy/pull/502))

## v0.16.1

### Fixes
Expand Down
1 change: 1 addition & 0 deletions docs/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -302,3 +302,4 @@ distance metrics
ir_dist.metrics.HammingDistanceCalculator
ir_dist.metrics.AlignmentDistanceCalculator
ir_dist.metrics.FastAlignmentDistanceCalculator
ir_dist.metrics.TCRdistDistanceCalculator
20 changes: 13 additions & 7 deletions docs/tutorials/tutorial_3k_tcr.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -1049,13 +1049,19 @@
"For instance, a distance of `10` is equivalent to 2 Rs mutating into N.\n",
"This appoach was initially proposed as *TCRdist* by Dash et al. {cite}`TCRdist`.\n",
"\n",
":::{tip}\n",
"You can use `metric=\"fastalignment\"` for a faster calculation at the cost of a few false-negatives (i.e. sequence pairs\n",
"that are actually below the distance cutoff, but are removed during a pre-filtering step). With default parameters, \n",
"the false-negative rate (of all sequence pairs actually below the cutoff) was ~2% on the {func}`scirpy.datasets.wu2020`\n",
"dataset. \n",
"\n",
"See also {class}`scirpy.ir_dist.metrics.FastAlignmentDistanceCalculator`. \n",
":::{admonition} Speeding up TCR distance calculation\n",
":class: tip\n",
"\n",
"Scirpy provides alternative distance metrics that are similar to `\"alignment\"`, but a lot faster: \n",
"\n",
"* `metric=\"tcrdist\"` is an implementation of [tcrdist3](https://github.com/kmayerb/tcrdist3) within scirpy. The scores\n",
" are calculated differently, but it gives very similar results compared to `metric=\"alignment\"`.\n",
" See also {class}`scirpy.ir_dist.metrics.TCRdistDistanceCalculator`.\n",
"* `metric=\"fastalignment\"` uses a heuristic to speed up the `\"alignment\"` metric at the cost of a few false-negatives (i.e. sequence pairs\n",
" that are actually below the distance cutoff, but are removed during a pre-filtering step). With default parameters, \n",
" the false-negative rate (of all sequence pairs actually below the cutoff) was ~2% on the {func}`scirpy.datasets.wu2020`\n",
" dataset. See also {class}`scirpy.ir_dist.metrics.FastAlignmentDistanceCalculator`.\n",
" \n",
":::\n",
"\n",
"All cells with a distance between their CDR3 sequences lower than `cutoff` will be connected in the network.\n"
Expand Down
7 changes: 6 additions & 1 deletion src/scirpy/ir_dist/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,8 @@ def _get_distance_calculator(metric: MetricType, cutoff: Union[int, None], *, n_
dist_calc = metrics.LevenshteinDistanceCalculator(n_jobs=n_jobs, **kwargs)
elif metric == "hamming":
dist_calc = metrics.HammingDistanceCalculator(n_jobs=n_jobs, **kwargs)
elif metric == "tcrdist":
dist_calc = metrics.TCRdistDistanceCalculator(n_jobs=n_jobs, **kwargs)
else:
raise ValueError("Invalid distance metric.")

Expand All @@ -122,6 +124,7 @@ def _ir_dist(
airr_mod_ref: str = "airr",
airr_key_ref: str = "airr",
chain_idx_key_ref: str = "chain_indices",
**kwargs,
) -> Union[dict, None]:
"""\
Computes a sequence-distance metric between all unique :term:`VJ <Chain locus>`
Expand Down Expand Up @@ -171,6 +174,8 @@ def _ir_dist(
Like `airr_key`, but for `reference`.
chain_idx_key_ref
Like `chain_idx_key`, but for `reference`.
**kwargs
Arguments are passed to the respective :class:`~scirpy.ir_dist.metrics.DistanceCalculator` class.
Returns
-------
Expand Down Expand Up @@ -227,7 +232,7 @@ def _get_unique_seqs(tmp_adata, chain_type):
result[chain_type][tmp_key] = unique_seqs

# compute distance matrices
dist_calc = _get_distance_calculator(metric, cutoff, n_jobs=n_jobs)
dist_calc = _get_distance_calculator(metric, cutoff, n_jobs=n_jobs, **kwargs)
for chain_type in ["VJ", "VDJ"]:
logging.info(f"Computing sequence x sequence distance matrix for {chain_type} sequences.") # type: ignore
result[chain_type]["distances"] = dist_calc.calc_dist_mat(
Expand Down
297 changes: 297 additions & 0 deletions src/scirpy/ir_dist/metrics.py

Large diffs are not rendered by default.

Binary file not shown.
Binary file not shown.
Loading

0 comments on commit d1db848

Please sign in to comment.