-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
137 lines (87 loc) · 6.83 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
---
title: "hlabud"
output:
md_document:
standalone: true
toc: false
variant: "markdown_github"
html_document:
toc: false
self_contained: true
---
# hlabud <img width="25%" align="right" src="https://github.com/slowkow/hlabud/assets/209714/b39a3f04-c9a8-4867-a3e0-9434f0f9ef20"></img>
[![R-CMD-check](https://github.com/slowkow/hlabud/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/slowkow/hlabud/actions/workflows/R-CMD-check.yaml)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.11093557.svg)](https://doi.org/10.5281/zenodo.11093557)
hlabud provides methods to retrieve sequence alignment data from [IMGTHLA] and convert the data into convenient R matrices ready for downstream analysis. See the [usage examples](https://slowkow.github.io/hlabud/articles/examples.html) to learn how to use the data with logistic regression and dimensionality reduction. We also share tips on how to [visualize the 3D molecular structure](https://slowkow.github.io/hlabud/articles/visualize-hla-structure.html) of HLA proteins and highlight specific amino acid residues.
[IMGTHLA]: https://github.com/ANHIG/IMGTHLA
For example, let's consider a simple question about two HLA genotypes.
What amino acid positions are different between two genotypes?
```{r}
library(hlabud)
a <- hla_alignments("DRB1")
a$release
dosage(a$onehot, c("DRB1*03:01:05", "DRB1*03:02:03"))
```
What nucleotides are different?
```{r}
n <- hla_alignments("DRB1", type = "nuc")
n$release
dosage(n$onehot, c("DRB1*03:01:05", "DRB1*03:02:03"))
```
# Installation
The quickest way to get hlabud is to install from GitHub:
```{r, eval=FALSE}
# install.packages("devtools")
devtools::install_github("slowkow/hlabud")
```
# Examples
See the [usage examples](https://slowkow.github.io/hlabud/articles/examples.html) to get some ideas for how to use hlabud in your analyses.
- [Get a one-hot encoded matrix for all HLA-DRB1 alleles](https://slowkow.github.io/hlabud/articles/examples.html#get-a-one-hot-encoded-matrix-for-all-hla-drb1-alleles)
- [Convert genotypes to a dosage matrix](https://slowkow.github.io/hlabud/articles/examples.html#convert-genotypes-to-a-dosage-matrix)
- [Logistic regression association for amino acid positions](https://slowkow.github.io/hlabud/articles/examples.html#logistic-regression-association-for-amino-acid-positions)
- [UMAP embedding of 3,516 HLA-DRB1 alleles](https://slowkow.github.io/hlabud/articles/examples.html#umap-embedding-of-3516-hla-drb1-alleles)
- [Get HLA allele frequencies from Allele Frequency Net Database (AFND)](https://slowkow.github.io/hlabud/articles/examples.html#get-hla-allele-frequencies-from-allele-frequency-net-database-afnd)
- [Compute HLA divergence with the Grantham distance matrix](https://slowkow.github.io/hlabud/articles/examples.html#compute-hla-divergence-with-the-grantham-distance-matrix)
- [Download and unpack all data from the latest IMGTHLA release](https://slowkow.github.io/hlabud/articles/examples.html#download-and-unpack-all-data-from-the-latest-imgthla-release)
<a href="https://slowkow.github.io/hlabud/articles/examples.html#logistic-regression-association-for-amino-acid-positions">
<img width="49%" src="vignettes/articles/examples_files/figure-html/glm-volcano-1.png">
</a>
<a href="https://slowkow.github.io/hlabud/articles/examples.html#umap-embedding-of-3516-hla-drb1-alleles">
<img width="49%" src="vignettes/articles/examples_files/figure-html/umap-2digit-1.png">
</a>
<a href="https://slowkow.github.io/hlabud/articles/examples.html#get-hla-allele-frequencies-from-allele-frequency-net-database-afnd">
<img width="49%" src="vignettes/articles/examples_files/figure-html/afnd_dqb1_02_01-1.png">
</a>
<a href="https://slowkow.github.io/hlabud/articles/visualize-hla-structure.html">
<img width="49%" src="https://github.com/slowkow/ggrepel/assets/209714/4843a850-a4fd-4832-9600-0b8e9c1bb904">
</a>
# Citation
`hlabud` provides access to the data in IMGT/HLA database. Therefore, if you use `hlabud` then please cite the IMGT/HLA paper:
- Robinson J, Barker DJ, Georgiou X, Cooper MA, Flicek P, Marsh SGE. [IPD-IMGT/HLA Database.](https://pubmed.ncbi.nlm.nih.gov/31667505/) Nucleic Acids Res. 2020;48: D948–D955. https://doi.org/10.1093/nar/gkz950
`hlabud` also provides access to the data in Allele Frequency Net Database (AFND). Therefore, if you use `hlabud::hla_frequencies()` then please cite the AFND paper:
- Gonzalez-Galarza FF, McCabe A, Santos EJMD, Jones J, Takeshita L, Ortega-Rivera ND, et al. [Allele frequency net database (AFND) 2020 update: gold-standard data classification, open access genotype data and new query tools.](https://pubmed.ncbi.nlm.nih.gov/31722398) Nucleic Acids Res. 2020;48: D783–D788. https://doi.org/10.1093/nar/gkz1029
Additionally, you can also cite the `hlabud` package like this:
- Slowikowski K. hlabud: HLA analysis in R. Zenodo. https://doi.org/10.5281/zenodo.11093557
# Learn more
I recommend this article for anyone new to HLA, because the beautiful figures
help to build intuition:
- La Gruta NL, Gras S, Daley SR, Thomas PG, Rossjohn J. [Understanding the drivers of MHC restriction of T cell receptors.](https://pubmed.ncbi.nlm.nih.gov/29636542/) Nat Rev Immunol. 2018;18: 467–478.
Learn about the conventions for HLA nomenclature:
- Marsh SGE, Albert ED, Bodmer WF, Bontrop RE, Dupont B, Erlich HA, et al. [Nomenclature for factors of the HLA system, 2010.](https://pubmed.ncbi.nlm.nih.gov/20356336/) Tissue Antigens. 2010;75: 291–455.
# Related work
[HLAtools] is an R package that also makes IPD-IMGT/HLA resources available for analysis, and also works with BIGDAWG data formats.
[HLAtools]: https://github.com/sjmack/HLAtools
For case-control analysis of HLA genotype data, consider the
[BIGDAWG](https://CRAN.R-project.org/package=BIGDAWG) R package available on
CRAN. Here is the related article:
- Pappas DJ, Marin W, Hollenbach JA, Mack SJ. [Bridging ImmunoGenomic Data Analysis Workflow Gaps (BIGDAWG): An integrated case-control analysis pipeline.](https://pubmed.ncbi.nlm.nih.gov/26708359) Hum Immunol. 2016;77: 283–287.
[HATK] is set of Python scripts for processing and analyzing IMGT-HLA data.
Here is the related article:
- Choi W, Luo Y, Raychaudhuri S, Han B. [HATK: HLA analysis toolkit.](https://pubmed.ncbi.nlm.nih.gov/32735319) Bioinformatics. 2021;37: 416–418. doi:10.1093/bioinformatics/btaa684
[HATK]: https://github.com/WansonChoi/HATK
The HLA divergence code in hlabud is a translation of the
[original Perl code](https://sourceforge.net/projects/granthamdist) by
[Tobias Lenz](https://orcid.org/0000-0002-7203-0044), which was published in:
- Pierini F, Lenz TL. [Divergent allele advantage at human MHC genes: signatures of past and ongoing selection.](https://pubmed.ncbi.nlm.nih.gov/29893875) Mol Biol Evol. 2018. doi:10.1093/molbev/msy116
[HLAdivR] is another R package for calculating HLA divergence.
[HLAdivR]: https://github.com/rbentham/HLAdivR/