forked from jhudsl/AnVIL_Book_Getting_Started
-
Notifications
You must be signed in to change notification settings - Fork 0
/
07-data.Rmd
146 lines (93 loc) · 10.1 KB
/
07-data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
# Data
Data is stored by Workspaces in two different locations: the Workspace Bucket and the Persistent Disk.
The Workspace Bucket is a special Google Cloud Storage bucket that is governed by the built-in AnVIL security policies. This durable and scalable storage location is suitable for both raw data as well as analysis outputs that need to be preserved and/or shared.
In contrast, Persistent Disks provide a working directory for Cloud Environments that run Jupyter, RStudio, and Galaxy. Input data can be localized to Persistent Disks for analysis while output data can be transferred to the Workspace Bucket for more reliable long term storage.
```{r, echo=FALSE, fig.alt='Image shows a schematic of the data storage locations in an AnVIL Workspace. The Google Bucket is highlighted with a number "one" and the Persistent Disk is highlighted with a number "two". The `gsutil` command connects the two storage locations and allows users to copy data back and forth. The Persistent Disk is used by RStudio, Jupyter, and Galaxy. Data can also be copied to the Persistent Disk from another Workspace or SRA dataset.'}
ottrpal::include_slide("https://docs.google.com/presentation/d/1H5onDH7cBLK2m7fCcJ6ZodAAQ3wtJO8tNc2rwptrTPM/edit#slide=id.gf982a3b800_0_4")
```
Data Tables provide a way to organize data and metadata, including URI links to storage buckets. These tables are a convenient way to organize input for analyses as well as tracking workflow outputs. More details can be found in the [Terra documentation](https://support.terra.bio/hc/en-us/sections/360004147951).
```{r, echo=FALSE, fig.alt='Image shows a schematic of the data storage locations in an AnVIL Workspace. The Data Table is highlighted with a number "three".'}
ottrpal::include_slide("https://docs.google.com/presentation/d/1H5onDH7cBLK2m7fCcJ6ZodAAQ3wtJO8tNc2rwptrTPM/edit#slide=id.gf982a3c0cd_0_0")
```
## Bring Your Own Data
### Overview {#bring-data-overview}
The starting point for bringing your own data to AnVIL is the Workspace Dashboard. At the bottom right, you'll find the full path to the Google Bucket information corresponding to your Workspace. You can click the clipboard icon on the right to copy the name of your Workspace Bucket. You will be able to see any uploaded files by clicking the "Open in browser" link.
```{r, echo=FALSE, fig.alt='Image shows a screenshot of the Workspace Dashboard. Google Bucket information, including the Google Bucket name, location, and "Open in browser" link, at the bottom right of the screen is highlighted.'}
ottrpal::include_slide("https://docs.google.com/presentation/d/1H5onDH7cBLK2m7fCcJ6ZodAAQ3wtJO8tNc2rwptrTPM/edit#slide=id.gf5172664d7_0_142")
```
You can also see any uploaded files by clicking the "Files" directory at the bottom left in the Data Tab.
```{r, echo=FALSE, fig.alt='Image shows a screenshot of the Workspace Data tab. The Files directory and link on the bottom left is highlighted.'}
ottrpal::include_slide("https://docs.google.com/presentation/d/1H5onDH7cBLK2m7fCcJ6ZodAAQ3wtJO8tNc2rwptrTPM/edit#slide=id.gf55fadc51c_0_3")
```
### Browser: Upload Single Files
Click the "Files" directory at the bottom left of the Data Tab. Then click the "+" button in the bottom right corner of the screen. This will prompt a file browser on your local machine.
```{r, echo=FALSE, fig.alt='Image shows a screenshot of the Workspace Data tab. The plus button on the bottom right corner of the screen is highlighted.'}
ottrpal::include_slide("https://docs.google.com/presentation/d/1H5onDH7cBLK2m7fCcJ6ZodAAQ3wtJO8tNc2rwptrTPM/edit#slide=id.gf55fadc51c_0_12")
```
### Browser: Upload Folders
Click the ["Open in browser"](#bring-data-overview) link on the bottom right of the Workspace Dashboard Tab. This will open a new browser window or tab directed to your Workspace's Google Bucket on the Google Cloud Platform.
```{r, echo=FALSE, fig.alt='Image shows a screenshot of the Workspace Dashboard. The "Open in browser" link at the bottom right of the screen is highlighted.'}
ottrpal::include_slide("https://docs.google.com/presentation/d/1H5onDH7cBLK2m7fCcJ6ZodAAQ3wtJO8tNc2rwptrTPM/edit#slide=id.gf57004a098_0_1")
```
Here, you can upload files and manage your data and folders. You can also upload an entire folder by clicking on "UPLOAD FOLDER".
```{r, echo=FALSE, fig.alt='Image shows a screenshot of the Workspace Google Bucket on the Google Cloud Platform. The "UPLOAD FOLDER" button is highlighted.'}
ottrpal::include_slide("https://docs.google.com/presentation/d/1H5onDH7cBLK2m7fCcJ6ZodAAQ3wtJO8tNc2rwptrTPM/edit#slide=id.gf57004a098_0_9")
```
### `gsutil`: Local to Cloud
`gsutil` is a Python application that lets you access Cloud Storage from the command line in a terminal. The terminal you use can be run on your local machine (local instance) or built into the Workspace Cloud Environment.
#### Install `gsutil` on Your Local Computer or Local Server
[Cloud SDK](https://cloud.google.com/sdk/docs) is a set of tools that you can use to manage resources and applications hosted on Google Cloud. These tools include the `gsutil` command-line tool.
1. Ensure you have a terminal available.
- MacOS and Linux users have a terminal application available by default. Terminal applications are also available through third party software, such as RStudio.
- Windows users should download a terminal application, such as [Putty](https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html).
1. Install Cloud SDK following the appropriate link below:
- [Windows](https://cloud.google.com/sdk/docs/install#windows)
- [MacOS](https://cloud.google.com/sdk/docs/install#mac)
- [Linux](https://cloud.google.com/sdk/docs/install#linux)
1. Test that Cloud SDK has been successfully installed by typing `gsutil` in the terminal application prompt:
```
gsutil
```
If the installation was successful, you should see information about using `gsutil` that looks like the following:
```
Usage: gsutil [-D] [-DD] [-h header]... [-i service_account] [-m] [-o] [-q] [-u user_project] [command [opts...] args...]
```
If the installation was not successful, you should see a warning that `gsutil` was not found. Please return to the installation steps to ensure they have been completed correctly.
```
command not found: gsutil
```
#### Copy Files From Your Local Computer to a Workspace Bucket
The `gsutil cp` command allows you to copy data from one machine to another. On your local machine's terminal, you should use the command in the following format:
```
gsutil cp where_to_copy_data_from/filename where_to_copy_data_to
```
**Example:** To copy the file `test.bam` located on your local computer at `users/name/data/` into the Workspace Bucket `gs://ab5-27x` on the cloud:
```
gsutil cp users/name/data/test.bam gs://ab5-27x
```
Remember that you can easily copy the Workspace Bucket ID using the clipboard button on the [Workspace Dashboard]({#bring-data-overview}). Please see the [`gsutil cp` documentation](https://cloud.google.com/storage/docs/gsutil/commands/cp) for more details, such as how to do parallel multi-threaded/multi-processing copying or copying an entire directory tree. The `gsutil cp` command can also be used to copy files from one Workspace Bucket to another (cloud-to-cloud copying).
<!------------------------------------------>
## Analyze Existing Data
In addition to bringing your own data, you can use existing data on AnVIL. Using the following resources can help you discover data to use in your analyses.
### AnVIL Data Library
The [Datasets Library](https://anvil.terra.bio/#library/datasets) is a good place to get started and familiarize yourself with existing data. Here, you can find curated datasets from thousands of participants. Some of these are open access (such as the [1000 Genomes dataset](https://anvil.terra.bio/#workspaces/anvil-datastorage/1000G-high-coverage-2019)) while others will require you to request access.
```{r, echo=FALSE, fig.alt='Image shows a screenshot of the Datasets Library landing page.'}
ottrpal::include_slide("https://docs.google.com/presentation/d/1H5onDH7cBLK2m7fCcJ6ZodAAQ3wtJO8tNc2rwptrTPM/edit#slide=id.gf5fa6f264a_0_15")
```
Taking a look at [Featured Workspaces](https://anvil.terra.bio/#library/showcase) can get you started quickly. Remember that when you [clone a Workspace](workspaces.html#clone-workspace), AnVIL automatically cross-links to the original data contained within the Data Tables.
```{r, echo=FALSE, fig.alt='Image shows a screenshot of the Featured Workspaces landing page.'}
ottrpal::include_slide("https://docs.google.com/presentation/d/1H5onDH7cBLK2m7fCcJ6ZodAAQ3wtJO8tNc2rwptrTPM/edit#slide=id.gf5fa6f264a_0_24")
```
```{r, echo=FALSE, fig.alt='Image shows a screenshot of the Featured Workspaces tab on AnVIL. The featured tab is highlighted.'}
ottrpal::include_slide("https://docs.google.com/presentation/d/1H5onDH7cBLK2m7fCcJ6ZodAAQ3wtJO8tNc2rwptrTPM/edit#slide=id.gf5fa6f264a_0_7")
```
### AnVIL Dataset Catalog
The [AnVIL Dataset Catalog](https://anvilproject.org/data) displays key NHGRI datasets accessible in AnVIL, such as the CCDG (Centers for Common Disease Genomics), CMG (Centers for Mendelian Genomics), eMERGE (Electronic Medical Records and Genomics), as well as other relevant datasets. You will need to [coordinate access](https://anvilproject.org/learn/accessing-data/requesting-data-access) to controlled data.
```{r, echo=FALSE, fig.alt='Image shows a screenshot of the AnVIL Dataset Catalog website landing page.'}
ottrpal::include_slide("https://docs.google.com/presentation/d/1H5onDH7cBLK2m7fCcJ6ZodAAQ3wtJO8tNc2rwptrTPM/edit#slide=id.gf5fa6f264a_0_31")
```
### Gen3 Data Explorer
The [Gen3 Data Explorer and Data Commons](https://gen3.theanvil.io/) provides their API for data queries and downloads, supporting cross-project analyses. Gen3 provides access to open and protected datasets that can be exported to an AnVIL Workspace. For example, users can find the 1000 Genomes dataset on Gen3 and filter by ancestry, age, and other features prior to performing analyses on AnVIL.
```{r, echo=FALSE, fig.alt='Image shows a screenshot of the Gen3 on AnVIL Data Explorer website landing page.'}
ottrpal::include_slide("https://docs.google.com/presentation/d/1H5onDH7cBLK2m7fCcJ6ZodAAQ3wtJO8tNc2rwptrTPM/edit#slide=id.gf5fa6f264a_0_40")
```