Skip to content

Commit 29f1440

Browse files
authored
Merge pull request #180 from uclahs-cds/nzeltser-update-spark-tempdir
Set default spark tempdir param and add checks
2 parents a3ed825 + 2375e4e commit 29f1440

File tree

4 files changed

+16
-6
lines changed

4 files changed

+16
-6
lines changed

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,9 +13,11 @@ This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm
1313
- Standardize output and log directory structure
1414
- Update index file extension from all processes to .bam.bai
1515
- Standardize config files
16+
- Remove spark_temp_dir parameter from config template
1617

1718
### Added
1819
- Intermediate file removal
20+
- Spark tempdir permission checks
1921

2022
## [7.3.1] - 2022-01-14
2123
### Changed

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -165,7 +165,8 @@ After marking dup BAM files, the BAM files are then indexed by utilizing Picard
165165
| `cache_intermediate_pipeline_steps` | yes | boolean | Enable cahcing to resume pipeline and the end of the last successful process completion when a pipeline fails (if true the default submission script must be modified). |
166166
| `mark_duplicates` | no | boolean | Disable processes which mark duplicates. When false, the pipeline stops at the sorting step, outputting a sorted, indexed, unmerged BAM with unmarked duplicates. Recommended for high coverage targeted panel sequencing datasets. Defaults as true to mark duplicates as usual.|
167167
| `enable_spark` | yes | boolean | Enable use of Spark processes. When true, `MarkDuplicatesSpark` will be used. When false, `MarkDuplicates` will be used. Default value is true. |
168-
| `spark_temp_dir` | yes | path | Path to temp dir for Spark processes. Defaults to `/scratch`. |
168+
| `spark_temp_dir` | no | path | Path to temp dir for Spark processes. When included in the sample config file, Spark intermediate files will be saved to this directory. Defaults to `/scratch` and should only be changed for testing/development. Changing this directory to `/hot` or `/tmp` can lead to high server latency and potential disk space limitations, respectively.|
169+
| `work_dir` | no | path | Path of working directory for Nextflow. When included in the sample config file, Nextflow intermediate files and logs will be saved to this directory. With ucla_cds, the default is `/scratch` and should only be changed for testing/development. Changing this directory to `/hot` or `/tmp` can lead to high server latency and potential disk space limitations, respectively. |
169170
| `max_number_of_parallel_jobs` | no | int | The maximum number of jobs or steps of the pipeline that can be ran in parallel. Default is 1. Be very cautious setting this to any value larger than 1, as it may cause out-of-memory error. It may be helpful when running on a big memory computing node. |
170171
| `bwa_mem_number_of_cpus` | no | int | Number of cores to use for BWA-MEM2. If not set, this will be calculated to ensure at least 2.5Gb memory per core. |
171172
| `blcds_registered_dataset_input` | yes | boolean | Input FASTQs are from the Boutros Lab data registry. |

pipeline/config/methods.config

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -217,21 +217,30 @@ methods {
217217
if (params.ucla_cds) {
218218
/**
219219
* By default, if the /scratch directory exists, set it as the Nextflow working directory
220+
* and Spark temp directory.
220221
* If config file specified work_dir, set it as the Nextflow working directory
222+
* If config file specified spark_temp_dir, set it as the Spark temp directory
221223
*
222-
* WARNING: changing this directory can lead to high server latency and
223-
* potential disk space limitations. Change with caution! The 'workDir'
224-
* in Nextflow determines the location of intermediate and temporary files.
224+
* WARNING: changing these directories can lead to high server latency and
225+
* potential disk space limitations. Change with caution! Handles creation of
226+
* directories which don't already exist e.g. '/scratch/test/'
227+
* The 'workDir' in Nextflow determines the location of intermediate and temporary files.
225228
*/
226229
params.work_dir = (params.containsKey('work_dir') && params.work_dir) ? params.work_dir : '/scratch'
227230
if (methods.check_workdir_permissions(params.work_dir)) {
228231
workDir = params.work_dir
229232
}
233+
234+
params.spark_temp_dir = (params.containsKey('spark_temp_dir') && params.spark_temp_dir && methods.check_workdir_permissions(params.spark_temp_dir)) ? params.spark_temp_dir : '/scratch'
235+
230236
} else {
231237
// If work_dir was specified as a param and exists or can be created, set workDir. Otherwise, let Nextflow's default behavior dictate workDir
232238
if (params.containsKey('work_dir') && params.work_dir && methods.check_workdir_permissions(params.work_dir)) {
233239
workDir = params.work_dir
234240
}
241+
242+
// If spark_temp_dir was specified as a param and exists or can be created, set as spark tempdir. Otherwise, set as workDir.
243+
params.spark_temp_dir = (params.containsKey('spark_temp_dir') && params.spark_temp_dir && methods.check_workdir_permissions(params.spark_temp_dir)) ? params.spark_temp_dir : workDir
235244
}
236245
}
237246

pipeline/config/template.config

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -36,8 +36,6 @@ params {
3636
// Spark options
3737
// By default, the Spark process MarkDuplicatesSpark will be used. Set to false to disable Spark process and use MarkDuplicates (Picard) instead
3838
enable_spark = true
39-
// Default Spark temp dir is /scratch. Update if necessary
40-
spark_temp_dir = "/scratch"
4139

4240
// set to true if the data input fastq files are registered in the Boutros Lab.
4341
blcds_registered_dataset_input = false

0 commit comments

Comments
 (0)