feat: Move functions to package (#2)

* feat: add an R package file structure * build: add chrombinarize R package to renv * feat: move the creation of bed files over to package * refactor: Force wrapper to handle processing of data for create_bins feat: add ability to specify which chromosomes to create blanks for docs: update documentation to accomodate changes * docs: update documentation to be actually correct * refactor: rename function to describe what it does better The main function here doesn't create files, it just creates the data that can be put into files. docs: update docs to reflect refactor * style: fix indentation * feat: add tests for creation of blank bed data * refactor: use a much smaller Rds file to test functionality * test: add test for processing of chromosomes length file * test: use chromosome Y for even smaller RDS file * fix: use new function name * fix: use bin_size * feat: add license for R package * refactor: move functions to package * refactor: set names upon file read * docs: add documentation for error rate estimation function * refactor: remove 'reverse_binomial' function This wasn't really needed. It can all just be done inside of the dplyr pipe * refactor: define column names on reading the file * refactor: prefer read depth over sample size * docs: update documentation * test: add tests for estimate_error_rate * feat: move assessment of dense methylation to package * test: add tests for incorrect file reading refactor: move file reading to separate function * test: Find errors with incorrect file specification feat: add extra checking of column classes * refactor: move verification of column classes to separate file * docs: add docs page for verify column class * feat: add a function that reads methylation data This functionality is used in lots of files. This should help massively with consistency in column names * docs: add documentation to function * feat: upgrade column verification to ensure name is m or h only * docs: add details on why errors are thrown. * refactor: rename name column to mark_name This is more indicative of what it holds * refactor: use read_methylation_data function * feat: add a read bedmethyl function * refactor: consistent function names the function reads in a bedmethyl file * refactor: use new read_comparison_bedmethyl function * refactor: More understandable way of combining "m" with "h" Also the previous way didn't add read depth or account for situations where the simple addition of percentages resulted in something above 100%. * docs: strand is not in comparison_bedmethyl data * feat: add comparison plotting functions to package docs: add Roxygen comments * refactor: Add filtering of data to all plotting functions * refactor: use match args to enforce mark input * fix: use correct column names * refactor: use chrombinarize functionality * fix: remove old filtering step * refactor: move combining 5hmC and 5mC signal to package * style: section naming * fix: update plot axes names to be general Can be BS, WGBS or oxBS. Easiest to just say BS as this encompasses all three * refactor: Make error estimating function more general Now takes methylation data and filters it for you. Instead of forcing the user to create a good reference set first. Obviously, they still have the ability to filter to CGIs, but I can't account for this easily in R. * docs: make return field nicer formatting * feat: add erroneous rate plotting to package * refactor: use chrombinarize functions refactor: remove percent script (no longer needed) * docs: make example output more believeable * feat: move cpg robustness to package feat: acutally find the surrounding methylation in a radius Previously this code just found the methylation of neighbouring CpGs, which ain't good enough. There may be a small performance increase as well (though I don't know how much by). * fix: further filter to chromosome Previously, if a CpG is close in position from `start` across two different chromosomes, they would be compared. This is obviously wrong. * docs: add documentation for CPG robustness functions * feat: add defaults defaults match those used in config file * refactor: use functions from package * docs: add return * test: add initial tests * test: fix tests for estimating error rate * feat: suppress warnings in favour of my own stopping Warnings are messing up tests * fix: R scripts expect "m" or "h" only It is easier this way than to allow "mh" as well. * feat: suppress warnings * test: add tests for reading in bedmethyl files * refactor: remove unnecessary c() calls * docs: update * feat: add default values to be consistent * test: add test for error rate plot data * fix: use or instead of and Previous conditional logic on the start column would always result in an empty data frame * test: add test for cpg robustness functions * test: add tests for adding absolute change columns * feat: add checks for nonsensicle rows Rows should have positive region sizes. No negative numbers should exist and percent methylation should be bounded above by 100 * fix: ensure only numerical columns are checked Certain strings can be less than 0 due to how the comparison works. This was a problem with strand for example as "." < 0 returns TRUE * test: add tests for bad methylation data files Negative values, negative regions and over 100% methylation * chore: add comments to explain why errors are expected * feat: move comments to info argument * refactor: use inheritParams more to reduce repetition * refactor: commit to bedmethyl over methylbed * docs: update docs to say what bedmethyl format is * refactor: read_bedmethyl ensures these columns are numeric These extra lines are therefore useless. * chore: update version * fix: person information given in wrong order * build: add new chrombinarize package to renv * docs: add proper example use case * chore: conform to linter's indentation * docs: convert title to title case * feat: add checks for column types * feat: add check for chromosome name being "chrxyz" * docs: make params consistent with others * docs: clear up inconsistencies and bad wording * feat: ensure max read depth isn't too high If max read depth is too high, you'll just end up with lots of NaN values. This ensures that doesn't happen * docs: clear up bad wording * docs: update man pages * build: add workflow for testing package * fix: workflow could not install devtools * build: update name of task * chore: extra plotting libraries not actually used * refactor: consistent naming * refactor: change shit up according to cmdcheck * build: try again with the workflow * build: try again
sof202 · Sep 19, 2024 · 2fb2dc8 · 2fb2dc8
1 parent 333b588
commit 2fb2dc8
Show file tree

Hide file tree

Showing 96 changed files with 4,074 additions and 790 deletions.
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -0,0 +1,31 @@
+name: test-chrombinarize
+
+on:
+  push:
+    branches: [ "main" ]
+  pull_request:
+    branches: [ "main" ]
+  workflow_dispatch:
+
+permissions:
+  contents: read
+jobs:
+    runs-on: ubuntu-latest
+
+    steps:
+      - name: Checkout code
+        uses: actions/checkout@v4
+
+      - name: Set up R
+        uses: r-lib/actions/setup-r@v2
+
+      - name: Install dependencies
+        uses: r-lib/actions/setup-r-dependencies@v2
+        working-directory: chrombinarize
+
+      - name: Run tests
+        run: |
+          library(testthat)
+          library(chrombinarize)
+          test_dir("tests/testthat")
+        working-directory: chrombinarize
diff --git a/JobSubmission/BS-Seq/1_binarize_WGBS_data.sh b/JobSubmission/BS-Seq/1_binarize_WGBS_data.sh
@@ -56,8 +56,8 @@ processing_directory="${BASE_DIR}/WGBS_5mc_5hmc"
 rm -rf "${processing_directory}"
 mkdir -p "${processing_directory}"
 
-purification_convertBSBedToMethylBedFormat \
-  "mh" \
+purification_convertBSBedToBedmethylFormat \
+  "m" \
   "${WGBS_bed_file_location}" \
   "${processing_directory}/formatted.bed"
 
@@ -70,21 +70,16 @@ if [[ -n "${use_cpg_islands}" ]]; then
     "${processing_directory}/formatted.bed"
 fi
 
-purification_extractSitesWithLowMethylation \
-  "mh" \
-  "${processing_directory}/formatted.bed" \
-  "${processing_directory}/unmethylated_reads.bed"
-purification_filterOnReadDepth \
-  "mh" \
-  "${processing_directory}/formatted.bed" \
-  "${processing_directory}/filtered_reads.bed"
 purification_calculateSiteMethylationProbability \
-  "${processing_directory}" \
-  "unmethylated_reads.bed" \
-  "filtered_reads.bed" \
+  "m" \
+  "${processing_directory}/formatted.bed" \
   "processed_reads.bed"
-purification_removeDeterminedUnmethylatedSites \
+purification_filterOnReadDepth \
+  "m" \
   "${processing_directory}/processed_reads.bed" \
+  "${processing_directory}/filtered_reads.bed" 
+purification_removeDeterminedUnmethylatedSites \
+  "${processing_directory}/filtered_reads.bed" \
   "${processing_directory}/purified_reads.bed"
 
 ## ======================== ##

diff --git a/JobSubmission/BS-Seq/2_binarize_oxBS_5mC_data.sh b/JobSubmission/BS-Seq/2_binarize_oxBS_5mC_data.sh
@@ -56,7 +56,7 @@ processing_directory="${BASE_DIR}/oxBS_5mc"
 rm -rf "${processing_directory}"
 mkdir -p "${processing_directory}"
 
-purification_convertBSBedToMethylBedFormat \
+purification_convertBSBedToBedMethylFormat \
   "m" \
   "${oxBS_bed_file_location}" \
   "${processing_directory}/formatted.bed"
@@ -70,21 +70,16 @@ if [[ -n "${use_cpg_islands}" ]]; then
     "${processing_directory}/formatted.bed"
 fi
 
-purification_extractSitesWithLowMethylation \
+purification_calculateSiteMethylationProbability \
   "m" \
   "${processing_directory}/formatted.bed" \
-  "${processing_directory}/unmethylated_reads.bed"
+  "processed_reads.bed"
 purification_filterOnReadDepth \
   "m" \
-  "${processing_directory}/formatted.bed" \
-  "${processing_directory}/filtered_reads.bed"
-purification_calculateSiteMethylationProbability \
-  "${processing_directory}" \
-  "unmethylated_reads.bed" \
-  "filtered_reads.bed" \
-  "processed_reads.bed"
-purification_removeDeterminedUnmethylatedSites \
   "${processing_directory}/processed_reads.bed" \
+  "${processing_directory}/filtered_reads.bed" 
+purification_removeDeterminedUnmethylatedSites \
+  "${processing_directory}/filtered_reads.bed" \
   "${processing_directory}/purified_reads.bed"
 
 ## ======================== ##

diff --git a/JobSubmission/BS-Seq/3_binarize_pure_5hmC_data.sh b/JobSubmission/BS-Seq/3_binarize_pure_5hmC_data.sh
@@ -128,7 +128,7 @@ awk \
   "${processing_directory}/WGBS_oxBS_combined.bed" > \
   "${processing_directory}/WGBS_5mC_removed.bed"
 
-purification_convertBSBedToMethylBedFormat \
+purification_convertBSBedToBedMethylFormat \
   "h" \
   "${processing_directory}/WGBS_5mC_removed.bed" \
   "${processing_directory}/purified_reads.bed"

diff --git a/JobSubmission/ONT/1_purify_reads.sh b/JobSubmission/ONT/1_purify_reads.sh
@@ -63,21 +63,16 @@ else
   reference_set_base="${ONT_bed_file_location}"
 fi
 
-purification_extractSitesWithLowMethylation \
+purification_calculateSiteMethylationProbability \
   "m" \
   "${reference_set_base}" \
-  "${processing_directory}/unmethylated_reads.bed"
+  "processed_reads.bed"
 purification_filterOnReadDepth \
   "m" \
-  "${reference_set_base}" \
+  "${processing_directory}/processed_reads.bed" \
   "${processing_directory}/filtered_reads.bed" 
-purification_calculateSiteMethylationProbability \
-  "${processing_directory}" \
-  "unmethylated_reads.bed" \
-  "filtered_reads.bed" \
-  "processed_reads.bed"
 purification_removeDeterminedUnmethylatedSites \
-  "${processing_directory}/processed_reads.bed" \
+  "${processing_directory}/filtered_reads.bed" \
   "${processing_directory}/purified_reads.bed"
 
 ## ===== ##
@@ -96,19 +91,14 @@ else
   reference_set_base="${ONT_bed_file_location}"
 fi
 
-purification_extractSitesWithLowMethylation \
+purification_calculateSiteMethylationProbability \
   "h" \
   "${reference_set_base}" \
-  "${processing_directory}/unmethylated_reads.bed"
+  "processed_reads.bed"
 purification_filterOnReadDepth \
   "h" \
-  "${reference_set_base}" \
+  "${processing_directory}/processed_reads.bed" \
   "${processing_directory}/filtered_reads.bed" 
-purification_calculateSiteMethylationProbability \
-  "${processing_directory}" \
-  "unmethylated_reads.bed" \
-  "filtered_reads.bed" \
-  "processed_reads.bed"
 purification_removeDeterminedUnmethylatedSites \
-  "${processing_directory}/processed_reads.bed" \
+  "${processing_directory}/filtered_reads.bed" \
   "${processing_directory}/purified_reads.bed"
diff --git a/JobSubmission/supplementary/a_erroneous_rate_plot.sh b/JobSubmission/supplementary/a_erroneous_rate_plot.sh
@@ -42,8 +42,9 @@ move_log_files pvalues
 
 number_of_columns=$(awk '{print NF; exit}' "${bed_file_location}")
 
+# If the file has 5 columns, it is not in BEDmethyl format.
 if [[ "${number_of_columns}" -eq 5 ]]; then
-    purification_convertBSBedToMethylBedFormat \
+    purification_convertBSBedToBedMethylFormat \
       "${mark}" \
       "${bed_file_location}" \
       "${BASE_DIR}/converted.bed"
@@ -58,22 +59,12 @@ fi
 mkdir -p "${BASE_DIR}/plots/"
 
 conda activate ChromBinarize-R
-if [[ "${run_type}" == "N" ]]; then
-  Rscript "${RSCRIPT_DIR}/erroneous_rate_plot_N.R" \
-    "${REPO_DIR}" \
-    "${bed_file_location}" \
-    "${mark}" \
-    "${max_N_value}" \
-    "${plot_type}" \
-    "${BASE_DIR}/plots/erroneous_rate_plot_${plot_type}_${run_type}_x_axis.png"
-elif [[ "${run_type}" == "percent" ]]; then
-  Rscript "${RSCRIPT_DIR}/erroneous_rate_plot_percent.R" \
-    "${REPO_DIR}" \
-    "${bed_file_location}" \
-    "${mark}" \
-    "${plot_type}" \
-    "${BASE_DIR}/plots/erroneous_rate_plot_${plot_type}_${run_type}_x_axis.png"
-fi
+Rscript "${RSCRIPT_DIR}/erroneous_rate_plot_N.R" \
+  "${REPO_DIR}" \
+  "${bed_file_location}" \
+  "${mark}" \
+  "${max_N_value}" \
+  "${BASE_DIR}/plots/erroneous_rate_plot.png"
 conda deactivate
 
 if [[ "${number_of_columns}" -eq 5 ]]; then

diff --git a/JobSubmission/supplementary/b_CpG_robustness.sh b/JobSubmission/supplementary/b_CpG_robustness.sh
@@ -47,7 +47,7 @@ mkdir -p "${filtered_reads_directory}"
 
 number_of_columns=$(awk '{print NF; exit}' "${bed_file_location}")
 if [[ "${number_of_columns}" -eq 5 ]]; then
-    purification_convertBSBedToMethylBedFormat \
+    purification_convertBSBedToBedMethylFormat \
       "${mark}" \
       "${bed_file_location}" \
       "${filtered_reads_directory}/converted.bed"

diff --git a/Rscripts/CpG_robustness.R b/Rscripts/CpG_robustness.R
@@ -8,117 +8,16 @@ output_file <- args[5]
 renv::load(renv_environment)
 library(ggplot2)
 
-## ===================== ##
-##   DATA MANIPULATION   ##
-## ===================== ##
-
-read_methylation_data <- function(bed_file_location) {
-  methylation_data <- data.table::fread(bed_file_location)
-  names(methylation_data) <-
-    c("Chr", "start", "end", "name", "size", "strand", "percent_methylation")
-  return(methylation_data)
-}
-
-add_lead_and_lag <- function(methylation_data) {
-  methylation_data <- dplyr::mutate(
-    methylation_data,
-    "leading_percent_methylation" = dplyr::lead(percent_methylation),
-    "leading_start_position" = dplyr::lead(start),
-    "lagging_percent_methylation" = dplyr::lag(percent_methylation),
-    "lagging_start_position" = dplyr::lag(start)
-  )
-  return(methylation_data)
-}
-
-exclude_distant_neighbours <- function(start,
-                                       neighbour_start,
-                                       max_distance,
-                                       percent_methylation) {
-  distance <- abs(start - neighbour_start)
-  return(ifelse(
-    distance < max_distance && distance > min_distance,
-    percent_methylation,
-    NA
-  ))
-}
-
-calculate_mean_methylation <- function(lead_methylation, lag_methylation) {
-  return(mean(c(lead_methylation, lag_methylation), na.rm = TRUE))
-}
-
-## ============ ##
-##   PLOTTING   ##
-## ============ ##
-
-create_cpg_robustness_plot <- function(methylation_data) {
-  methylation_correlation <- cor(
-    methylation_data[["percent_methylation"]],
-    methylation_data[["average_surrounding_methylation"]]
-  )
-  cpg_robustness_plot <-
-    ggplot(
-      methylation_data,
-      aes(x = percent_methylation, y = average_surrounding_methylation)
-    ) +
-    geom_bin2d(bins = 100) +
-    scale_fill_continuous(type = "viridis") +
-    geom_abline(intercept = 0, slope = 1, color = "black") +
-    labs(
-      x = "Percent methylation",
-      y = "Average surrounding percent methylation"
-    ) +
-    scale_fill_gradientn(colors = c("grey", "red", "yellow")) +
-    annotate(
-      "text",
-      x = 95,
-      y = 5,
-      label = paste0("R^2: ", methylation_correlation)
-    ) +
-    theme_bw()
-
-  return(cpg_robustness_plot)
-}
-
 ## ======== ##
 ##   MAIN   ##
 ## ======== ##
 
-methylation_data <- read_methylation_data(bed_file_location)
-
-methylation_data <-
-  add_lead_and_lag(methylation_data) |>
-  dplyr::rowwise() |>
-  dplyr::mutate(
-    "true_lead_percent_methylation" = exclude_distant_neighbours(
-      start,
-      leading_start_position,
-      max_distance,
-      leading_percent_methylation
-    )
-  ) |>
-  dplyr::mutate(
-    "true_lag_percent_methylation" = exclude_distant_neighbours(
-      start,
-      lagging_start_position,
-      max_distance,
-      lagging_percent_methylation
-    )
-  ) |>
-  dplyr::mutate(
-    "average_surrounding_methylation" = calculate_mean_methylation(
-      true_lead_percent_methylation,
-      true_lag_percent_methylation
-    )
-  ) |>
-  dplyr::filter(
-    !is.na(average_surrounding_methylation)
-  )
+methylation_data <- chrombinarize::read_bedmethyl(bed_file_location)
 
-cpg_robustness_plot <- create_cpg_robustness_plot(methylation_data)
-write.table(
+cpg_robustness_plot <- chrombinarize::create_cpg_robustness_plot(
   methylation_data,
-  quote = FALSE,
-  row.names = FALSE
+  min_distance,
+  max_distance
 )
 
 # needed on some servers to actually create png files