Skip to content

Commit

Permalink
feat: Move functions to package (#2)
Browse files Browse the repository at this point in the history
* feat: add an R package file structure

* build: add chrombinarize R package to renv

* feat: move the creation of bed files over to package

* refactor: Force wrapper to handle processing of data for create_bins

feat: add ability to specify which chromosomes to create blanks for

docs: update documentation to accomodate changes

* docs: update documentation to be actually correct

* refactor: rename function to describe what it does better

The main function here doesn't create files, it just creates the data
that can be put into files.

docs: update docs to reflect refactor

* style: fix indentation

* feat: add tests for creation of blank bed data

* refactor: use a much smaller Rds file to test functionality

* test: add test for processing of chromosomes length file

* test: use chromosome Y for even smaller RDS file

* fix: use new function name

* fix: use bin_size

* feat: add license for R package

* refactor: move functions to package

* refactor: set names upon file read

* docs: add documentation for error rate estimation function

* refactor: remove 'reverse_binomial' function

This wasn't really needed. It can all just be done inside of the dplyr
pipe

* refactor: define column names on reading the file

* refactor: prefer read depth over sample size

* docs: update documentation

* test: add tests for estimate_error_rate

* feat: move assessment of dense methylation to package

* test: add tests for incorrect file reading

refactor: move file reading to separate function

* test: Find errors with incorrect file specification

feat: add extra checking of column classes

* refactor: move verification of column classes to separate file

* docs: add docs page for verify column class

* feat: add a function that reads methylation data

This functionality is used in lots of files. This should help massively
with consistency in column names

* docs: add documentation to function

* feat: upgrade column verification to ensure name is m or h only

* docs: add details on why errors are thrown.

* refactor: rename name column to mark_name

This is more indicative of what it holds

* refactor: use read_methylation_data function

* feat: add a read bedmethyl function

* refactor: consistent function names

the function reads in a bedmethyl file

* refactor: use new read_comparison_bedmethyl function

* refactor: More understandable way of combining "m" with "h"

Also the previous way didn't add read depth or account for situations
where the simple addition of percentages resulted in something above
100%.

* docs: strand is not in comparison_bedmethyl data

* feat: add comparison plotting functions to package

docs: add Roxygen comments

* refactor: Add filtering of data to all plotting functions

* refactor: use match args to enforce mark input

* fix: use correct column names

* refactor: use chrombinarize functionality

* fix: remove old filtering step

* refactor: move combining 5hmC and 5mC signal to package

* style: section naming

* fix: update plot axes names to be general

Can be BS, WGBS or oxBS. Easiest to just say BS as this encompasses all
three

* refactor: Make error estimating function more general

Now takes methylation data and filters it for you. Instead of forcing
the user to create a good reference set first.

Obviously, they still have the ability to filter to CGIs, but I can't
account for this easily in R.

* docs: make return field nicer formatting

* feat: add erroneous rate plotting to package

* refactor: use chrombinarize functions

refactor: remove percent script (no longer needed)

* docs: make example output more believeable

* feat: move cpg robustness to package

feat: acutally find the surrounding methylation in a radius

Previously this code just found the methylation of neighbouring CpGs,
which ain't good enough. There may be a small performance increase as
well (though I don't know how much by).

* fix: further filter to chromosome

Previously, if a CpG is close in position from `start` across two
different chromosomes, they would be compared. This is obviously wrong.

* docs: add documentation for CPG robustness functions

* feat: add defaults

defaults match those used in config file

* refactor: use functions from package

* docs: add return

* test: add initial tests

* test: fix tests for estimating error rate

* feat: suppress warnings in favour of my own stopping

Warnings are messing up tests

* fix: R scripts expect "m" or "h" only

It is easier this way than to allow "mh" as well.

* feat: suppress warnings

* test: add tests for reading in bedmethyl files

* refactor: remove unnecessary c() calls

* docs: update

* feat: add default values to be consistent

* test: add test for error rate plot data

* fix: use or instead of and

Previous conditional logic on the start column would always result in an
empty data frame

* test: add test for cpg robustness functions

* test: add tests for adding absolute change columns

* feat: add checks for nonsensicle rows

Rows should have positive region sizes. No negative numbers should exist
and percent methylation should be bounded above by 100

* fix: ensure only numerical columns are checked

Certain strings can be less than 0 due to how the comparison works. This
was a problem with strand for example as "." < 0 returns TRUE

* test: add tests for bad methylation data files

Negative values, negative regions and over 100% methylation

* chore: add comments to explain why errors are expected

* feat: move comments to info argument

* refactor: use inheritParams more to reduce repetition

* refactor: commit to bedmethyl over methylbed

* docs: update docs to say what bedmethyl format is

* refactor: read_bedmethyl ensures these columns are numeric

These extra lines are therefore useless.

* chore: update version

* fix: person information given in wrong order

* build: add new chrombinarize package to renv

* docs: add proper example use case

* chore: conform to linter's indentation

* docs: convert title to title case

* feat: add checks for column types

* feat: add check for chromosome name being "chrxyz"

* docs: make params consistent with others

* docs: clear up inconsistencies and bad wording

* feat: ensure max read depth isn't too high

If max read depth is too high, you'll just end up with lots of NaN
values. This ensures that doesn't happen

* docs: clear up bad wording

* docs: update man pages

* build: add workflow for testing package

* fix: workflow could not install devtools

* build: update name of task

* chore: extra plotting libraries not actually used

* refactor: consistent naming

* refactor: change shit up according to cmdcheck

* build: try again with the workflow

* build: try again
  • Loading branch information
sof202 authored Sep 19, 2024
1 parent 333b588 commit 2fb2dc8
Show file tree
Hide file tree
Showing 96 changed files with 4,074 additions and 790 deletions.
31 changes: 31 additions & 0 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
name: test-chrombinarize

on:
push:
branches: [ "main" ]
pull_request:
branches: [ "main" ]
workflow_dispatch:

permissions:
contents: read
jobs:
runs-on: ubuntu-latest

steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Set up R
uses: r-lib/actions/setup-r@v2

- name: Install dependencies
uses: r-lib/actions/setup-r-dependencies@v2
working-directory: chrombinarize

- name: Run tests
run: |
library(testthat)
library(chrombinarize)
test_dir("tests/testthat")
working-directory: chrombinarize
23 changes: 9 additions & 14 deletions JobSubmission/BS-Seq/1_binarize_WGBS_data.sh
Original file line number Diff line number Diff line change
Expand Up @@ -56,8 +56,8 @@ processing_directory="${BASE_DIR}/WGBS_5mc_5hmc"
rm -rf "${processing_directory}"
mkdir -p "${processing_directory}"

purification_convertBSBedToMethylBedFormat \
"mh" \
purification_convertBSBedToBedmethylFormat \
"m" \
"${WGBS_bed_file_location}" \
"${processing_directory}/formatted.bed"

Expand All @@ -70,21 +70,16 @@ if [[ -n "${use_cpg_islands}" ]]; then
"${processing_directory}/formatted.bed"
fi

purification_extractSitesWithLowMethylation \
"mh" \
"${processing_directory}/formatted.bed" \
"${processing_directory}/unmethylated_reads.bed"
purification_filterOnReadDepth \
"mh" \
"${processing_directory}/formatted.bed" \
"${processing_directory}/filtered_reads.bed"
purification_calculateSiteMethylationProbability \
"${processing_directory}" \
"unmethylated_reads.bed" \
"filtered_reads.bed" \
"m" \
"${processing_directory}/formatted.bed" \
"processed_reads.bed"
purification_removeDeterminedUnmethylatedSites \
purification_filterOnReadDepth \
"m" \
"${processing_directory}/processed_reads.bed" \
"${processing_directory}/filtered_reads.bed"
purification_removeDeterminedUnmethylatedSites \
"${processing_directory}/filtered_reads.bed" \
"${processing_directory}/purified_reads.bed"

## ======================== ##
Expand Down
17 changes: 6 additions & 11 deletions JobSubmission/BS-Seq/2_binarize_oxBS_5mC_data.sh
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ processing_directory="${BASE_DIR}/oxBS_5mc"
rm -rf "${processing_directory}"
mkdir -p "${processing_directory}"

purification_convertBSBedToMethylBedFormat \
purification_convertBSBedToBedMethylFormat \
"m" \
"${oxBS_bed_file_location}" \
"${processing_directory}/formatted.bed"
Expand All @@ -70,21 +70,16 @@ if [[ -n "${use_cpg_islands}" ]]; then
"${processing_directory}/formatted.bed"
fi

purification_extractSitesWithLowMethylation \
purification_calculateSiteMethylationProbability \
"m" \
"${processing_directory}/formatted.bed" \
"${processing_directory}/unmethylated_reads.bed"
"processed_reads.bed"
purification_filterOnReadDepth \
"m" \
"${processing_directory}/formatted.bed" \
"${processing_directory}/filtered_reads.bed"
purification_calculateSiteMethylationProbability \
"${processing_directory}" \
"unmethylated_reads.bed" \
"filtered_reads.bed" \
"processed_reads.bed"
purification_removeDeterminedUnmethylatedSites \
"${processing_directory}/processed_reads.bed" \
"${processing_directory}/filtered_reads.bed"
purification_removeDeterminedUnmethylatedSites \
"${processing_directory}/filtered_reads.bed" \
"${processing_directory}/purified_reads.bed"

## ======================== ##
Expand Down
2 changes: 1 addition & 1 deletion JobSubmission/BS-Seq/3_binarize_pure_5hmC_data.sh
Original file line number Diff line number Diff line change
Expand Up @@ -128,7 +128,7 @@ awk \
"${processing_directory}/WGBS_oxBS_combined.bed" > \
"${processing_directory}/WGBS_5mC_removed.bed"

purification_convertBSBedToMethylBedFormat \
purification_convertBSBedToBedMethylFormat \
"h" \
"${processing_directory}/WGBS_5mC_removed.bed" \
"${processing_directory}/purified_reads.bed"
Expand Down
26 changes: 8 additions & 18 deletions JobSubmission/ONT/1_purify_reads.sh
Original file line number Diff line number Diff line change
Expand Up @@ -63,21 +63,16 @@ else
reference_set_base="${ONT_bed_file_location}"
fi

purification_extractSitesWithLowMethylation \
purification_calculateSiteMethylationProbability \
"m" \
"${reference_set_base}" \
"${processing_directory}/unmethylated_reads.bed"
"processed_reads.bed"
purification_filterOnReadDepth \
"m" \
"${reference_set_base}" \
"${processing_directory}/processed_reads.bed" \
"${processing_directory}/filtered_reads.bed"
purification_calculateSiteMethylationProbability \
"${processing_directory}" \
"unmethylated_reads.bed" \
"filtered_reads.bed" \
"processed_reads.bed"
purification_removeDeterminedUnmethylatedSites \
"${processing_directory}/processed_reads.bed" \
"${processing_directory}/filtered_reads.bed" \
"${processing_directory}/purified_reads.bed"

## ===== ##
Expand All @@ -96,19 +91,14 @@ else
reference_set_base="${ONT_bed_file_location}"
fi

purification_extractSitesWithLowMethylation \
purification_calculateSiteMethylationProbability \
"h" \
"${reference_set_base}" \
"${processing_directory}/unmethylated_reads.bed"
"processed_reads.bed"
purification_filterOnReadDepth \
"h" \
"${reference_set_base}" \
"${processing_directory}/processed_reads.bed" \
"${processing_directory}/filtered_reads.bed"
purification_calculateSiteMethylationProbability \
"${processing_directory}" \
"unmethylated_reads.bed" \
"filtered_reads.bed" \
"processed_reads.bed"
purification_removeDeterminedUnmethylatedSites \
"${processing_directory}/processed_reads.bed" \
"${processing_directory}/filtered_reads.bed" \
"${processing_directory}/purified_reads.bed"
25 changes: 8 additions & 17 deletions JobSubmission/supplementary/a_erroneous_rate_plot.sh
Original file line number Diff line number Diff line change
Expand Up @@ -42,8 +42,9 @@ move_log_files pvalues

number_of_columns=$(awk '{print NF; exit}' "${bed_file_location}")

# If the file has 5 columns, it is not in BEDmethyl format.
if [[ "${number_of_columns}" -eq 5 ]]; then
purification_convertBSBedToMethylBedFormat \
purification_convertBSBedToBedMethylFormat \
"${mark}" \
"${bed_file_location}" \
"${BASE_DIR}/converted.bed"
Expand All @@ -58,22 +59,12 @@ fi
mkdir -p "${BASE_DIR}/plots/"

conda activate ChromBinarize-R
if [[ "${run_type}" == "N" ]]; then
Rscript "${RSCRIPT_DIR}/erroneous_rate_plot_N.R" \
"${REPO_DIR}" \
"${bed_file_location}" \
"${mark}" \
"${max_N_value}" \
"${plot_type}" \
"${BASE_DIR}/plots/erroneous_rate_plot_${plot_type}_${run_type}_x_axis.png"
elif [[ "${run_type}" == "percent" ]]; then
Rscript "${RSCRIPT_DIR}/erroneous_rate_plot_percent.R" \
"${REPO_DIR}" \
"${bed_file_location}" \
"${mark}" \
"${plot_type}" \
"${BASE_DIR}/plots/erroneous_rate_plot_${plot_type}_${run_type}_x_axis.png"
fi
Rscript "${RSCRIPT_DIR}/erroneous_rate_plot_N.R" \
"${REPO_DIR}" \
"${bed_file_location}" \
"${mark}" \
"${max_N_value}" \
"${BASE_DIR}/plots/erroneous_rate_plot.png"
conda deactivate

if [[ "${number_of_columns}" -eq 5 ]]; then
Expand Down
2 changes: 1 addition & 1 deletion JobSubmission/supplementary/b_CpG_robustness.sh
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ mkdir -p "${filtered_reads_directory}"

number_of_columns=$(awk '{print NF; exit}' "${bed_file_location}")
if [[ "${number_of_columns}" -eq 5 ]]; then
purification_convertBSBedToMethylBedFormat \
purification_convertBSBedToBedMethylFormat \
"${mark}" \
"${bed_file_location}" \
"${filtered_reads_directory}/converted.bed"
Expand Down
109 changes: 4 additions & 105 deletions Rscripts/CpG_robustness.R
Original file line number Diff line number Diff line change
Expand Up @@ -8,117 +8,16 @@ output_file <- args[5]
renv::load(renv_environment)
library(ggplot2)

## ===================== ##
## DATA MANIPULATION ##
## ===================== ##

read_methylation_data <- function(bed_file_location) {
methylation_data <- data.table::fread(bed_file_location)
names(methylation_data) <-
c("Chr", "start", "end", "name", "size", "strand", "percent_methylation")
return(methylation_data)
}

add_lead_and_lag <- function(methylation_data) {
methylation_data <- dplyr::mutate(
methylation_data,
"leading_percent_methylation" = dplyr::lead(percent_methylation),
"leading_start_position" = dplyr::lead(start),
"lagging_percent_methylation" = dplyr::lag(percent_methylation),
"lagging_start_position" = dplyr::lag(start)
)
return(methylation_data)
}

exclude_distant_neighbours <- function(start,
neighbour_start,
max_distance,
percent_methylation) {
distance <- abs(start - neighbour_start)
return(ifelse(
distance < max_distance && distance > min_distance,
percent_methylation,
NA
))
}

calculate_mean_methylation <- function(lead_methylation, lag_methylation) {
return(mean(c(lead_methylation, lag_methylation), na.rm = TRUE))
}

## ============ ##
## PLOTTING ##
## ============ ##

create_cpg_robustness_plot <- function(methylation_data) {
methylation_correlation <- cor(
methylation_data[["percent_methylation"]],
methylation_data[["average_surrounding_methylation"]]
)
cpg_robustness_plot <-
ggplot(
methylation_data,
aes(x = percent_methylation, y = average_surrounding_methylation)
) +
geom_bin2d(bins = 100) +
scale_fill_continuous(type = "viridis") +
geom_abline(intercept = 0, slope = 1, color = "black") +
labs(
x = "Percent methylation",
y = "Average surrounding percent methylation"
) +
scale_fill_gradientn(colors = c("grey", "red", "yellow")) +
annotate(
"text",
x = 95,
y = 5,
label = paste0("R^2: ", methylation_correlation)
) +
theme_bw()

return(cpg_robustness_plot)
}

## ======== ##
## MAIN ##
## ======== ##

methylation_data <- read_methylation_data(bed_file_location)

methylation_data <-
add_lead_and_lag(methylation_data) |>
dplyr::rowwise() |>
dplyr::mutate(
"true_lead_percent_methylation" = exclude_distant_neighbours(
start,
leading_start_position,
max_distance,
leading_percent_methylation
)
) |>
dplyr::mutate(
"true_lag_percent_methylation" = exclude_distant_neighbours(
start,
lagging_start_position,
max_distance,
lagging_percent_methylation
)
) |>
dplyr::mutate(
"average_surrounding_methylation" = calculate_mean_methylation(
true_lead_percent_methylation,
true_lag_percent_methylation
)
) |>
dplyr::filter(
!is.na(average_surrounding_methylation)
)
methylation_data <- chrombinarize::read_bedmethyl(bed_file_location)

cpg_robustness_plot <- create_cpg_robustness_plot(methylation_data)
write.table(
cpg_robustness_plot <- chrombinarize::create_cpg_robustness_plot(
methylation_data,
quote = FALSE,
row.names = FALSE
min_distance,
max_distance
)

# needed on some servers to actually create png files
Expand Down
Loading

0 comments on commit 2fb2dc8

Please sign in to comment.