Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

minor edits #15

Open
27 of 33 tasks
tavareshugo opened this issue Nov 28, 2023 · 2 comments
Open
27 of 33 tasks

minor edits #15

tavareshugo opened this issue Nov 28, 2023 · 2 comments

Comments

@tavareshugo
Copy link
Collaborator

tavareshugo commented Nov 28, 2023

Collecting a few minor issues here, to avoid me pushing now and causing merge conflicts.
These are not urgent and don't need to be fixed imminently - I'm happy to push these changes once development is less active.

A few of these are perhaps higher priority, and I've marked them with 🔴 to highlight this.


setup

  • Missing comma here

01_intro.Rmd

  • Slightly rephrase this one, maybe: "After labelling each sample with a different TMT reagent, the same peptide will have an identical mass but be differentially labelled across samples.

02_import_and_infrastructure.Rmd

  • In the description of rowData and colData (here) we could use the term "table (data frame)" rather than "matrix-like data structure", as participants of this workshop should be familiar with data.frame/tibble objects.
  • Can turn this note into a :::{.callout-note} box. The "assay" nomenclature is rather unfortunate here 😕 . Until this issue gets solved in QFeatures, could we maybe use the term "experiment assay" instead? That would fit the fact that you use the function experiments() to fetch an ExperimentList from your object.
  • Missing a verb here: "later see" or "later do" or something like that.
  • Is there a reason to import dplyr/stringr/ggplot2/tibble individually, rather than library("tidyverse")?
  • The working directory (here) could be simply referred to as "the course materials folder". On our training computers that will actually be their default working directory. Although setting up a project might be good practice anyway. I would perhaps even change the advice to be to create a project, rather than changing the working directory. That means their project is portable, as long as they open their Rproj file next time they run the analysis. This is what we advise on our data carpentry and reproducibility courses.
  • Here "the import" should be "then import"
  • Throughout, at some point switch to native pipes |> (which tidyverse folk now also endorse)
  • Here: "packjaje" typo; and rather than "link together" perhaps use "chain together".
  • Here, rather than accessing the slot, could say: "Since a QFeatures is a list of SummarizedExperiment objects, we can use length(cc_qf) to check how many objects we have inside it". Alternatively we can also extract the list of experiments directly using experiments(cc_qf). Generally, it's discouraged to access slots directly using @ as it risks users breaking the object if they decide to do an assignment <-.
  • Order of loading packages might cause issues as clusterProfiler and org.Hs.eg.db mask several dplyr functions.
  • Here and throughout, in pipes use function calls with (), to make it explicit these are functions and also make it compatible with the native pipe |> if we ever switch over.
  • Here and here it seems like using unique() would make more sense, to match with the previous piped answer.
  • Here "using pipes" rather than "using tidyr". Especially now that R has a native pipe implementation, we want to steer away from making it seem that pipes are only a tidyverse thing.
  • 🔴 For this section it might be nicer to import a metadata CSV file (to encourage participants to create that for their experiment) and then add that as colData.
    • 🔴 This is hard, especially the sapply() use. This would be solved if we had a metadata CSV prepared for them. I would argue it's also best-practice to have that metadata samplesheet ahead of time.
  • Here might be worth adding dir.create("output", showWarnings = FALSE) in case the directory doesn't exist.
  • HEre, rather than pull() |> table() could just do count(Number.of.Missed.Cleavages).
  • Also in the same exercise, we perhaps need a bit of a clue as to what to do. It's not quite clear to me where this column came from. I assumed it is part of the output from the ThermoFisher software, or was it calculated by the import function somehow? Perhaps the question could be: "One of the pieces of information given by the Proteome Discoverer software used to produce the TMT data is the number of missed cleavages. This is stored in a rowData column named Number.of.Missed.Cleavages. Can you count how many occurrences of missed cleavages there are in our data?"
  • Should the [ subset syntax be introduced here?

03_data_processing.Rmd

  • Again, I feel a little confused by the nomenclature "assays" used here. To avoid confusion with the assay slots, could we maybe use the term "experiment assay"?
  • Here - Again, since we're in tidyverse land, this could be simplified with count(Search.Engine.Rank). If you wanted to make it more visual, could even pipe to ggplot: ggplot(aes(Search.Engine.Rank, n)) + geom_col() + scale_y_log10()
  • Typo "give" should be "given"
  • Exercise maybe give a bit of a clue of which column we should be looking at. You could say something like: "the XX software flags potential contaminant features in the Contaminants column found in the rowData of the experiment assay. For example, we can count how many contaminants there are using cc_qf[["psms_filtered"]] |> rowData() |> as_tibble() |> count(Contaminant). Use the filterCounts() function... etc."

04_normalisation_aggregation.Rmd

  • outputDir = "." could perhaps be outputs for consistency to where they are saving analysis outputs.

  • Here - I wonder if having a snapshot of the report would be helpful. For example, the QQ plots or the boxplots. Also, I'm not sure how we infered center.median was the method? In the PDF report it's referred to as median.


05_protein_exploration.Rmd

  • Here I have max .n is 223 and median is 2.

06_statistical_analysis.Rmd

  • Revise some of the statistical concepts #16
  • Here maybe use "continuous variable" instead of covariate.
  • Could be worth adding somewhere what the interpretation of FDR is. Something like "The FDR defines the fraction of false discoveries that we are willing to tolerate in our list of differential proteins. For example, an FDR threshold of 0.05 means that around 5% of the differential proteins will be false positives. It is up to you to decide what this threshold should be, but conventionally people use 0.01 or 0.05."
  • Try to be consistent using either BH or FDR. Sometimes one or the other is used.

More general questions:

  • Here "The quantitation data is stored in columns 47 through to 56" --> how would we know this?
  • throughout replace cc_qf@ExperimentList with experiments(cc_qf)
lmsimp added a commit that referenced this issue Nov 29, 2023
lmsimp added a commit that referenced this issue Nov 29, 2023
@lmsimp
Copy link
Contributor

lmsimp commented Nov 29, 2023

I've started to address these edits for lessons 4 and 5 in commit e868288 and commit 3278941. Thank you very @tavareshugo for going through these lessons. Please feel free to add more as you go to this issue.

@tavareshugo
Copy link
Collaborator Author

I've turned them into checkboxes, it might be easier to keep track of it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants