-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Sveinn Einarsson edited this page Jul 10, 2023
·
4 revisions
Welcome to the meta_ML wiki!
For now let's use this wiki to post our ideas, such as links to papers that we find interesting or proposed project plans. We can also use it to assign/describe roles and tasks once the project gets started.
Thoughts on predicting microbial communities:
- Sample to feature ratio is often low. Is there a way we can scrape more samples?
- Papers I've seen use the entire dataset as the training set to see how the output correlates with the known samples. I think papers should hold out some of the dataset to test how it predicts outside the environmental range the NN given. ex. If you have temperatures from 0-100 C in your dataset, only train on 0-80 C and see how well it predicts at 100C.
- ASVs are probably not the best unit as rare ASVs are difficult to predict on. Is there another way to get a sense of the community composition without relying on ASVs? Another way to represent the community?
Denni:
- I've seen a lot of people duplicate "outliers" or 80-100C data, in order to not over train on the rest. Our dataset is probably sparse anyway, so I think we can duplicate some parts of the data and use the "Leave one out" training/test method.
- Can we make an algorithm that tests the model based on different initial sub-sampling. So for instance, the algorithm continuously tests until the score won't get better, which sub-sampling works best. For example, different ranges (min range ~40C) of temperature (see Lei's comment above). Would making the model more fluid like this help find the best feature importance for each range? Basically over training the model on multiple ranges of environmental variables but hopefully finding a pattern showing correlations when predicting outside of it's trained dataset. I've wondered whether the feature importance changes with the range of data trained on, and if that signal gets lost when we train the model on the entire dataset.