-
Notifications
You must be signed in to change notification settings - Fork 7
WIP: Extend MixtureFinder to codon, binary, multistate, (and amino acid) data #11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
thanks for contributing. @HuaiyanRen can you check? |
Thanks for contributing. I will check it soon once I have time. |
Thank you for your responses. I have realized that I specified inappropriate frequency types for some data types. Sorry for the confusion I may have caused. I am unable to work on it now, but I will fix it as soon as possible, referring to the frequency types tested by default by ModelFinder. |
I have fixed it now. It should work properly, probably? I have updated the tests (HS6986/iqtree3ForkTests) accordingly. |
Hello, I read your repository. I think your extension is correct for amino acid, binary and multistate data. I also never work on codon, I saw your log only consider +F1X4 and +F3X4 but no +FQ or +FO, which I'm not sure this is proper. But thank you very much again for your contribution! |
Thank you for your response. I have never dealt with codon data either, so I was fumbling about this. ModelFinder seems to test the frequency parameters I would deeply appreciate it if you could verify this. Thank you for your time and support. |
F can be applied with a mixture model. MixtureFinder consider +FQ and +FO as default for DNA, if you want +F, you need to specify it by -mset. ModelFinder consider +FQ and +F as default, but since I joined in our team, we developed mixture models (for DNA) with +FO. I didn't ask about the exact reason, by I think one reason could be, for example: {F81+FO,F81+FO} are actually two-class mixture model, although the exchangeabilities are the same in different classes, by the frequencies could be different. However, {F81+F, F81+F} is a meaningless mixture model, because +F is counted based on the alignment, so for each class the +F is the same thing. |
Thank you for your response.
Oh, thank you for pointing it out. Indeed, I would be happy to continue discussing the frequency types for codon data in MixtureFinder. Thank you for your support. |
When user input For codon models, I may ask around in the team to find someone who can answer your question. |
Sorry, I made a mistake when typing before. I want to ask: in your log file |
Thank you for your responses.
Oh, this is good to know, thank you (please ignore my deleted statement about linking or unlinking frequencies, my misunderstanding)!
Thank you!
Sorry, I misread “I also never work on codon, I saw your log only consider +F1X4 and +F3X4 but no +FQ or +FO, which I'm not sure this is proper.” and thought that you were then talking about the frequency types specified for codon data in MixtureFinder, not about my MixtureFinder test for codon data. My apologies. Thank you for your time and support. |
Dear IQ-TREE Developers, I have tested the codon MixtureFinder on data whose genetic code was standard (HS6986/iqtree3ForkTests), thus activating the tests of empirical codon models. However, the analysis stopped with an error message Could you review this issue and fix it when you are available? Thank you for your time and support. P.S. This error seems to be due to the IQ-TREE's inability to handle mixture models with classes whose frequencies are |
With a very simple change, I was able to fix the aforementioned issue, namely that IQ-TREE does not accept |
Hi all, I've also tried to explore the IQ-Tree functionality towards using multistate data recently.
And the multistate alignment ( So @HS6986, the question is: Sorry if I'm misleading here, I haven't dug deep into this topic yet. Best, |
I've just downloaded your code and made a couple of runs with your 4-state test alignment. I added some log messages to model class constructors, and here is what I've got: If I don't use MixtureFinder, but use ModelFinder not specifying the sequence type as with the following command:
However, if I use MixtureFinder instead with the following command:
So the same alignment under the same conditions (i.e. neither seq type, nor model specified) is treated differently by ModelFinder and MixtureFinder! Kinda weird behaviour to me. This situation, in fact, can lead to the following confusion:
It is the alignments like from my previous comment (with states from an arbitrarily large alphabet designated with ints) that can be assigned the existing in the original code, but unused Maybe the good old Best, |
The bug with We didn't deal with this issue because this is not a common case that users specify |
Thank you for your feedback. The reason I created the new model class When analyzing morphological data with probabilistic methods, most empiricists partition data by the number of states in each character (see here) and use Mk models (models with equal rates and frequencies) with ascertainment bias corrections (Lewis, 2001) to model data. Also in IQ-TREE, ModelFinder only considers MK+FQ(+ASC+(rate heterogeneity across characters (e.g., +G))) for morphological data. Although some software programs (MrBayes and RevBayes) implement methods that model heterogeneity of state frequencies in morphological data with mixture models (Wright et al., 2016; here), as morphological data should be partitioned by the number of states and currently ascertainment bias corrections ( On the contrary, multistate data other than morphology, such as recoded amino acid data, can and often should be analyzed in models with unequal rates and/or frequencies (e.g., MK+FO, GTRX+FQ, and GTRX+FO). If we implemented MixtureFinder so that multistate data other than morphology would be handled by However, thinking about it again, it seems that there will be no problem if we replace I will work on this as soon as possible. Thank you very much for your thorough feedback, Stefan! |
Hi @HS6986, Thank you for the extensive explanation and useful links! I think I got your idea. You state that using only the MK model and FQ frequencies by default is problematic:
But I think it is quite the opposite: Maybe the original default behaviour (MK+FQ, thus no MixtureFinder) is fine, and if a user still has a good reason to run MixtureFinder for the |
Thank you for your reply. That makes sense! It seems to me that the best choice is to specify the default models and frequency types for I am going to remove all the pieces of code that have been added in this pull request in relation to Thank you very much for your support and time. |
Yes, good idea! I think you could also suppress the following warnings from the
For example, if the alignment have 6 states, for a GTRX matrix we have to estimate only (6*6 - 6)/2 - 1 = 14 parameters. All the corresponding 14 transition pairs are likely to appear in the alignment numerous times, so there should be no concern of overfitting. However, if someone decided to use the GTRX model for true morphological data, which, just as you mentioned before, can be divided into short partitions, estimating even 14 parameters would be a problem. Hence the check for the partition length. |
Sorry for the late reply. That's an excellent idea! I was also wondering if it is possible to partially suppress the warnings that occur every time you use GTRX. I'll work on this. Thank you for your time and support. |
…if the number of states <= 6 && the number of the patterns in the alignment/partition >= 100
Dear @StefanFlaumberg, @HuaiyanRen, @bqminh, and others, I have completed the following tasks:
I have tested the behavior of this pull request again (HS6986/iqtree3ForkTests), and it seemed to work properly. Please let me know if there are any problems or other areas for improvement. It seems to me that currently no problems exist in this pull request (though I'm still not sure whether FQ should be specified as a default frequency type for codon data in MixtureFinder, but probably the answer is no?). Thank you very much for your time and help. |
Hi @HS6986, A few minor things that bother me (some of them are mere stylistic and can be ignored):
Please notice that I omitted everything associated with the Concerning the default frequency types for codon data:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, generateNestNetwork is only working with DNA, but this function is still called. I'm not sure it's going on if this function is called for other data types. Please check that it's the right behaviour.
@HuaiyanRen Can you make these changes to the pull request? |
…; Temporarily comment out `free(init_state_freq_set);`, which HuaiyanRen added, as they cause an error
Dear All, Sorry for the slow response. I improved the code according to the advice by StefanFlaumberg and bqminh. Thank you for your advice. I have updated the tests (HS6986/iqtree3ForkTests) accordingly, and the updated code seems to work properly, with the issue of printing the state frequencies of Please let me know if I have missed something. Thank you very much for your time and support. |
Dear bqminh, Thank you very much for thoroughly reviewing my code! |
Dear @StefanFlaumberg,
Sorry, I misread your previous message. I'm convinced. Thank you very much for your advice.
I see... That's strange. I've fixed the aforementioned bug using your code while leaving |
Dear @HuaiyanRen, Thank you very much for your changes! But Thank you for your time and support. |
Dear @thomaskf, Sorry for the mention. In this pull request, we extended MixtureFinder to codon, binary, multistate, and amino acid data. In the process, we found a bug that the state frequencies of Thank you very much for your time and help. |
@HS6986 , you are correct. This piece of code is unnecessary. Thank you for pointing it out. |
Dear @HuaiyanRen, Thank you for your reply. I've deleted Thank you very much for your help. |
Hi @HS6986 , I'm sorry, but the BTW |
Dear @StefanFlaumberg, Thank you for your reply.
Sorry, I messed up again. Thank you for pointing it out. I've fixed it.
@HuaiyanRen, could you check this? Thank you very much for your help. |
Hi All, Apologies for weighing in a bit late here - I've been busy with teaching in the last couple of months. First I want to thank @HS6986 and @StefanFlaumberg for all the work they have put into this. It's great to see the community extending the software! The point of this comment is to ask a question about this addition, which I'd like to see clearly addressed before these updates are made available to IQ-TREE users. Hopefully this kicks off a good discussion! The question is: what is the theoretical and practical basis of adding these features, and do we have empirical evidence that they will be useful (i.e. improve inference for IQ-TREE users)? Here's a longer version, focussed on amino acid models which are the ones I am most familiar with. The motivation for MixtureFinder was that perhaps partitioned models are not the best way to account for variation in evolutionary processes among sites. We noticed that mixture models were rarely applied to DNA data, even though most DNA datasets contain plenty of information to estimate mixture models, and the parameterisation of DNA models is such that it's feasible to estimate quite complex DNA mixture models for most users. To this end, the questions we asked (and, I hope, answered!) in the mixturefinder paper was whether these models are really justified (they were, because they fit individual empirical partitions far better than one-class GTR models, suggesting that current methods are limited in terms of their abilty to explain data), and then whether they were useful (they were, though the improvements in inference were sometimes modest). Before we make the current additions available, I'd like to see similar tests done for the other data types. For the amino acid models (and potentially the others, but I'm less sure there because I am far less familiar with them) I'm a little sceptical that a MixtureFinder approach is worthwhile. The main reason is that most amino acid models (like LG, WAG, Q.pfam, etc) are estimated from huge collections of alignments, and represent average replacement rates and frequencies across those collections. Even if you take clade-specific models, the frequency vectors are quite similar to each other, and more similar to each other than the vectors of profile mixture models like C10-C60, precisely because the former are still averages over a large number of sites. So, I'm not really sure what it means to have a mixture of these kinds of models. Sure, it's likely to fit the data a bit better, but I doubt that it's really worth the extra computational cost. The second reason is that we already have good amino acid mixture models in IQ-TREE, both for replacement rates (e.g. LG4X, LG4M) and frequencies (e.g. all the profile mixture models like C10-C60; PMSF models which can build on / summarise Bayesian CAT models; UDM models, etc). These are very much biologically motivated approaches to mixture models, which are likely (I guess, but I don't know!) to be a better way to model amino acid evolution than building complex mixtures of one-class models (like LG, WAG, Q.mammal, etc). On top of that we can now estimate GTR20 replacement matrices under these models (e.g. the GTRpmix paper) and re-estimate the weights of frequency vectors for profile mixture models. We can also estimate a huge range of these kinds of models from scratch in IQ-TREE (not all of those methods are published yet, because they haven't been checked sufficiently). To be clear, I'm not saying that I know that extending MixtureFinder to amino acid models wouldn't be useful. It's just that I have reservations, and I think one would need clear theoretical grounds and empirical demonstrations before this extension is made available to users. The same goes for the morphological and codon models. (I'm not well versed in morphological models, though my intuition is that mixturefinder might be a good approach for codon models). To me, it's really important that methods available in IQ-TREE are backed up by solid evidence of their utility and accuracy. Ultimately I think this means doing a lot of simulations to check their accuracy (in terms of implementation), then finding creative ways to do validate their utility on published empirical datasets. As long as we do this, then users can trust that the methods open to them in IQ-TREE are ones that we expect to be useful, accurate, and practical. The latter is important here, because mixture models imply a lot more computation, so that should be coupled with the payoff of more accuracy. If the MixtureFInder approach proves not to be useful for some datatypes (like amino acids perhaps, if my hunch above is right), that's still very useful to know, and good research! The good news is that I think the MixtureFinder paper (https://academic.oup.com/mbe/article/42/1/msae264/7931682) provides a useful template for doing this on each of the new datatypes. Though I'm sure there are ways to improve our approaches there too. Happy to discuss more! Thanks again for all the work. I'm genuinely excited to see these methods extended, and I hope (and suspect) that the approach will be useful on some datatypes, just perhaps not all of them. Cheers, Rob P.S. We can of course build the methods into IQ-TREE so they are available, but not documented. And we should make sure that methods that have not been validated come with an appropriate warning! I prefer this approach to having too many branches which gradually diverge and create huge headaches down the track. |
Thank you very much for your thorough feedback. Since I am out now, I'll give you my tentative conclusion to your question for now. I'll post details as to why later. In conclusion, your comments have made me realize that the MixtureFinder approach is probably not suitable for amino acid data and their recoded binary or multistate data, but that, for codon data and binary (e.g., genome gene content, microsynteny, RY recoded DNA, I'm sure there are many other examples) and multistate data other than morphology or recoded amino acids, it is at least theoretically justified, could most probably improve phylogenetic inference, and is probably a very sensible way. Thank you for pointing it out. We may consider disabling MixtureFinder for amino acid data or setting warnings in MixtureFinder recommending other mixture models for amino acid data. |
I agree that the trade-off between model fit and computational cost may not favour using MixtureFinder for data types with high number of states and that it is worthwhile to theoretically and experimentally identify the specific situations when the usage of MixtureFinder is justified and present these situations in the IQ-Tree manual to prevent unintended usage. Regarding the warnings, I think a good solution woud be putting something like this at the line 6674 of the
|
I just talked with @thomaskf and @HuaiyanRen about this. One way to go is this:
Make sure that the warning message is always included in either case. That way, users are aware of the problem, as they have to run IQ-TREE twice. Otherwise, if you just let it run, many people won't notice, no matter if there is any warning... |
Dear All, Thank you for all the comments! I'd agree that in most cases the MixtureFinder approach probably wouldn't be a sensible way to model amino acid data, and their recoded binary or multistate data. I'd also agree that, for amino acid data, the MixtureFinder approach should not be explained in the documentation and should be offered with warnings and stopped with error messages by default. However, I think that the MixtureFinder approach would probably be a very sensible way to find the best-fitting model for codon data and binary and multistate data other than morphology and recoded amino acids. I think that for these data types the MixtureFinder approach is well worth documenting, even if accompanied by warnings (or also even default error messages). The following are these reasons: Amino acid data and their recoded binary or multistate dataFor amino acid data, as roblanf pointed out, we already have probably well-behaved mixture frequency vectors CXXs, and they should explain data well. Using the MixtureFinder approach for amino acid data would be inferior to using them in several ways probably in most cases: (1) it would require extra computational cost, (2) the replacement rates and frequencies for each class determined by MixtureFinder often wouldn't be predefined, greatly increasing the number of free parameters and thus making MixtureFinder terminate before the mixture model explains the data sufficiently well. The same applies to recoded binary or multistate data of amino acids because, for these data types, the recoded mixture frequency vectors CXX can be applied (see Redmond & McLysaght, 2021; Najle et al., 2023; https://github.com/xgrau/recoded-mixture-models). Codon dataI'm not familiar with codon models, so I apologize if I have misunderstood something. For codon data, we don't have predefined mixture vectors of replacement rates and/or frequencies, so using the MixtureFinder approach would probably be a sensible way to improve the model fit for the data. It is not empirically validated, but it should probably improve the model fit judging from BIC, AIC, AICc, or likelihood ratio test. As ModelFinder exclusively relies on these criteria without any warnings or error messages, I think that there wouldn't be problems in documenting the MixtureFinder approach for codon data as well, and it might not require even special warnings or default error messages. If mixing multiple empirical codon models would seem strange, users could restrict the models to be applied using Binary and multistate data other than morphology and recoded amino acidsThese data are some sort of genomic information (e.g., genome gene content and synteny) and recoded DNA data in most cases. As for codon data, we don't have predefined mixture vectors of replacement rates and/or frequencies for these data types, so what has been said about codon data also applies to these data. For example, for binary data, MIX{GTR2+FO,GTR2+FO,JC2+FO} would fit better to some data than GTR2+FO or JC2+FQ. I think that MixtureFinder approach cannot be applied to morphological data as it is, but I believe that a MixtureFinder-like algorithm I devised that I've implemented in #35 will work well. Thank you very much for your time and support. I look forward to further discussion. |
Dear All,
This pull request extends MixtureFinder (Ren et al., 2024), which currently works only on DNA data, to codon, binary, multistate, and amino acid data.
For multistate and amino acid data, the frequency parameters
FQ
andFO
are tested by default, and for codon data, the frequency parametersFQ
,F
,F1X4
, andF3X4
are tested by default. As I have never done phylogenetic analyses of codon or amino acid data in my actual research, I apologize if I am doing something wrong.I have tested the modified MixtureFinder on DNA, codon, binary, multistate and amino acid data to see if it works properly, and it seems to work fine. The test data can be found at HS6986/iqtree3ForkTests.
Since I am completely unfamiliar with C++ and have little understanding of the IQ-TREE implementation, I believe that this PR might contain bugs and there are many improvements that could be made. Thus, it almost certainly needs extensive code review by the IQ-TREE developers. If you would like to get access to my repository for editing, please feel free to ask.
This is almost my first PR, so please feel free to let me know if I am doing something wrong.
Thank you very much for your time and support.