WIP: Extend MixtureFinder to codon, binary, multistate, (and amino acid) data #11

HS6986 · 2025-04-06T06:51:26Z

Dear All,

This pull request extends MixtureFinder (Ren et al., 2024), which currently works only on DNA data, to codon, binary, multistate, and amino acid data.

For multistate and amino acid data, the frequency parameters FQ and FO are tested by default, and for codon data, the frequency parameters FQ, F, F1X4, and F3X4 are tested by default. As I have never done phylogenetic analyses of codon or amino acid data in my actual research, I apologize if I am doing something wrong.

I have tested the modified MixtureFinder on DNA, codon, binary, multistate and amino acid data to see if it works properly, and it seems to work fine. The test data can be found at HS6986/iqtree3ForkTests.

Since I am completely unfamiliar with C++ and have little understanding of the IQ-TREE implementation, I believe that this PR might contain bugs and there are many improvements that could be made. Thus, it almost certainly needs extensive code review by the IQ-TREE developers. If you would like to get access to my repository for editing, please feel free to ask.

This is almost my first PR, so please feel free to let me know if I am doing something wrong.

Thank you very much for your time and support.

bqminh · 2025-04-06T11:43:22Z

thanks for contributing. @HuaiyanRen can you check?

HuaiyanRen · 2025-04-08T03:47:51Z

Thanks for contributing. I will check it soon once I have time.

HS6986 · 2025-04-08T09:25:10Z

Thank you for your responses.

I have realized that I specified inappropriate frequency types for some data types. Sorry for the confusion I may have caused. I am unable to work on it now, but I will fix it as soon as possible, referring to the frequency types tested by default by ModelFinder.

HS6986 · 2025-04-08T14:52:42Z

I have fixed it now. It should work properly, probably? I have updated the tests (HS6986/iqtree3ForkTests) accordingly.

HuaiyanRen · 2025-04-11T02:57:08Z

Hello, I read your repository. I think your extension is correct for amino acid, binary and multistate data. I also never work on codon, I saw your log only consider +F1X4 and +F3X4 but no +FQ or +FO, which I'm not sure this is proper.

But thank you very much again for your contribution!

HS6986 · 2025-04-11T03:19:32Z

Thank you for your response.

I have never dealt with codon data either, so I was fumbling about this.

ModelFinder seems to test the frequency parameters F3X4, F1X4, F, and for codon data by default. With reference to this, I wanted to apply F3X4, F1X4, FO, and for codon data. However, according to the manual (Substitution models), FO cannot be applied to codon data, and it was confirmed in my test analysis. As such, I applied F3X4, F1X4, and for codon data. Am I correct in understanding that F cannot be applied to a class in a mixture model? I am not sure whether FQ should be applied.

I would deeply appreciate it if you could verify this.

Thank you for your time and support.

HuaiyanRen · 2025-04-11T03:30:35Z

F can be applied with a mixture model. MixtureFinder consider +FQ and +FO as default for DNA, if you want +F, you need to specify it by -mset.

ModelFinder consider +FQ and +F as default, but since I joined in our team, we developed mixture models (for DNA) with +FO. I didn't ask about the exact reason, by I think one reason could be, for example:

{F81+FO,F81+FO} are actually two-class mixture model, although the exchangeabilities are the same in different classes, by the frequencies could be different. However, {F81+F, F81+F} is a meaningless mixture model, because +F is counted based on the alignment, so for each class the +F is the same thing.

HS6986 · 2025-04-11T04:15:34Z

Thank you for your response.

F can be applied with a mixture model.

Oh, thank you for pointing it out. Indeed, F can be applied in a mixture model, but that should only be the case if F is applied to all the classes. I think otherwise the mixture model should not work properly and probably causes errors. Is this correct? It seems to me that F can be specified in MixtureFinder only when only F is specified for frequency types and only models with unequal frequencies are specified. As such, F should not be tested in MixtureFinder by default.

I would be happy to continue discussing the frequency types for codon data in MixtureFinder.

Thank you for your support.

HuaiyanRen · 2025-04-11T04:29:24Z

When user input -mset JC,GTR -mfreq F, FO with MixtureFinder, the model below will be considered: JC+FQ, GTR+F, GTR+FO. No errors will be report. So you don't need to worry about the conflict of combination between the exchangeability matrix types and the F types.

For codon models, I may ask around in the team to find someone who can answer your question.

HuaiyanRen · 2025-04-11T05:00:11Z

Sorry, I made a mistake when typing before.

I want to ask: in your log file Chen2024.COI.Aligned.NT.Replaced.Subsampled.fas.log, it seems +F and are not estimated (ignore the +FO thing, my bad), but you mentioned that you appied , which I couldn't see in the log file.

HS6986 · 2025-04-11T06:12:21Z

Thank you for your responses.

No errors will be report.

Oh, this is good to know, thank you (please ignore my deleted statement about linking or unlinking frequencies, my misunderstanding)!

For codon models, I may ask around in the team to find someone who can answer your question.

Thank you!

but you mentioned that you appied , which I couldn't see in the log file.

Sorry, I misread “I also never work on codon, I saw your log only consider +F1X4 and +F3X4 but no +FQ or +FO, which I'm not sure this is proper.” and thought that you were then talking about the frequency types specified for codon data in MixtureFinder, not about my MixtureFinder test for codon data. My apologies. was specified as a frequency type for codon data in MixtureFinder, but it was automatically not tested in my test analysis because empirical codon models were not tested as the genetic code of the data was not The Standard Code but The Invertebrate Mitochondrial Code (I just learned that for empirical codon models The Standard Code must be used). I will test the codon MixtureFinder again on data whose code is standard.

Thank you for your time and support.

HS6986 · 2025-04-11T07:15:24Z

Dear IQ-TREE Developers,

I have tested the codon MixtureFinder on data whose genetic code was standard (HS6986/iqtree3ForkTests), thus activating the tests of empirical codon models. However, the analysis stopped with an error message ERROR: Frequency mixture name not found U. I suspect that this is due to the original IQ-TREE implementation rather than my implementation.

Could you review this issue and fix it when you are available?

Thank you for your time and support.

P.S. This error seems to be due to the IQ-TREE's inability to handle mixture models with classes whose frequencies are FU. This error also occurs when analyzing DNA data using MixtureFinder in IQ-TREE v3.0.0 with -mfreq FQ,F,FO. Here is the log file of the analysis.

HS6986 · 2025-04-15T06:52:23Z

With a very simple change, I was able to fix the aforementioned issue, namely that IQ-TREE does not accept +FU in mixture models. I have updated the tests (HS6986/iqtree3ForkTests) accordingly.

StefanFlaumberg · 2025-04-16T21:01:57Z

Hi all,

I've also tried to explore the IQ-Tree functionality towards using multistate data recently.
As far as I could get, it seems that any kind of multistate data can readily be handled by the ModelMorphology class just after a few simple code modifications not altering the model itself (like turning off the restriction for 32 max states, etc. -- I can expand on the details in another reply if needed). For example, I could successfully run an analysis for the following test alignment (in .phy format) using states 0-399:

4       5
A       1 25 399 6 7
B       1 25 10 6 7
C       1 35 1 6 7
D       1 25 1 6 17

And the multistate alignment (Khalturin2022.WoGaps.recSR4.Subsampled.fasta) used for testing this pull request, which is a usual 4-state alignment, can be handled by the ModelMorphology class without any modification of the program at all. Thus, maybe not a MixtureFinder analysis, but a plain tree reconstruction under a specified model can be run fine for this alignment using just the original IQ-Tree code.

So @HS6986, the question is:
Do we really need to implement a whole new model class (ModelMultistate), when we already have the functionality of the ModelMorphology class?
To me, any multistate data can always be treated as morphology data, and we already have a model class for it.

Sorry if I'm misleading here, I haven't dug deep into this topic yet.

Best,
Stefan

StefanFlaumberg · 2025-04-16T22:26:16Z

@HS6986,

I've just downloaded your code and made a couple of runs with your 4-state test alignment. I added some log messages to model class constructors, and here is what I've got:

If I don't use MixtureFinder, but use ModelFinder not specifying the sequence type as with the following command:
iqtree3 -s Khalturin2022.WoGaps.recSR4.Subsampled.fasta --tree-fix -pre newtest
the sequence type gets correctly recognized as SEQ_MORPH and the ModelMorphology class is initialized:

Creating fast initial parsimony tree by random order stepwise addition...
0.009 seconds, parsimony score: 170 (based on 40 sites)

!!! SeqType is MORPH !!!
!!! Initialize ModelMorphology class !!!

Perform fast likelihood tree search using MK+I+G model...

However, if I use MixtureFinder instead with the following command:
iqtree3 -s Khalturin2022.WoGaps.recSR4.Subsampled.fasta -m MIX+MF -pre newtest
the sequnce type ends up to be SEQ_MULTISTATE and the new ModelMultistate class is initialized:

MixtureFinder will also test models with unequal rates and/or frequencies

Creating fast initial parsimony tree by random order stepwise addition...
0.007 seconds, parsimony score: 171 (based on 40 sites)

!!! SeqType is MULTI !!!
!!! Initialize ModelMultistate class !!!

WARNING: GTRX multistate model will estimate 5 substitution rates that might be overfitting!
WARNING: Please only use GTRX with very large data and always test for model fit!
Perform fast likelihood tree search using GTRX+I+G model...

So the same alignment under the same conditions (i.e. neither seq type, nor model specified) is treated differently by ModelFinder and MixtureFinder! Kinda weird behaviour to me.

This situation, in fact, can lead to the following confusion:
The alignments of the following type (with states from the [0-9A-V] alphabet) should probably always be assigned the SEQ_MORPH SeqType, and this SeqType always leads to initializing the ModelMorphology class:

It is the alignments like from my previous comment (with states from an arbitrarily large alphabet designated with ints) that can be assigned the existing in the original code, but unused SEQ_MULTISTATE SeqType, and they are not actually supported in IQ-Tree currently.
Please see the Alignment::buildStateMap function to get a better idea of SEQ_MORPH vs SEQ_MULTISTATE difference.

Maybe the good old ModelMorphology, usually used by IQ-Tree for the data as in your example, could be used by MixtureFinder as well? :)

Best,
Stefan

HuaiyanRen · 2025-04-17T00:40:03Z

@HS6986

The bug with +FU has been existing in MixtureFinder for long time. Thank you for point it out and fix it!

We didn't deal with this issue because this is not a common case that users specify +FQ,FO... then GTR+FQ, which is equivalent to SYM, is selected.

HS6986 · 2025-04-17T09:53:32Z

@StefanFlaumberg

Thank you for your feedback. The reason I created the new model class ModelMultistate was that models that can be (or are usually) applied to morphological data and multistate data (this may have been a bad expression, I have used “multistate data” here as a term for data represented by [0-9,A-Z] other than morphology), respectively, are different.

When analyzing morphological data with probabilistic methods, most empiricists partition data by the number of states in each character (see here) and use Mk models (models with equal rates and frequencies) with ascertainment bias corrections (Lewis, 2001) to model data. Also in IQ-TREE, ModelFinder only considers MK+FQ(+ASC+(rate heterogeneity across characters (e.g., +G))) for morphological data. Although some software programs (MrBayes and RevBayes) implement methods that model heterogeneity of state frequencies in morphological data with mixture models (Wright et al., 2016; here), as morphological data should be partitioned by the number of states and currently ascertainment bias corrections (+ASC) cannot be applied in mixture models in IQ-TREE, morphological data cannot be analyzed in MixtureFinder, at least currently.

On the contrary, multistate data other than morphology, such as recoded amino acid data, can and often should be analyzed in models with unequal rates and/or frequencies (e.g., MK+FO, GTRX+FQ, and GTRX+FO).

If we implemented MixtureFinder so that multistate data other than morphology would be handled by ModelMorphology, only MK models would be called by getModelSubst() in phylotesting.cpp by default, which would be highly problematic.

However, thinking about it again, it seems that there will be no problem if we replace getModelSubst() for multistate data passed to MixtureFinder with code that calls MK and GTRX as the models to test. By doing this, we can avoid the complicated and frankly messy way of creating the new class ModelMultistate.

I will work on this as soon as possible. Thank you very much for your thorough feedback, Stefan!

StefanFlaumberg · 2025-04-17T15:56:11Z

Hi @HS6986,

Thank you for the extensive explanation and useful links! I think I got your idea.
Here is what I think now:

You state that using only the MK model and FQ frequencies by default is problematic:

If we implemented MixtureFinder so that multistate data other than morphology would be handled by ModelMorphology, only MK models would be called by getModelSubst() by default, which would be highly problematic.

But I think it is quite the opposite:
Assume we are analysing some multistate data with 32 possible states (the current maximum allowed). If the default behaviour is to consider both MK and GTRX models and all three FQ, F and FO frequencies, than ModelFinder or MixtureFinder will have to estimate a 32*32 GTRX matrix from a single alignment -- a behaviour just as wrong as unnecessary! This will take much time and lead only to overfitting the model parameters.

Maybe the original default behaviour (MK+FQ, thus no MixtureFinder) is fine, and if a user still has a good reason to run MixtureFinder for the SEQ_MORPH seqtype data (as in the case with recoded AA data), they could just add something like -mset "+GTRX" -mfreq "FQ,F,FO" to the command?
This should spare us from having to estimate the too many parameters of the GTRX model when unnecessary, yet allow using this model if we are sure we want it. One can implement this based on the ModelMorphology class alone, and the code allowing user to provide additional models and freqs to test (with the -mset and -mfreq options) is already there.

HS6986 · 2025-04-18T12:41:56Z

@StefanFlaumberg

Thank you for your reply.

That makes sense!

It seems to me that the best choice is to specify the default models and frequency types for SEQ_MORPH data in MixtureFinder as MK and FQ, respectively, and to make IQ-TREE generate an error (e.g., Running MixtureFinder only with the MK model and the FQ frequency is completely meaningless. Please provide additional models and/or frequencies, such as GTRX, +F, and +FO, using -mset and/or -mfreq, if you really want to use MixtureFinder for your data.) when users call MixtureFinder for multistate data without supplying either other models or frequencies. What do you think?

I am going to remove all the pieces of code that have been added in this pull request in relation to SeqMultistate.

Thank you very much for your support and time.

StefanFlaumberg · 2025-04-18T14:57:33Z

@HS6986,

Yes, good idea!

I think you could also suppress the following warnings from the ModelMorphology class if the number of states is <=6 and the alignment/partition length is >=100:

WARNING: GTRX multistate model will estimate 5 substitution rates that might be overfitting!
WARNING: Please only use GTRX with very large data and always test for model fit!

For example, if the alignment have 6 states, for a GTRX matrix we have to estimate only (6*6 - 6)/2 - 1 = 14 parameters. All the corresponding 14 transition pairs are likely to appear in the alignment numerous times, so there should be no concern of overfitting. However, if someone decided to use the GTRX model for true morphological data, which, just as you mentioned before, can be divided into short partitions, estimating even 14 parameters would be a problem. Hence the check for the partition length.
The best way to check partition length is probably through the number of distinct patterns:
phylo_tree->aln->getNPattern() >= 100

HS6986 · 2025-04-24T05:13:42Z

@StefanFlaumberg

Sorry for the late reply.

That's an excellent idea! I was also wondering if it is possible to partially suppress the warnings that occur every time you use GTRX.

I'll work on this.

Thank you for your time and support.

…if the number of states <= 6 && the number of the patterns in the alignment/partition >= 100

HS6986 · 2025-04-29T08:04:31Z

Dear @StefanFlaumberg, @HuaiyanRen, @bqminh, and others,

I have completed the following tasks:

Delete the ModelMultistate class and its related code so that multistate data passed to MixtureFinder would be handled by the ModelMorphology class
Set the default model and frequency type for multistate data in MixtureFinder as MK and FQ, respectively
Make IQ-TREE throw an error Error! Running MixtureFinder only with the MK model and the FQ frequency is completely meaningless. Please provide additional models and/or frequencies, such as GTRX, +F, and +FO, using -mset and/or -mfreq, if you really want to use MixtureFinder for your multistate data. when users call MixtureFinder for multistate data without supplying additional models and/or frequencies
Change the wording of the GTRX warnings
Disable the GTRX warnings if the number of states <= 6 && the number of the patterns in an alignment/partition >= 100, as suggested by @StefanFlaumberg

I have tested the behavior of this pull request again (HS6986/iqtree3ForkTests), and it seemed to work properly.

Please let me know if there are any problems or other areas for improvement. It seems to me that currently no problems exist in this pull request (though I'm still not sure whether FQ should be specified as a default frequency type for codon data in MixtureFinder, but probably the answer is no?).

Thank you very much for your time and help.

StefanFlaumberg · 2025-04-29T22:49:14Z

Hi @HS6986,
Thank you for the changes!

A few minor things that bother me (some of them are mere stylistic and can be ignored):

In ModelMorphology class:
You removed the name = "GTRX"; statement right after the warning block. Its function was to convert a user-specified name (which can be either "GTR" or "GTRX") into "GTRX", for the model to have a single proper name in the end.
Maybe the statement should be there?
In ModelMorphology class:
I think putting the warning condition in the following way would make code more readable:
if (num_states > 6 || phylo_tree->aln->getNPattern() < 100)
In phylotesting.cpp lines 6351 and 6528:
I think freq_set_multistate should be substituted for freq_set_morph for code readability, by the analogy with other variables exitsing in this file like morph_model_names and morph_usual_model.
In one of your example runs, lines 71-73 here:
Despite using the MK+FO model, we get only MK printed without mentioning the +FO{...} state frequencies. This is caused not by your changes, but by a bug of the ModelMorphology class. This can be solved by modifying the ModelMorphology::getNameParams function like below:

string ModelMorphology::getNameParams(bool show_fixed_params) {
    ostringstream retname;
    retname << name;
    if (num_params > 0 || show_fixed_params) {
        retname << '{';
        int nrates = getNumRateEntries();
        for (int i = 0; i < nrates; i++) {
                if (i > 0) retname << ',';
                retname << rates[i];
        }
        retname << '}';
    }
    getNameParamsFreq(retname);
    return retname.str();
}

Please notice that I omitted everything associated with the pos_plus variable, as in no scenario I can expect this->name to include the "+" char. According to the same logic one can remove the ModelMorphology::getName() function, as it always reduces to the ModelMarkov::getName() function of the parent class. (I've tested this, yet it'd better be double-checked).

Concerning the default frequency types for codon data:
By the analogy with DNA data, I would answer "no" to your question:

whether FQ should be specified as a default frequency type for codon data in MixtureFinder

+F1X4 for codon works the same as +FO for DNA and is quite a reasonable assumption in terms of the number of parameters (only 3 free parameters to estimate), while +FQ seems to be too strict for most cases.

bqminh

Also, generateNestNetwork is only working with DNA, but this function is still called. I'm not sure it's going on if this function is called for other data types. Please check that it's the right behaviour.

bqminh · 2025-05-08T02:14:32Z

@HuaiyanRen Can you make these changes to the pull request?

HuaiyanRen · 2025-05-09T06:25:13Z

@HS6986 I made changes to your code according to @bqminh 's advice. Please confirm, thanks!

…; Temporarily comment out `free(init_state_freq_set);`, which HuaiyanRen added, as they cause an error

HS6986 · 2025-05-11T15:46:35Z

Dear All,

Sorry for the slow response. I improved the code according to the advice by StefanFlaumberg and bqminh. Thank you for your advice. I have updated the tests (HS6986/iqtree3ForkTests) accordingly, and the updated code seems to work properly, with the issue of printing the state frequencies of MK+FO solved.

Please let me know if I have missed something.

Thank you very much for your time and support.

HS6986 · 2025-05-11T15:47:05Z

Dear bqminh,

Thank you very much for thoroughly reviewing my code!

HS6986 · 2025-05-11T15:55:43Z

Dear @StefanFlaumberg,

Yeah, I know) In my answer I implicitly meant that as far as FQ does not appear as a default frequency type for any codon model, we should not force it as it is too strict. And similar to DNA models using FQ as default frequencies, your implementation allows using the default model-specified frequencies for empirical codon models, which is great.
Thus, we agree here.

Sorry, I misread your previous message. I'm convinced. Thank you very much for your advice.

Thank you for pointing out!
The strange thing is I cannot reproduce it. So even without these 2 commits by Thomas everything is working perfectly well...
My tests included printing this->name at the end of the model constructor, and it was always the bare model name printed (like MK or GTRX). Also I couldn't find any difference in the .log and .iqtree outputs, when using single models, partition models, mixture models or AliSim with or without the changes from those commits.

I see... That's strange. I've fixed the aforementioned bug using your code while leaving pos_plus as it is just in case. I'll contact thomaskf to ask him if this part of the code is really necessary.

HS6986 · 2025-05-11T15:59:50Z

Dear @HuaiyanRen,

Thank you very much for your changes!

But free(init_state_freq_set);, which were added in your commits, do not seem to work properly, causing an error, free(): invalid pointer. I temporarily commented them out and it worked fine. Could you confirm whether these statements make sense?

Thank you for your time and support.

HS6986 · 2025-05-11T16:42:27Z

Dear @thomaskf,

Sorry for the mention.

In this pull request, we extended MixtureFinder to codon, binary, multistate, and amino acid data. In the process, we found a bug that the state frequencies of MK+FO are not printed in the .iqtree file. We have fixed this bug by modifying modelmorphology.cpp. However, StefanFlamberg claims based on their extensive tests that ModelMorphology::getName() and pos_plus in modelmorphology.cpp, which were added in the latest two commits by you, do not make sense, as this->name never contains the + character. Could you confirm this?

Thank you very much for your time and help.

HuaiyanRen · 2025-05-11T23:55:14Z

@HS6986 ， you are correct. This piece of code is unnecessary. Thank you for pointing it out.

HS6986 · 2025-05-12T01:05:13Z

Dear @HuaiyanRen,

Thank you for your reply. I've deleted free(init_state_freq_set).

Thank you very much for your help.

StefanFlaumberg · 2025-05-12T01:07:43Z

Hi @HS6986 ,
Thank you for the changes!

I'm sorry, but the name = "GTRX"; statement in the ModelMorphology class is out of place: it should be right outside the warning block.

BTW free() is only applicable if the memory for the pointer has explicitly been allocated. Here it's not the case, so no free() can be used.
BTW2 I should say that new stuff with the initFreqSet function returning char* pointer to initialize the init_state_freq_set variable on its declaration doesn't look right and safe in terms of memory, the previous version without the function was much cleaner. But I don't know how to improve it here.

HS6986 · 2025-05-12T01:28:50Z

Dear @StefanFlaumberg,

Thank you for your reply.

I'm sorry, but the name = "GTRX"; statement in the ModelMorphology class is out of place: it should be right outside the warning block.

Sorry, I messed up again. Thank you for pointing it out. I've fixed it.

BTW2 I should say that new stuff with the initFreqSet function returning char* pointer to initialize the init_state_freq_set variable on its declaration doesn't look right and safe in terms of memory, the previous version without the function was much cleaner. But I don't know how to improve it here.

@HuaiyanRen, could you check this?

Thank you very much for your help.

roblanf · 2025-05-12T04:13:36Z

Hi All,

Apologies for weighing in a bit late here - I've been busy with teaching in the last couple of months. First I want to thank @HS6986 and @StefanFlaumberg for all the work they have put into this. It's great to see the community extending the software!

The point of this comment is to ask a question about this addition, which I'd like to see clearly addressed before these updates are made available to IQ-TREE users. Hopefully this kicks off a good discussion!

The question is: what is the theoretical and practical basis of adding these features, and do we have empirical evidence that they will be useful (i.e. improve inference for IQ-TREE users)?

Here's a longer version, focussed on amino acid models which are the ones I am most familiar with. The motivation for MixtureFinder was that perhaps partitioned models are not the best way to account for variation in evolutionary processes among sites. We noticed that mixture models were rarely applied to DNA data, even though most DNA datasets contain plenty of information to estimate mixture models, and the parameterisation of DNA models is such that it's feasible to estimate quite complex DNA mixture models for most users. To this end, the questions we asked (and, I hope, answered!) in the mixturefinder paper was whether these models are really justified (they were, because they fit individual empirical partitions far better than one-class GTR models, suggesting that current methods are limited in terms of their abilty to explain data), and then whether they were useful (they were, though the improvements in inference were sometimes modest). Before we make the current additions available, I'd like to see similar tests done for the other data types.

For the amino acid models (and potentially the others, but I'm less sure there because I am far less familiar with them) I'm a little sceptical that a MixtureFinder approach is worthwhile. The main reason is that most amino acid models (like LG, WAG, Q.pfam, etc) are estimated from huge collections of alignments, and represent average replacement rates and frequencies across those collections. Even if you take clade-specific models, the frequency vectors are quite similar to each other, and more similar to each other than the vectors of profile mixture models like C10-C60, precisely because the former are still averages over a large number of sites. So, I'm not really sure what it means to have a mixture of these kinds of models. Sure, it's likely to fit the data a bit better, but I doubt that it's really worth the extra computational cost. The second reason is that we already have good amino acid mixture models in IQ-TREE, both for replacement rates (e.g. LG4X, LG4M) and frequencies (e.g. all the profile mixture models like C10-C60; PMSF models which can build on / summarise Bayesian CAT models; UDM models, etc). These are very much biologically motivated approaches to mixture models, which are likely (I guess, but I don't know!) to be a better way to model amino acid evolution than building complex mixtures of one-class models (like LG, WAG, Q.mammal, etc). On top of that we can now estimate GTR20 replacement matrices under these models (e.g. the GTRpmix paper) and re-estimate the weights of frequency vectors for profile mixture models. We can also estimate a huge range of these kinds of models from scratch in IQ-TREE (not all of those methods are published yet, because they haven't been checked sufficiently).

To be clear, I'm not saying that I know that extending MixtureFinder to amino acid models wouldn't be useful. It's just that I have reservations, and I think one would need clear theoretical grounds and empirical demonstrations before this extension is made available to users. The same goes for the morphological and codon models. (I'm not well versed in morphological models, though my intuition is that mixturefinder might be a good approach for codon models).

To me, it's really important that methods available in IQ-TREE are backed up by solid evidence of their utility and accuracy. Ultimately I think this means doing a lot of simulations to check their accuracy (in terms of implementation), then finding creative ways to do validate their utility on published empirical datasets. As long as we do this, then users can trust that the methods open to them in IQ-TREE are ones that we expect to be useful, accurate, and practical. The latter is important here, because mixture models imply a lot more computation, so that should be coupled with the payoff of more accuracy. If the MixtureFInder approach proves not to be useful for some datatypes (like amino acids perhaps, if my hunch above is right), that's still very useful to know, and good research!

The good news is that I think the MixtureFinder paper (https://academic.oup.com/mbe/article/42/1/msae264/7931682) provides a useful template for doing this on each of the new datatypes. Though I'm sure there are ways to improve our approaches there too.

Happy to discuss more! Thanks again for all the work. I'm genuinely excited to see these methods extended, and I hope (and suspect) that the approach will be useful on some datatypes, just perhaps not all of them.

Cheers,

Rob

P.S. We can of course build the methods into IQ-TREE so they are available, but not documented. And we should make sure that methods that have not been validated come with an appropriate warning! I prefer this approach to having too many branches which gradually diverge and create huge headaches down the track.

HS6986 · 2025-05-12T05:58:01Z

@roblanf

Thank you very much for your thorough feedback.

Since I am out now, I'll give you my tentative conclusion to your question for now. I'll post details as to why later.

In conclusion, your comments have made me realize that the MixtureFinder approach is probably not suitable for amino acid data and their recoded binary or multistate data, but that, for codon data and binary (e.g., genome gene content, microsynteny, RY recoded DNA, I'm sure there are many other examples) and multistate data other than morphology or recoded amino acids, it is at least theoretically justified, could most probably improve phylogenetic inference, and is probably a very sensible way. Thank you for pointing it out. We may consider disabling MixtureFinder for amino acid data or setting warnings in MixtureFinder recommending other mixture models for amino acid data.

StefanFlaumberg · 2025-05-12T18:11:28Z

Hi @roblanf and @HS6986,

I agree that the trade-off between model fit and computational cost may not favour using MixtureFinder for data types with high number of states and that it is worthwhile to theoretically and experimentally identify the specific situations when the usage of MixtureFinder is justified and present these situations in the IQ-Tree manual to prevent unintended usage.
However, realistically I do not expect anyone to thoroughly test it anytime soon. And given the suggested extension of MixtureFinder to non-DNA data types is rather simple code-wise and is also friendly in usage, it would be quite unwise to take away the chance from some advanced users to experiment with MixtureFinder on their data.
Thus, I'd agree that for now the extension should not be documented (i.e. not explicitly advertised for general users), but should be accepted after adding the relevant warnings.

Regarding the warnings, I think a good solution woud be putting something like this at the line 6674 of the phylotesting.cpp file (inside the runMixtureFinder function):

if (iqtree->aln->seq_type != SEQ_DNA)
    outWarning("MixtureFinder has not been tested for non-DNA data types. Be cautious about interpreting the results");
if (iqtree->aln->getMaxNumStates() > 6)
    outWarning("Running MixtureFinder for the given data type can take much time. Consider restricting the set of the models to test as much as possible");

bqminh · 2025-05-13T07:59:43Z

I just talked with @thomaskf and @HuaiyanRen about this. One way to go is this:

If MixtureFinder is run on non-DNA, in addition to the warning message above, we additionally stop and print an error message, but we also offer the possibility to still run it by telling user to use option "--force-non-dna-mixture" (or something like that).
We add this new option to the command line. IQ-TREE will run with this option for non-DNA data.

Make sure that the warning message is always included in either case.

That way, users are aware of the problem, as they have to run IQ-TREE twice. Otherwise, if you just let it run, many people won't notice, no matter if there is any warning...

HS6986 · 2025-05-16T06:28:19Z

Dear All,

Thank you for all the comments!

I'd agree that in most cases the MixtureFinder approach probably wouldn't be a sensible way to model amino acid data, and their recoded binary or multistate data. I'd also agree that, for amino acid data, the MixtureFinder approach should not be explained in the documentation and should be offered with warnings and stopped with error messages by default. However, I think that the MixtureFinder approach would probably be a very sensible way to find the best-fitting model for codon data and binary and multistate data other than morphology and recoded amino acids. I think that for these data types the MixtureFinder approach is well worth documenting, even if accompanied by warnings (or also even default error messages).

The following are these reasons:

Amino acid data and their recoded binary or multistate data

For amino acid data, as roblanf pointed out, we already have probably well-behaved mixture frequency vectors CXXs, and they should explain data well. Using the MixtureFinder approach for amino acid data would be inferior to using them in several ways probably in most cases: (1) it would require extra computational cost, (2) the replacement rates and frequencies for each class determined by MixtureFinder often wouldn't be predefined, greatly increasing the number of free parameters and thus making MixtureFinder terminate before the mixture model explains the data sufficiently well. The same applies to recoded binary or multistate data of amino acids because, for these data types, the recoded mixture frequency vectors CXX can be applied (see Redmond & McLysaght, 2021; Najle et al., 2023; https://github.com/xgrau/recoded-mixture-models).

Codon data

I'm not familiar with codon models, so I apologize if I have misunderstood something. For codon data, we don't have predefined mixture vectors of replacement rates and/or frequencies, so using the MixtureFinder approach would probably be a sensible way to improve the model fit for the data. It is not empirically validated, but it should probably improve the model fit judging from BIC, AIC, AICc, or likelihood ratio test. As ModelFinder exclusively relies on these criteria without any warnings or error messages, I think that there wouldn't be problems in documenting the MixtureFinder approach for codon data as well, and it might not require even special warnings or default error messages. If mixing multiple empirical codon models would seem strange, users could restrict the models to be applied using -mset.

Binary and multistate data other than morphology and recoded amino acids

These data are some sort of genomic information (e.g., genome gene content and synteny) and recoded DNA data in most cases. As for codon data, we don't have predefined mixture vectors of replacement rates and/or frequencies for these data types, so what has been said about codon data also applies to these data. For example, for binary data, MIX{GTR2+FO,GTR2+FO,JC2+FO} would fit better to some data than GTR2+FO or JC2+FQ.

I think that MixtureFinder approach cannot be applied to morphological data as it is, but I believe that a MixtureFinder-like algorithm I devised that I've implemented in #35 will work well.

Thank you very much for your time and support. I look forward to further discussion.

extend MixtureFinder to codon, binary, multistate, and amino acid data

6df0555

fix frequency types

d3e3e59

allow FU in mixture models

3a4de2b

HS6986 added 4 commits April 29, 2025 00:38

delete the class ModelMultistate

39c8a32

Merge branch 'master' into feature/HS6986/extend-MixtureFinder

4af564f

Changed the wording of the GTRX warnings; disabled the GTRX warnings …

477b8b0

…if the number of states <= 6 && the number of the patterns in the alignment/partition >= 100

Fix indentation issues

5e95d0e

bqminh reviewed May 6, 2025

View reviewed changes

HuaiyanRen added 2 commits May 9, 2025 15:42

Only generateNestNetwork for DNA models.

06fffe7

create a function to initialise MixtureFinder frequencies

16fc417

HS6986 added 3 commits May 11, 2025 19:47

Fix conflicts

6bac548

Restore a comment I accidentally deleted

0b417c2

Refine the code according to the advice by StefanFlaumberg and bqminh…

ce2b9a9

…; Temporarily comment out `free(init_state_freq_set);`, which HuaiyanRen added, as they cause an error

Delete free(init_state_freq_set)

727a934

Relocate the misplaced name = "GTRX";

623874a

Restrict isRateTypeNested() only to DNA data

3f01fd9

HS6986 changed the title ~~Extend MixtureFinder to codon, binary, multistate, and amino acid data~~ Extend MixtureFinder to codon, binary, multistate, (and amino acid) data May 12, 2025

bqminh mentioned this pull request May 13, 2025

[Feature Request] Allow +ASC for mixture models #12

Open

HS6986 changed the title ~~Extend MixtureFinder to codon, binary, multistate, (and amino acid) data~~ WIP: Extend MixtureFinder to codon, binary, multistate, (and amino acid) data May 13, 2025

HS6986 mentioned this pull request May 13, 2025

[WIP: Implementation of a New Algorithm] Extension of MixtureFinder to morphological data #35

Draft

HS6986 marked this pull request as draft May 13, 2025 23:06

WIP: Extend MixtureFinder to codon, binary, multistate, (and amino acid) data #11

Are you sure you want to change the base?

WIP: Extend MixtureFinder to codon, binary, multistate, (and amino acid) data #11

Conversation

HS6986 commented Apr 6, 2025 • edited Loading

bqminh commented Apr 6, 2025

HuaiyanRen commented Apr 8, 2025

HS6986 commented Apr 8, 2025 • edited Loading

HS6986 commented Apr 8, 2025

HuaiyanRen commented Apr 11, 2025

HS6986 commented Apr 11, 2025 • edited Loading

HuaiyanRen commented Apr 11, 2025 • edited Loading

HS6986 commented Apr 11, 2025 • edited Loading

HuaiyanRen commented Apr 11, 2025

HuaiyanRen commented Apr 11, 2025

HS6986 commented Apr 11, 2025 • edited Loading

HS6986 commented Apr 11, 2025 • edited Loading

HS6986 commented Apr 15, 2025 • edited Loading

StefanFlaumberg commented Apr 16, 2025

StefanFlaumberg commented Apr 16, 2025 • edited Loading

HuaiyanRen commented Apr 17, 2025

HS6986 commented Apr 17, 2025 • edited Loading

StefanFlaumberg commented Apr 17, 2025

HS6986 commented Apr 18, 2025 • edited Loading

StefanFlaumberg commented Apr 18, 2025

HS6986 commented Apr 24, 2025

HS6986 commented Apr 29, 2025 • edited Loading

StefanFlaumberg commented Apr 29, 2025

bqminh left a comment

Choose a reason for hiding this comment

bqminh commented May 8, 2025

HuaiyanRen commented May 9, 2025

HS6986 commented May 11, 2025

HS6986 commented May 11, 2025

HS6986 commented May 11, 2025

HS6986 commented May 11, 2025 • edited Loading

HS6986 commented May 11, 2025 • edited Loading

HuaiyanRen commented May 11, 2025

HS6986 commented May 12, 2025

StefanFlaumberg commented May 12, 2025

HS6986 commented May 12, 2025

roblanf commented May 12, 2025

HS6986 commented May 12, 2025 • edited Loading

StefanFlaumberg commented May 12, 2025

bqminh commented May 13, 2025

HS6986 commented May 16, 2025 • edited Loading

Amino acid data and their recoded binary or multistate data

Codon data

Binary and multistate data other than morphology and recoded amino acids

HS6986 commented Apr 6, 2025 •

edited

Loading

HS6986 commented Apr 8, 2025 •

edited

Loading

HS6986 commented Apr 11, 2025 •

edited

Loading

HuaiyanRen commented Apr 11, 2025 •

edited

Loading

HS6986 commented Apr 11, 2025 •

edited

Loading

HS6986 commented Apr 11, 2025 •

edited

Loading

HS6986 commented Apr 11, 2025 •

edited

Loading

HS6986 commented Apr 15, 2025 •

edited

Loading

StefanFlaumberg commented Apr 16, 2025 •

edited

Loading

HS6986 commented Apr 17, 2025 •

edited

Loading

HS6986 commented Apr 18, 2025 •

edited

Loading

HS6986 commented Apr 29, 2025 •

edited

Loading

HS6986 commented May 11, 2025 •

edited

Loading

HS6986 commented May 11, 2025 •

edited

Loading

HS6986 commented May 12, 2025 •

edited

Loading

HS6986 commented May 16, 2025 •

edited

Loading