You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like to ask whether the computational complexity in the paper is correct.
Should it be O((SHW)*( F * T/F)). instead of O((SHW)*( S * T/F)). ?
I think in RS-MMA, there is F audio pitches (length T/F), and each audio pitch is calculated with video pitch (length SHW). Thus, the computational complexity should be O((SHW)*( F * T/F)).
May I ask if my idea is correct? Your comments will be really appreciated.
The text was updated successfully, but these errors were encountered:
i think in the paper, the computational complexity is calculated by the size of two sequences, so in O((SHW)( S * T/F)), SHW is the size of video, and ST/F is the size of audio. It should be correct.
However, I am confused that the cross-attention is calculated iteratively for all the segments instead of only one segment mentioned in the paper. So I think the complexity should be O((SHW)*( S * T/F) * F/S)=O((SHW)*T), where extra F/S means it calculates F/S iterations.
I would like to ask whether the computational complexity in the paper is correct.
Should it be O((SHW)*( F * T/F)). instead of O((SHW)*( S * T/F)). ?
I think in RS-MMA, there is F audio pitches (length T/F), and each audio pitch is calculated with video pitch (length SHW). Thus, the computational complexity should be O((SHW)*( F * T/F)).
May I ask if my idea is correct? Your comments will be really appreciated.
The text was updated successfully, but these errors were encountered: