You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for your wonderful work! I have a question: How does the visual specialist (e.g., stablevideo) receive both textual instruction and task features?
It seems that textual instruction are a series of words, while task features are matrices or tensors. How can we combine them to input into the visual specialist?
The text was updated successfully, but these errors were encountered:
Thanks for your wonderful work! I have a question: How does the visual specialist (e.g., stablevideo) receive both textual instruction and task features?
It seems that textual instruction are a series of words, while task features are matrices or tensors. How can we combine them to input into the visual specialist?
The text was updated successfully, but these errors were encountered: