Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Several questions on the Neural FK. #45

Open
catherineytw opened this issue Apr 24, 2022 · 11 comments
Open

Several questions on the Neural FK. #45

catherineytw opened this issue Apr 24, 2022 · 11 comments

Comments

@catherineytw
Copy link

Dear authors,

Thank you for sharing the code, really nice work!

In the past few days, I've been reading your paper and studying your code carefully, and have several questions on the bone2skeleton function and the neural FK layer.

  1. Why the bone2skel function (in model.py) reconstructs an unusual skeleton? According to your paper, the output of the S network is the bone length of a predefined skeleton (in my opinion, the skeleton only defines the topology), and the real skeleton is reconstructed by bone2skel() with the learned bone length and the topology. I visualized the reconstructed skeleton(left), as shown in the figure, it was definitely unusual and the topology (without the end-effectors) was incorrect. In my opinion, it should be the one on the right. (The skeleton was saved when running evaluation on H36m data using your pre-trained model: h36m_gt.pth)

image

  1. Since the skeleton topology was wrong, why the neural FK layer reconstructs the correct 3D points? Did the neural FK layer compute 3D joints differently from the traditional FK algorithm?

Any responses will be highly appreciated!

@lucasjinreal
Copy link

how did u visualize it? which tool

@Shimingyi
Copy link
Owner

Hi @catherineytw ,

Thanks for the question! I think you got the wrong visualization becuase of the misunderstanding of skel_in variable. When I write code, I name this variable refer to original TensorFlow code. It represents the offsets between parent joints and child joints, rather than the position in global coordinate. I found the knee and foot locates in a same place in your visualization, but based on our code, it won't happen, so I will wondering if it's the reason.

def bones2skel(bones, bone_mean, bone_std):
    unnorm_bones = bones * bone_std.unsqueeze(0) + bone_mean.repeat(bones.shape[0], 1, 1)
    skel_in = torch.zeros(bones.shape[0], 17, 3).cuda()
    skel_in[:, 1, 0] = -unnorm_bones[:, 0, 0]
    skel_in[:, 4, 0] = unnorm_bones[:, 0, 0]
    skel_in[:, 2, 1] = -unnorm_bones[:, 0, 1]
    skel_in[:, 5, 1] = -unnorm_bones[:, 0, 1]
    skel_in[:, 3, 1] = -unnorm_bones[:, 0, 2]
    skel_in[:, 6, 1] = -unnorm_bones[:, 0, 2]
    skel_in[:, 7, 1] = unnorm_bones[:, 0, 3]
    skel_in[:, 8, 1] = unnorm_bones[:, 0, 4]
    skel_in[:, 9, 1] = unnorm_bones[:, 0, 5]
    skel_in[:, 10, 1] = unnorm_bones[:, 0, 6]
    skel_in[:, 11, 0] = unnorm_bones[:, 0, 7]
    skel_in[:, 12, 0] = unnorm_bones[:, 0, 8]
    skel_in[:, 13, 0] = unnorm_bones[:, 0, 9]
    skel_in[:, 14, 0] = -unnorm_bones[:, 0, 7]
    skel_in[:, 15, 0] = -unnorm_bones[:, 0, 8]
    skel_in[:, 16, 0] = -unnorm_bones[:, 0, 9]
    return skel_in

@catherineytw
Copy link
Author

Hi @catherineytw ,

Thanks for the question! I think you got the wrong visualization becuase of the misunderstanding of skel_in variable. When I write code, I name this variable refer to original TensorFlow code. It represents the offsets between parent joints and child joints, rather than the position in global coordinate. I found the knee and foot locates in a same place in your visualization, but based on our code, it won't happen, so I will wondering if it's the reason.

def bones2skel(bones, bone_mean, bone_std):
    unnorm_bones = bones * bone_std.unsqueeze(0) + bone_mean.repeat(bones.shape[0], 1, 1)
    skel_in = torch.zeros(bones.shape[0], 17, 3).cuda()
    skel_in[:, 1, 0] = -unnorm_bones[:, 0, 0]
    skel_in[:, 4, 0] = unnorm_bones[:, 0, 0]
    skel_in[:, 2, 1] = -unnorm_bones[:, 0, 1]
    skel_in[:, 5, 1] = -unnorm_bones[:, 0, 1]
    skel_in[:, 3, 1] = -unnorm_bones[:, 0, 2]
    skel_in[:, 6, 1] = -unnorm_bones[:, 0, 2]
    skel_in[:, 7, 1] = unnorm_bones[:, 0, 3]
    skel_in[:, 8, 1] = unnorm_bones[:, 0, 4]
    skel_in[:, 9, 1] = unnorm_bones[:, 0, 5]
    skel_in[:, 10, 1] = unnorm_bones[:, 0, 6]
    skel_in[:, 11, 0] = unnorm_bones[:, 0, 7]
    skel_in[:, 12, 0] = unnorm_bones[:, 0, 8]
    skel_in[:, 13, 0] = unnorm_bones[:, 0, 9]
    skel_in[:, 14, 0] = -unnorm_bones[:, 0, 7]
    skel_in[:, 15, 0] = -unnorm_bones[:, 0, 8]
    skel_in[:, 16, 0] = -unnorm_bones[:, 0, 9]
    return skel_in

Hi @catherineytw ,

Thanks for the question! I think you got the wrong visualization becuase of the misunderstanding of skel_in variable. When I write code, I name this variable refer to original TensorFlow code. It represents the offsets between parent joints and child joints, rather than the position in global coordinate. I found the knee and foot locates in a same place in your visualization, but based on our code, it won't happen, so I will wondering if it's the reason.

def bones2skel(bones, bone_mean, bone_std):
    unnorm_bones = bones * bone_std.unsqueeze(0) + bone_mean.repeat(bones.shape[0], 1, 1)
    skel_in = torch.zeros(bones.shape[0], 17, 3).cuda()
    skel_in[:, 1, 0] = -unnorm_bones[:, 0, 0]
    skel_in[:, 4, 0] = unnorm_bones[:, 0, 0]
    skel_in[:, 2, 1] = -unnorm_bones[:, 0, 1]
    skel_in[:, 5, 1] = -unnorm_bones[:, 0, 1]
    skel_in[:, 3, 1] = -unnorm_bones[:, 0, 2]
    skel_in[:, 6, 1] = -unnorm_bones[:, 0, 2]
    skel_in[:, 7, 1] = unnorm_bones[:, 0, 3]
    skel_in[:, 8, 1] = unnorm_bones[:, 0, 4]
    skel_in[:, 9, 1] = unnorm_bones[:, 0, 5]
    skel_in[:, 10, 1] = unnorm_bones[:, 0, 6]
    skel_in[:, 11, 0] = unnorm_bones[:, 0, 7]
    skel_in[:, 12, 0] = unnorm_bones[:, 0, 8]
    skel_in[:, 13, 0] = unnorm_bones[:, 0, 9]
    skel_in[:, 14, 0] = -unnorm_bones[:, 0, 7]
    skel_in[:, 15, 0] = -unnorm_bones[:, 0, 8]
    skel_in[:, 16, 0] = -unnorm_bones[:, 0, 9]
    return skel_in

Thank you for the quick response.
Based on your explanation, the parameters of the FK layer are: skeletal topology(parents), joint offset matrix (skel_in) , and the joints rotations in quaternion, am I right?

In addition, I am super curious about the adversarial rotation loss.

  1. If the training dataset is aligned, is it possible to minimize the differences between the fake and real rotations (absolute value) instead of their temporal differences?
  2. Is it possible to use cmu mocap data to train the S and Q branch and use h36m dataset in a semi-supervision style to validate the training?

And the global trajectory reconstruction is another subject that I'm interested in.
I compared the global trajectory reconstructed using the pre-trained model h36m_gt_t.pth, as shown in the gif, top row showed the global trajectory reconstructed by videopose3d (I fashioned it a little bit) and the second row was the GT and the third row was reconstructed by h36m_gt_t. As shown in the video, the man was walking in circles, but the global trajectory reconstructed by h36m_gt_t was linear. Could you be generous to explain?
S11

Any responses will be highly appreciated!
Best regards.

@catherineytw
Copy link
Author

how did u visualize it? which tool

I wrote a simple interface using qglviewer and rendered 3D skeletons using opengl

@Shimingyi
Copy link
Owner

@catherineytw

Thank you for the quick response. Based on your explanation, the parameters of the FK layer are: skeletal topology(parents), joint offset matrix (skel_in), and the joints rotations in quaternion, am I right?

Yes, you are right.

In addition, I am super curious about the adversarial rotation loss.

  1. If the training dataset is aligned, is it possible to minimize the differences between the fake and real rotations (absolute value) instead of their temporal differences?
  2. Is it possible to use cmu mocap data to train the S and Q branch and use h36m dataset in a semi-supervision style to validate the training?
  1. When we write this paper, we found some potential problems when we apply the supervision on rotation directly. We didn't explore this problem further. Recently, I have done more experiments on this field to get more feeling about it, and the main reason is the ambiguities of rotation. Basically, the alignment on the skeleton is not enough for getting a stable fake-real paired training but still appliable.
  2. Possible, we also got some results based on the CMU-supervised. But just like I said, the result is not stable.

And the global trajectory reconstruction is another subject that I'm interested in.

I didn't convert our prediction into world space because of some reason. In your visualization, what kind of transformation do you apply to the global rotation and global trajectory? I have my visualization script, it should align the GT very well, so I wonder if some steps are missing.

@Shimingyi
Copy link
Owner

Here is how we transform the global trajectory, you can insert it after this line and get tthe correct trajectory in your visualization.

translation_world = (R.T.dot(translation.T) + T).T

Notice that, this a trajectory in world space, which cannot be combined with the predicted rotation in camera space. So we can only compare the trajectory itself, until we can convert our bvh rotation to world space. I didn't do that because of some technical issue in that time, now I am able to fix it but allow me some time to implement and test it.

@catherineytw
Copy link
Author

@catherineytw

I didn't convert our prediction into world space because of some reason. In your visualization, what kind of transformation do you apply to the global rotation and global trajectory? I have my visualization script, it should align the GT very well, so I wonder if some steps are missing.

Thank you for the detailed explanation. Allow me to be more specific, the trajectories in the gif were in the camera space. Here is the bvh file I got by running the evaluation code (didn't change anything) using h36m_gt_t.pth, he didn't walk in circles and the motion jittered when he turned around. The trajectory validation example in your project page is perfect, but I cannot replicate it. I am really confused and trying to figure out what goes wrong.
S11_Walking_1_bvh.txt

Any responses will be highly appreciated!
Best regards.

@Shimingyi
Copy link
Owner

@catherineytw
The trajectory is in world space in your visualization, but ours is in camera space. That's why you find they are different, and ours is not in circles. If you want to solve it, you need to take the XYZ trajectory from our method and apply the above transformation. For the trajectory prediction, there are a few points that should be claimed:

  1. We assume it's impossible to recover an absolute value of trajectory because it's an ill-posed problem from 2d to 3d. So our pipeline is a little different from VideoPose, which predicts the absolute XYZ number. In our solution, we will predict a depth factor, then assemble it with the XY movement of the root joint from the original 2d detection to recover an XYZ global trajectory. Also, because all the prediction in our method is a relative number, we will give a scaling manually to make the trajectory aligned better. In the default setting, I set it to 8 (code). It causes unnatural performance but is adjustable.
  2. So, how do we evaluate the trajectory? We only need to compare our predicted depth_facter with gt_depth_facter and recover a translation_rec and gt_translation, then visualize the translations.

In this open-source code, you can easily get it by inserting the following code after L54:

# Recover the trajectory with pre_proj
translation = (translations * np.repeat(pre_proj[0].cpu().numpy(), 3, axis=-1).reshape((-1, 3))) * 12
with open('./paper_figure/root_path_global/%s.rec.txt' % video_name, 'w') as f:
    for index1 in range(translation.shape[0]):
        f.writelines('%s %s %s\n' % (translation[index1, 0], translation[index1, 1], translation[index1, 2]))
f.close()

# Recover the trajectory with gt_proj_facters
translation = (translations * np.repeat(proj_facters[0].cpu().numpy(), 3, axis=-1).reshape((-1, 3))) * 12
with open('./paper_figure/root_path_global/%s.gt.txt' % video_name, 'w') as f:
    for index1 in range(translation.shape[0]):
        f.writelines('%s %s %s\n' % (translation[index1, 0], translation[index1, 1], translation[index1, 2]))
f.close()

Then you can compare these two trajectories. I re-run the visualization code, and was able to get these results:
20220424210759

Regarding the motion jittered problem, I am sorry it's hard to avoid based on the current framework. Using sparse 2d position as input will lose important information. Our prediction is per-framed, unlike VideoPose, which operates 243 inputs to predict 1, so they are expected to have stronger temporal coherence than ours.

@catherineytw
Copy link
Author

Then you can compare these two trajectories. I re-run the visualization code, and was able to get these results: 20220424210759

Thank you for the detailed explanation. I add the code and finally get the trajectories on the left, they match, perfectly. However, I still cannot get the circular global trajectory in the world space on the right. Here are my codes, and the world trajectories looks exactly like those on the left (as shown in the figure), could you help me to figure out what's wrong?
` #-----------------------Trajectory file--------------------------
if config.arch.translation:
R, T, f, c, k, p, res_w, res_h = test_data_loader.dataset.cameras[(int(video_name.split('')[0].replace('S', '')), int(video_name.split('')[-1]))]
pose_2d_film = (poses_2d_pixel[0, :, :2].cpu().numpy() - c[:, 0]) / f[:, 0]
translations = np.ones(shape=(pose_2d_film.shape[0], 3))
translations[:, :2] = pose_2d_film

                # Recover the trajectory with pre_proj
                translation = (translations * np.repeat(pre_proj[0].cpu().numpy(), 3, axis=-1).reshape(
                    (-1, 3))) * 12
                np.save('{}/{}_rec.npy'.format(output_trajectory_path,video_name),translation)
                translation_world = (R.T.dot(translation.T) + T).T
                np.save('{}/{}_rec_world.npy'.format(output_trajectory_path, video_name), translation_world)


                # Recover the trajectory with gt_proj_facters
                translation = (translations * np.repeat(proj_facters[0].cpu().numpy(), 3, axis=-1).reshape(
                    (-1, 3))) * 12
                np.save('{}/{}_gt.npy'.format(output_trajectory_path, video_name), translation)
                translation_world = (R.T.dot(translation.T) + T).T
                np.save('{}/{}_gt_world.npy'.format(output_trajectory_path, video_name), translation_world)`

image

image

By the way, I couldn't agree with you more, precisely recovering the 3D global trajectory is impossible due to the depth ambiguity and unknown camera intrinsic/extrinsic parameters. In my opinion, as long as the projected 3d trajectory matches the 2D trajectory in the video, it is valid.

Regarding the motion jittered problem, I am sorry it's hard to avoid based on the current framework. Using sparse 2d position as input will lose important information. Our prediction is per-framed, unlike VideoPose, which operates 243 inputs to predict 1, so they are expected to have stronger temporal coherence than ours.

In terms of the jittered motion problem, I have several immature ideas and want to discuss them with you. Is it possible to add rotation angular velocity and acceleration terms to the adversarial training loss to keep the motion smooth? Or add the joint velocity and acceleration loss terms to the reconstructed skeleton? Since the network takes motion chunks as input, why not make use of the temporal information to mitigate the motion jitters?

@Shimingyi
Copy link
Owner

@catherineytw
I have no idea what's wrong with the visualization on the right. Would you mind scheduling a meeting so I can know more about the inference details? Here is my email: [email protected]

In terms of the jittered motion problem, I have several immature ideas and want to discuss them with you. Is it possible to add rotation angular velocity and acceleration terms to the adversarial training loss to keep the motion smooth? Or add the joint velocity and acceleration loss terms to the reconstructed skeleton? Since the network takes motion chunks as input, why not make use of the temporal information to mitigate the motion jitters?

Adversarial loss needed to be designed carefully. Even though we found it helpful in some cases, but still hard to refine the motion to a better level. We have some experiments on VIBE, which uses adversarial learning to improve human reconstruction, but it also makes a tiny improvement. Adding velocity and acceleration loss looks great. It's also what we are considering, and it works well in some motion synthesis papers like PFNN, etc. We do use temporal information as input, but the videopose architecture underperformed with our data operation, so I gave it up and changed to the current version with adaptive pooling. Bringing neural FK into the learning process is the key idea of our paper, and the current network architecture is just appliable and has big space to be improved. I am so happy to see more great ideas based on neural FK, we can talk about it more in the meeting.

@catherineytw
Copy link
Author

Would you mind scheduling a meeting so I can know more about the inference details? Here is my email: [email protected]

Thank you for your kindness. Is Thursday evening a good time to you? Sorry to say that I never used google meeting, if you don't mind, here is my wechat ID: Catherineytw, maybe we can have a conversation on Tecent Meeting?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants