Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable more rigorous authorship detection by default #461

Open
damithc opened this issue Jan 7, 2019 · 11 comments
Open

Enable more rigorous authorship detection by default #461

damithc opened this issue Jan 7, 2019 · 11 comments

Comments

@damithc
Copy link
Collaborator

damithc commented Jan 7, 2019

Current: last person modified is identified as the author

Suggested: Use git features to ignore whitespace changes, code movement etc. by default but give a way to switch to the faster (less accurate) method

@damithc
Copy link
Collaborator Author

damithc commented Jan 7, 2019

@yamidark what do you think? I remember I asked you to switch off this feature to make the analysis faster. Can we bring it back?

@yamidark
Copy link
Contributor

yamidark commented Jan 7, 2019

Can we bring it back?

Yes, we can bring this back if we want. Currently, the 'git blame' command we use only ignores whitespace changes, we can track code movement/copied using the -M and -C option in git blame.

This could make the analysis more 'accurate', but would take a (much) larger amount of time depending on the 'level' of code movement we want to track (currently there are 3 levels for the -C option), so we need to decide on this as well (iirc, previously it was at 2).

@damithc
Copy link
Collaborator Author

damithc commented Jan 7, 2019

This could make the analysis more 'accurate', but would take a (much) larger amount of time depending on the 'level' of code movement we want to track (currently there are 3 levels for the -C option), so we need to decide on this as well (iirc, previously it was at 2).

Perhaps we can try different settings on cs2103 dataset to see how much of a time difference it makes?

@damithc
Copy link
Collaborator Author

damithc commented Jan 7, 2019

Perhaps we can try different settings on cs2103 dataset to see how much of a time difference it makes?

We can use a smaller set (e.g., 10 teams) and extrapolate.

@yamidark
Copy link
Contributor

yamidark commented Jan 10, 2019

Tested using these config files(config.zip)

No move/copy detection: ~2min30s
-M (detect within the file): ~3min30s
-M -C (detect between other files within same commit): ~5min
-M -C -C (detect between other files in the commit which created the file): ~21min
-M -C -C -C (detect by searching all commits): ~52min

Seems setting at least -C -C will take up significantly more time, as it has to search other commits if the line was moved/copied, while up to -C is still relatively fast enough (approximately twice the time) since it only searches within a single commit.

Will also try testing the performance on a much larger repo (Teammates)

@damithc
Copy link
Collaborator Author

damithc commented Jan 11, 2019

Thanks for the update @yamidark
Can you roughly quantify the difference in output? i.e., how much of the authorship changes for each setting?

@yamidark
Copy link
Contributor

Can you roughly quantify the difference in output? i.e., how much of the authorship changes for each setting?

Tested with the same config file given:
From no settings to -C -M per level, there was a small~moderate amount of author line contribution changes (about 100++ for 1 author), generally all decreased in this case, since most of the code that are moved/copied were from other contributors of AB4.

The largest amount of authorship line changes occurs from -C -M to -C -C -M(can go up to 1000++ for 1 author), while from -C -C -M to -C -C -C -M it has the least amount of authorship line changes (<100 for 1 author).

Also tested on TEAMMATES repo with -C -C -M, and the analysis took about 5hrs 😱 .

@damithc
Copy link
Collaborator Author

damithc commented Jan 15, 2019

I see.

  1. Which one is taking more time? copy detection or move detection?
  2. Can there be multiple levels of move detection?

@yamidark
Copy link
Contributor

yamidark commented Jan 15, 2019

  1. Which one is taking more time? copy detection or move detection?
  2. Can there be multiple levels of move detection?

Copy and move detection is done together (-M already does both move and copy detection, just only within the file with line changes), and I don't think we can configure it to do either 1 separately. But my guess would be that copy detection is longer, since it would have to check all lines (including untouched ones) in the files, rather than just the removed lines for move detection.

Summary on what each 'level' of copy and move detection can be found in my comment above, more details can be found here on the git blame docs.

@damithc
Copy link
Collaborator Author

damithc commented Jan 15, 2019

For our use case, it is reasonable to not credit the author for moving code but it is also reasonable to credit the author for copied code as the copy could be used for a different purpose.
If we cannot do move detection without copy detection, perhaps we should not enable either by default, but allow users to enable it using a flag with the caveat of slow performance.
What do you guys think?

@yamidark
Copy link
Contributor

For our use case, it is reasonable to not credit the author for moving code but it is also reasonable to credit the author for copied code as the copy could be used for a different purpose.

Yes, I agree we should credit the original author for code that is moved, but copied code could be credited to the person who copies.

If we cannot do move detection without copy detection, perhaps we should not enable either by default, but allow users to enable it using a flag with the caveat of slow performance.

Yes, that sounds good to me.

@dcshzj dcshzj moved this to For existing developers in RepoSense Roadmap Aug 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: For developers
Development

No branches or pull requests

2 participants