-
Notifications
You must be signed in to change notification settings - Fork 371
[php2cpg] fix invalid offset for non utf-8 encoded files #5698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| new String(fileContentBytes.slice(0, phpNode.attributes.startFilePos), fileCharset).length | ||
| val endPos = | ||
| new String(fileContent.get.getBytes.slice(0, phpNode.attributes.endFilePos), StandardCharsets.UTF_8).length | ||
| new String(fileContentBytes.slice(0, phpNode.attributes.endFilePos), fileCharset).length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use this String constructor to avoid the extra slice operation:
public String(byte[] bytes,
int offset,
int length,
[String](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/String.html) charsetName)
| "io.shiftleft" %% "codepropertygraph" % Versions.cpg, | ||
| "com.github.sh4869" %% "semver-parser-scala" % Versions.semverParser, | ||
| "org.scalatest" %% "scalatest" % Versions.scalatest % Test, | ||
| "com.github.albfernandez" % "juniversalchardet" % Versions.juniversalchardet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an open ticket on github which claims that this library does not have support for ISO8859-1. Why does this still seem to work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, that's actually true. I played around more with it and found that it detects the code as Windows-1255. I also gave it more characters from the ISO-8859-1 encoding to see if it'll guess differently but it still guesses Windows-1255. The full list of detectable encodings is on https://github.com/albfernandez/juniversalchardet/blob/b43c5b5e9b4519cfb7ae7702a305572212f7a11b/README.md.
Since most files are in UTF-8, we could opt to first try decoding as UTF-8 then if there are any errors, resort to guessing using the encoding as a best-effort, using the library. What's your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its unclear to me how we would detect errors for the first attempt of loading as UTF-8. I guess using juniversalchardet is fine for now as it solves the problem at hand and does not seem to cause problems in the standard UTF-8 input case.
| "io.shiftleft" %% "codepropertygraph" % Versions.cpg, | ||
| "com.github.sh4869" %% "semver-parser-scala" % Versions.semverParser, | ||
| "org.scalatest" %% "scalatest" % Versions.scalatest % Test, | ||
| "com.github.albfernandez" % "juniversalchardet" % Versions.juniversalchardet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its unclear to me how we would detect errors for the first attempt of loading as UTF-8. I guess using juniversalchardet is fine for now as it solves the problem at hand and does not seem to cause problems in the standard UTF-8 input case.
Fix invalid offset for non-UTF-8 encoded files.
Fixes https://github.com/ShiftLeftSecurity/codescience/issues/8500