Skip to content

Conversation

@TNSelahle
Copy link
Member

Fix invalid offset for non-UTF-8 encoded files.

Fixes https://github.com/ShiftLeftSecurity/codescience/issues/8500

@TNSelahle TNSelahle requested a review from ml86 November 13, 2025 11:11
Comment on lines +34 to +36
new String(fileContentBytes.slice(0, phpNode.attributes.startFilePos), fileCharset).length
val endPos =
new String(fileContent.get.getBytes.slice(0, phpNode.attributes.endFilePos), StandardCharsets.UTF_8).length
new String(fileContentBytes.slice(0, phpNode.attributes.endFilePos), fileCharset).length
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use this String constructor to avoid the extra slice operation:

public String​(byte[] bytes,
              int offset,
              int length,
              [String](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/String.html) charsetName)

"io.shiftleft" %% "codepropertygraph" % Versions.cpg,
"com.github.sh4869" %% "semver-parser-scala" % Versions.semverParser,
"org.scalatest" %% "scalatest" % Versions.scalatest % Test,
"com.github.albfernandez" % "juniversalchardet" % Versions.juniversalchardet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an open ticket on github which claims that this library does not have support for ISO8859-1. Why does this still seem to work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that's actually true. I played around more with it and found that it detects the code as Windows-1255. I also gave it more characters from the ISO-8859-1 encoding to see if it'll guess differently but it still guesses Windows-1255. The full list of detectable encodings is on https://github.com/albfernandez/juniversalchardet/blob/b43c5b5e9b4519cfb7ae7702a305572212f7a11b/README.md.

Since most files are in UTF-8, we could opt to first try decoding as UTF-8 then if there are any errors, resort to guessing using the encoding as a best-effort, using the library. What's your thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its unclear to me how we would detect errors for the first attempt of loading as UTF-8. I guess using juniversalchardet is fine for now as it solves the problem at hand and does not seem to cause problems in the standard UTF-8 input case.

"io.shiftleft" %% "codepropertygraph" % Versions.cpg,
"com.github.sh4869" %% "semver-parser-scala" % Versions.semverParser,
"org.scalatest" %% "scalatest" % Versions.scalatest % Test,
"com.github.albfernandez" % "juniversalchardet" % Versions.juniversalchardet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its unclear to me how we would detect errors for the first attempt of loading as UTF-8. I guess using juniversalchardet is fine for now as it solves the problem at hand and does not seem to cause problems in the standard UTF-8 input case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants