[php2cpg] fix invalid offset for non utf-8 encoded files #5698

TNSelahle · 2025-11-13T08:10:07Z

Fix invalid offset for non-UTF-8 encoded files.

Fixes https://github.com/ShiftLeftSecurity/codescience/issues/8500

ml86 · 2025-11-13T11:24:57Z

joern-cli/frontends/php2cpg/src/main/scala/io/joern/php2cpg/astcreation/AstCreatorHelper.scala

+        new String(fileContentBytes.slice(0, phpNode.attributes.startFilePos), fileCharset).length
      val endPos =
-        new String(fileContent.get.getBytes.slice(0, phpNode.attributes.endFilePos), StandardCharsets.UTF_8).length
+        new String(fileContentBytes.slice(0, phpNode.attributes.endFilePos), fileCharset).length


Please use this String constructor to avoid the extra slice operation:

public String(byte[] bytes, int offset, int length, [String](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/String.html) charsetName)

ml86 · 2025-11-13T11:35:31Z

joern-cli/frontends/php2cpg/build.sbt

+  "io.shiftleft"      %% "codepropertygraph"      % Versions.cpg,
+  "com.github.sh4869" %% "semver-parser-scala"    % Versions.semverParser,
+  "org.scalatest"     %% "scalatest"              % Versions.scalatest % Test,
+  "com.github.albfernandez" % "juniversalchardet" % Versions.juniversalchardet


There is an open ticket on github which claims that this library does not have support for ISO8859-1. Why does this still seem to work?

Hmm, that's actually true. I played around more with it and found that it detects the code as Windows-1255. I also gave it more characters from the ISO-8859-1 encoding to see if it'll guess differently but it still guesses Windows-1255. The full list of detectable encodings is on https://github.com/albfernandez/juniversalchardet/blob/b43c5b5e9b4519cfb7ae7702a305572212f7a11b/README.md.

Since most files are in UTF-8, we could opt to first try decoding as UTF-8 then if there are any errors, resort to guessing using the encoding as a best-effort, using the library. What's your thoughts?

Its unclear to me how we would detect errors for the first attempt of loading as UTF-8. I guess using juniversalchardet is fine for now as it solves the problem at hand and does not seem to cause problems in the standard UTF-8 input case.

ml86 · 2025-11-14T09:52:23Z

joern-cli/frontends/php2cpg/build.sbt

+  "io.shiftleft"      %% "codepropertygraph"      % Versions.cpg,
+  "com.github.sh4869" %% "semver-parser-scala"    % Versions.semverParser,
+  "org.scalatest"     %% "scalatest"              % Versions.scalatest % Test,
+  "com.github.albfernandez" % "juniversalchardet" % Versions.juniversalchardet


Its unclear to me how we would detect errors for the first attempt of loading as UTF-8. I guess using juniversalchardet is fine for now as it solves the problem at hand and does not seem to cause problems in the standard UTF-8 input case.

Tebogo Selahle added 2 commits November 12, 2025 18:11

fix: detect file contents encoding for code offset

6b3f90d

refactor: scalafmt

f3cdc7b

TNSelahle requested a review from ml86 November 13, 2025 11:11

ml86 requested changes Nov 13, 2025

View reviewed changes

ml86 approved these changes Nov 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[php2cpg] fix invalid offset for non utf-8 encoded files #5698

[php2cpg] fix invalid offset for non utf-8 encoded files #5698

Uh oh!

TNSelahle commented Nov 13, 2025

Uh oh!

ml86 Nov 13, 2025

Uh oh!

ml86 Nov 13, 2025

Uh oh!

TNSelahle Nov 14, 2025

Uh oh!

ml86 Nov 14, 2025

Uh oh!

ml86 Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[php2cpg] fix invalid offset for non utf-8 encoded files #5698

Are you sure you want to change the base?

[php2cpg] fix invalid offset for non utf-8 encoded files #5698

Uh oh!

Conversation

TNSelahle commented Nov 13, 2025

Uh oh!

ml86 Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

ml86 Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

TNSelahle Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

ml86 Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

ml86 Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants