Skip to content

C++: Total number of baseline files limit #17743

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
artem-smotrakov opened this issue Oct 11, 2024 · 29 comments
Open

C++: Total number of baseline files limit #17743

artem-smotrakov opened this issue Oct 11, 2024 · 29 comments
Labels
question Further information is requested

Comments

@artem-smotrakov
Copy link
Contributor

Hey friends, I have quite a large C++ database:

codeql database print-baseline -- ${CODEQL_DATABASE_DIR}
Counted a baseline of 27711380 lines of code for cpp.

Before running scans, I normally run some simple diagnostic queries to make sure the database looks fine. The queries look for things like:

  • Files
  • FunctionCalls
  • IfStmts

When I run these queries on this large database, I get this

codeql database analyze ${CODEQL_DATABASE_DIR} --format=sarif-latest --output=calls.sarif ${CODEQL_QUERIES}/qlpacks/cpp-queries/diagnostics/FunctionCalls.ql
Running queries.
[1/1 comp 7.8s] Compiled [...]/qlpacks/cpp-queries/diagnostics/FunctionCalls.ql.
Files.ql: [1/1 eval 36s] Results written to cpp-queries/diagnostics/FunctionCalls.bqrs.
Shutting down query evaluator.
Interpreting results.
Will not interpret file coverage baseline information, since the total number of baseline files is 153738, which is greater than the limit of 50000.

The exit code is 0 but calls.sarif is empty.

When I run queries from the standard C++ pack, I get the same message.

What does this limit mean? Is there any way to increase it? I didn't find anything either in the docs or in this repo unfortunately, may be missing something though. Thanks!

@artem-smotrakov artem-smotrakov added the question Further information is requested label Oct 11, 2024
@redsun82
Copy link
Contributor

👋 @artem-smotrakov, sorry for the late reply!

That limit is meant to avoid generating too large a SARIF file when populating the Tool Status Page for information about how many files were analyzed, hitting the SARIF file size limit. It is currently hard-coded and cannot be configured.

That said, I'm not entirely sure this limit (and the warning) should really cause a custom query like yours to return no results. Could you:

  • share your FunctionCalls.ql, so we can experiment with it a bit?
  • maybe try another output format like cvs to see if the issue is specifically related to the SARIF format?

@artem-smotrakov
Copy link
Contributor Author

artem-smotrakov commented Oct 14, 2024

Hi @redsun82 ! Thanks for your reply!

share your FunctionCalls.ql, so we can experiment with it a bit?

Yeah, sure, it's quite simple

import cpp

from FunctionCall call, Function func
where func = call.getTarget()
select call, "Call " + func + "(" + func.getParameterString() + ")"

maybe try another output format like cvs to see if the issue is specifically related to the SARIF format?

I get the same message for CSV if I use --format=csv --output=calls.csv. The calls.csv file is empty.

It is currently hard-coded and cannot be configured.

Would it be possible to make it configurable in one of the next releases? 🤔

@rvermeulen
Copy link
Contributor

Hi @artem-smotrakov,

The base line information should not influence the result of the query.
Could you run https://github.com/github/codeql/blob/main/cpp/ql/src/Diagnostics/ExtractionWarnings.ql to determine if other issue are influencing the results of the query?

@rvermeulen rvermeulen added the awaiting-response The CodeQL team is awaiting further input or clarification from the original reporter of this issue. label Oct 14, 2024
@artem-smotrakov
Copy link
Contributor Author

Hi @rvermeulen ! Attaching the results of the ExtractionWarnings.ql. I see errors in several files but the codebase has way more C++ files. Also, I got the same limit warning when I ran the query.

extractrion_warnings.sarif.txt

@rvermeulen
Copy link
Contributor

rvermeulen commented Oct 17, 2024

Hi @artem-smotrakov,

Let me forward this to our C/C++ team.
In the mean time, could you share which CodeQL CLI version you are using codeql version --format=json and the build-tracer.log that you can find in the database directory under logs. Before sharing make sure possible sensitive information is redacted (such as the unpack location).

Copy link
Contributor

github-actions bot commented Nov 1, 2024

This issue is stale because it has been open 14 days with no activity. Comment or remove the Stale label in order to avoid having this issue closed in 7 days.

@github-actions github-actions bot added the Stale label Nov 1, 2024
@artem-smotrakov
Copy link
Contributor Author

This issue is stale because it has been open 14 days with no activity. Comment or remove the Stale label in order to avoid having this issue closed in 7 days.

I am working on it, please don't close it.

@jketema jketema removed Stale awaiting-response The CodeQL team is awaiting further input or clarification from the original reporter of this issue. labels Nov 4, 2024
@artem-smotrakov
Copy link
Contributor Author

share which CodeQL CLI version you are using codeql version --format=json


{
  "productName" : "CodeQL",
  "vendor" : "GitHub",
  "version" : "2.19.0",
  "sha" : "9f0ad8ab1f14c2711b9fc2666e8bcdd09ab39ce8",
  "branches" : [
    "codeql-cli-2.19.0"
  ],
  "copyright" : "Copyright (C) 2019-2024 GitHub, Inc.",
  [...]
  "configFileFound" : false,
  "features" : {
    "analysisSummaryV2Default" : true,
    "buildModeOption" : true,
    "bundleSupportsIncludeDiagnostics" : true,
    "bundleSupportsIncludeLogs" : true,
    "databaseInterpretResultsSupportsSarifRunProperty" : true,
    "featuresInVersionResult" : true,
    "indirectTracingSupportsStaticBinaries" : false,
    "informsAboutUnsupportedPathFilters" : true,
    "supportsPython312" : true,
    "mrvaPackCreate" : true,
    "threatModelOption" : true,
    "traceCommandUseBuildMode" : true,
    "v2ramSizing" : true,
    "mrvaPackCreateMultipleQueries" : true,
    "setsCodeqlRunnerEnvVar" : true,
    "sarifMergeRunsFromEqualCategory" : true,
    "forceOverwrite" : true,
    "generateSummarySymbolMap" : true
  }
}

and the build-tracer.log

I am figuring out if I can actually share it.

@artem-smotrakov
Copy link
Contributor Author

build-tracer.log that you can find in the database directory under logs. Before sharing make sure possible sensitive information is redacted

Hi @rvermeulen The build-tracer.log is huge. I am looking at it and trying to redact possible sensitive info, but I am not super well familiar with what it has, and doesn't feel confident if I removed everything that has to be removed.

Is there anything specific that the C++ team might be looking for in this log?

What would be best way to share this log with the team? I would not like attaching it here a publicly visible comment. If that helps, my employer Riot Games is GitHub's customer, I guess there might be some private ticketing/etc services where I could probably post the file?

@redsun82
Copy link
Contributor

redsun82 commented Feb 4, 2025

Hi @artem-smotrakov, I'm so sorry this has slipped through our cracks, and I just figured out only now you were waiting for a reply.

Did you maybe manage to solve this issue on your own in the meantime?

What would be best way to share this log with the team? I would not like attaching it here a publicly visible comment. If that helps, my employer Riot Games is GitHub's customer, I guess there might be some private ticketing/etc services where I could probably post the file?

You (or someone in your company) can create a private support ticket here. More infos are available here. If you do, you can link this public issue in the request to help us sort this out more easily. More sensitive log artifacts can then be shared through that channel.

@redsun82 redsun82 added the awaiting-response The CodeQL team is awaiting further input or clarification from the original reporter of this issue. label Feb 5, 2025
Copy link
Contributor

This issue is stale because it has been open 14 days with no activity. Comment or remove the Stale label in order to avoid having this issue closed in 7 days.

@github-actions github-actions bot added the Stale label Feb 20, 2025
@artem-smotrakov
Copy link
Contributor Author

artem-smotrakov commented Feb 20, 2025

Hi @redsun82

Did you maybe manage to solve this issue on your own in the meantime?

I still see the warning, but looks like the queries run okay with CodeQL 2.20.1. I am not sure what changed between 2.19.0 and 2.20.1, but it might have solved the issue.

This issue aside, I wrote a relatively simple diagnostic query that look for functions that meet certain criteria. The query works well with 2.20.1. It takes several minutes to complete on a relatively large database. However, with the latest 2.20.4, it's been running for a few hours already (update: it took approx 11 hours to complete).

You (or someone in your company) can create a private support ticket here. More infos are available here. If you do, you can link this public issue in the request to help us sort this out more easily. More sensitive log artifacts can then be shared through that channel.

Thanks! If I see this issue again, I'll try to send a support request.

@github-actions github-actions bot removed the Stale label Feb 21, 2025
@jketema
Copy link
Contributor

jketema commented Feb 21, 2025

@artem-smotrakov

However, with the latest 2.20.4, it's been running for a few hours already (update: it took approx 11 hours to complete).

We were forced to change something in the stats file that is used to compute the join-orders. This change happened between 2.20.1 and 2.20.4. We fixed this issues in our own queries that were caused by this, but those fixes might not have been sufficient in general. Would you be able to share your query so we can investigate?

@artem-smotrakov
Copy link
Contributor Author

Hi @jketema

We fixed this issues in our own queries that were caused by this, but those fixes might not have been sufficient in general.

What was the fix? Do you have an example?

Would you be able to share your query so we can investigate?

Here it is

import cpp

/**
 * UFUNCTION(..) annotation.
 */
class UnrealFunctionAnnotation extends MacroInvocation {
  UnrealFunctionAnnotation() { this.getMacroName() = "UFUNCTION" }

  /**
   * Returns arguments passed to this UFUNCTION annotation.
   */
  string getAnArgument() {
    result = this.getExpandedArgument(_) or result = this.getUnexpandedArgument(_)
  }

  /**
   * Holds if `functionDeclaration` is annotated with this UFUNCTION annotation.
   */
  predicate annotates(FunctionDeclarationEntry functionDeclaration) {
    this.getLocation().isBefore(functionDeclaration.getLocation())
  }

  /**
   * Holds if this UFUNCTION annotation marks the function as RPC.
   */
  predicate declaresRpc() {
    this.getAnArgument().toLowerCase().matches(["%server%", "%client%", "%netmulticast%"])
  }
}

/**
 * Unreal RPC function declaration.
 */
class UnrealRpcDeclaration extends FunctionDeclarationEntry {
  UnrealRpcDeclaration() {
    exists(UnrealFunctionAnnotation ufunction | ufunction.annotates(this) | ufunction.declaresRpc())
  }

  /**
   * Returns implementation of the declared RPC.
   *
   * Unreal requires "_Implementation" suffix to be added to the definition of the RPC.
   * For example, if PRC is defined in the SomeClassName class like
   *
   *     UFUNCTION(Server, Reliable)
   *     void ServerDoSomething();
   *
   * Then, the RPC impementation should be
   *
   *     void SomeClassName::ServerDoSomething_Implementation()
   *     {
   *         [...]
   *     }
   */
  Function getImplementation() {
    exists(Function func |
      func.getName() = this.getName() + "_Implementation" and
      func.getADeclarationEntry().isDefinition()
    |
      result = func
    )
  }

  /**
   * Holds if the declared RPC is part of the engine.
   */
  predicate isEngine() {
    getLocation().getFile().getRelativePath().matches(["%Engine/Source%", "%Engine/Plugins%"])
  }

  /**
   * Holds if the declared RPC looks like to have been added for debugging purpose only.
   */
  predicate isDebug() { getName().matches("%Debug%") }
}

from UnrealRpcDeclaration rpcDeclaration
where
  not rpcDeclaration.isEngine() and // exclude RPCs that are part of the engine
  not rpcDeclaration.isDebug() // filter out RPCs for debugging
select rpcDeclaration.getImplementation(), rpcDeclaration.getImplementation().getName()

It looks for PRCs in code that uses Unreal Engine.

To test the query, you'd need a multiplayer game based on Unreal Engine, for example, Lyra. Here is RPC example.

@jketema
Copy link
Contributor

jketema commented Feb 24, 2025

What was the fix? Do you have an example?

It very much depends:

To test the query, you'd need a multiplayer game based on Unreal Engine, for example, Lyra. Here is RPC example.

This is not something I have access to, unfortunately. I can guide you on how to get the numbers you see on some of the PRs from above if that would make sense.

@artem-smotrakov
Copy link
Contributor Author

Thanks for the examples.

Is this something you're planning to address in the future CodeQL releases, or do all impacted queries have to be updated? I have to admit that my knowledge of CodeQL doesn't let me understand what's happening in the PR you linked. I am wondering if I can just stay on 2.20.1 before it's addressed.

This is not something I have access to, unfortunately.

Yes, you'd need to join Epic's org to get access to it, instructions

https://github.com/tomlooman/ActionRoguelike -- that's another project that can be used for testing the query.

@jketema
Copy link
Contributor

jketema commented Feb 24, 2025

Is this something you're planning to address in the future CodeQL releases, or do all impacted queries have to be updated?

All queries that were impacted have been updated at this point.

I have to admit that my knowledge of CodeQL doesn't let me understand what's happening in the PR you linked.

Understandable. Especially since this is not really documented.

I am wondering if I can just stay on 2.20.1 before it's addressed.

Currently I'm not able to address this, as I do not have enough data.

Yes, you'd need to join Epic's org to get access to it, instructions

This is very high-overhead, which means I would need to schedule this, but it can only have very low priority because this is the only report we've received about this so far, and as this doesn't show up in our own testing.

https://github.com/tomlooman/ActionRoguelike -- that's another project that can be used for testing the query.

This still seems to require access to Epic's org to get access to the required tooling?

I can guide you on how to get the numbers you see on some of the PRs from above if that would make sense.

To get the numbers you see on the PRs that I linked above:

  • codeql database cleanup --cache-cleanup=clear ${CODEQL_DATABASE_DIR}
  • codeql database analyze ${CODEQL_DATABASE_DIR} --format=sarif-latest --output=output.sarif --tuple-counting --evaluator-log output.log --rerun ${CODEQL_QUERY}
  • codeql generate log-summary --format=text output.log summary.log

The summary.log file should not contain any of your IP (but please double check), so should be shareable here.

@artem-smotrakov
Copy link
Contributor Author

artem-smotrakov commented Feb 24, 2025

This still seems to require access to Epic's org to get access to the required tooling?

No. You only need to install Unreal Engine and compiler toolkit

https://www.unrealengine.com/en-US/download

No need access to the Epic's org.

I started the query using the instructions you provided. When it's finished, I'll check the log, and post it here if it's okay.

@artem-smotrakov
Copy link
Contributor Author

Attaching summary.log

@jketema
Copy link
Contributor

jketema commented Feb 25, 2025

Thanks. It's completely blowing up in your annotates predicate:

[2025-02-25 03:57:02] Evaluated non-recursive predicate rpc::UnrealFunctionAnnotation.annotates/1#dispred#f5072c6e@37a93epk in 57295162ms (size: 351015697).
Evaluated relational algebra for predicate rpc::UnrealFunctionAnnotation.annotates/1#dispred#f5072c6e@37a93epk with tuple counts:
                   1   ~0%    {0} r1 = CONSTANT()[]
              343222   ~0%    {2}    | JOIN WITH `Location::Location.getEndLine/0#dispred#83af84ae#bf` CARTESIAN PRODUCT OUTPUT Rhs.0, Rhs.1
        745838910322   ~0%    {4}    | JOIN WITH `Location::Location.getStartLine/0#d54f9e6c` CARTESIAN PRODUCT OUTPUT Lhs.0, Lhs.1, Rhs.0, Rhs.1
                              {4}    | REWRITE WITH TEST InOut.1 < InOut.3
        406186767271   ~5%    {2}    | SCAN OUTPUT In.2, In.0
        279209432197   ~0%    {3}    | JOIN WITH fun_decls_40#join_rhs ON FIRST 1 OUTPUT Lhs.1, Lhs.0, Rhs.1
        279209432197   ~3%    {4}    | JOIN WITH `Location::Location.getFile/0#dispred#d1f8b5d1` ON FIRST 1 OUTPUT Lhs.1, Rhs.1, Lhs.0, Lhs.2
            77263627   ~4%    {2}    | JOIN WITH `Location::Location.getFile/0#dispred#d1f8b5d1` ON FIRST 2 OUTPUT Lhs.2, Lhs.3
         11422410724   ~3%    {2}    | JOIN WITH macroinvocations_20#join_rhs ON FIRST 1 OUTPUT Rhs.1, Lhs.1
         18574408764   ~0%    {2}    | JOIN WITH `Macro::MacroAccess.getOutermostMacroAccess/0#d58b05db_10#join_rhs` ON FIRST 1 OUTPUT Rhs.1, Lhs.1
           351015697   ~0%    {2}    | JOIN WITH rpc::UnrealFunctionAnnotation#6f570527 ON FIRST 1 OUTPUT Lhs.0, Lhs.1
                              return r1

[2025-02-25 03:57:14] Evaluated non-recursive predicate rpc::UnrealRpcDeclaration#5b0cb5dc@7bb217ts in 239ms (size: 4637).
Evaluated relational algebra for predicate rpc::UnrealRpcDeclaration#5b0cb5dc@7bb217ts with tuple counts:
        7976950  ~169140%    {1} r1 = JOIN `rpc::UnrealFunctionAnnotation.declaresRpc/0#dispred#aa76493c` WITH `rpc::UnrealFunctionAnnotation.annotates/1#dispred#f5072c6e` ON FIRST 1 OUTPUT Rhs.1
                             return r1

What's not clear to me though is why the same thing doesn't happen with 2.20.1. Would you also be able to produce the same log file but with 2.20.1?

@artem-smotrakov
Copy link
Contributor Author

Would you also be able to produce the same log file but with 2.20.1?

summary_2_20_1.log

jketema added a commit to jketema/codeql that referenced this issue Feb 25, 2025

Verified

This commit was signed with the committer’s verified signature.
jketema Jeroen Ketema
Reported here: github#17743

Without this change on the query provided by the user:
```
[2025-02-25 12:42:01] Evaluated non-recursive predicate quickquery::UnrealFunctionAnnotation.annotates/1#dispred#9cd6c269@c668c8tv in 23846ms (size: 20381473).
Evaluated relational algebra for predicate quickquery::UnrealFunctionAnnotation.annotates/1#dispred#9cd6c269@c668c8tv with tuple counts:
                 1   ~0%    {0} r1 = CONSTANT()[]
             27323   ~0%    {2}    | JOIN WITH `Location::Location.getEndLine/0#dispred#83af84ae#bf` CARTESIAN PRODUCT OUTPUT Rhs.0, Rhs.1
        6162566035   ~0%    {4}    | JOIN WITH `Location::Location.getStartLine/0#d54f9e6c` CARTESIAN PRODUCT OUTPUT Lhs.0, Lhs.1, Rhs.0, Rhs.1
                            {4}    | REWRITE WITH TEST InOut.1 < InOut.3
        3894825644   ~5%    {2}    | SCAN OUTPUT In.2, In.0
          73148692   ~0%    {3}    | JOIN WITH fun_decls_40#join_rhs ON FIRST 1 OUTPUT Lhs.1, Lhs.0, Rhs.1
          73148692   ~0%    {4}    | JOIN WITH `Location::Location.getFile/0#dispred#d1f8b5d1` ON FIRST 1 OUTPUT Lhs.1, Rhs.1, Lhs.0, Lhs.2
            864579   ~0%    {2}    | JOIN WITH `Location::Location.getFile/0#dispred#d1f8b5d1` ON FIRST 2 OUTPUT Lhs.2, Lhs.3
          13010742   ~1%    {2}    | JOIN WITH macroinvocations_20#join_rhs ON FIRST 1 OUTPUT Rhs.1, Lhs.1
          20653781   ~0%    {3}    | JOIN WITH `Macro::MacroAccess.getOutermostMacroAccess/0#d58b05db_10#join_rhs` ON FIRST 1 OUTPUT Rhs.1, _, Lhs.1
          20653781   ~4%    {3}    | REWRITE WITH Out.1 := 1
          20381473   ~8%    {2}    | JOIN WITH macroinvocations_03#join_rhs ON FIRST 2 OUTPUT Lhs.0, Lhs.2
                            return r1
```

With this change:
```
[2025-02-25 12:43:10] Evaluated non-recursive predicate quickquery::UnrealFunctionAnnotation.annotates/1#dispred#9cd6c269@11bf8956 in 928ms (size: 20381473).
Evaluated relational algebra for predicate quickquery::UnrealFunctionAnnotation.annotates/1#dispred#9cd6c269@11bf8956 with tuple counts:
            6873   ~3%    {2} r1 = SCAN fun_decls OUTPUT In.4, In.0
            6857   ~0%    {3}    | JOIN WITH `Location::Location.getStartLine/0#d54f9e6c` ON FIRST 1 OUTPUT Lhs.0, Lhs.1, Rhs.1
            6857   ~2%    {3}    | JOIN WITH `Location::Location.getFile/0#dispred#d1f8b5d1` ON FIRST 1 OUTPUT Rhs.1, Lhs.1, Lhs.2
         6193961   ~0%    {3}    | JOIN WITH `Location::Location.getFile/0#dispred#d1f8b5d1_10#join_rhs` ON FIRST 1 OUTPUT Rhs.1, Lhs.1, Lhs.2
        27389714   ~1%    {4}    | JOIN WITH macroinvocations_20#join_rhs ON FIRST 1 OUTPUT Lhs.0, Lhs.1, Lhs.2, Rhs.1
        27389714   ~1%    {4}    | JOIN WITH locations_default ON FIRST 1 OUTPUT Lhs.1, Lhs.2, Lhs.3, Rhs.4
                          {4}    | REWRITE WITH TEST InOut.3 < InOut.1
        13010742   ~1%    {2}    | SCAN OUTPUT In.2, In.0
        20653781   ~0%    {3}    | JOIN WITH `Macro::MacroAccess.getOutermostMacroAccess/0#d58b05db_10#join_rhs` ON FIRST 1 OUTPUT Rhs.1, _, Lhs.1
        20653781   ~4%    {3}    | REWRITE WITH Out.1 := 1
        20381473   ~8%    {2}    | JOIN WITH macroinvocations_03#join_rhs ON FIRST 2 OUTPUT Lhs.0, Lhs.2
                          return r1
```
@jketema
Copy link
Contributor

jketema commented Feb 25, 2025

I believe the performance problem should be fixed by #18859

@artem-smotrakov
Copy link
Contributor Author

I ran the query on my database, and the UnrealFunctionAnnotation.annotates() predicate took 56m. This is a significant improvement, thanks!

summary_2_20_4_patched.log

Do you know if there's any chance to reduce it further? The predicate takes about 3m to complete with 2.20.1. I am not sure it would be possible to bring it back to 3m but anything closer would be great!

@jketema
Copy link
Contributor

jketema commented Feb 25, 2025

Do you know if there's any chance to reduce it further?

I don't immediately see an easy way of doing this by just modifying the library, because it's difficult to predict what effect any changes to the library have in other contexts. We believe that #18859 is still relatively safe in that respect.

Let me have a think though about whether something can be done on the side of your query.

@artem-smotrakov
Copy link
Contributor Author

Thanks! I've updated the predicate to this

predicate annotates(FunctionDeclarationEntry functionDeclaration) {
    this.getLocation().getFile() = functionDeclaration.getLocation().getFile() and
    this.getLocation().getStartLine() = functionDeclaration.getLocation().getStartLine() - 1
  }

because I am only interested in UFUNCTION() macros that are located on the previous line from the function declaration. Apparently, this brought query execution time down to approx 21m. I guess it happened because the new condition is a bit stricter that isBefore(), but not 100% sure.

Copy link
Contributor

This issue is stale because it has been open 14 days with no activity. Comment or remove the Stale label in order to avoid having this issue closed in 7 days.

@github-actions github-actions bot added the Stale label Mar 12, 2025
@jketema
Copy link
Contributor

jketema commented Mar 12, 2025

Hi @artem-smotrakov

Let me have a think though about whether something can be done on the side of your query.

With the limited time I had for this, I've unfortunately not been able to come up with anything that make things fast again for you. My apologies.

@artem-smotrakov
Copy link
Contributor Author

No problem @jketema ! Thanks!

@github-actions github-actions bot removed the Stale label Mar 13, 2025
Copy link
Contributor

This issue is stale because it has been open 14 days with no activity. Comment or remove the Stale label in order to avoid having this issue closed in 7 days.

@github-actions github-actions bot added the Stale label Mar 27, 2025
@jketema jketema removed Stale awaiting-response The CodeQL team is awaiting further input or clarification from the original reporter of this issue. labels Mar 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants