Cherry-pick Do not use immediate restart in regress test #878

yjhjstz · 2025-01-17T09:40:14Z

Fixes #ISSUE_Number

What does this PR do?

Type of Change

Bug fix (non-breaking change)
New feature (non-breaking change)
Breaking change (fix or feature with breaking changes)
Documentation update

Breaking Changes

Test Plan

Unit tests added/updated
Integration tests added/updated
Passed make installcheck
Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:

Checklist

Followed contribution guide
Added/updated documentation
Reviewed code for security implications
Requested review from cloudberry committers

Additional Context

CI Skip Instructions

There is nothing wrong with doing that for the tests themselves, but it has an implication to a bug #16120 where restarting w/o checkpoint can create orphaned files and results in the failure of test gp_check_fiels. While that issue is dealt with separately, for now turn the restart mode to fast to avoid the issue. We ignored isolation2 tests for now because gp_check_files are not covering them (and it would be really hard to cover, given so many intentional crash/panic or other ad-hoc scenarios that we are doing in isolation2 could create orphaned files... I don't see us do that anytime soon).

The two methods don't cooperate, so set_config_option("search_path", ...) has been ineffective under non-empty overrideStack. This defect enabled an attacker having database-level CREATE privilege to execute arbitrary code as the bootstrap superuser. While that particular attack requires v13+ for the trusted extension attribute, other attacks are feasible in all supported versions. Standardize on the combination of NewGUCNestLevel() and set_config_option("search_path", ...). It is newer than PushOverrideSearchPath(), more-prevalent, and has no known disadvantages. The "override" mechanism remains for now, for compatibility with out-of-tree code. Users should update such code, which likely suffers from the same sort of vulnerability closed here. Back-patch to v11 (all supported versions). Alexander Lakhin. Reported by Alexander Lakhin. Security: CVE-2023-2454 (cherry-picked from 78119a0bf98edb13e7f79c6402378376cc6b6ca4 with conflicts pg:REL_12_STABLE)

When optimizer is off. For qual (NOT(EXISTS_SUBLINK) and EXPR_SUBLINK) or qual (NOT(ANY_SUBLINK) AND EXPR_SUBLINK) In pull_up_sublinks_qual(), after NOT(EXISTS_SUBLINK) is converted to Anti-JOIN (or NOT(ANY_SUBLINK) => LASJ_NOTIN-JOIN), if EXPR_SUBLINK is also successfully converted to INNER-JOIN (see convert_EXPR_to_join) , we can encounter the scenario where qual is postponed past the anti or lasj_notin join. It will trigger assertion failure: Assert(j->jointype == JOIN_INNER || j->jointype == JOIN_SEMI); > FATAL: Unexpected internal error (assert.c:48) > DETAIL: FailedAssertion("!(j->jointype == JOIN_INNER || j->jointype == JOIN_SEMI)", File: "initsplan.c", Line: 1018) > server closed the connection unexpectedly > ... We can easily reproduce this issue in master and 6X branch: ``` CREATE TABLE A(id int, tt timestamp) DISTRIBUTED BY (id); SET optimizer TO off; -- j->jointype == JOIN_LASJ_NOTIN SELECT t1.* FROM A t1 WHERE t1.id NOT IN (SELECT id FROM A WHERE id != 4) AND t1.tt = (SELECT MIN(tt) FROM A WHERE id = t1.id); -- j->jointype == JOIN_ANTI SELECT t1.* FROM A t1 WHERE NOT EXISTS (SELECT id FROM A WHERE id != 4 and id = t1.id) AND t1.tt = (SELECT MIN(tt) FROM A WHERE id = t1.id); ``` Add these two jointype into the Assert(). I made several simple tests, results are ok. However, same as the code comment said, i am also unsure if postponing quals past anti-join/lasj_notin-join is always semantically correct.

This commit fixes test, which was added at commit 50e66c76acf19d09ac7f0fc05cca01e014bb0ab8 ``` begin; declare c1 cursor for select count(*) from t_15278 group by sqrt(a); abort; select pg_cancel_backend(pg_backend_pid()); ``` Above old version test checks that after some errors generated on segment, InterruptHoldoffCount should be reset to 0 on QD. However, this is flaky: QD does not wait result of declare cursor, If error on segments occurs after declare finished, This error is processed when abort is called. At some cases abort is called before error is generated on segment. It led to situation, when there is not any errors on segments and abort is executed with success result. This test requires error to check InterruptHoldoffCount after query which ended with error. New version of the test uses existing inject fault, which generate error on segment at create table. create table ends with error and InterruptHoldoffCount can be checked correctly. Co-authored-by: Vasiliy Ivanov <[email protected]>

1. Updated code to use get_stderr() instead of get_result().stderr 2. Added test case for retry count.

The error(NOTICE) meessage is too long. Move the details into errdetail(). fix partition

We need to call gp_malloc|free() for the newly added MemoryContext types instead of malloc|free(), so that we can get the VMEM reservation etc. This was missed during the 12 merge. Reviewed-by: Brent Doil <[email protected]>

Issue: Some utilities (gprecoverseg, gpstop, etc...) use a lock directory to prevent customers from running multiple instances of the program simultaneously. The lock directory is created inside the COORDINATOR_DATA_DIRECTORY at the start of program execution and should be removed when the execution completes. However, the lock directory is not always removed if the user abruptly kills the program during execution. Consequently, the next run of the same utility encounters an error, stating that the lock directory already exists and asks the user to manually delete it, even if the process holding the lock is already terminated. Solution: To address this issue, this commit introduces a modification in mainUtils.py. When attempting to acquire the lock, the code now checks if the process ID in the PID file, present in the lock directory, still exists. If the process has been killed, the script automatically removes the lock directory and proceeds to acquire a new lock for the current run. This ensures that the utility can continue its execution without being hindered by orphaned lock files from previous runs.

map_old_guc_names facility exist to have alias for old GUC names to new ones. Hence best to use that facility instead of hand roll gp_session_role as alias to gp_role. Doing this also eliminates having the problem fixed via commit 8d52dad.

If gp_role value is not specified then internally "utility" is set as default. This is mainly for single node coordinator or segments. Changing that setting to use PGC_S_DYNAMIC_DEFAULT instead of PGC_S_OVERRIDE. The problem with using PGC_S_OVERRIDE is GUC value can't be set later. Since commit 260710, we rely on assign_gp_role() hook to set should_reject_connection. If PGC_S_OVERRIDE is used to SetConfigOption() while setting the default in PostmasterMain(), then later setting gp_role via PGOPTIONS or any other mechanism will not call the assign hook and hence misses to set should_reject_connection. The reason for change being safe is check_gp_role() only allows "utility" mode to set as value for this GUC and any other values like "execute" or "dispatch" are already banned from PGOPTIONS or postgresql.conf followed by reload. This problem was exposed by src/test/gpdb_pitr/test_gpdb_pitr.sh tests which gets fixed now. Reviewed-by: Marbin Tan <[email protected]>

The benefit from this optimization idea is plausible (basically, avoiding one memcpy of the memtuple), but the changes incurred by this are just too backwards and invasive for the effort to be worthy: 1. There is no memtuple to directly serialize from. The `PRIVATE_tts_memtuple` field in `TupleTableSlot` is gone after the FIXME was added (d20e071e879). Basically we do not store the memtuple outside of the AM layer anymore (like in the executor memory). I think bringing `PRIVATE_tts_memtuple` back to `TupleTableSlot` and utilize it everywhere is a clear backward change. 2. Assuming we somehow pass the memtuple into `SerializeTuple` as a function argument. But on the deserialization side, we would still need to deserialize it back to minimal tuple because all the deserilization logic and else utilizes minimal tuple now. The only way to avoid that is to carry memtuple around and use it, which is basically same as bringing `PRIVATE_tts_memtuple` back to `TupleTableSlot`. We also might have to bring back the concept of `GenericTuple`, and revert a lot of the changes that we have made to bring GPDB aligned with the AM framework which is a big milestone in 7X. Not very worthwhile IMO.

add some resource group test results into .gitignore

Following the merging of changes from pull request #15727, there was an intermittent test failure in gpstop. This situation occurred when testing how the gpstart process behaves in the presence of a lock directory created by a previous gpstop operation. In this specific scenario, the expected behavior was that the lock directory from gpstop shouldn't disrupt the gpstart process. To examine this, a test was set up where a gpstop process was started and stopped before completion. This created a lock directory in COORDINATOR_DATA_DIRECTORY. Subsequently, an attempt was made to clean up any remaining parts of the postgres process on the coordinator using the 'pkill postgres' command. This was essential to prevent issues where gpstart might fail with errors like ‘Greenplum instance already exists’. But sometimes, the remaining parts of the postgres process didn't get cleaned up properly, causing the test to fail with the error ‘gpstart error: Coordinator instance process running’. To fix this, added a step for cleaning all the postgres processes on the host by removing the lock file ‘/tmp/.s.PGSQL.*' along with ‘pkill -9 postgres’ to make sure everything is cleaned up.

If AT SET DISTRIBUTED BY, or AT SET WITH (reorganize=true|false) is run today along with another subcommand, it will result in ERRORs such as: ALTER TABLE foo ADD COLUMN c int, SET DISTRIBUTED BY(b); ERROR: attribute number 3 exceeds number of columns 2 This kind of error results when the CTAS for the temp table is dispatched as part of ATExecSetDistributedBy(). Since SET DISTRIBUTED BY is run in the last pass, the ADD COLUMN takes effect on the QD and is considered as part of the dispatched QueryDesc for the CTAS. However, since the AT command has not been dispatched yet to the QEs (which only happens after ATController()), the QE hasn't yet applied the results of the ADD COLUMN operation. Thus, the QEs cannot reconcile the reference to the added column, and an ERROR results. Other ERRORs can also result from having an inconsistent view of the catalog on the QEs in the CTAS. Supporting AT SET DISTRIBUTED BY with other subcommands would require abandoning the current approach, and aligining it with regular AT workflow. Since, that is a larger change, we ban such usage at the moment. Co-authored-by: Andrew Repp <[email protected]>

For a recursive CTE, planner currently creates a mergejoin plan which moves the WorkTableScan results to a singleQE using CdbPathLocus_MakeSingleQE That would introduce motion in between Union and WorkTableScan. The WorkTableScan results should be local to each segment and shouldn't be broadcasted/moved to other segments. We shouldn't be allowing any motion in between Union and WorkTableScan. This commit adds a check for the last resort case to avoid adding that motion. If there is a WTS on any side in the last resort case, we should just fail (no possible plan using mergejoin) Added simplified test to force a mergejoin for such recursive CTE, the test shouldn't create a mergejoin plan. Co-authored-by: Zhenghua Lyu <[email protected]> Co-authored-by: Alexandra Wang<[email protected]>

* Fix flakiness of regression test in create_index A test case in create_index.sql became flaky after commit b5aeb102946bedd3118916e84d79894af7e9f926. Found issue to be related to stats not being available for table in query sometimes. So added `ANALYZE <table_name>` after insert queries, so stats are available always.

Ref: https://pylint.pycqa.org/en/latest/user_guide/messages/warning/unnecessary-semicolon.html

A test case in create_index is still failing after commit 50b7996999951d8e85cc4194a22ce7e92e6b44d3. Failure could be reproduced only about 1 in 5 times. **Test case:** `EXPLAIN (COSTS OFF) SELECT * FROM polygon_tbl WHERE f1 ~ '((1,1),(2,2),(2,1))'::polygon ORDER BY (poly_center(f1))[0];` **RCA:** Since the ANALYZE query on table polygon_tbl was run before all inserts, rows were not estimated correctly. **FIX:** By moving the ANALYZE command after all the insert queries of the table, it could be observed that the faulty plan is not occurring.

Previously, the only time the timeline history file would be archived was when a standby already configured for WAL archiving was promoted. If WAL archiving was set up and started after the standby was promoted, its current timeline history file would not be found in the archive. This could cause issues when restoring backups that did not include the timeline history files (e.g. pg_basebackup -Xnone). Some example failures include failing to restore with recovery_target_timeline explicitly set to the control file's timeline id and failing to create cascade standbys after recovering with recovery_target_timeline set to 'current' or 'latest'. To prevent these restore issues, we now ensure that the current timeline history file has been archived by marking it as ready for archiving after recovery finishes if it was found that it has not already been archived or marked ready.

work_mem still enjoys abundant use in our codebase. Although, we would like to replace it with operatorMemKB, that kind of change will require significant investment and a careful strategy. So, bring it out of deprecation for now. Discussion: https://groups.google.com/a/greenplum.org/g/gpdb-dev/c/GgOxTgaMJsA/m/004-RkHDFAAJ Reviewed-by: Ashwin Agrawal <[email protected]>

* Fix the false warning on the pgpass file The existing logic compares file modes using string comparison, which leads to false warnings on Python 3 due to the different mode representations ("0600" vs. "0o600"). To address this issue and enhance code efficiency, this PR proposes changing the comparison to an octal integer comparison. This not only eliminates false warnings but also ensures a more strict and faster comparison method, compatible with tools like "2to3".

…16410) Previously, analyzedb did not include materialized views when preparing the list of tables to be analyzed. This caused refreshed materialized views to have poor performance until analyzed. Solution: Updated the SQL statements in analyzedb that collect the list of tables to also include relkind='m' from pg_class. This will add materialized views to the list of hash tables to be analyzed for each run.

When memory usage have reached Vmem limit or resource group limit, it will loop in gp_malloc and gp_failed_to_alloc if new allocation happens, and then errors out with "ERRORDATA_STACK_SIZE exceeded". We are therefore printing the log message header using write_stderr.

huansong and others added 25 commits January 17, 2025 17:47

Add test case for PR 15279.

3210589

Added test case for pr 15283 (#15528)

7aee305

1. Updated code to use get_stderr() instead of get_result().stderr 2. Added test case for retry count.

Refine error message for EXCHANGE PARTITION ... WITH|WITHOU VALIDATION

2f22c21

The error(NOTICE) meessage is too long. Move the details into errdetail(). fix partition

Make slabs and generations use gp_malloc|free

24dd293

We need to call gp_malloc|free() for the newly added MemoryContext types instead of malloc|free(), so that we can get the VMEM reservation etc. This was missed during the 12 merge. Reviewed-by: Brent Doil <[email protected]>

add some test results into .gitignore (#16254)

dd0748f

add some resource group test results into .gitignore

Fix partition single node test.

db39bb4

Remove unnecessary trailing semicolons from Python scripts. (#15976)

aa99971

Ref: https://pylint.pycqa.org/en/latest/user_guide/messages/warning/unnecessary-semicolon.html

yjhjstz force-pushed the pick_0116 branch from 4e14d3f to 9c3351c Compare January 17, 2025 09:48

yjhjstz marked this pull request as ready for review January 17, 2025 11:20

yjhjstz added the cherry-pick cherry-pick upstream commts label Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cherry-pick Do not use immediate restart in regress test #878

Cherry-pick Do not use immediate restart in regress test #878

yjhjstz commented Jan 17, 2025

Cherry-pick Do not use immediate restart in regress test #878

Are you sure you want to change the base?

Cherry-pick Do not use immediate restart in regress test #878

Conversation

yjhjstz commented Jan 17, 2025

What does this PR do?

Type of Change

Breaking Changes

Test Plan

Impact

Checklist

Additional Context

CI Skip Instructions