Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MDEV-34529 Shrink the system tablespace when system tablespace contains MDEV-30671 leaked undo pages #3466

Merged
merged 1 commit into from
Oct 17, 2024

Conversation

Thirunarayanan
Copy link
Member

@Thirunarayanan Thirunarayanan commented Aug 16, 2024

  • The Jira issue number for this PR is: MDEV-34529

Description

  • InnoDB fails to shrink the system tablespace when it contains
    the leaked undo log pages caused by MDEV-30671.

  • InnoDB does free the unused segment in system tablespace
    before shrinking the tablespace.

inode_info: Structure to store the inode page and offsets.

fil_space_t::garbage_collect():
Frees the system tablespace unused segment

fsp_free_unused_seg(): Frees the unused segment

fsp_get_sys_used_segment(): Iterates through all default
file segment and index segment present in system tablespace.

How can this PR be tested?

Added the test case innodb.undo_leak which basically leaks the undo segment by purge thread.

Basing the PR against the correct MariaDB version

  • This is a new feature or a refactoring, and the PR is based against the latest MariaDB development branch.
  • This is a bug fix, and the PR is based against the earliest maintained branch in which the bug can be reproduced.

PR quality check

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Contributor

@dr-m dr-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand it, the current version of this change is tweaking the shrinking of the system tablespace so that any pages that do not belong to index trees that are declared in SYS_INDEXES will be ignored when determining the maximum allocated page number within the system tablespace.

This seems to be both wrong and insufficient.

  1. Some pages of SYS_TABLES, SYS_INDEXES, SYS_COLUMNS and SYS_FIELDS could actually be stored in an area that the current patch would deem safe for truncation.
  2. There could also be other objects that are allocated in the system tablespace, such as undo log pages that are reachable via the TRX_SYS page. (I think that the TRX_SYS page itself as well as the empty change buffer tree would not cause a problem.)
  3. We fail to mark any leaked pages that are below the tweaked upper limit as free.

I think that for development and debugging, it would be good to implement something that pretty-prints the allocated page bitmap, possibly using different characters for representing allocated pages in different roles. For example, `'0' + (index_id & 63)`` for index pages, and some characters in the 0x21 to 0x2f range for anything else. And 0x20 (space) for pages that are marked as free. This should nicely illustrate the page leaks due to MDEV-31234 and allow us to understand if we have missed anything.

I think that it would be better to implement an isolated step that would mark as freed any leaked pages in the system tablespace. It could be independent of the shrinking logic.

Comment on lines 3721 to 3722
err= dict_get_sys_tablespace_indexes(sys_root_pages);
if (err) return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no diagnostic in case an error was reported, and no suggestion how to resolve the corruption (such as displaying the name of the problematic index and table that had been created with innodb_file_per_table=0). We wouldn’t even attempting a normal truncation in this case.

storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
storage/innobase/handler/ha_innodb.cc Outdated Show resolved Hide resolved
storage/innobase/include/dict0load.h Outdated Show resolved Hide resolved
storage/innobase/dict/dict0load.cc Outdated Show resolved Hide resolved
storage/innobase/dict/dict0load.cc Outdated Show resolved Hide resolved
storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
Comment on lines 3660 to 3664
if (n_words == 16)
{
fprintf(stderr, "\n\t");
n_words= 0;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be good to display the page number in hexadecimal at the start of each line.

Comment on lines 3681 to 3684
ulint type= fil_page_get_type(block->page.frame);
switch (type) {
case FIL_PAGE_INDEX:
fprintf(stderr, "I");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add a comment that we can’t trust the FIL_PAGE_TYPE field unless space->full_crc32() holds. For other system tablespaces, the contents can be any garbage. The only guarantee is that if the page belongs to a B-tree, then the FIL_PAGE_TYPE will contain FIL_PAGE_INDEX. The opposite is not guaranteed.

Comment on lines 3727 to 3730
fprintf(stderr, "\nNote the letter description of the pages\n\t");
fprintf(stderr, "I - Index node page\n\t"
"U - Undo log page\n\t"
"Z - Compressed page\n\t"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is nothing formatted in the output; therefore we could as well use fputs or putc.

I was hoping that we would display something like 'A' + ((index_id - 10) % ('~' - 'A' + 1)) for pages that look like index pages. That would help identify them. This would leave the ASCII range ' ' to '?' (0x20 to 0x3f) available for displaying other leaked pages.

Compressed pages can’t exist in the system tablespace, neither page_compressed nor ROW_FORMAT=COMPRESSED.

storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
storage/innobase/handler/ha_innodb.cc Outdated Show resolved Hide resolved
@Thirunarayanan Thirunarayanan force-pushed the 11.2-MDEV-34529 branch 3 times, most recently from 4d37e0d to 17196d1 Compare September 13, 2024 07:49
Copy link
Contributor

@dr-m dr-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some quick comments. I don’t think that an additional configuration parameter for enabling this is absolutely necessary.

storage/innobase/dict/dict0load.cc Outdated Show resolved Hide resolved
storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
Comment on lines 3616 to 3619
@param used_inode list of inode information used in
system tablespace
@return error code */
dberr_t fsp_get_sys_used_segment(inodeInfo *usedInodes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The @param does not match the parameter name.

storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
storage/innobase/include/dict0mem.h Outdated Show resolved Hide resolved
Copy link
Contributor

@dr-m dr-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic looks good to me. My comments are mainly about coding style or API efficiency.

storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
@Thirunarayanan Thirunarayanan force-pushed the 11.2-MDEV-34529 branch 3 times, most recently from deecdf2 to 007226c Compare September 18, 2024 07:52
storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
Copy link
Contributor

@dr-m dr-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this looks mostly OK. I have a few nitpicks.

storage/innobase/srv/srv0start.cc Outdated Show resolved Hide resolved
storage/innobase/srv/srv0start.cc Outdated Show resolved Hide resolved
storage/innobase/srv/srv0start.cc Outdated Show resolved Hide resolved
storage/innobase/srv/srv0start.cc Outdated Show resolved Hide resolved
storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
Comment on lines 3894 to 3896
mtr->memset(block,
TRX_RSEG + TRX_RSEG_UNDO_SLOTS,
TRX_RSEG_N_SLOTS * TRX_RSEG_SLOT_SIZE, 0xff);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment that says which existing function this is mimicing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is wrong. we shouldn't reset the rseg undo slots. Because undo object has already constructed
before this function. If we do that, we need to remove the respective undo objects. I think that is un-necessary.

Comment on lines 3889 to 3893
buf_block_t *block= buf_page_get_gen(rseg->page_id(), 0,
RW_X_LATCH, nullptr,
BUF_GET, mtr, &err);
if (!block)
return err;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we latch all these pages before making any changes? In that way, we would not do anything irreversible before trying to make sure that the operation can succeed?

storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
storage/innobase/include/dict0load.h Outdated Show resolved Hide resolved
@Thirunarayanan Thirunarayanan force-pushed the 11.2-MDEV-34529 branch 3 times, most recently from d0a5743 to 7256097 Compare September 23, 2024 11:27
Comment on lines 3212 to 3225
if (m_old_inode_entry[page_no])
return DB_SUCCESS;
dberr_t err= DB_SUCCESS;
buf_block_t *block=
buf_page_get_gen(page_id_t{0, page_no}, 0, RW_SX_LATCH,
nullptr, BUF_GET, mtr, &err);
if (!block)
return err;
buf_block_t *old= buf_LRU_get_free_block(have_no_mutex_soft);
if (!old) return DB_OUT_OF_MEMORY;

memcpy_aligned<UNIV_PAGE_SIZE_MIN>(
old->page.frame, block->page.frame, srv_page_size);
m_old_inode_entry[page_no]= old;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the start of this function, the operator[] call already inserted a nullptr to the std::map. We could have saved a reference to that, e.g.,

buf_block_t *&old= m_old_inode_entry[page_no];
if (old)
  return DB_SUCCESS;
//
old= buf_LRU_get_free_block(have_no_mutex_soft);
return old ? DB_OUT_OF_MEMORY : DB_SUCCESS;

However, I do not quite understand the need for storing a copy of the page. Logic like this would make sense when we are shrinking the system tablespace and we expect to free most pages, rewriting only a few surviving pages of the system tablespace. In that case we wanted to minimize the amount of redo log written, so that everything can be done in a single atomic mini-transaction, with at most 2 MiB of log records.

Here we are handling a special case and we are allowed to use as many mini-transactions as it takes.

Can we structure the mini-transactions in such a way that we free only one leaked page per mini-transaction, after performing all necessary consistency checks? In that way, there should be no need to allocate any shadow pages. Yes, we may be modifying the same pages over and over again, but the buffer pool will cache the write-backs. Yes, we may write a lot of log records, but this is a very rarely executed clean-up of corrupted data, and it will not be executed by default.

Comment on lines 3882 to 3887
err= fseg_free_extent(inode, iblock, space,
xdes_get_offset(descr), &mtr);
DBUG_EXECUTE_IF("unused_undo_free_fail_4",
err= DB_CORRUPTION;);
if (err)
goto err_exit;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we split fseg_free_extent() to two separate: checking the consistency, and modifying the data? We should first only check the consistency.

Comment on lines 3894 to 3903
for (uint32_t i= 0; i < FSEG_FRAG_ARR_N_SLOTS; i++)
{
uint32_t page_no = fseg_get_nth_frag_page_no(inode, i);
if (page_no != FIL_NULL)
{
err= fseg_free_page_low(inode, iblock, space, page_no, &mtr);
DBUG_EXECUTE_IF("unused_undo_free_fail_5",
err= DB_CORRUPTION;);
if (err)
goto err_exit;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we first execute a loop that checks that all pages that we are going to free have been marked as allocated? And only after all consistency checks have passed, execute the modification?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be a simpler approach:

  • We would execute this extra freeing of leaked pages only on startup (not on shutdown) if the :autoshrink attribute is set on the system tablespace.
  • We will structure the mini-transactions normally, with no special low-level error handling.
  • If we notice any error, then we will:
    1. display an error message saying that :autoshrink is being ignored due to an error
    2. invoke log_buffer_flush_to_disk()
    3. execute InnoDB shutdown as if innodb_fast_shutdown=2 had been set
    4. pretend that :autoshrink was not set
    5. restart InnoDB

In this way, all changes right before the current mini-transaction will be recovered, and the server should start up without corrupting a corrupted system tablespace too much further.

Comment on lines 3907 to 3909
/* There is a possiblity that the inode page can move from
FSP_SEG_INODES_FULL to FSP_SEG_INODES_FREE and remove it
from FSEG_SEG_INODES_FREE. So better to store the prev
and next inode pages */
const fil_addr_t prev_addr=
flst_get_prev_addr(iblock->page.frame + FSEG_INODE_PAGE_NODE);
if (prev_addr.page != FIL_NULL)
old_inodes_entry.insert(prev_addr.page, &mtr);

const fil_addr_t next_addr=
flst_get_next_addr(iblock->page.frame + FSEG_INODE_PAGE_NODE);

if (next_addr.page != FIL_NULL)
old_inodes_entry.insert(next_addr.page, &mtr);

fsp_free_seg_inode(space, inode, iblock, &mtr);
mtr.commit();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not quite understand the purpose of the old_insert_entry.insert() calls here. Are they really needed? If they are, why are we ignoring the return value?

@Thirunarayanan Thirunarayanan force-pushed the 11.2-MDEV-34529 branch 2 times, most recently from 12a175a to da758da Compare October 2, 2024 01:59
Copy link
Contributor

@dr-m dr-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really close to completion. I did not debug the test case yet, but will do so on my final review.

storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
@Thirunarayanan Thirunarayanan force-pushed the 11.2-MDEV-34529 branch 7 times, most recently from c282781 to 536b46c Compare October 7, 2024 05:41
mysql-test/suite/innodb/t/undo_leak_fail.test Outdated Show resolved Hide resolved
mysql-test/suite/innodb/r/undo_leak_fail.result Outdated Show resolved Hide resolved
storage/innobase/fsp/fsp0fsp.cc Show resolved Hide resolved
Copy link
Contributor

@dr-m dr-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting better, but some error handling could improved further.

storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
@Thirunarayanan Thirunarayanan force-pushed the 11.2-MDEV-34529 branch 2 times, most recently from e7dd44d to cb24698 Compare October 9, 2024 05:05
Copy link
Contributor

@dr-m dr-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error handling is almost there. I’d like to see some clarifying source code comments still.

storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
Comment on lines 2986 to 2988
) != DB_SUCCESS_LOCKED_REC) {
return true;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could replace if (fseg_free_step_low(…) != DB_SUCCESS_LOCKED_REC) with

	return fseg_free_step_low(…) != DB_SUCCESS_LOCKED_REC;

But, is that what we really want here? Wouldn’t we want == DB_SUCCESS instead? That is, our caller should keep freeing as long there was no corruption and we have not freed everything yet? Our callers are btr_free_but_not_root(), which is followed by btr_free_root(), and trx_purge_free_segment(). I’d suggest to carefully review the logic in both. A source code comment could be useful here.

On a related note, I see that fseg_inode_free() is checking for DB_SUCCESS_LOCKED_REC. It seems to be the right course of action there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the fseg_free_step_low() returns any error code other than DB_SUCCESS_LOCKED_REC then InnoDB
faced corruption or freed the segment successfully. So I think this condition is correct.

storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
Copy link
Contributor

@dr-m dr-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the test execution in a bit more detail. I think that we need to cover the existence of XA PREPARE transactions and improve some error handling and code coverage.

storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
@Thirunarayanan Thirunarayanan force-pushed the 11.2-MDEV-34529 branch 2 times, most recently from 0475664 to 278a508 Compare October 15, 2024 14:03
Copy link
Contributor

@dr-m dr-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this still needs some minor refinement.

storage/innobase/trx/trx0rseg.cc Outdated Show resolved Hide resolved
storage/innobase/trx/trx0rseg.cc Outdated Show resolved Hide resolved
Comment on lines 799 to 805
undo= UT_LIST_GET_FIRST(rseg.undo_cached);
while (undo)
{
if (undo->state == TRX_UNDO_PREPARED)
return true;
undo= UT_LIST_GET_NEXT(undo_list, undo);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can undo_cached possibly contain any XA PREPARE transaction, or when can we have anything else than undo->state == TRX_UNDO_CACHED? As far as I can tell, the elements in undo_cached must have undo->size==1 and must carry the state TRX_UNDO_CACHED at offset TRX_UNDO_SEG_HDR + TRX_UNDO_STATE of that single page.

Can you replace the condition with undo->state != TRX_UNDO_CACHED and a debug assertion that would catch this failure?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was wrong. undo_cached must carry the state TRX_UNDO_CACHED and state has been changed during trx_purge_add_undo_to_history() . So we should check only for rseg.undo_list

storage/innobase/include/fsp0sysspace.h Outdated Show resolved Hide resolved
storage/innobase/include/fsp0sysspace.h Outdated Show resolved Hide resolved
storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved
…ns MDEV-30671 leaked undo pages

- InnoDB fails to shrink the system tablespace when it contains
the leaked undo log pages caused by MDEV-30671.

- InnoDB does free the unused segment in system tablespace
before shrinking the tablespace. InnoDB fails to free
the unused segment if XA PREPARE transaction exist or
if the previous shutdown was not with innodb_fast_shutdown=0

inode_info: Structure to store the inode page and offsets.

fil_space_t::garbage_collect(): Frees the system tablespace
unused segment

fsp_get_sys_used_segment(): Iterates through all default
file segment and index segment present in system tablespace.

trx_sys_t::is_xa_exist(): Returns true if the XA transaction
exist in the undo logs

fseg_inode_free(): Frees the extents, fragment pages for the
given index node and ignores any error similar to
trx_purge_free_segment()

trx_sys_t::reset_page(): Retain the TRX_SYS_FSEG_HEADER value
in trx_sys page while resetting the page.
@Thirunarayanan Thirunarayanan merged commit 4a1ded6 into 11.2 Oct 17, 2024
14 of 16 checks passed
@Thirunarayanan Thirunarayanan deleted the 11.2-MDEV-34529 branch October 17, 2024 05:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants