MDEV-34529 Shrink the system tablespace when system tablespace contains MDEV-30671 leaked undo pages #3466

Thirunarayanan · 2024-08-16T14:24:19Z

The Jira issue number for this PR is: MDEV-34529

Description

InnoDB fails to shrink the system tablespace when it contains
the leaked undo log pages caused by MDEV-30671.
InnoDB does free the unused segment in system tablespace
before shrinking the tablespace.

inode_info: Structure to store the inode page and offsets.

fil_space_t::garbage_collect():
Frees the system tablespace unused segment

fsp_free_unused_seg(): Frees the unused segment

fsp_get_sys_used_segment(): Iterates through all default
file segment and index segment present in system tablespace.

How can this PR be tested?

Added the test case innodb.undo_leak which basically leaks the undo segment by purge thread.

Basing the PR against the correct MariaDB version

This is a new feature or a refactoring, and the PR is based against the latest MariaDB development branch.
This is a bug fix, and the PR is based against the earliest maintained branch in which the bug can be reproduced.

PR quality check

I checked the [CODING_STANDARDS.md](https://github.com/MariaDB/server/blob/-/CODING_STANDARD- (https://github.com/MariaDB/server/blob/-/CODING_STANDARDS.md) file and my PR conforms to this where appropriate.
For any trivial modifications to the PR, I am ok with the reviewer making the changes themselves.

CLAassistant · 2024-08-16T14:24:25Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

dr-m

As far as I understand it, the current version of this change is tweaking the shrinking of the system tablespace so that any pages that do not belong to index trees that are declared in SYS_INDEXES will be ignored when determining the maximum allocated page number within the system tablespace.

This seems to be both wrong and insufficient.

Some pages of SYS_TABLES, SYS_INDEXES, SYS_COLUMNS and SYS_FIELDS could actually be stored in an area that the current patch would deem safe for truncation.
There could also be other objects that are allocated in the system tablespace, such as undo log pages that are reachable via the TRX_SYS page. (I think that the TRX_SYS page itself as well as the empty change buffer tree would not cause a problem.)
We fail to mark any leaked pages that are below the tweaked upper limit as free.

I think that for development and debugging, it would be good to implement something that pretty-prints the allocated page bitmap, possibly using different characters for representing allocated pages in different roles. For example, `'0' + (index_id & 63)`` for index pages, and some characters in the 0x21 to 0x2f range for anything else. And 0x20 (space) for pages that are marked as free. This should nicely illustrate the page leaks due to MDEV-31234 and allow us to understand if we have missed anything.

I think that it would be better to implement an isolated step that would mark as freed any leaked pages in the system tablespace. It could be independent of the shrinking logic.

dr-m · 2024-08-20T12:01:36Z

storage/innobase/fsp/fsp0fsp.cc

+    err= dict_get_sys_tablespace_indexes(sys_root_pages);
+    if (err) return;


There is no diagnostic in case an error was reported, and no suggestion how to resolve the corruption (such as displaying the name of the problematic index and table that had been created with innodb_file_per_table=0). We wouldn’t even attempting a normal truncation in this case.

storage/innobase/fsp/fsp0fsp.cc

storage/innobase/handler/ha_innodb.cc

storage/innobase/include/dict0load.h

storage/innobase/dict/dict0load.cc

storage/innobase/fsp/fsp0fsp.cc

dr-m · 2024-09-05T12:26:46Z

storage/innobase/fsp/fsp0fsp.cc

+    if (n_words == 16)
+    {
+      fprintf(stderr, "\n\t");
+      n_words= 0;
+    }


It could be good to display the page number in hexadecimal at the start of each line.

dr-m · 2024-09-05T12:28:30Z

storage/innobase/fsp/fsp0fsp.cc

+    ulint type= fil_page_get_type(block->page.frame);
+    switch (type) {
+    case FIL_PAGE_INDEX:
+      fprintf(stderr, "I");


It would be good to add a comment that we can’t trust the FIL_PAGE_TYPE field unless space->full_crc32() holds. For other system tablespaces, the contents can be any garbage. The only guarantee is that if the page belongs to a B-tree, then the FIL_PAGE_TYPE will contain FIL_PAGE_INDEX. The opposite is not guaranteed.

dr-m · 2024-09-05T12:32:51Z

storage/innobase/fsp/fsp0fsp.cc

+  fprintf(stderr, "\nNote the letter description of the pages\n\t");
+  fprintf(stderr, "I - Index node page\n\t"
+                  "U - Undo log page\n\t"
+                  "Z - Compressed page\n\t"


There is nothing formatted in the output; therefore we could as well use fputs or putc.

I was hoping that we would display something like 'A' + ((index_id - 10) % ('~' - 'A' + 1)) for pages that look like index pages. That would help identify them. This would leave the ASCII range ' ' to '?' (0x20 to 0x3f) available for displaying other leaked pages.

Compressed pages can’t exist in the system tablespace, neither page_compressed nor ROW_FORMAT=COMPRESSED.

storage/innobase/fsp/fsp0fsp.cc

storage/innobase/handler/ha_innodb.cc

dr-m

Some quick comments. I don’t think that an additional configuration parameter for enabling this is absolutely necessary.

storage/innobase/dict/dict0load.cc

storage/innobase/fsp/fsp0fsp.cc

dr-m · 2024-09-16T05:17:15Z

storage/innobase/fsp/fsp0fsp.cc

+@param used_inode list of inode information used in
+                  system tablespace
+@return error code */
+dberr_t fsp_get_sys_used_segment(inodeInfo *usedInodes)


The @param does not match the parameter name.

storage/innobase/fsp/fsp0fsp.cc

storage/innobase/include/dict0mem.h

dr-m

The logic looks good to me. My comments are mainly about coding style or API efficiency.

storage/innobase/fsp/fsp0fsp.cc

dr-m

Thank you, this looks mostly OK. I have a few nitpicks.

storage/innobase/srv/srv0start.cc

storage/innobase/fsp/fsp0fsp.cc

dr-m · 2024-09-18T16:12:11Z

storage/innobase/fsp/fsp0fsp.cc

+      mtr->memset(block,
+                  TRX_RSEG + TRX_RSEG_UNDO_SLOTS,
+                  TRX_RSEG_N_SLOTS * TRX_RSEG_SLOT_SIZE, 0xff);


Can you add a comment that says which existing function this is mimicing?

This change is wrong. we shouldn't reset the rseg undo slots. Because undo object has already constructed
before this function. If we do that, we need to remove the respective undo objects. I think that is un-necessary.

dr-m · 2024-09-18T16:14:10Z

storage/innobase/fsp/fsp0fsp.cc

+      buf_block_t *block= buf_page_get_gen(rseg->page_id(), 0,
+                                           RW_X_LATCH, nullptr,
+                                           BUF_GET, mtr, &err);
+      if (!block)
+        return err;


Could we latch all these pages before making any changes? In that way, we would not do anything irreversible before trying to make sure that the operation can succeed?

storage/innobase/fsp/fsp0fsp.cc

storage/innobase/include/dict0load.h

dr-m · 2024-09-23T14:44:02Z

storage/innobase/fsp/fsp0fsp.cc

+    if (m_old_inode_entry[page_no])
+      return DB_SUCCESS;
+    dberr_t err= DB_SUCCESS;
+    buf_block_t *block=
+      buf_page_get_gen(page_id_t{0, page_no}, 0, RW_SX_LATCH,
+                       nullptr, BUF_GET, mtr, &err);
+    if (!block)
+      return err;
+    buf_block_t *old= buf_LRU_get_free_block(have_no_mutex_soft);
+    if (!old) return DB_OUT_OF_MEMORY;
+
+    memcpy_aligned<UNIV_PAGE_SIZE_MIN>(
+      old->page.frame, block->page.frame, srv_page_size);
+    m_old_inode_entry[page_no]= old;


At the start of this function, the operator[] call already inserted a nullptr to the std::map. We could have saved a reference to that, e.g.,

buf_block_t *&old= m_old_inode_entry[page_no]; if (old) return DB_SUCCESS; //… old= buf_LRU_get_free_block(have_no_mutex_soft); return old ? DB_OUT_OF_MEMORY : DB_SUCCESS;

However, I do not quite understand the need for storing a copy of the page. Logic like this would make sense when we are shrinking the system tablespace and we expect to free most pages, rewriting only a few surviving pages of the system tablespace. In that case we wanted to minimize the amount of redo log written, so that everything can be done in a single atomic mini-transaction, with at most 2 MiB of log records.

Here we are handling a special case and we are allowed to use as many mini-transactions as it takes.

Can we structure the mini-transactions in such a way that we free only one leaked page per mini-transaction, after performing all necessary consistency checks? In that way, there should be no need to allocate any shadow pages. Yes, we may be modifying the same pages over and over again, but the buffer pool will cache the write-backs. Yes, we may write a lot of log records, but this is a very rarely executed clean-up of corrupted data, and it will not be executed by default.

dr-m · 2024-09-23T14:46:08Z

storage/innobase/fsp/fsp0fsp.cc

+    err= fseg_free_extent(inode, iblock, space,
+                          xdes_get_offset(descr), &mtr);
+    DBUG_EXECUTE_IF("unused_undo_free_fail_4",
+                    err= DB_CORRUPTION;);
+    if (err)
+      goto err_exit;


Can we split fseg_free_extent() to two separate: checking the consistency, and modifying the data? We should first only check the consistency.

dr-m · 2024-09-23T14:47:33Z

storage/innobase/fsp/fsp0fsp.cc

+  for (uint32_t i= 0; i < FSEG_FRAG_ARR_N_SLOTS; i++)
+  {
+    uint32_t page_no = fseg_get_nth_frag_page_no(inode, i);
+    if (page_no != FIL_NULL)
+    {
+      err= fseg_free_page_low(inode, iblock, space, page_no, &mtr);
+      DBUG_EXECUTE_IF("unused_undo_free_fail_5",
+                      err= DB_CORRUPTION;);
+      if (err)
+        goto err_exit;
+    }
+  }


Can we first execute a loop that checks that all pages that we are going to free have been marked as allocated? And only after all consistency checks have passed, execute the modification?

There seems to be a simpler approach:

We would execute this extra freeing of leaked pages only on startup (not on shutdown) if the :autoshrink attribute is set on the system tablespace.

We will structure the mini-transactions normally, with no special low-level error handling.

If we notice any error, then we will:

display an error message saying that :autoshrink is being ignored due to an error

invoke log_buffer_flush_to_disk()

execute InnoDB shutdown as if innodb_fast_shutdown=2 had been set

pretend that :autoshrink was not set

restart InnoDB

In this way, all changes right before the current mini-transaction will be recovered, and the server should start up without corrupting a corrupted system tablespace too much further.

dr-m · 2024-09-23T14:51:44Z

storage/innobase/fsp/fsp0fsp.cc

+  /* There is a possiblity that the inode page can move from
+  FSP_SEG_INODES_FULL to FSP_SEG_INODES_FREE and remove it
+  from FSEG_SEG_INODES_FREE. So better to store the prev
+  and next inode pages */
+  const fil_addr_t prev_addr=
+    flst_get_prev_addr(iblock->page.frame + FSEG_INODE_PAGE_NODE);
+  if (prev_addr.page != FIL_NULL)
+    old_inodes_entry.insert(prev_addr.page, &mtr);
+
+  const fil_addr_t next_addr=
+    flst_get_next_addr(iblock->page.frame + FSEG_INODE_PAGE_NODE);
+
+  if (next_addr.page != FIL_NULL)
+    old_inodes_entry.insert(next_addr.page, &mtr);
+
+  fsp_free_seg_inode(space, inode, iblock, &mtr);
+  mtr.commit();


I do not quite understand the purpose of the old_insert_entry.insert() calls here. Are they really needed? If they are, why are we ignoring the return value?

dr-m

This is really close to completion. I did not debug the test case yet, but will do so on my final review.

storage/innobase/fsp/fsp0fsp.cc

mysql-test/suite/innodb/t/undo_leak_fail.test

mysql-test/suite/innodb/r/undo_leak_fail.result

storage/innobase/fsp/fsp0fsp.cc

dr-m

This is getting better, but some error handling could improved further.

storage/innobase/fsp/fsp0fsp.cc

dr-m

The error handling is almost there. I’d like to see some clarifying source code comments still.

storage/innobase/fsp/fsp0fsp.cc

dr-m · 2024-10-09T14:33:04Z

storage/innobase/fsp/fsp0fsp.cc

+                               ) != DB_SUCCESS_LOCKED_REC) {
 		return true;
 	}


We could replace if (fseg_free_step_low(…) != DB_SUCCESS_LOCKED_REC) with

return fseg_free_step_low(…) != DB_SUCCESS_LOCKED_REC;

But, is that what we really want here? Wouldn’t we want == DB_SUCCESS instead? That is, our caller should keep freeing as long there was no corruption and we have not freed everything yet? Our callers are btr_free_but_not_root(), which is followed by btr_free_root(), and trx_purge_free_segment(). I’d suggest to carefully review the logic in both. A source code comment could be useful here.

On a related note, I see that fseg_inode_free() is checking for DB_SUCCESS_LOCKED_REC. It seems to be the right course of action there.

If the fseg_free_step_low() returns any error code other than DB_SUCCESS_LOCKED_REC then InnoDB
faced corruption or freed the segment successfully. So I think this condition is correct.

storage/innobase/fsp/fsp0fsp.cc

dr-m

I reviewed the test execution in a bit more detail. I think that we need to cover the existence of XA PREPARE transactions and improve some error handling and code coverage.

storage/innobase/fsp/fsp0fsp.cc

dr-m

Sorry, this still needs some minor refinement.

storage/innobase/trx/trx0rseg.cc

dr-m · 2024-10-16T10:31:09Z

storage/innobase/trx/trx0rseg.cc

+    undo= UT_LIST_GET_FIRST(rseg.undo_cached);
+    while (undo)
+    {
+      if (undo->state == TRX_UNDO_PREPARED)
+        return true;
+      undo= UT_LIST_GET_NEXT(undo_list, undo);
+    }


How can undo_cached possibly contain any XA PREPARE transaction, or when can we have anything else than undo->state == TRX_UNDO_CACHED? As far as I can tell, the elements in undo_cached must have undo->size==1 and must carry the state TRX_UNDO_CACHED at offset TRX_UNDO_SEG_HDR + TRX_UNDO_STATE of that single page.

Can you replace the condition with undo->state != TRX_UNDO_CACHED and a debug assertion that would catch this failure?

Yes, I was wrong. undo_cached must carry the state TRX_UNDO_CACHED and state has been changed during trx_purge_add_undo_to_history() . So we should check only for rseg.undo_list

storage/innobase/include/fsp0sysspace.h

storage/innobase/fsp/fsp0fsp.cc

…ns MDEV-30671 leaked undo pages - InnoDB fails to shrink the system tablespace when it contains the leaked undo log pages caused by MDEV-30671. - InnoDB does free the unused segment in system tablespace before shrinking the tablespace. InnoDB fails to free the unused segment if XA PREPARE transaction exist or if the previous shutdown was not with innodb_fast_shutdown=0 inode_info: Structure to store the inode page and offsets. fil_space_t::garbage_collect(): Frees the system tablespace unused segment fsp_get_sys_used_segment(): Iterates through all default file segment and index segment present in system tablespace. trx_sys_t::is_xa_exist(): Returns true if the XA transaction exist in the undo logs fseg_inode_free(): Frees the extents, fragment pages for the given index node and ignores any error similar to trx_purge_free_segment() trx_sys_t::reset_page(): Retain the TRX_SYS_FSEG_HEADER value in trx_sys page while resetting the page.

Thirunarayanan requested a review from dr-m August 16, 2024 14:24

Thirunarayanan force-pushed the 11.2-MDEV-34529 branch from c76856e to 57b5405 Compare August 20, 2024 06:27

dr-m reviewed Aug 20, 2024

View reviewed changes

Thirunarayanan force-pushed the 11.2-MDEV-34529 branch from 57b5405 to b6cabad Compare September 4, 2024 14:24

dr-m reviewed Sep 5, 2024

View reviewed changes

Thirunarayanan force-pushed the 11.2-MDEV-34529 branch 3 times, most recently from 4d37e0d to 17196d1 Compare September 13, 2024 07:49

dr-m reviewed Sep 16, 2024

View reviewed changes

Thirunarayanan force-pushed the 11.2-MDEV-34529 branch from 17196d1 to 9d9f286 Compare September 16, 2024 15:48

dr-m reviewed Sep 17, 2024

View reviewed changes

storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved

storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved

storage/innobase/include/dict0mem.h Outdated Show resolved Hide resolved

dr-m reviewed Sep 17, 2024

View reviewed changes

Thirunarayanan force-pushed the 11.2-MDEV-34529 branch 3 times, most recently from deecdf2 to 007226c Compare September 18, 2024 07:52

dr-m requested changes Sep 18, 2024

View reviewed changes

storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved

dr-m reviewed Sep 18, 2024

View reviewed changes

storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved

dr-m reviewed Sep 18, 2024

View reviewed changes

storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved

dr-m reviewed Sep 18, 2024

View reviewed changes

storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved

dr-m reviewed Sep 18, 2024

View reviewed changes

storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved

dr-m reviewed Sep 18, 2024

View reviewed changes

storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved

dr-m reviewed Sep 18, 2024

View reviewed changes

storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved

Thirunarayanan force-pushed the 11.2-MDEV-34529 branch from 007226c to 364c13f Compare September 18, 2024 13:38

dr-m reviewed Sep 18, 2024

View reviewed changes

Thirunarayanan force-pushed the 11.2-MDEV-34529 branch 3 times, most recently from d0a5743 to 7256097 Compare September 23, 2024 11:27

dr-m reviewed Sep 23, 2024

View reviewed changes

Thirunarayanan force-pushed the 11.2-MDEV-34529 branch from 7256097 to 0c632c1 Compare October 1, 2024 14:48

Thirunarayanan force-pushed the 11.2-MDEV-34529 branch 2 times, most recently from 12a175a to da758da Compare October 2, 2024 01:59

dr-m reviewed Oct 2, 2024

View reviewed changes

Thirunarayanan force-pushed the 11.2-MDEV-34529 branch 7 times, most recently from c282781 to 536b46c Compare October 7, 2024 05:41

dr-m reviewed Oct 7, 2024

View reviewed changes

mysql-test/suite/innodb/t/undo_leak_fail.test Outdated Show resolved Hide resolved

mysql-test/suite/innodb/r/undo_leak_fail.result Outdated Show resolved Hide resolved

storage/innobase/fsp/fsp0fsp.cc Show resolved Hide resolved

dr-m reviewed Oct 7, 2024

View reviewed changes

Thirunarayanan force-pushed the 11.2-MDEV-34529 branch 2 times, most recently from e7dd44d to cb24698 Compare October 9, 2024 05:05

dr-m reviewed Oct 9, 2024

View reviewed changes

Thirunarayanan force-pushed the 11.2-MDEV-34529 branch from cb24698 to c1fc294 Compare October 10, 2024 13:51

dr-m reviewed Oct 14, 2024

View reviewed changes

storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved

storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved

storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved

storage/innobase/fsp/fsp0fsp.cc Outdated Show resolved Hide resolved

Thirunarayanan force-pushed the 11.2-MDEV-34529 branch 2 times, most recently from 0475664 to 278a508 Compare October 15, 2024 14:03

dr-m reviewed Oct 16, 2024

View reviewed changes

Thirunarayanan force-pushed the 11.2-MDEV-34529 branch from 278a508 to 4a1ded6 Compare October 16, 2024 16:05

dr-m approved these changes Oct 16, 2024

View reviewed changes

Thirunarayanan merged commit 4a1ded6 into 11.2 Oct 17, 2024
14 of 16 checks passed

Thirunarayanan deleted the 11.2-MDEV-34529 branch October 17, 2024 05:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MDEV-34529 Shrink the system tablespace when system tablespace contains MDEV-30671 leaked undo pages #3466

MDEV-34529 Shrink the system tablespace when system tablespace contains MDEV-30671 leaked undo pages #3466

Thirunarayanan commented Aug 16, 2024 •

edited

Loading

CLAassistant commented Aug 16, 2024

dr-m left a comment

dr-m Aug 20, 2024

dr-m Sep 5, 2024

dr-m Sep 5, 2024

dr-m Sep 5, 2024

dr-m left a comment

dr-m Sep 16, 2024

dr-m left a comment

dr-m left a comment

dr-m Sep 18, 2024

Thirunarayanan Sep 18, 2024

dr-m Sep 18, 2024

dr-m Sep 23, 2024

dr-m Sep 23, 2024

dr-m Sep 23, 2024

dr-m Sep 24, 2024

dr-m Sep 23, 2024

dr-m left a comment

dr-m left a comment

dr-m left a comment

dr-m Oct 9, 2024

Thirunarayanan Oct 10, 2024

dr-m left a comment

dr-m left a comment

dr-m Oct 16, 2024

Thirunarayanan Oct 16, 2024

		err= dict_get_sys_tablespace_indexes(sys_root_pages);
		if (err) return;

MDEV-34529 Shrink the system tablespace when system tablespace contains MDEV-30671 leaked undo pages #3466

MDEV-34529 Shrink the system tablespace when system tablespace contains MDEV-30671 leaked undo pages #3466

Conversation

Thirunarayanan commented Aug 16, 2024 • edited Loading

Description

How can this PR be tested?

Basing the PR against the correct MariaDB version

PR quality check

CLAassistant commented Aug 16, 2024

dr-m left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dr-m left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dr-m left a comment

Choose a reason for hiding this comment

dr-m left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dr-m left a comment

Choose a reason for hiding this comment

dr-m left a comment

Choose a reason for hiding this comment

dr-m left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dr-m left a comment

Choose a reason for hiding this comment

dr-m left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Thirunarayanan commented Aug 16, 2024 •

edited

Loading