-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MDEV-34529 Shrink the system tablespace when system tablespace contains MDEV-30671 leaked undo pages #3466
Conversation
|
c76856e
to
57b5405
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand it, the current version of this change is tweaking the shrinking of the system tablespace so that any pages that do not belong to index trees that are declared in SYS_INDEXES
will be ignored when determining the maximum allocated page number within the system tablespace.
This seems to be both wrong and insufficient.
- Some pages of
SYS_TABLES
,SYS_INDEXES
,SYS_COLUMNS
andSYS_FIELDS
could actually be stored in an area that the current patch would deem safe for truncation. - There could also be other objects that are allocated in the system tablespace, such as undo log pages that are reachable via the
TRX_SYS
page. (I think that theTRX_SYS
page itself as well as the empty change buffer tree would not cause a problem.) - We fail to mark any leaked pages that are below the tweaked upper limit as free.
I think that for development and debugging, it would be good to implement something that pretty-prints the allocated page bitmap, possibly using different characters for representing allocated pages in different roles. For example, `'0' + (index_id & 63)`` for index pages, and some characters in the 0x21 to 0x2f range for anything else. And 0x20 (space) for pages that are marked as free. This should nicely illustrate the page leaks due to MDEV-31234 and allow us to understand if we have missed anything.
I think that it would be better to implement an isolated step that would mark as freed any leaked pages in the system tablespace. It could be independent of the shrinking logic.
storage/innobase/fsp/fsp0fsp.cc
Outdated
err= dict_get_sys_tablespace_indexes(sys_root_pages); | ||
if (err) return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no diagnostic in case an error was reported, and no suggestion how to resolve the corruption (such as displaying the name of the problematic index and table that had been created with innodb_file_per_table=0
). We wouldn’t even attempting a normal truncation in this case.
57b5405
to
b6cabad
Compare
storage/innobase/fsp/fsp0fsp.cc
Outdated
if (n_words == 16) | ||
{ | ||
fprintf(stderr, "\n\t"); | ||
n_words= 0; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be good to display the page number in hexadecimal at the start of each line.
storage/innobase/fsp/fsp0fsp.cc
Outdated
ulint type= fil_page_get_type(block->page.frame); | ||
switch (type) { | ||
case FIL_PAGE_INDEX: | ||
fprintf(stderr, "I"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to add a comment that we can’t trust the FIL_PAGE_TYPE
field unless space->full_crc32()
holds. For other system tablespaces, the contents can be any garbage. The only guarantee is that if the page belongs to a B-tree, then the FIL_PAGE_TYPE
will contain FIL_PAGE_INDEX
. The opposite is not guaranteed.
storage/innobase/fsp/fsp0fsp.cc
Outdated
fprintf(stderr, "\nNote the letter description of the pages\n\t"); | ||
fprintf(stderr, "I - Index node page\n\t" | ||
"U - Undo log page\n\t" | ||
"Z - Compressed page\n\t" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is nothing formatted in the output; therefore we could as well use fputs
or putc
.
I was hoping that we would display something like 'A' + ((index_id - 10) % ('~' - 'A' + 1))
for pages that look like index pages. That would help identify them. This would leave the ASCII range ' ' to '?'
(0x20 to 0x3f) available for displaying other leaked pages.
Compressed pages can’t exist in the system tablespace, neither page_compressed
nor ROW_FORMAT=COMPRESSED
.
4d37e0d
to
17196d1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some quick comments. I don’t think that an additional configuration parameter for enabling this is absolutely necessary.
storage/innobase/fsp/fsp0fsp.cc
Outdated
@param used_inode list of inode information used in | ||
system tablespace | ||
@return error code */ | ||
dberr_t fsp_get_sys_used_segment(inodeInfo *usedInodes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The @param
does not match the parameter name.
17196d1
to
9d9f286
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic looks good to me. My comments are mainly about coding style or API efficiency.
deecdf2
to
007226c
Compare
007226c
to
364c13f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, this looks mostly OK. I have a few nitpicks.
storage/innobase/fsp/fsp0fsp.cc
Outdated
mtr->memset(block, | ||
TRX_RSEG + TRX_RSEG_UNDO_SLOTS, | ||
TRX_RSEG_N_SLOTS * TRX_RSEG_SLOT_SIZE, 0xff); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment that says which existing function this is mimicing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is wrong. we shouldn't reset the rseg undo slots. Because undo object has already constructed
before this function. If we do that, we need to remove the respective undo objects. I think that is un-necessary.
storage/innobase/fsp/fsp0fsp.cc
Outdated
buf_block_t *block= buf_page_get_gen(rseg->page_id(), 0, | ||
RW_X_LATCH, nullptr, | ||
BUF_GET, mtr, &err); | ||
if (!block) | ||
return err; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we latch all these pages before making any changes? In that way, we would not do anything irreversible before trying to make sure that the operation can succeed?
d0a5743
to
7256097
Compare
storage/innobase/fsp/fsp0fsp.cc
Outdated
if (m_old_inode_entry[page_no]) | ||
return DB_SUCCESS; | ||
dberr_t err= DB_SUCCESS; | ||
buf_block_t *block= | ||
buf_page_get_gen(page_id_t{0, page_no}, 0, RW_SX_LATCH, | ||
nullptr, BUF_GET, mtr, &err); | ||
if (!block) | ||
return err; | ||
buf_block_t *old= buf_LRU_get_free_block(have_no_mutex_soft); | ||
if (!old) return DB_OUT_OF_MEMORY; | ||
|
||
memcpy_aligned<UNIV_PAGE_SIZE_MIN>( | ||
old->page.frame, block->page.frame, srv_page_size); | ||
m_old_inode_entry[page_no]= old; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the start of this function, the operator[]
call already inserted a nullptr
to the std::map
. We could have saved a reference to that, e.g.,
buf_block_t *&old= m_old_inode_entry[page_no];
if (old)
return DB_SUCCESS;
//…
old= buf_LRU_get_free_block(have_no_mutex_soft);
return old ? DB_OUT_OF_MEMORY : DB_SUCCESS;
However, I do not quite understand the need for storing a copy of the page. Logic like this would make sense when we are shrinking the system tablespace and we expect to free most pages, rewriting only a few surviving pages of the system tablespace. In that case we wanted to minimize the amount of redo log written, so that everything can be done in a single atomic mini-transaction, with at most 2 MiB of log records.
Here we are handling a special case and we are allowed to use as many mini-transactions as it takes.
Can we structure the mini-transactions in such a way that we free only one leaked page per mini-transaction, after performing all necessary consistency checks? In that way, there should be no need to allocate any shadow pages. Yes, we may be modifying the same pages over and over again, but the buffer pool will cache the write-backs. Yes, we may write a lot of log records, but this is a very rarely executed clean-up of corrupted data, and it will not be executed by default.
storage/innobase/fsp/fsp0fsp.cc
Outdated
err= fseg_free_extent(inode, iblock, space, | ||
xdes_get_offset(descr), &mtr); | ||
DBUG_EXECUTE_IF("unused_undo_free_fail_4", | ||
err= DB_CORRUPTION;); | ||
if (err) | ||
goto err_exit; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we split fseg_free_extent()
to two separate: checking the consistency, and modifying the data? We should first only check the consistency.
storage/innobase/fsp/fsp0fsp.cc
Outdated
for (uint32_t i= 0; i < FSEG_FRAG_ARR_N_SLOTS; i++) | ||
{ | ||
uint32_t page_no = fseg_get_nth_frag_page_no(inode, i); | ||
if (page_no != FIL_NULL) | ||
{ | ||
err= fseg_free_page_low(inode, iblock, space, page_no, &mtr); | ||
DBUG_EXECUTE_IF("unused_undo_free_fail_5", | ||
err= DB_CORRUPTION;); | ||
if (err) | ||
goto err_exit; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we first execute a loop that checks that all pages that we are going to free have been marked as allocated? And only after all consistency checks have passed, execute the modification?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be a simpler approach:
- We would execute this extra freeing of leaked pages only on startup (not on shutdown) if the
:autoshrink
attribute is set on the system tablespace. - We will structure the mini-transactions normally, with no special low-level error handling.
- If we notice any error, then we will:
- display an error message saying that
:autoshrink
is being ignored due to an error - invoke
log_buffer_flush_to_disk()
- execute InnoDB shutdown as if
innodb_fast_shutdown=2
had been set - pretend that
:autoshrink
was not set - restart InnoDB
- display an error message saying that
In this way, all changes right before the current mini-transaction will be recovered, and the server should start up without corrupting a corrupted system tablespace too much further.
storage/innobase/fsp/fsp0fsp.cc
Outdated
/* There is a possiblity that the inode page can move from | ||
FSP_SEG_INODES_FULL to FSP_SEG_INODES_FREE and remove it | ||
from FSEG_SEG_INODES_FREE. So better to store the prev | ||
and next inode pages */ | ||
const fil_addr_t prev_addr= | ||
flst_get_prev_addr(iblock->page.frame + FSEG_INODE_PAGE_NODE); | ||
if (prev_addr.page != FIL_NULL) | ||
old_inodes_entry.insert(prev_addr.page, &mtr); | ||
|
||
const fil_addr_t next_addr= | ||
flst_get_next_addr(iblock->page.frame + FSEG_INODE_PAGE_NODE); | ||
|
||
if (next_addr.page != FIL_NULL) | ||
old_inodes_entry.insert(next_addr.page, &mtr); | ||
|
||
fsp_free_seg_inode(space, inode, iblock, &mtr); | ||
mtr.commit(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not quite understand the purpose of the old_insert_entry.insert()
calls here. Are they really needed? If they are, why are we ignoring the return value?
7256097
to
0c632c1
Compare
12a175a
to
da758da
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really close to completion. I did not debug the test case yet, but will do so on my final review.
c282781
to
536b46c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is getting better, but some error handling could improved further.
e7dd44d
to
cb24698
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error handling is almost there. I’d like to see some clarifying source code comments still.
storage/innobase/fsp/fsp0fsp.cc
Outdated
) != DB_SUCCESS_LOCKED_REC) { | ||
return true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could replace if (fseg_free_step_low(…) != DB_SUCCESS_LOCKED_REC)
with
return fseg_free_step_low(…) != DB_SUCCESS_LOCKED_REC;
But, is that what we really want here? Wouldn’t we want == DB_SUCCESS
instead? That is, our caller should keep freeing as long there was no corruption and we have not freed everything yet? Our callers are btr_free_but_not_root()
, which is followed by btr_free_root()
, and trx_purge_free_segment()
. I’d suggest to carefully review the logic in both. A source code comment could be useful here.
On a related note, I see that fseg_inode_free()
is checking for DB_SUCCESS_LOCKED_REC
. It seems to be the right course of action there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the fseg_free_step_low()
returns any error code other than DB_SUCCESS_LOCKED_REC
then InnoDB
faced corruption or freed the segment successfully. So I think this condition is correct.
cb24698
to
c1fc294
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the test execution in a bit more detail. I think that we need to cover the existence of XA PREPARE
transactions and improve some error handling and code coverage.
0475664
to
278a508
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, this still needs some minor refinement.
storage/innobase/trx/trx0rseg.cc
Outdated
undo= UT_LIST_GET_FIRST(rseg.undo_cached); | ||
while (undo) | ||
{ | ||
if (undo->state == TRX_UNDO_PREPARED) | ||
return true; | ||
undo= UT_LIST_GET_NEXT(undo_list, undo); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can undo_cached
possibly contain any XA PREPARE
transaction, or when can we have anything else than undo->state == TRX_UNDO_CACHED
? As far as I can tell, the elements in undo_cached
must have undo->size==1
and must carry the state TRX_UNDO_CACHED
at offset TRX_UNDO_SEG_HDR + TRX_UNDO_STATE
of that single page.
Can you replace the condition with undo->state != TRX_UNDO_CACHED
and a debug assertion that would catch this failure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was wrong. undo_cached
must carry the state TRX_UNDO_CACHED
and state has been changed during trx_purge_add_undo_to_history()
. So we should check only for rseg.undo_list
…ns MDEV-30671 leaked undo pages - InnoDB fails to shrink the system tablespace when it contains the leaked undo log pages caused by MDEV-30671. - InnoDB does free the unused segment in system tablespace before shrinking the tablespace. InnoDB fails to free the unused segment if XA PREPARE transaction exist or if the previous shutdown was not with innodb_fast_shutdown=0 inode_info: Structure to store the inode page and offsets. fil_space_t::garbage_collect(): Frees the system tablespace unused segment fsp_get_sys_used_segment(): Iterates through all default file segment and index segment present in system tablespace. trx_sys_t::is_xa_exist(): Returns true if the XA transaction exist in the undo logs fseg_inode_free(): Frees the extents, fragment pages for the given index node and ignores any error similar to trx_purge_free_segment() trx_sys_t::reset_page(): Retain the TRX_SYS_FSEG_HEADER value in trx_sys page while resetting the page.
278a508
to
4a1ded6
Compare
Description
InnoDB fails to shrink the system tablespace when it contains
the leaked undo log pages caused by MDEV-30671.
InnoDB does free the unused segment in system tablespace
before shrinking the tablespace.
inode_info: Structure to store the inode page and offsets.
fil_space_t::garbage_collect():
Frees the system tablespace unused segment
fsp_free_unused_seg(): Frees the unused segment
fsp_get_sys_used_segment(): Iterates through all default
file segment and index segment present in system tablespace.
How can this PR be tested?
Added the test case innodb.undo_leak which basically leaks the undo segment by purge thread.
Basing the PR against the correct MariaDB version
PR quality check