Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRAM debug take 2 #9907

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

DRAM debug take 2 #9907

wants to merge 2 commits into from

Conversation

lyakh
Copy link
Collaborator

@lyakh lyakh commented Mar 18, 2025

Additional DRAM / cold code debugging. This allows to designate and verify and run-time interval as performance-critical and banned for DRAM code. ATM this includes #9850 , will remove once merged

@lyakh lyakh force-pushed the dram-dbg2 branch 3 times, most recently from 8b5e30c to 6c3faba Compare March 19, 2025 11:52
@lyakh
Copy link
Collaborator Author

lyakh commented Mar 19, 2025

empty PTL. retest

@lyakh
Copy link
Collaborator Author

lyakh commented Mar 19, 2025

SOFCI TEST

@lyakh
Copy link
Collaborator Author

lyakh commented Mar 19, 2025

...but in fact the important for DRAM testing platform is currently MTL, because it has CONFIG_COLD_STORE_EXECUTE_DRAM=y in its default configuration, and https://sof-ci.01.org/sofpr/PR9907/build11620/devicetest/index.html was clean (apart from HDA not running)

@lyakh
Copy link
Collaborator Author

lyakh commented Mar 20, 2025

CI: strange pause-release failures on

Cannot reproduce locally so far. Since there's now a conflict, I'll rebase and see if it re-occurs...

Adding all source files in a single, giant zephyr/CMakeLists.txt is
inconvenient and does not scale.

Link: thesofproject#8260
Signed-off-by: Guennadi Liakhovetski <[email protected]>
Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we need to go down to this level when we have the performance testing and the cold assert check now ?

endif()
add_subdirectory(tester)

is_zephyr(it_is)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @kv2019i has improved this check now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could stay just with assert_can_be_cold to not overcomplicate dram things

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lgirdwood don't think so, we plan to do this globally but I don't think this has been changed yet, I still see this in "main"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could stay just with assert_can_be_cold to not overcomplicate dram things

@abonislawski that part doesn't change. That's remains the only check that we add to __cold functions. But inside that function it now performs 2 checks: (1) for LL-context, and (2) for any "critical path" violations

@lyakh
Copy link
Collaborator Author

lyakh commented Mar 20, 2025

Not sure we need to go down to this level when we have the performance testing and the cold assert check now ?

@lgirdwood it is very easy to miss functions that can be called on hot paths and assign them as "cold." Without this debugging I'd have missed 10 or more of them. Particularly those, called during trigger IPC handling. And we don't have automated performance checking yet either. So I think we need this. Maybe we could enable it selectively. E.g. on MTL and LNL only? I've run both playback and capture tests on MTL nocodec on various interfaces over 13 seconds. The difference between the first (after timer sync) and the last firmware message with debugging was in most cases 0-4ms longer. Only in one case it was 14ms longer. LL measurements were on average 10us longer. So, I think it should be ok to enable this debugging for all.

@lyakh
Copy link
Collaborator Author

lyakh commented Mar 21, 2025

CI:

Copy link
Collaborator

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly ok, a couple of note/questions inline.

@@ -242,6 +244,8 @@ enum task_state ipc_platform_do_cmd(struct ipc *ipc)
#endif /* CONFIG_PM */
}

mem_hot_path_stop_watching();

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we power down on L243 and we have context save enabled in the build? Wouldn't that mess the tracker?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kv2019i not sure why it should confuse it? The tracker is just a couple of simple variables, if the context is saved, then those variables are saved too. But in fact if any of those power on / off functions are cold, then it might trigger a bug. Let me reduce the watched scope

@@ -477,6 +477,8 @@ int ipc4_pipeline_trigger(struct ipc_comp_dev *ppl_icd, uint32_t cmd, bool *dela
return IPC4_INVALID_REQUEST;
}

mem_hot_path_confirm();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be an excellent place to add a comment why this is considered a hot-path.

DRAM execution is banned in LL task context, but there are also time
intervals during which performance is important. Moreover, sometimes
such critical periods aren't certain until a later time. E.g. when
processing IPCs, only some of them are performance critical. And
those time-critical IPCs should be handled with maximum performance
from IPC-entry till IPC-exit, not only while handling that specific
IPC. Consider this pseudo-code:

void ipc_cmd()
{
	common_pre_processing();

	switch (cmd) {
	case NON_CRITICAL_CMD:
		handle_non_critical();
		break;
	case CRITICAL_CMD:
		handle_critical();
	}

	common_post_processing();
}

In this case ipc_cmd(), common_pre_processing(), handle_critical()
and common_post_processing() cannot be executed in DRAM, while
handle_non_critical() can be executed in DRAM. If we place
start-critical-debugging and stop-critical-debugging to cover all of
ipc_cmd(), we'll be forced to also place handle_non_critical() in
SRAM. OTOH if we only surround handle_critical() with those markers,
we will miss all the common code. To support such cases we use 3
markers:
mem_hot_path_start_watching()
mem_hot_path_stop_watching()
mem_hot_path_confirm()
Then we place start- and stop-watching in the beginning and end of
ipc_cmd(), and mem_hot_path_confirm() under the CRITICAL_CMD case.
Then if we made common_pre_processing() or common_post_processing()
__cold, the fact of them being called while watching will be recorded.
Then mem_hot_path_confirm() will set a confirmation flag. Then when
watching is stopped, we check if the flag was raised and a cold
function was called, in which case we issue an error.

Signed-off-by: Guennadi Liakhovetski <[email protected]>
@lyakh
Copy link
Collaborator Author

lyakh commented Mar 24, 2025

CI:

static const char *cold_path_fn;
static bool hot_path_confirmed;

void mem_cold_path_enter(const char *fn)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really need to put all debug logic behind a dbg_ API prefix and a Kconfig to enable and disable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lgirdwood so far we don't seem to have such a consistent convention. If you grep the source tree for dbg_ you'll see a few internal uses and a few global macros in src/include/sof/debug/debug.h which don't seem to be used anywhere any more, so they can be removed. After that a consistent dbg_* namespace could be established. FWIW other files under src/debug/ don't use it either. One of the files there uses the debug_ prefix

@lyakh lyakh mentioned this pull request Mar 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants