-
Notifications
You must be signed in to change notification settings - Fork 330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRAM debug take 2 #9907
base: main
Are you sure you want to change the base?
DRAM debug take 2 #9907
Conversation
8b5e30c
to
6c3faba
Compare
empty PTL. retest |
SOFCI TEST |
...but in fact the important for DRAM testing platform is currently MTL, because it has |
CI: strange pause-release failures on
Cannot reproduce locally so far. Since there's now a conflict, I'll rebase and see if it re-occurs... |
Adding all source files in a single, giant zephyr/CMakeLists.txt is inconvenient and does not scale. Link: thesofproject#8260 Signed-off-by: Guennadi Liakhovetski <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure we need to go down to this level when we have the performance testing and the cold assert check now ?
endif() | ||
add_subdirectory(tester) | ||
|
||
is_zephyr(it_is) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @kv2019i has improved this check now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could stay just with assert_can_be_cold to not overcomplicate dram things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lgirdwood don't think so, we plan to do this globally but I don't think this has been changed yet, I still see this in "main"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could stay just with assert_can_be_cold to not overcomplicate dram things
@abonislawski that part doesn't change. That's remains the only check that we add to __cold
functions. But inside that function it now performs 2 checks: (1) for LL-context, and (2) for any "critical path" violations
@lgirdwood it is very easy to miss functions that can be called on hot paths and assign them as "cold." Without this debugging I'd have missed 10 or more of them. Particularly those, called during trigger IPC handling. And we don't have automated performance checking yet either. So I think we need this. Maybe we could enable it selectively. E.g. on MTL and LNL only? I've run both playback and capture tests on MTL nocodec on various interfaces over 13 seconds. The difference between the first (after timer sync) and the last firmware message with debugging was in most cases 0-4ms longer. Only in one case it was 14ms longer. LL measurements were on average 10us longer. So, I think it should be ok to enable this debugging for all. |
CI:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly ok, a couple of note/questions inline.
src/ipc/ipc-zephyr.c
Outdated
@@ -242,6 +244,8 @@ enum task_state ipc_platform_do_cmd(struct ipc *ipc) | |||
#endif /* CONFIG_PM */ | |||
} | |||
|
|||
mem_hot_path_stop_watching(); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we power down on L243 and we have context save enabled in the build? Wouldn't that mess the tracker?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kv2019i not sure why it should confuse it? The tracker is just a couple of simple variables, if the context is saved, then those variables are saved too. But in fact if any of those power on / off functions are cold, then it might trigger a bug. Let me reduce the watched scope
@@ -477,6 +477,8 @@ int ipc4_pipeline_trigger(struct ipc_comp_dev *ppl_icd, uint32_t cmd, bool *dela | |||
return IPC4_INVALID_REQUEST; | |||
} | |||
|
|||
mem_hot_path_confirm(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be an excellent place to add a comment why this is considered a hot-path.
DRAM execution is banned in LL task context, but there are also time intervals during which performance is important. Moreover, sometimes such critical periods aren't certain until a later time. E.g. when processing IPCs, only some of them are performance critical. And those time-critical IPCs should be handled with maximum performance from IPC-entry till IPC-exit, not only while handling that specific IPC. Consider this pseudo-code: void ipc_cmd() { common_pre_processing(); switch (cmd) { case NON_CRITICAL_CMD: handle_non_critical(); break; case CRITICAL_CMD: handle_critical(); } common_post_processing(); } In this case ipc_cmd(), common_pre_processing(), handle_critical() and common_post_processing() cannot be executed in DRAM, while handle_non_critical() can be executed in DRAM. If we place start-critical-debugging and stop-critical-debugging to cover all of ipc_cmd(), we'll be forced to also place handle_non_critical() in SRAM. OTOH if we only surround handle_critical() with those markers, we will miss all the common code. To support such cases we use 3 markers: mem_hot_path_start_watching() mem_hot_path_stop_watching() mem_hot_path_confirm() Then we place start- and stop-watching in the beginning and end of ipc_cmd(), and mem_hot_path_confirm() under the CRITICAL_CMD case. Then if we made common_pre_processing() or common_post_processing() __cold, the fact of them being called while watching will be recorded. Then mem_hot_path_confirm() will set a confirmation flag. Then when watching is stopped, we check if the flag was raised and a cold function was called, in which case we issue an error. Signed-off-by: Guennadi Liakhovetski <[email protected]>
CI:
|
static const char *cold_path_fn; | ||
static bool hot_path_confirmed; | ||
|
||
void mem_cold_path_enter(const char *fn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We really need to put all debug logic behind a dbg_
API prefix and a Kconfig to enable and disable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lgirdwood so far we don't seem to have such a consistent convention. If you grep the source tree for dbg_
you'll see a few internal uses and a few global macros in src/include/sof/debug/debug.h which don't seem to be used anywhere any more, so they can be removed. After that a consistent dbg_*
namespace could be established. FWIW other files under src/debug/ don't use it either. One of the files there uses the debug_
prefix
Additional DRAM / cold code debugging. This allows to designate and verify and run-time interval as performance-critical and banned for DRAM code. ATM this includes #9850 , will remove once merged