Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 90 additions & 0 deletions content/shmem_malloc_hints.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@

\apisummary{
Collective memory allocation routine with support for providing hints.
}

\begin{apidefinition}

\begin{Csynopsis}
void *@\FuncDecl{shmem\_malloc\_with\_hints}@(size_t size, long hints);
\end{Csynopsis}

\begin{apiarguments}
\apiargument{IN}{size}{The size, in bytes, of a block to be
allocated from the symmetric heap. This argument is of type \CTYPE{size\_t}}
\apiargument{IN}{hints}{A bit array of hints provided by the user to the implementation}
\end{apiarguments}


\apidescription{

The \FUNC{shmem\_malloc\_with\_hints} routine, like \FUNC{shmem\_malloc}, returns a pointer to a block of at least
\VAR{size} bytes, which shall be suitably aligned so that it may be
assigned to a pointer to any type of object. This space is allocated from
the symmetric heap (similar to \FUNC{shmem\_malloc}). When the \VAR{size} is zero,
the \FUNC{shmem\_malloc\_with\_hints} routine performs no action and returns a null pointer.

In addition to the \VAR{size} argument, the \VAR{hints} argument is provided by the user.
The \VAR{hints} describes the expected manner in which the \openshmem program may use the allocated memory.
The valid usage hints are described in Table~\ref{usagehints}. Multiple hints may be requested by combining them with a bitwise \CONST{OR} operation.
A zero option can be given if no options are requested.

The information provided by the \VAR{hints} is used to optimize for performance by the implementation.
If the implementation cannot optimize, the behavior is same as \FUNC{shmem\_malloc}.
If more than one hint is provided, the implementation will make the best effort to use one or more hints
to optimize performance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must all PEs allocate from the same special memory? It's not clear if asymmetry can exist. Does this impose additional implicit synchronization for each subset configuration of hints if it cannot satisfy the entire hint list?

Also, what happens if you OR SHMEM_HINT_NONE with other hint behaviors?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question.

All use cases that I have thought of requires the memory to be symmetric (same kind of memory).

Regarding extra synchronization, it depends on the implementation. If the implementations maintain asymmetric memory sizes (say each PE starts with different amount of special memory) on the PEs, you might need the extra synchronization for agreement. Otherwise, I do not see a need. In a way, it is similar to current DRAM allocations. Also, for the implementation we explored, we did not need extra synchronization.

I’m reluctant to add such a constraint. Without such constraint, the implementations are free to explore either approach (symmetric and asymmetric).

Do see value in specifying one way or other?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, what would using long hint = SHMEM_HINT_NONE | SHMEM_HINT_LOW_LAT_MEM | SHMEM_HINT_HIGH_BW_MEM do?

Dropping SHMEM_HINT_NONE, what if the platform could provide SHMEM_HINT_LOW_LAT_MEM or SHMEM_HINT_HIGH_BW_MEM but not both simultaneously? Will we see application code marked up like this because who doesn't want to use low latency and high bandwidth memory for their application? Does the "best effort" default to a platform-specific precedence? The only feedback is that an allocation succeeded or it did not.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, what would using long hint = SHMEM_HINT_NONE | SHMEM_HINT_LOW_LAT_MEM | SHMEM_HINT_HIGH_BW_MEM do?

Though this is a legal usage, it does not make sense to use. The implementations are allowed to default to shmem_malloc in this case.

Dropping SHMEM_HINT_NONE, what if the platform could provide SHMEM_HINT_LOW_LAT_MEM or SHMEM_HINT_HIGH_BW_MEM but not both simultaneously? Will we see application code marked up like this because who doesn't want to use low latency and high bandwidth memory for their application? Does the "best effort" default to a platform-specific precedence? The only feedback is that an allocation succeeded or it did not.

“If more than one hint is provided, the implementation will make the best effort to use one or more hints to optimize performance. 
“

My intention with this statement was to provide flexibility for the implementations to optimize as they wish when the user provides multiple hints. Obviously, some combinations of hints might not make sense. In such cases, If the implementations want to give precedence of one hint over others, the proposal allows it. That (assigning priorities to hints) is one way to implement it, but not the only way.

The \FUNC{shmem\_malloc\_with\_hints} routine is provided so that multiple \acp{PE} in a program can allocate symmetric,
remotely accessible memory blocks. When no action is performed, these
routines return without performing a barrier. Otherwise, the routine will call a barrier on exit.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When no action is performed, these routines return without performing a barrier.

What does this mean? That the function returns NULL for all PEs, no memory has been allocated, and no implicit barrier has occurred? Some applications may use shmem_malloc and friends as implicit barriers. This proposal is for an optimization which seems to be a drop in replacement for vanilla shmem_malloc, but dropping the implicit barrier on function exit would change the behavior of the application.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When no allocation is done, “dropping the implicit barrier” is the behavior we have for shmem_malloc in OpenSHMEM 1.4 - please refer page 26 line 40-41. The proposal is aiming to maintain the same behavior for that case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. In OpenSHMEM 1.4, that was not the case. This behavior was changed between 1.4 and now in #201.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah… Thanks for correcting that. It has been so long that we debated about this I did not realize it is relatively new. I was looking at my git copy. :)

Can we consider this issue resolved?

This ensures that all \acp{PE} participate in the memory allocation, and that the memory on other
\acp{PE} can be used as soon as the local \ac{PE} returns. The implicit barrier performed by this routine will quiet the
default context. It is the user's responsibility to ensure that no communication operations involving the given memory block are pending on
other contexts prior to calling the \FUNC{shmem\_free} and \FUNC{shmem\_realloc} routines.
The user is also responsible for calling these routines with identical argument(s) on all
\acp{PE}; if differing \VAR{size}, or \VAR{hints} arguments are used, the behavior of the call
and any subsequent \openshmem calls is undefined.
}

\apireturnvalues{
The \FUNC{shmem\_malloc\_with\_hints} routine returns a pointer to the allocated space;
otherwise, it returns a null pointer.
}

\begin{longtable}{|p{0.45\textwidth}|p{0.5\textwidth}|}
\hline
\textbf{Hints} & \textbf{Usage hint}
\tabularnewline \hline
\endhead
%%
\newline
\CONST{0} &
\newline
Behavior same as \FUNC{shmem\_malloc}
\tabularnewline \hline


\LibConstDecl{SHMEM\_HINT\_ATOMICS\_REMOTE} &
\newline
Memory used for \VAR{atomic} operations
\tabularnewline \hline

\LibConstDecl{SHMEM\_HINT\_SIGNAL} &
\newline
Memory used for \VAR{signal} operations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when the flag was not used, but signal operation was used ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm .. Which flag are you referring to, can you please elaborate?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above SHMEM_HINT_SIGNAL

\tabularnewline \hline

\TableCaptionRef{Memory usage hints}
\label{usagehints}
\end{longtable}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we should to state that neither alignment requirements or memory properties, such as cache line size are not get impacted by the hint. We also shell state the memory semantics for local assess (load and store) are not impacted by the hint.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU, for the shmem_malloc routine the implementation is free to allocate the memory, which is either cache aligned or not. One of the constraints is that it should be word-aligned. Similarly, the memory access model (which is yet be defined or clarified here #229) will provide certain access guarantees to the memory allocated by shmem_malloc. In both cases, the proposal intends to follow the semantics of shmem_malloc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memory gets eventually mapped into the core. The core architecture defined multiple way how it can be mapped. Each on of the mapping has own semantics and constrains. For example semantics between normal cacheable (write back) and non-cacheable (WC) is very different. Since user has direct access through the pointer, code that worked on one machine will break on another with exception or even data corruption.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to support something like this, you have to remove shmem_ptr and prohibit any direct asses to the memory through load and store semantics. Next you would have to introduce shmem_memcpy function to copy-in-out shmem_malloced region.


\apinotes{
The \openshmem programs should allocate memory with
\CONST{SHMEM\_HINT\_ATOMICS\_REMOTE}, when the majority of
operations performed on this memory are atomic operations, and origin
and target \ac{PE} of the atomic operations do not share a memory domain
.i.e., symmetric objects on the target \ac{PE} is not accessible using
load/store operations from the origin \ac{PE} or vice versa.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: I would consider to move the text under SHMEM_HINT_ATOMICS_REMOTE since this is critical information. BTW, probably the same true for remove SHMEM_HINT_SIGNAL vs local SHMEM_HINT_SIGNAL ? @naveen-rn

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand the definition of remote vs local wrt this change, I'm not sure whether there is need to differentiate the local side to the signal. All signals from the source PE are passed-by-value to the remote signal address buffer. A buffer created with SHMEM_HINT_SIGNAL will mostly be updated by some remote PE.

void shmem_put_signal(shmem_ctx_t ctx, TYPE *dest, const TYPE *source, size_t nelems,
uint64_t *sig_addr, uint64_t signal, int sig_op, int pe);

}
\end{apidefinition}
\newpage
4 changes: 4 additions & 0 deletions main_spec.tex
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,10 @@ \subsection{Memory Management Routines}
\subsubsection{\textbf{SHMEM\_MALLOC, SHMEM\_FREE, SHMEM\_REALLOC, SHMEM\_ALIGN}}\label{subsec:shfree}
\input{content/shmem_malloc.tex}

\newpage
\subsubsection{\textbf{SHMEM\_MALLOC\_WITH\_HINTS}}\label{subsec:shmmallochint}
\input{content/shmem_malloc_hints.tex}

\subsubsection{\textbf{SHMEM\_CALLOC}}\label{subsec:shmem_calloc}
\input{content/shmem_calloc.tex}

Expand Down