-
Notifications
You must be signed in to change notification settings - Fork 0
/
Chap_API_Job_Mgmt.tex
1142 lines (950 loc) · 50.6 KB
/
Chap_API_Job_Mgmt.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Chapter: Job Allocation Management
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{Job Management and Reporting}
\label{chap:api_job_mgmt}
The job management \acp{API} provide an application with the ability to orchestrate its operation in partnership with the \ac{SMS}.
Members of this category include the \refapi{PMIx_Allocation_request}, \refapi{PMIx_Job_control}, and \refapi{PMIx_Process_monitor} \acp{API}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Allocation Requests}
\label{chap:api_job_mgmt:alloc}
This section defines functionality to request new allocations from the \ac{RM}, and request modifications to existing allocations.
These are primarily used in the following scenarios:
\begin{itemize}
\item \textit{Evolving} applications that dynamically request and return resources as they execute.
\item \textit{Malleable} environments where the scheduler redirects resources away from executing applications for higher priority jobs or load balancing.
\item \textit{Resilient} applications that need to request replacement resources in the face of failures.
\item \textit{Rigid} jobs where the user has requested a static allocation of resources for a fixed period of time, but realizes that they underestimated their required time while executing.
\end{itemize}
\ac{PMIx} attempts to address this range of use-cases with a flexible \ac{API}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{\code{PMIx_Allocation_request}}
\declareapi{PMIx_Allocation_request}
%%%%
\summary
Request an allocation operation from the host resource manager.
%%%%
\format
\copySignature{PMIx_Allocation_request}{3.0}{
pmix_status_t \\
PMIx_Allocation_request(pmix_alloc_directive_t directive, \\
\hspace*{24\sigspace}pmix_info_t info[], size_t ninfo, \\
\hspace*{24\sigspace}pmix_info_t *results[], size_t *nresults);
}
\begin{arglist}
\argin{directive}{Allocation directive (\refstruct{pmix_alloc_directive_t})}
\argin{info}{Array of \refstruct{pmix_info_t} structures (array of handles)}
\argin{ninfo}{Number of elements in the \refarg{info} array (integer)}
\arginout{results}{Address where a pointer to an array of \refstruct{pmix_info_t} containing the results of the request can be returned (memory reference)}
\arginout{nresults}{Address where the number of elements in \refarg{results} can be returned (handle)}
\end{arglist}
Returns one of the following:
\begin{itemize}
\item \refconst{PMIX_SUCCESS}, indicating that the request was processed and returned \textit{success}
\item a PMIx error constant indicating either an error in the input or that the request was refused
\end{itemize}
\reqattrstart
\ac{PMIx} libraries are not required to directly support any attributes for this function. However, any provided attributes must be passed to the host \ac{SMS} daemon for processing, and the \ac{PMIx} library is \textit{required} to add the \refAttributeItem{PMIX_USERID} and the \refAttributeItem{PMIX_GRPID} attributes of the client process making the request.
Host environments that implement support for this operation are required to support the following attributes:
\pasteAttributeItem{PMIX_ALLOC_REQ_ID}
\pasteAttributeItem{PMIX_ALLOC_NUM_NODES}
\pasteAttributeItem{PMIX_ALLOC_NUM_CPUS}
\pasteAttributeItem{PMIX_ALLOC_TIME}
\reqattrend
\optattrstart
The following attributes are optional for host environments that support this operation:
\pasteAttributeItem{PMIX_ALLOC_NODE_LIST}
\pasteAttributeItem{PMIX_ALLOC_NUM_CPU_LIST}
\pasteAttributeItem{PMIX_ALLOC_CPU_LIST}
\pasteAttributeItem{PMIX_ALLOC_MEM_SIZE}
\pasteAttributeItem{PMIX_ALLOC_FABRIC}
\pasteAttributeItem{PMIX_ALLOC_FABRIC_ID}
\pasteAttributeItem{PMIX_ALLOC_BANDWIDTH}
\pasteAttributeItem{PMIX_ALLOC_FABRIC_QOS}
\pasteAttributeItem{PMIX_ALLOC_FABRIC_TYPE}
\pasteAttributeItem{PMIX_ALLOC_FABRIC_PLANE}
\pasteAttributeItem{PMIX_ALLOC_FABRIC_ENDPTS}
\pasteAttributeItem{PMIX_ALLOC_FABRIC_ENDPTS_NODE}
\pasteAttributeItem{PMIX_ALLOC_FABRIC_SEC_KEY}
\optattrend
%%%%
\descr
Request an allocation operation from the host resource manager.
Several broad categories are envisioned, including the ability to:
\begin{compactitem}
%
\item Request allocation of additional resources, including memory, bandwidth, and compute.
This should be accomplished in a non-blocking manner so that the application can continue to progress while waiting for resources to become available.
Note that the new allocation will be disjoint from (i.e., not affiliated with) the allocation of the requestor - thus the termination of one allocation will not impact the other.
%
\item Extend the reservation on currently allocated resources, subject to scheduling availability and priorities.
This includes extending the time limit on current resources, and/or requesting additional resources be allocated to the requesting job.
Any additional allocated resources will be considered as part of the current allocation, and thus will be released at the same time.
%
\item Return no-longer-required resources to the scheduler.
This includes the ``loan'' of resources back to the scheduler with a promise to return them upon subsequent request.
\end{compactitem}
If successful, the returned results for a request for additional resources must include the host resource manager's identifier (\refattr{PMIX_ALLOC_ID}) that the requester can use to specify the resources in, for example, a call to \refapi{PMIx_Spawn}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{\code{PMIx_Allocation_request_nb}}
\declareapi{PMIx_Allocation_request_nb}
%%%%
\summary
Request an allocation operation from the host resource manager.
%%%%
\format
\copySignature{PMIx_Allocation_request_nb}{2.0}{
pmix_status_t \\
PMIx_Allocation_request_nb(pmix_alloc_directive_t directive, \\
\hspace*{27\sigspace}pmix_info_t info[], size_t ninfo, \\
\hspace*{27\sigspace}pmix_info_cbfunc_t cbfunc, void *cbdata);
}
\begin{arglist}
\argin{directive}{Allocation directive (\refstruct{pmix_alloc_directive_t})}
\argin{info}{Array of \refstruct{pmix_info_t} structures (array of handles)}
\argin{ninfo}{Number of elements in the \refarg{info} array (integer)}
\argin{cbfunc}{Callback function \refapi{pmix_info_cbfunc_t} (function reference)}
\argin{cbdata}{Data to be passed to the callback function (memory reference)}
\end{arglist}
Returns one of the following:
\begin{itemize}
\item \refconst{PMIX_SUCCESS}, indicating that the request is being processed by the host environment - result will be returned in the provided \refarg{cbfunc}. Note that the library must not invoke the callback function prior to returning from the \ac{API}.
\item \refconst{PMIX_OPERATION_SUCCEEDED}, indicating that the request was immediately processed and returned \textit{success} - the \refarg{cbfunc} will \textit{not} be called
\item a PMIx error constant indicating either an error in the input or that the request was immediately processed and failed - the \refarg{cbfunc} will \textit{not} be called
\end{itemize}
\reqattrstart
\ac{PMIx} libraries are not required to directly support any attributes for this function. However, any provided attributes must be passed to the host \ac{SMS} daemon for processing, and the \ac{PMIx} library is \textit{required} to add the \refAttributeItem{PMIX_USERID} and the \refAttributeItem{PMIX_GRPID} attributes of the client process making the request.
Host environments that implement support for this operation are required to support the following attributes:
\pasteAttributeItem{PMIX_ALLOC_REQ_ID}
\pasteAttributeItem{PMIX_ALLOC_NUM_NODES}
\pasteAttributeItem{PMIX_ALLOC_NUM_CPUS}
\pasteAttributeItem{PMIX_ALLOC_TIME}
\reqattrend
\optattrstart
The following attributes are optional for host environments that support this operation:
\pasteAttributeItem{PMIX_ALLOC_NODE_LIST}
\pasteAttributeItem{PMIX_ALLOC_NUM_CPU_LIST}
\pasteAttributeItem{PMIX_ALLOC_CPU_LIST}
\pasteAttributeItem{PMIX_ALLOC_MEM_SIZE}
\pasteAttributeItem{PMIX_ALLOC_FABRIC}
\pasteAttributeItem{PMIX_ALLOC_FABRIC_ID}
\pasteAttributeItem{PMIX_ALLOC_BANDWIDTH}
\pasteAttributeItem{PMIX_ALLOC_FABRIC_QOS}
\pasteAttributeItem{PMIX_ALLOC_FABRIC_TYPE}
\pasteAttributeItem{PMIX_ALLOC_FABRIC_PLANE}
\pasteAttributeItem{PMIX_ALLOC_FABRIC_ENDPTS}
\pasteAttributeItem{PMIX_ALLOC_FABRIC_ENDPTS_NODE}
\pasteAttributeItem{PMIX_ALLOC_FABRIC_SEC_KEY}
\optattrend
%%%%
\descr
Non-blocking form of the \refapi{PMIx_Allocation_request} \ac{API}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Job Allocation attributes}
\label{api:struct:attributes:joballoc}
Attributes used to describe the job allocation - these are values passed to and/or returned by the \refapi{PMIx_Allocation_request_nb} and \refapi{PMIx_Allocation_request} \acp{API} and are not accessed using the \refapi{PMIx_Get} \ac{API}.
%
\declareAttribute{PMIX_ALLOC_REQ_ID}{"pmix.alloc.reqid"}{char*}{
User-provided string identifier for this allocation request which can later be used to query status of the request.
}
%
\declareAttributeNEW{PMIX_ALLOC_ID}{"pmix.alloc.id"}{char*}{
A string identifier (provided by the host environment) for the resulting allocation which can later be used to reference the allocated resources in, for example, a call to \refapi{PMIx_Spawn}.
}
%
\declareAttributeNEW{PMIX_ALLOC_QUEUE}{"pmix.alloc.queue"}{char*}{
Name of the \ac{WLM} queue to which the allocation request is to be directed, or the queue being referenced in a query.
}
%
\declareAttribute{PMIX_ALLOC_NUM_NODES}{"pmix.alloc.nnodes"}{uint64_t}{
The number of nodes being requested in an allocation request.
}
%
\declareAttribute{PMIX_ALLOC_NODE_LIST}{"pmix.alloc.nlist"}{char*}{
Regular expression of the specific nodes being requested in an allocation request.
}
%
\declareAttribute{PMIX_ALLOC_NUM_CPUS}{"pmix.alloc.ncpus"}{uint64_t}{
Number of \acp{PU} being requested in an allocation request.
}
%
\declareAttribute{PMIX_ALLOC_NUM_CPU_LIST}{"pmix.alloc.ncpulist"}{char*}{
Regular expression of the number of \acp{PU} for each node being requested in an allocation request.
}
%
\declareAttribute{PMIX_ALLOC_CPU_LIST}{"pmix.alloc.cpulist"}{char*}{
Regular expression of the specific \acp{PU} being requested in an allocation request.
}
%
\declareAttribute{PMIX_ALLOC_MEM_SIZE}{"pmix.alloc.msize"}{float}{
Number of Megabytes[base2] of memory (per process) being requested in an allocation request.
}
%
\declareAttribute{PMIX_ALLOC_FABRIC}{"pmix.alloc.net"}{array}{
Array of \refstruct{pmix_info_t} describing requested fabric resources. This must include at least: \refattr{PMIX_ALLOC_FABRIC_ID}, \refattr{PMIX_ALLOC_FABRIC_TYPE}, and \refattr{PMIX_ALLOC_FABRIC_ENDPTS}, plus whatever other descriptors are desired.
}
%
\declareAttribute{PMIX_ALLOC_FABRIC_ID}{"pmix.alloc.netid"}{char*}{
The key to be used when accessing this requested fabric allocation. The fabric allocation will be returned/stored as a \refstruct{pmix_data_array_t} of \refstruct{pmix_info_t} whose first element is composed of this key and the allocated resource description.
The type of the included value depends upon the fabric support. For example, a \ac{TCP} allocation might consist of a comma-delimited string of socket ranges such as \code{"32000-32100,\allowbreak 33005,38123-38146"}. Additional array entries will consist of any provided resource request directives, along with their assigned values. Examples include: \refattr{PMIX_ALLOC_FABRIC_TYPE} - the type of resources provided; \refattr{PMIX_ALLOC_FABRIC_PLANE} - if applicable, what plane the resources were assigned from; \refattr{PMIX_ALLOC_FABRIC_QOS} - the assigned QoS; \refattr{PMIX_ALLOC_BANDWIDTH} - the allocated bandwidth; \refattr{PMIX_ALLOC_FABRIC_SEC_KEY} - a security key for the requested fabric allocation. NOTE: the array contents may differ from those requested, especially if \refconst{PMIX_INFO_REQD} was not set in the request.
}
%
\declareAttribute{PMIX_ALLOC_BANDWIDTH}{"pmix.alloc.bw"}{float}{
Fabric bandwidth (in Megabits[base2]/sec) for the job being requested in an allocation request.
}
%
\declareAttribute{PMIX_ALLOC_FABRIC_QOS}{"pmix.alloc.netqos"}{char*}{
Fabric quality of service level for the job being requested in an allocation request.
}
%
\declareAttribute{PMIX_ALLOC_TIME}{"pmix.alloc.time"}{uint32_t}{
Total session time (in seconds) being requested in an allocation request.
}
%
\declareAttribute{PMIX_ALLOC_FABRIC_TYPE}{"pmix.alloc.nettype"}{char*}{
Type of desired transport (e.g., \var{``tcp''}, \var{``udp''}) being requested in an allocation request.
}
%
\declareAttribute{PMIX_ALLOC_FABRIC_PLANE}{"pmix.alloc.netplane"}{char*}{
ID string for the \refterm{fabric plane} to be used for the requested allocation.
}
%
\declareAttribute{PMIX_ALLOC_FABRIC_ENDPTS}{"pmix.alloc.endpts"}{size_t}{
Number of endpoints to allocate per \refterm{process} in the job.
}
%
\declareAttribute{PMIX_ALLOC_FABRIC_ENDPTS_NODE}{"pmix.alloc.endpts.nd"}{size_t}{
Number of endpoints to allocate per \refterm{node} for the job.
}
%
\declareAttribute{PMIX_ALLOC_FABRIC_SEC_KEY}{"pmix.alloc.nsec"}{pmix_byte_object_t}{
Request that the allocation include a fabric security key for the spawned job.
}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Job Allocation Directives}
\declarestruct{pmix_alloc_directive_t}
\versionMarker{2.0}
The \refstruct{pmix_alloc_directive_t} structure is a \code{uint8_t} type that defines the behavior of allocation requests.
The following constants can be used to set a variable of the type \refstruct{pmix_alloc_directive_t}. All definitions were introduced in version 2 of the standard unless otherwise marked.
\begin{constantdesc}
%
\declareconstitem{PMIX_ALLOC_NEW}
A new allocation is being requested.
The resulting allocation will be disjoint (i.e., not connected in a job sense) from the requesting allocation.
%
\declareconstitem{PMIX_ALLOC_EXTEND}
Extend the existing allocation, either in time or as additional resources.
%
\declareconstitem{PMIX_ALLOC_RELEASE}
Release part of the existing allocation.
Attributes in the accompanying \refstruct{pmix_info_t} array may be used to specify permanent release of the identified resources, or ``lending'' of those resources for some period of time.
%
\declareconstitem{PMIX_ALLOC_REAQUIRE}
Reacquire resources that were previously ``lent'' back to the scheduler.
%
\declareconstitem{PMIX_ALLOC_EXTERNAL}
A value boundary above which implementers are free to define their own directive values.
%
\end{constantdesc}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Job Control}
\label{chap:api_job_mgmt:jctrl}
This section defines \acp{API} that enable the application and host environment to coordinate the response to failures and other events.
This can include requesting termination of the entire job or a subset of processes within a job, but can
also be used in combination with other \ac{PMIx} capabilities (e.g., allocation support and event notification) for more nuanced responses. For example, an application notified of an incipient over-temperature condition on a node could use the \refapi{PMIx_Allocation_request_nb} interface to request replacement nodes while simultaneously using the \refapi{PMIx_Job_control_nb} interface to direct that a checkpoint event be delivered to all processes in the application. If replacement resources are not available, the application might use the \refapi{PMIx_Job_control_nb} interface to request that the job continue at a lower power setting, perhaps sufficient to avoid the over-temperature failure.
The job control \acp{API} can also be used by an application to register itself as available for preemption when operating in an environment such as a cloud or where incentives, financial or otherwise, are provided to jobs willing to be preempted. Registration can include attributes indicating how many resources are being offered for preemption (e.g., all or only some portion), whether the application will require time to prepare for preemption, etc. Jobs that
request a warning will receive an event notifying them of an impending preemption (possibly including information as to the resources that will be taken away, how much time the application will be given prior to being preempted, whether the preemption will be a suspension or full termination, etc.) so they have an opportunity to save
their work. Once the application is ready, it calls the provided event completion callback function to indicate that
the SMS is free to suspend or terminate it, and can include directives regarding any desired restart.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{\code{PMIx_Job_control}}
\declareapi{PMIx_Job_control}
%%%%
\summary
Request a job control action.
%%%%
\format
\copySignature{PMIx_Job_control}{3.0}{
pmix_status_t \\
PMIx_Job_control(const pmix_proc_t targets[], size_t ntargets, \\
\hspace*{17\sigspace}const pmix_info_t directives[], size_t ndirs, \\
\hspace*{17\sigspace}pmix_info_t *results[], size_t *nresults);
}
\begin{arglist}
\argin{targets}{Array of proc structures (array of handles)}
\argin{ntargets}{Number of elements in the \refarg{targets} array (integer)}
\argin{directives}{Array of info structures (array of handles)}
\argin{ndirs}{Number of elements in the \refarg{directives} array (integer)}
\arginout{results}{Address where a pointer to an array of \refstruct{pmix_info_t} containing the results of the request can be returned (memory reference)}
\arginout{nresults}{Address where the number of elements in \refarg{results} can be returned (handle)}
\end{arglist}
Returns one of the following:
\begin{itemize}
\item \refconst{PMIX_SUCCESS}, indicating that the request was processed by the host environment and returned \textit{success}. Details of the result will be returned in the \refarg{results} array
\item a \ac{PMIx} error constant indicating either an error in the input or that the request was refused
\end{itemize}
\reqattrstart
\ac{PMIx} libraries are not required to directly support any attributes for this function. However, any provided attributes must be passed to the host \ac{SMS} daemon for processing, and the \ac{PMIx} library is \textit{required} to add the \refAttributeItem{PMIX_USERID} and the \refAttributeItem{PMIX_GRPID} attributes of the client process making the request.
Host environments that implement support for this operation are required to support the following attributes:
\pasteAttributeItem{PMIX_JOB_CTRL_ID}
\pasteAttributeItem{PMIX_JOB_CTRL_PAUSE}
\pasteAttributeItem{PMIX_JOB_CTRL_RESUME}
\pasteAttributeItem{PMIX_JOB_CTRL_KILL}
\pasteAttributeItem{PMIX_JOB_CTRL_SIGNAL}
\pasteAttributeItem{PMIX_JOB_CTRL_TERMINATE}
\pasteAttributeItem{PMIX_REGISTER_CLEANUP}
\pasteAttributeItem{PMIX_REGISTER_CLEANUP_DIR}
\pasteAttributeItem{PMIX_CLEANUP_RECURSIVE}
\pasteAttributeItem{PMIX_CLEANUP_EMPTY}
\pasteAttributeItem{PMIX_CLEANUP_IGNORE}
\pasteAttributeItem{PMIX_CLEANUP_LEAVE_TOPDIR}
\reqattrend
\optattrstart
The following attributes are optional for host environments that support this operation:
\pasteAttributeItem{PMIX_JOB_CTRL_CANCEL}
\pasteAttributeItem{PMIX_JOB_CTRL_RESTART}
\pasteAttributeItem{PMIX_JOB_CTRL_CHECKPOINT}
\pasteAttributeItem{PMIX_JOB_CTRL_CHECKPOINT_EVENT}
\pasteAttributeItem{PMIX_JOB_CTRL_CHECKPOINT_SIGNAL}
\pasteAttributeItem{PMIX_JOB_CTRL_CHECKPOINT_TIMEOUT}
\pasteAttributeItem{PMIX_JOB_CTRL_CHECKPOINT_METHOD}
\pasteAttributeItem{PMIX_JOB_CTRL_PROVISION}
\pasteAttributeItem{PMIX_JOB_CTRL_PROVISION_IMAGE}
\pasteAttributeItem{PMIX_JOB_CTRL_PREEMPTIBLE}
\optattrend
%%%%
\descr
Request a job control action.
The \refarg{targets} array identifies the processes to which the requested job control action is to be applied. All \refterm{clones} of an identified process are to have the requested action applied to them.
A \code{NULL} value can be used to indicate all processes in the caller's namespace.
The use of \refconst{PMIX_RANK_WILDCARD} can also be used to indicate that all processes in the given namespace are to be included.
The directives are provided as \refstruct{pmix_info_t} structures in the \refarg{directives} array.
The returned \refarg{status} indicates whether or not the request was granted, and information as to the reason for any denial of the request shall be returned in the \refarg{results} array.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{\code{PMIx_Job_control_nb}}
\declareapi{PMIx_Job_control_nb}
%%%%
\summary
Request a job control action.
%%%%
\format
\copySignature{PMIx_Job_control_nb}{2.0}{
pmix_status_t \\
PMIx_Job_control_nb(const pmix_proc_t targets[], size_t ntargets, \\
\hspace*{20\sigspace}const pmix_info_t directives[], size_t ndirs, \\
\hspace*{20\sigspace}pmix_info_cbfunc_t cbfunc, void *cbdata);
}
\begin{arglist}
\argin{targets}{Array of proc structures (array of handles)}
\argin{ntargets}{Number of elements in the \refarg{targets} array (integer)}
\argin{directives}{Array of info structures (array of handles)}
\argin{ndirs}{Number of elements in the \refarg{directives} array (integer)}
\argin{cbfunc}{Callback function \refapi{pmix_info_cbfunc_t} (function reference)}
\argin{cbdata}{Data to be passed to the callback function (memory reference)}
\end{arglist}
Returns one of the following:
\begin{itemize}
\item \refconst{PMIX_SUCCESS}, indicating that the request is being processed by the host environment - result will be returned in the provided \refarg{cbfunc}. Note that the library must not invoke the callback function prior to returning from the \ac{API}.
\item \refconst{PMIX_OPERATION_SUCCEEDED}, indicating that the request was immediately processed and returned \textit{success} - the \refarg{cbfunc} will \textit{not} be called
\item a PMIx error constant indicating either an error in the input or that the request was immediately processed and failed - the \refarg{cbfunc} will \textit{not} be called
\end{itemize}
\reqattrstart
\ac{PMIx} libraries are not required to directly support any attributes for this function. However, any provided attributes must be passed to the host \ac{SMS} daemon for processing, and the \ac{PMIx} library is \textit{required} to add the \refAttributeItem{PMIX_USERID} and the \refAttributeItem{PMIX_GRPID} attributes of the client process making the request.
Host environments that implement support for this operation are required to support the following attributes:
\pasteAttributeItem{PMIX_JOB_CTRL_ID}
\pasteAttributeItem{PMIX_JOB_CTRL_PAUSE}
\pasteAttributeItem{PMIX_JOB_CTRL_RESUME}
\pasteAttributeItem{PMIX_JOB_CTRL_KILL}
\pasteAttributeItem{PMIX_JOB_CTRL_SIGNAL}
\pasteAttributeItem{PMIX_JOB_CTRL_TERMINATE}
\pasteAttributeItem{PMIX_REGISTER_CLEANUP}
\pasteAttributeItem{PMIX_REGISTER_CLEANUP_DIR}
\pasteAttributeItem{PMIX_CLEANUP_RECURSIVE}
\pasteAttributeItem{PMIX_CLEANUP_EMPTY}
\pasteAttributeItem{PMIX_CLEANUP_IGNORE}
\pasteAttributeItem{PMIX_CLEANUP_LEAVE_TOPDIR}
\reqattrend
\optattrstart
The following attributes are optional for host environments that support this operation:
\pasteAttributeItem{PMIX_JOB_CTRL_CANCEL}
\pasteAttributeItem{PMIX_JOB_CTRL_RESTART}
\pasteAttributeItem{PMIX_JOB_CTRL_CHECKPOINT}
\pasteAttributeItem{PMIX_JOB_CTRL_CHECKPOINT_EVENT}
\pasteAttributeItem{PMIX_JOB_CTRL_CHECKPOINT_SIGNAL}
\pasteAttributeItem{PMIX_JOB_CTRL_CHECKPOINT_TIMEOUT}
\pasteAttributeItem{PMIX_JOB_CTRL_CHECKPOINT_METHOD}
\pasteAttributeItem{PMIX_JOB_CTRL_PROVISION}
\pasteAttributeItem{PMIX_JOB_CTRL_PROVISION_IMAGE}
\pasteAttributeItem{PMIX_JOB_CTRL_PREEMPTIBLE}
\optattrend
%%%%
\descr
Non-blocking form of the \refapi{PMIx_Job_control} \ac{API}.
The \refarg{targets} array identifies the processes to which the requested job control action is to be applied. All \refterm{clones} of an identified process are to have the requested action applied to them.
A \code{NULL} value can be used to indicate all processes in the caller's namespace.
The use of \refconst{PMIX_RANK_WILDCARD} can also be used to indicate that all processes in the given namespace are to be included.
The directives are provided as \refstruct{pmix_info_t} structures in the \refarg{directives} array.
The callback function provides a \refarg{status} to indicate whether or not the request was granted, and to provide some information as to the reason for any denial in the \refapi{pmix_info_cbfunc_t} array of \refstruct{pmix_info_t} structures.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Job control constants}
\label{api:struct:constants:jobcontrol}
The following constants are specifically defined for return by the job control \acp{API}:
\begin{constantdesc}
%
\declareconstitemNEW{PMIX_ERR_CONFLICTING_CLEANUP_DIRECTIVES}
Conflicting directives given for job/process cleanup.
\end{constantdesc}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Job control events}
\label{api:struct:events:jobcontrol}
The following job control events may be available for registration, depending upon implementation and host environment support:
\begin{constantdesc}
%
\declareconstitem{PMIX_JCTRL_CHECKPOINT}
Monitored by \ac{PMIx} client to trigger a checkpoint operation.
%
\declareconstitem{PMIX_JCTRL_CHECKPOINT_COMPLETE}
Sent by a \ac{PMIx} client and monitored by a \ac{PMIx} server to notify that requested checkpoint operation has completed.
%
\declareconstitem{PMIX_JCTRL_PREEMPT_ALERT}
Monitored by a \ac{PMIx} client to detect that an \ac{RM} intends to preempt the job.
%
\declareconstitem{PMIX_ERR_PROC_RESTART}
Error in process restart.
%
\declareconstitem{PMIX_ERR_PROC_CHECKPOINT}
Error in process checkpoint.
%
\declareconstitem{PMIX_ERR_PROC_MIGRATE}
Error in process migration.
%
\end{constantdesc}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Job control attributes}
\label{api:struct:attributes:jobcontrol}
Attributes used to request control operations on an executing application - these are values passed to the job control \acp{API} and are not accessed using the \refapi{PMIx_Get} \ac{API}.
%
\declareAttribute{PMIX_JOB_CTRL_ID}{"pmix.jctrl.id"}{char*}{
Provide a string identifier for this request. The user can provide an identifier for the requested operation, thus allowing them to later request status of the operation or to terminate it. The host, therefore, shall track it with the request for future reference.
}
%
\declareAttribute{PMIX_JOB_CTRL_PAUSE}{"pmix.jctrl.pause"}{bool}{
Pause the specified processes.
}
%
\declareAttribute{PMIX_JOB_CTRL_RESUME}{"pmix.jctrl.resume"}{bool}{
Resume (``un-pause'') the specified processes.
}
%
\declareAttribute{PMIX_JOB_CTRL_CANCEL}{"pmix.jctrl.cancel"}{char*}{
Cancel the specified request - the provided request ID must match the \refattr{PMIX_JOB_CTRL_ID} provided to a previous call to \refapi{PMIx_Job_control}. An ID of \code{NULL} implies cancel all requests from this requestor.
}
%
\declareAttribute{PMIX_JOB_CTRL_KILL}{"pmix.jctrl.kill"}{bool}{
Forcibly terminate the specified processes and cleanup.
}
%
\declareAttribute{PMIX_JOB_CTRL_RESTART}{"pmix.jctrl.restart"}{char*}{
Restart the specified processes using the given checkpoint ID.
}
%
\declareAttribute{PMIX_JOB_CTRL_CHECKPOINT}{"pmix.jctrl.ckpt"}{char*}{
Checkpoint the specified processes and assign the given ID to it.
}
%
\declareAttribute{PMIX_JOB_CTRL_CHECKPOINT_EVENT}{"pmix.jctrl.ckptev"}{bool}{
Use event notification to trigger a process checkpoint.
}
%
\declareAttribute{PMIX_JOB_CTRL_CHECKPOINT_SIGNAL}{"pmix.jctrl.ckptsig"}{int}{
Use the given signal to trigger a process checkpoint.
}
%
\declareAttribute{PMIX_JOB_CTRL_CHECKPOINT_TIMEOUT}{"pmix.jctrl.ckptsig"}{int}{
Time in seconds to wait for a checkpoint to complete.
}
%
\declareAttribute{PMIX_JOB_CTRL_CHECKPOINT_METHOD}{"pmix.jctrl.ckmethod"}{pmix_data_array_t}{
Array of \refstruct{pmix_info_t} declaring each method and value supported by this application.
}
%
\declareAttribute{PMIX_JOB_CTRL_SIGNAL}{"pmix.jctrl.sig"}{int}{
Send given signal to specified processes.
}
%
\declareAttribute{PMIX_JOB_CTRL_PROVISION}{"pmix.jctrl.pvn"}{char*}{
Regular expression identifying nodes that are to be provisioned.
}
%
\declareAttribute{PMIX_JOB_CTRL_PROVISION_IMAGE}{"pmix.jctrl.pvnimg"}{char*}{
Name of the image that is to be provisioned.
}
%
\declareAttribute{PMIX_JOB_CTRL_PREEMPTIBLE}{"pmix.jctrl.preempt"}{bool}{
Indicate that the job can be pre-empted.
}
%
\declareAttribute{PMIX_JOB_CTRL_TERMINATE}{"pmix.jctrl.term"}{bool}{
Politely terminate the specified processes.
}
%
\declareAttribute{PMIX_REGISTER_CLEANUP}{"pmix.reg.cleanup"}{char*}{
Comma-delimited list of files to be removed upon process termination.
}
%
\declareAttribute{PMIX_REGISTER_CLEANUP_DIR}{"pmix.reg.cleanupdir"}{char*}{
Comma-delimited list of directories to be removed upon process termination.
}
%
\declareAttribute{PMIX_CLEANUP_RECURSIVE}{"pmix.clnup.recurse"}{bool}{
Recursively cleanup all subdirectories under the specified one(s).
}
%
\declareAttribute{PMIX_CLEANUP_EMPTY}{"pmix.clnup.empty"}{bool}{
Only remove empty subdirectories.
}
%
\declareAttribute{PMIX_CLEANUP_IGNORE}{"pmix.clnup.ignore"}{char*}{
Comma-delimited list of filenames that are not to be removed.
}
%
\declareAttribute{PMIX_CLEANUP_LEAVE_TOPDIR}{"pmix.clnup.lvtop"}{bool}{
When recursively cleaning subdirectories, do not remove the top-level directory (the one given in the cleanup request).
}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Process and Job Monitoring}
\label{chap:api_job_mgmt:monitor}
In addition to external faults, a common problem encountered in \ac{HPC} applications is a failure to make
progress due to some internal conflict in the computation. These situations can
result in a significant waste of resources as the \ac{SMS} is unaware of the problem, and thus cannot terminate the
job. Various watchdog methods have been developed for detecting this situation, including requiring a periodic ``heartbeat''
from the application and monitoring a specified file for changes in size and/or modification time.
The following \acp{API} allow applications to request monitoring, directing what is to be monitored, the frequency of the associated check, whether or not the application is to be notified (via the event notification subsystem) of stall detection, and other characteristics of the operation.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{\code{PMIx_Process_monitor}}
\declareapi{PMIx_Process_monitor}
%%%%
\summary
Request that application processes be monitored.
%%%%
\format
\copySignature{PMIx_Process_monitor}{3.0}{
pmix_status_t \\
PMIx_Process_monitor(const pmix_info_t *monitor, \\
\hspace*{21\sigspace}pmix_status_t error, \\
\hspace*{21\sigspace}const pmix_info_t directives[], size_t ndirs, \\
\hspace*{21\sigspace}pmix_info_t *results[], size_t *nresults);
}
\begin{arglist}
\argin{monitor}{info (handle)}
\argin{error}{status (integer)}
\argin{directives}{Array of info structures (array of handles)}
\argin{ndirs}{Number of elements in the \refarg{directives} array (integer)}
\arginout{results}{Address where a pointer to an array of \refstruct{pmix_info_t} containing the results of the request can be returned (memory reference)}
\arginout{nresults}{Address where the number of elements in \refarg{results} can be returned (handle)}
\end{arglist}
Returns one of the following:
\begin{itemize}
\item \refconst{PMIX_SUCCESS}, indicating that the request was processed and returned \textit{success}. Details of the result will be returned in the \refarg{results} array
\item a PMIx error constant indicating either an error in the input or that the request was refused
\end{itemize}
\optattrstart
The following attributes may be implemented by a \ac{PMIx} library or by the host environment. If supported by the \ac{PMIx} server library, then the library must not pass the supported attributes to the host environment. All attributes not directly supported by the server library must be passed to the host environment if it supports this operation, and the library is \textit{required} to add the \refAttributeItem{PMIX_USERID} and the \refAttributeItem{PMIX_GRPID} attributes of the requesting process:
\pasteAttributeItem{PMIX_MONITOR_ID}
\pasteAttributeItem{PMIX_MONITOR_CANCEL}
\pasteAttributeItem{PMIX_MONITOR_APP_CONTROL}
\pasteAttributeItem{PMIX_MONITOR_HEARTBEAT}
\pasteAttributeItem{PMIX_MONITOR_HEARTBEAT_TIME}
\pasteAttributeItem{PMIX_MONITOR_HEARTBEAT_DROPS}
\pasteAttributeItem{PMIX_MONITOR_FILE}
\pasteAttributeItem{PMIX_MONITOR_FILE_SIZE}
\pasteAttributeItem{PMIX_MONITOR_FILE_ACCESS}
\pasteAttributeItem{PMIX_MONITOR_FILE_MODIFY}
\pasteAttributeItem{PMIX_MONITOR_FILE_CHECK_TIME}
\pasteAttributeItem{PMIX_MONITOR_FILE_DROPS}
\pasteAttributeItem{PMIX_SEND_HEARTBEAT}
\optattrend
%%%%
\descr
Request that application processes be monitored via several possible methods.
For example, that the server monitor this process for periodic heartbeats as an indication that the process has not become ``wedged''.
When a monitor detects the specified alarm condition, it will generate an event notification using the provided error code and passing along any available relevant information.
It is up to the caller to register a corresponding event handler.
The \refarg{monitor} argument is an attribute indicating the type of monitor being requested.
For example, \refattr{PMIX_MONITOR_FILE} to indicate that the requestor is asking that a file be monitored.
The \refarg{error} argument is the status code to be used when generating an event notification alerting that the monitor has been triggered.
The range of the notification defaults to \refconst{PMIX_RANGE_NAMESPACE}.
This can be changed by providing a \refattr{PMIX_RANGE} directive.
The \refarg{directives} argument characterizes the monitoring request (e.g., monitor file size) and frequency of checking to be done
The returned \refarg{status} indicates whether or not the request was granted, and information as to the reason for any denial of the request shall be returned in the \refarg{results} array.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{\code{PMIx_Process_monitor_nb}}
\declareapi{PMIx_Process_monitor_nb}
%%%%
\summary
Request that application processes be monitored.
%%%%
\format
\copySignature{PMIx_Process_monitor_nb}{2.0}{
pmix_status_t \\
PMIx_Process_monitor_nb(const pmix_info_t *monitor, \\
\hspace*{24\sigspace}pmix_status_t error, \\
\hspace*{24\sigspace}const pmix_info_t directives[], \\
\hspace*{24\sigspace}size_t ndirs, \\
\hspace*{24\sigspace}pmix_info_cbfunc_t cbfunc, void *cbdata);
}
\begin{arglist}
\argin{monitor}{info (handle)}
\argin{error}{status (integer)}
\argin{directives}{Array of info structures (array of handles)}
\argin{ndirs}{Number of elements in the \refarg{directives} array (integer)}
\argin{cbfunc}{Callback function \refapi{pmix_info_cbfunc_t} (function reference)}
\argin{cbdata}{Data to be passed to the callback function (memory reference)}
\end{arglist}
Returns one of the following:
\begin{itemize}
\item \refconst{PMIX_SUCCESS}, indicating that the request is being processed by the host environment - result will be returned in the provided \refarg{cbfunc}. Note that the library must not invoke the callback function prior to returning from the \ac{API}.
\item \refconst{PMIX_OPERATION_SUCCEEDED}, indicating that the request was immediately processed and returned \textit{success} - the \refarg{cbfunc} will \textit{not} be called.
\item a PMIx error constant indicating either an error in the input or that the request was immediately processed and failed - the \refarg{cbfunc} will \textit{not} be called.
\end{itemize}
\optattrstart
The following attributes may be implemented by a \ac{PMIx} library or by the host environment. If supported by the \ac{PMIx} server library, then the library must not pass the supported attributes to the host environment. All attributes not directly supported by the server library must be passed to the host environment if it supports this operation, and the library is \textit{required} to add the \refAttributeItem{PMIX_USERID} and the \refAttributeItem{PMIX_GRPID} attributes of the requesting process:
\pasteAttributeItem{PMIX_MONITOR_ID}
\pasteAttributeItem{PMIX_MONITOR_CANCEL}
\pasteAttributeItem{PMIX_MONITOR_APP_CONTROL}
\pasteAttributeItem{PMIX_MONITOR_HEARTBEAT}
\pasteAttributeItem{PMIX_MONITOR_HEARTBEAT_TIME}
\pasteAttributeItem{PMIX_MONITOR_HEARTBEAT_DROPS}
\pasteAttributeItem{PMIX_MONITOR_FILE}
\pasteAttributeItem{PMIX_MONITOR_FILE_SIZE}
\pasteAttributeItem{PMIX_MONITOR_FILE_ACCESS}
\pasteAttributeItem{PMIX_MONITOR_FILE_MODIFY}
\pasteAttributeItem{PMIX_MONITOR_FILE_CHECK_TIME}
\pasteAttributeItem{PMIX_MONITOR_FILE_DROPS}
\pasteAttributeItem{PMIX_SEND_HEARTBEAT}
\optattrend
%%%%
\descr
Non-blocking form of the \refapi{PMIx_Process_monitor} \ac{API}. The \refarg{cbfunc} function provides a \refarg{status} to indicate whether or not the request was granted, and to provide some information as to the reason for any denial in the \refapi{pmix_info_cbfunc_t} array of \refstruct{pmix_info_t} structures.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{\code{PMIx_Heartbeat}}
\declaremacro{PMIx_Heartbeat}
%%%%
\summary
Send a heartbeat to the \ac{PMIx} server library
%%%%
\format
\copySignature{PMIx_Heartbeat}{2.0}{
PMIx_Heartbeat();
}
%%%%
\descr
A simplified macro wrapping \refapi{PMIx_Process_monitor_nb} that sends a heartbeat to the \ac{PMIx} server library.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Monitoring events}
\label{api:struct:events:monitor}
The following monitoring events may be available for registration, depending upon implementation and host environment support:
\begin{constantdesc}
%
\declareconstitem{PMIX_MONITOR_HEARTBEAT_ALERT}
Heartbeat failed to arrive within specified window. The process that triggered this alert will be identified in the event.
%
\declareconstitem{PMIX_MONITOR_FILE_ALERT}
File failed its monitoring detection criteria. The file that triggered this alert will be identified in the event.
%
\end{constantdesc}
%%%%%%%%%%%
\subsection{Monitoring attributes}
\label{api:struct:attributes:monitor}
Attributes used to control monitoring of an executing application- these are values passed to the \refapi{PMIx_Process_monitor_nb} \ac{API} and are not accessed using the \refapi{PMIx_Get} \ac{API}.
%
\declareAttribute{PMIX_MONITOR_ID}{"pmix.monitor.id"}{char*}{
Provide a string identifier for this request.
}
%
\declareAttribute{PMIX_MONITOR_CANCEL}{"pmix.monitor.cancel"}{char*}{
Identifier to be canceled (\code{NULL} means cancel all monitoring for this process).
}
%
\declareAttribute{PMIX_MONITOR_APP_CONTROL}{"pmix.monitor.appctrl"}{bool}{
The application desires to control the response to a monitoring event - i.e., the application is requesting that the host environment not take immediate action in response to the event (e.g., terminating the job).
}
%
\declareAttribute{PMIX_MONITOR_HEARTBEAT}{"pmix.monitor.mbeat"}{void}{
Register to have the PMIx server monitor the requestor for heartbeats.
}
%
\declareAttribute{PMIX_SEND_HEARTBEAT}{"pmix.monitor.beat"}{void}{
Send heartbeat to local PMIx server.
}
%
\declareAttribute{PMIX_MONITOR_HEARTBEAT_TIME}{"pmix.monitor.btime"}{uint32_t}{
Time in seconds before declaring heartbeat missed.
}
%
\declareAttribute{PMIX_MONITOR_HEARTBEAT_DROPS}{"pmix.monitor.bdrop"}{uint32_t}{
Number of heartbeats that can be missed before generating the event.
}
%
\declareAttribute{PMIX_MONITOR_FILE}{"pmix.monitor.fmon"}{char*}{
Register to monitor file for signs of life.
}
%
\declareAttribute{PMIX_MONITOR_FILE_SIZE}{"pmix.monitor.fsize"}{bool}{
Monitor size of given file is growing to determine if the application is running.
}
%
\declareAttribute{PMIX_MONITOR_FILE_ACCESS}{"pmix.monitor.faccess"}{char*}{
Monitor time since last access of given file to determine if the application is running.
}
%
\declareAttribute{PMIX_MONITOR_FILE_MODIFY}{"pmix.monitor.fmod"}{char*}{
Monitor time since last modified of given file to determine if the application is running.
}
%
\declareAttribute{PMIX_MONITOR_FILE_CHECK_TIME}{"pmix.monitor.ftime"}{uint32_t}{
Time in seconds between checking the file.
}
%
\declareAttribute{PMIX_MONITOR_FILE_DROPS}{"pmix.monitor.fdrop"}{uint32_t}{
Number of file checks that can be missed before generating the event.
}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Logging}
\label{chap:api_job_mgmt:logging}
The logging interface supports posting information by applications and SMS elements to persistent storage. This function is \textit{not} intended for output of computational results, but rather for reporting status and saving state information such as inserting computation progress reports into the application's \ac{SMS} job log or error reports to the local syslog.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{\code{PMIx_Log}}
\declareapi{PMIx_Log}
%%%%
\summary
Log data to a data service.
%%%%
\format
\copySignature{PMIx_Log}{3.0}{
pmix_status_t \\
PMIx_Log(const pmix_info_t data[], size_t ndata, \\
\hspace*{9\sigspace}const pmix_info_t directives[], size_t ndirs);
}
\begin{arglist}
\argin{data}{Array of info structures (array of handles)}
\argin{ndata}{Number of elements in the \refarg{data} array (\code{size_t})}
\argin{directives}{Array of info structures (array of handles)}
\argin{ndirs}{Number of elements in the \refarg{directives} array (\code{size_t})}
\end{arglist}
Return codes are one of the following:
\begin{constantdesc}
\item \refconst{PMIX_SUCCESS} The logging request was successful.
\item \refconst{PMIX_ERR_BAD_PARAM} The logging request contains at least one incorrect entry.
\item \refconst{PMIX_ERR_NOT_SUPPORTED} The \ac{PMIx} implementation or host environment does not support this function.
\item other appropriate \ac{PMIx} error code
\end{constantdesc}
\reqattrstart
If the \ac{PMIx} library does not itself perform this operation, then it is required to pass any attributes provided by the client to the host environment for processing. In addition, it must include the following attributes in the passed \refarg{info} array:
\pasteAttributeItem{PMIX_USERID}
\pasteAttributeItem{PMIX_GRPID}
Host environments or \ac{PMIx} libraries that implement support for this operation are required to support the following attributes:
\pasteAttributeItem{PMIX_LOG_STDERR}
\pasteAttributeItem{PMIX_LOG_STDOUT}
\pasteAttributeItem{PMIX_LOG_SYSLOG}
\pasteAttributeItem{PMIX_LOG_LOCAL_SYSLOG}
\pasteAttributeItem{PMIX_LOG_GLOBAL_SYSLOG}
\pasteAttributeItem{PMIX_LOG_SYSLOG_PRI}
\pasteAttributeItem{PMIX_LOG_ONCE}
\reqattrend
\optattrstart
The following attributes are optional for host environments or \ac{PMIx} libraries that support this operation:
\pasteAttributeItem{PMIX_LOG_SOURCE}
\pasteAttributeItem{PMIX_LOG_TIMESTAMP}
\pasteAttributeItem{PMIX_LOG_GENERATE_TIMESTAMP}
\pasteAttributeItem{PMIX_LOG_TAG_OUTPUT}
\pasteAttributeItem{PMIX_LOG_TIMESTAMP_OUTPUT}
\pasteAttributeItem{PMIX_LOG_XML_OUTPUT}
\pasteAttributeItem{PMIX_LOG_EMAIL}
\pasteAttributeItem{PMIX_LOG_EMAIL_ADDR}
\pasteAttributeItem{PMIX_LOG_EMAIL_SENDER_ADDR}
\pasteAttributeItem{PMIX_LOG_EMAIL_SERVER}
\pasteAttributeItem{PMIX_LOG_EMAIL_SRVR_PORT}
\pasteAttributeItem{PMIX_LOG_EMAIL_SUBJECT}
\pasteAttributeItem{PMIX_LOG_EMAIL_MSG}
\pasteAttributeItem{PMIX_LOG_JOB_RECORD}
\pasteAttributeItem{PMIX_LOG_GLOBAL_DATASTORE}
\optattrend
%%%%
\descr
Log data subject to the services offered by the host environment. The data to be logged is provided in the \refarg{data} array. The (optional) \refarg{directives} can be used to direct the choice of logging channel.
\adviceuserstart
It is strongly recommended that the \refapi{PMIx_Log} API not be used by applications for streaming data as it is not a ``performant'' transport and can perturb the application since it involves the local \ac{PMIx} server and host \ac{SMS} daemon. Note that a return of \refconst{PMIX_SUCCESS} only denotes that the data was successfully handed to the appropriate system call (for local channels) or the host environment and does not indicate receipt at the final destination.
\adviceuserend
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{\code{PMIx_Log_nb}}
\declareapi{PMIx_Log_nb}
%%%%
\summary
Log data to a data service.
%%%%
\format
\copySignature{PMIx_Log_nb}{2.0}{
pmix_status_t \\
PMIx_Log_nb(const pmix_info_t data[], size_t ndata, \\
\hspace*{12\sigspace}const pmix_info_t directives[], size_t ndirs, \\
\hspace*{12\sigspace}pmix_op_cbfunc_t cbfunc, void *cbdata);
}
\begin{arglist}
\argin{data}{Array of info structures (array of handles)}
\argin{ndata}{Number of elements in the \refarg{data} array (\code{size_t})}
\argin{directives}{Array of info structures (array of handles)}
\argin{ndirs}{Number of elements in the \refarg{directives} array (\code{size_t})}
\argin{cbfunc}{Callback function \refapi{pmix_op_cbfunc_t} (function reference)}
\argin{cbdata}{Data to be passed to the callback function (memory reference)}
\end{arglist}
Return codes are one of the following:
\begin{constantdesc}
\item \refconst{PMIX_SUCCESS} The logging request is valid and is being processed. The resulting status from the operation will be provided in the callback function. Note that the library must not invoke the callback function prior to returning from the \ac{API}.
\item \refconst{PMIX_OPERATION_SUCCEEDED}, indicating that the request was immediately processed and returned \textit{success} - the \refarg{cbfunc} will \textit{not} be called
\item \refconst{PMIX_ERR_BAD_PARAM} The logging request contains at least one incorrect entry that prevents it from being processed. The callback function will not be called.
\item \refconst{PMIX_ERR_NOT_SUPPORTED} The \ac{PMIx} implementation does not support this function. The callback function will not be called.
\item other appropriate \ac{PMIx} error code - the callback function will not be called.
\end{constantdesc}
\reqattrstart
If the \ac{PMIx} library does not itself perform this operation, then it is required to pass any attributes provided by the client to the host environment for processing. In addition, it must include the following attributes in the passed \refarg{info} array:
\pasteAttributeItem{PMIX_USERID}
\pasteAttributeItem{PMIX_GRPID}
Host environments or \ac{PMIx} libraries that implement support for this operation are required to support the following attributes:
\pasteAttributeItem{PMIX_LOG_STDERR}
\pasteAttributeItem{PMIX_LOG_STDOUT}
\pasteAttributeItem{PMIX_LOG_SYSLOG}