-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.xml
executable file
·1901 lines (1510 loc) · 110 KB
/
index.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
<channel>
<title>f(x) on f(x) </title>
<generator uri="https://gohugo.io">Hugo</generator>
<link>http://firoyang.org/</link>
<language>en-us</language>
<author>Firo Yang</author>
<updated>Sat, 08 Jun 2019 00:00:00 UTC</updated>
<item>
<title>Linux kernel page allocation</title>
<link>http://firoyang.org/cs/page_allocator/</link>
<pubDate>Sat, 08 Jun 2019 00:00:00 UTC</pubDate>
<author>Firo Yang</author>
<guid>http://firoyang.org/cs/page_allocator/</guid>
<description>
<h1 id="gfp">GFP</h1>
<p><a href="https://www.kernel.org/doc/html/latest/core-api/memory-allocation.html">Memory Allocation Guide</a><br />
<a href="https://www.kernel.org/doc/html/latest/core-api/mm-api.html#memory-allocation-controls">Memory Allocation Controls</a><br />
Also see include/linux/gfp.h</p>
<h2 id="removed-gfp-flags">Removed GFP flags</h2>
<p>__GFP_WAIT: mm, page_alloc: Rename __GFP_WAIT to __GFP_RECLAIM</p>
<h2 id="gfp-zone-table-and-gfp-zone-bad">GFP_ZONE_TABLE and GFP_ZONE_BAD</h2>
<p>commit b70d94ee438b3fd9c15c7691d7a932a135c18101<br />
Refs: v2.6.30-5489-gb70d94ee438b<br />
Author: Christoph Lameter <a href="mailto:[email protected]">[email protected]</a><br />
AuthorDate: Tue Jun 16 15:32:46 2009 -0700<br />
page-allocator: use integer fields lookup for gfp_zone and check for errors in flags passed to the page allocator<br />
+ * GFP_ZONE_TABLE is a word size bitstring that is used for looking up the<br />
+ * zone to use given the lowest 4 bits of gfp_t. Entries are ZONE_SHIFT long<br />
+ * and there are 16 of them to cover all possible combinations of<br />
+ * __GFP_DMA, __GFP_DMA32, __GFP_MOVABLE and __GFP_HIGHMEM<br />
+ * The zone fallback order is MOVABLE=&gt;HIGHMEM=&gt;NORMAL=&gt;DMA32=&gt;DMA.<br />
+ * But GFP_MOVABLE is not only a zone specifier but also an allocation<br />
+ * policy. Therefore __GFP_MOVABLE plus another zone selector is valid.<br />
+ * Only 1bit of the lowest 3 bit (DMA,DMA32,HIGHMEM) can be set to &ldquo;1&rdquo;.<br />
+ * bit result<br />
+ * 0x0 =&gt; NORMAL<br />
+ * 0x1 =&gt; DMA or NORMAL<br />
+ * 0x2 =&gt; HIGHMEM or NORMAL<br />
+ * 0x3 =&gt; BAD (DMA+HIGHMEM)<br />
+ * 0x4 =&gt; DMA32 or DMA or NORMAL<br />
+ * 0x5 =&gt; BAD (DMA+DMA32)<br />
+ * 0x6 =&gt; BAD (HIGHMEM+DMA32)<br />
+ * 0x7 =&gt; BAD (HIGHMEM+DMA32+DMA)<br />
+ * 0x8 =&gt; NORMAL (MOVABLE+0)<br />
+ * 0x9 =&gt; DMA or NORMAL (MOVABLE+DMA)<br />
+ * 0xa =&gt; MOVABLE (Movable is valid only if HIGHMEM is set too)<br />
+ * 0xb =&gt; BAD (MOVABLE+HIGHMEM+DMA)<br />
+ * 0xc =&gt; DMA32 (MOVABLE+HIGHMEM+DMA32)<br />
+ * 0xd =&gt; BAD (MOVABLE+DMA32+DMA)<br />
+ * 0xe =&gt; BAD (MOVABLE+DMA32+HIGHMEM)<br />
+ * 0xf =&gt; BAD (MOVABLE+DMA32+HIGHMEM+DMA)</p>
<h1 id="alloc-flags">Alloc flags</h1>
<p>gfp_to_alloc_flags<br />
ALLOC_HIGH: __zone_watermark_ok(): if (alloc_flags &amp; ALLOC_HIGH) min -= min / 2;<br />
ALLOC_HARDER: rmqueue(): if (alloc_flags &amp; ALLOC_HARDER) { page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);</p>
<h1 id="pf-memalloc">PF_MEMALLOC</h1>
<p><a href="https://www.kernel.org/doc/gorman/html/understand/understand009.html">Mel&rsquo;s book on PF_MEMALLOC</a><br />
<a href="https://lore.kernel.org/patchwork/cover/178099/">Kill PF_MEMALLOC abuse</a><br />
get_page_from_freelist and __ac_get_obj<br />
* page is set pfmemalloc is when ALLOC_NO_WATERMARKS was<br />
* necessary to allocate the page. The expectation is<br />
* that the caller is taking steps that will free more<br />
* memory. The caller should avoid the page being used<br />
* for !PFMEMALLOC purposes.<br />
if (alloc_flags &amp; ALLOC_NO_WATERMARKS)<br />
set_page_pfmemalloc(page);</p>
<h2 id="users-of-pf-memalloc">Users of PF_MEMALLOC</h2>
<p>kswapd and <strong>alloc_pages_direct_reclaim-&gt;</strong>perform_reclaim-&gt;Set PF_MEMALLOC.<br />
commit c93bdd0e03e848555d144eb44a1f275b871a8dd5<br />
Author: Mel Gorman <a href="mailto:[email protected]">[email protected]</a><br />
Date: Tue Jul 31 16:44:19 2012 -0700<br />
netvm: allow skb allocation to use PFMEMALLOC reserves</p>
<h1 id="pf-swapwrite-swapwrite-originally-means-swap-space-but-now-stands-for-kswapd-or-zone-reclaim-and-migration">PF_SWAPWRITE - swapwrite originally means swap space but now stands for kswapd or zone reclaim and migration?</h1>
<p><a href="https://lore.kernel.org/linux-mm/[email protected]/#r">Swap Migration V4: Overview</a><br />
<a href="https://lwn.net/Articles/157936/">Swap Migration V5: Overview</a><br />
commit 930d915252edda7042c944ed3c30194a2f9fe163<br />
Refs: v2.6.15-1460-g930d915252ed<br />
Author: Christoph Lameter <a href="mailto:[email protected]">[email protected]</a><br />
AuthorDate: Sun Jan 8 01:00:47 2006 -0800<br />
[PATCH] Swap Migration V5: PF_SWAPWRITE to allow writing to swap<br />
Add PF_SWAPWRITE to control a processes permission to write to swap.<br />
- Use PF_SWAPWRITE in may_write_to_queue() instead of checking for kswapd and pdflush<br />
- Set PF_SWAPWRITE flag for kswapd and pdflush</p>
<h2 id="firo">Firo</h2>
<p>The origianl migrations code <a href="https://lore.kernel.org/linux-mm/[email protected]/">swap_pages</a><br />
seems I can remove it from migration code since it&rsquo;s not used during migrating pages.<br />
Could I remove it completely.</p>
<h1 id="zone-lists">Zone lists</h1>
<p>struct zonelist node_zonelists[MAX_ZONELISTS];<br />
* [0] : Zonelist with fallback<br />
* [1] : No fallback (__GFP_THISNODE)<br />
start_kernel -&gt; build_all_zonelists<br />
or hotpulg or /proc/sys/vm/numa_zonelist_order: numa_zonelist_order_handler<br />
node_zonelists = {{ # Fallback zones: this zonelist including all zones from all nodes.<br />
_zonerefs = {{<br />
zone = 0xffff88107ffd5d80, # node 0<br />
zone_idx = 2<br />
zone = 0xffff88107ffd56c0, # node 0<br />
zone_idx = 1<br />
zone = 0xffff88107ffd5000, # node 0<br />
zone_idx = 0<br />
zone = 0xffff88207ffd2d80, # Node 1; fallback.<br />
zone_idx = 2<br />
zone = 0x0,<br />
zone_idx = 0<br />
&hellip;}}}<br />
node_zonelists[1] # Nofallback zones</p>
<h1 id="lqo">LQO</h1>
<p>[Driver porting: low-level memory allocation]<a href="https://lwn.net/Articles/22909/)">https://lwn.net/Articles/22909/)</a><br />
<a href="https://lwn.net/Articles/627419/">The &ldquo;too small to fail&rdquo; memory-allocation rule</a><br />
<a href="https://lwn.net/Articles/723317/">Revisiting &ldquo;too small to fail&rdquo;</a></p>
<h1 id="high-order-atomic-allocations">High-order atomic allocations</h1>
<p>commit 0aaa29a56e4fb0fc9e24edb649e2733a672ca099<br />
Author: Mel Gorman <a href="mailto:[email protected]">[email protected]</a><br />
Date: Fri Nov 6 16:28:37 2015 -0800<br />
mm, page_alloc: reserve pageblocks for high-order atomic allocations on demand</p>
<h1 id="hot-and-cold-pages-pcp-list">Hot and cold pages, pcp list</h1>
<p><a href="https://lwn.net/Articles/14768/">Hot and cold pages</a><br />
<a href="https://patchwork.kernel.org/patch/10013971/">mm, Remove cold parameter from free_hot_cold_page*</a></p>
<h1 id="fair-zone-allocation-obsoleted-but-see-gfp-write">Fair-zone allocation - obsoleted but see __GFP_WRITE</h1>
<p><a href="https://lore.kernel.org/patchwork/patch/691300/">mm, page_alloc: Remove fair zone allocation policy</a><br />
<a href="https://lwn.net/Articles/576778/">Configurable fair allocation zone policy</a></p>
<h1 id="compaction-and-reclamation">Compaction and reclamation</h1>
<p>Direct reclaim: do_try_to_free_pages vm_event_item ALLOCSTALL<br />
Kswapd: balance_pgdat PAGEOUTRUN</p>
<h1 id="buddy-memory-system-1963-1965">Buddy memory system 1963 ~ 1965</h1>
<p><a href="http://sci-hub.tw/https://dl.acm.org/citation.cfm?doid=365628.365655">buddy system 1965 a fast storage allocator.</a><br />
<a href="https://en.wikipedia.org/wiki/Buddy_memory_allocation">Buddy memory allocation</a><br />
<a href="https://dl.acm.org/citation.cfm?id=359626">buddy system variants 1977</a><br />
The following cited from above 1965 paper.<br />
The oporations involved in obtaining blocks from and retm&rsquo;ning thom to the free<br />
storage lists aro vory fast, making this scheme particularly appropriate for list structure operations and for other<br />
situations involving many sizes of blocks which are fixed in size and location. This is in fact tho storago bookkeeping<br />
mothod used in tho Boll Telephone Laboratories Low-Level List Language&rsquo;</p>
<h2 id="osidp">OSIDP</h2>
<p>Both fixed and dynamic partitioning schemes have drawbacks. A fixed partitioning<br />
scheme limits the number of active processes and may use space inefficiently if there is<br />
a poor match between available partition sizes and process sizes. A dynamic partition-<br />
ing scheme is more complex to maintain and includes the overhead of compaction. An<br />
interesting compromise is the buddy system</p>
<h2 id="translations">Translations</h2>
<p>free_area; page_is_buddy; PageBuddy(buddy) &amp;&amp; page_order(buddy)<br />
setup_arch-&gt;x86_init.paging.pagetable_init = native_pagetable_init<br />
sparse_init vmemmap_populate # vmemmap<br />
zone_sizes_init free_area_init_core zone_pcp_init<br />
memmap_init_zone # Memory map a) Set all page to reserved. MIGRATE_MOVABLE? b) Set node, zone to page-&gt;flags; set_page_links</p>
<h3 id="buddy-init">Buddy init</h3>
<p>mem_init-&gt; memblock_free_all or free_all_bootmem # /* this will put all low memory onto the freelists */</p>
</description>
</item>
<item>
<title>Kernel memory bug - SLAB's 3 lists are corrupted.</title>
<link>http://firoyang.org/howto/bug_mm_1/</link>
<pubDate>Wed, 02 Jan 2019 00:00:00 UTC</pubDate>
<author>Firo Yang</author>
<guid>http://firoyang.org/howto/bug_mm_1/</guid>
<description>
<p>Recently, I was working on a kernel memory bug.</p>
<p><a href="https://apibugzilla.suse.com/show_bug.cgi?id=1118875">https://apibugzilla.suse.com/show_bug.cgi?id=1118875</a><br />
L3: kernel BUG at ../mm/slab.c:2804! bad LRU list and active values in page structs in possible use-after-free</p>
<p>After digging the binary vmcore file of kdump, I got the following findings.</p>
<h1 id="node-0">Node 0</h1>
<h2 id="partial">Partial</h2>
<p>list page.lru -H 0xffff8801a7c01348 -s page.lru,s_mem,active,slab_cache,flags &gt;n0p.log<br />
n0p -&gt; n0f=0xffff8801a7c01358</p>
<h2 id="full">Full</h2>
<p>list page.lru -H 0xffff8801a7c01358 -s page.lru,s_mem,active,slab_cache,flags &gt;n0f.log<br />
n0f -&gt;<br />
ffffea0006902380<br />
lru = {<br />
next = 0xffffea0080ed53e0,<br />
prev = 0xffffea00405f8ae0<br />
}<br />
s_mem = 0xffff8801a408e000<br />
active = 16<br />
slab_cache = 0xffff8801a7c00400<br />
flags = 6755398367314048<br />
ffffea0080ed53c0<br />
lru = {<br />
next = 0xffffea00422a34e0,<br />
prev = 0xffffea00069023a0<br />
}<br />
s_mem = 0xffff88203b54f000<br />
active = 7<br />
slab_cache = 0xffff8801a7c00400<br />
flags = 24769796876796032<br />
&hellip; -&gt; n1f = 0xffff881107c00358</p>
<h1 id="node-1">Node 1</h1>
<h2 id="partial-1">Partial</h2>
<p>crash&gt; list page.lru -H 0xffff881107c00348 -s page.lru,s_mem,active,slab_cache,flags &gt;n1p.log<br />
nip-&gt; SLAB ffffea0043ab74e0 -&gt; 0xffff881107c00348 = n1p<br />
SLAB ffffea0043ab74e0&rsquo;s prev pointing to 0xffff881107c00358</p>
<h2 id="full-1">Full</h2>
<p>crash&gt; list page.lru -H 0xffff881107c00358 -s page.lru,s_mem,active,slab_cache,flags &gt;n1f.log<br />
n1f-&gt; SLAB ffffea0043ab74e0 -&gt; &hellip; -&gt; 0xffff881107c00348 = n1p</p>
<p>This issue occured on a NUMA system with 2 memory nodes.<br />
Both node 0 and node 1&rsquo;s SLAB&rsquo;s partial and full lists were corrupted. After looking into this issue a few days, I talked to Vlastimil Babka.<br />
He provided a fix for this issue. That is 7810e6781e0fcbca78b91cf65053f895bf59e85f - mm, page_alloc: do not break __ GFP_THISNODE by zonelist reset.</p>
<p>Now, I have a question: why did I cannot solve this issue?</p>
</description>
</item>
<item>
<title>memory mapping</title>
<link>http://firoyang.org/cs/mem_map/</link>
<pubDate>Wed, 22 Aug 2018 21:39:41 CST</pubDate>
<author>Firo Yang</author>
<guid>http://firoyang.org/cs/mem_map/</guid>
<description>
<p>This article is talking about user space Memory mmapping; it&rsquo;s not limitted to mmap(2) system call.<br />
<a href="https://www.ibm.com/support/knowledgecenter/en/ssw_aix_72/com.ibm.aix.genprogc/understanding_mem_mapping.htm">Understanding memory mapping</a><br />
TLPI:chapter 49 and LSP: Chapter 8</p>
<h1 id="history">History</h1>
<p>BSD 4.2<br />
1990 SunOS 4.1<br />
<a href="http://bitsavers.trailing-edge.com/pdf/sun/sunos/4.1/800-3846-10A_System_Services_Overview_199003.pdf">A Must-read: The applications programmer gains access to the facilities of the VM system through several sets of system calls.</a></p>
<h1 id="memory-mappings">Memory mappings</h1>
<p><a href="https://landley.net/writing/memory-faq.txt">What are memory mappings? - Landley</a></p>
<blockquote>
<p>A memory mapping is a set of page table entries describing the properties<br />
of a consecutive virtual address range. Each memory mapping has a<br />
start address and length, permissions (such as whether the program can<br />
read, write, or execute from that memory), and associated resources (such<br />
as physical pages, swap pages, file contents, and so on).</p>
</blockquote>
<h1 id="vma">VMA</h1>
<p>vma&rsquo;s unit is PAGE_SIZE;</p>
<h2 id="split-vma">split_vma</h2>
<p>new_below<br />
commit 5846fc6c31162234e88bdfd91548b1cf0d2cebbd<br />
Author: Andrew Morton <a href="mailto:[email protected]">[email protected]</a><br />
Date: Tue Sep 17 06:35:47 2002 -0700<br />
[PATCH] consolidate the VMA splitting code<br />
new_below means the place where the old vma go to! Bad naming!<br />
0 means the old will save the head part. 1 means tail part.</p>
<h1 id="release-memory-resources">Release memory resources</h1>
<p>exit_mm exit_mmap</p>
<h1 id="shared-memory-mapping">Shared memory mapping</h1>
<p><a href="https://www.kernel.org/doc/gorman/html/understand/understand015.html">Chapter 12 Shared Memory Virtual Filesystem:</a></p>
<blockquote>
<p>This is a very clean interface that is conceptually easy to understand but it does not help anonymous pages as there is no file backing. To keep this nice interface, Linux creates an artifical file-backing for anonymous pages using a RAM-based filesystem where each VMA is backed by a “file” in this filesystem. Every inode in the filesystem is placed on a linked list called shmem_inodes so that they may always be easily located. This allows the same file-based interface to be used without treating anonymous pages as a special case.</p>
</blockquote>
<p>Firo: every time you create a shared memory via mmap(2), you create a inode with same name dev/zero in the hidden shm_mnt fs;<br />
The name dev/zero is only a name. It has nothing related to /dev/zero in drivers/char/mem.c. And /dev/shm is only a tmpfs; it has nothing related shmemfs, but POSIX&rsquo;s shm_open uses /dev/shm.</p>
<h2 id="shared-anonymouse-mappings">Shared anonymouse mappings</h2>
<p><a href="https://lore.kernel.org/patchwork/patch/174306/">vmscan: limit VM_EXEC protection to file pages</a><br />
If someone may take advange of reclaimation code by mmap(&hellip;, VM_EXEC, SHRED|ANON), OOM may occur since the old code protect it from reclaiming by add it back to the active list. Great patch. However, program running in tmpfs will also penalized.<br />
page_is_file_cache &lt; !PageAnon<br />
<a href="https://lwn.net/Articles/452035/">ashmem</a><br />
* onset - mmap<br />
do_mmap -&gt; mmap_region -&gt; vma_link -&gt; (__shmem_file_setup) &amp;&amp; __vma_link_file: into i_mmap interval_tree.<br />
* nuclus - share fault<br />
Read: do_read_fault<br />
Write: do_shared_fault -&gt; shmem_getpage_gfp shmem_add_to_page_cache<br />
WP: do_wp_page -&gt; wp_page_shared or wp_page_reuse<br />
b)IPC using a shared file mapping</p>
<h2 id="history-1">History</h2>
<p>late 70s - IPC: see TLPI: Chapter 45 INTRODUCTION TO SYSTEM V IPC<br />
they first appear together in Columbus UNIX, a Bell UNIX for database and efficient transaction processing<br />
1983 - IPC See TLPI or wikipedia shared mmeory.<br />
they land together in System V that made them popular in mainstream UNIX-es, hence the name</p>
<p>1983 - BSD mmap with shared vs private memory mapping<br />
BSD 4.2: The system supports sharing of data between processes by allowing pages to be mapped into memory. These mapped pages may be shared with other processes or private to the process.</p>
<p>1984 Jan - BSD mmap with file memory mapping support by SunOS<br />
The mmap seems firstly implemented by <a href="http://bitsavers.trailing-edge.com/pdf/sun/sunos/1.1/800-1108-01E_System_Interface_Manual_for_the_Sun_Workstation_Jan84.pdf">SunOS 1.1</a><br />
N.B. This call is not completely implemented In 4.2(BSD).<br />
More sunos docs: <a href="http://bitsavers.trailing-edge.com/pdf/sun/sunos/">http://bitsavers.trailing-edge.com/pdf/sun/sunos/</a></p>
<p>1988<br />
<a href="https://en.wikipedia.org/wiki/Memory-mapped_file#History">SunOS 4 introduced Unix&rsquo;s mmap, which permitted programs &ldquo;to map files into memory.&rdquo;</a><br />
1989<br />
One paper found in OSTEP: <a href="https://courses.cs.washington.edu/courses/cse551/09sp/papers/memory_coherence.pdf">Memory Coherence in Shared Virtual Memory Systems</a></p>
<h2 id="shared-memory-in-kernel">Shared memory in kernel</h2>
<h3 id="initial-version">Initial version</h3>
<p>history: commit 9cb9f18b5d26bf176e13edbc0c248d121217c6b3<br />
Refs: <0.99.10><br />
Author: Linus Torvalds <a href="mailto:[email protected]">[email protected]</a><br />
AuthorDate: Fri Nov 23 15:09:11 2007 -0500<br />
[PATCH] Linux-0.99.10 (June 7, 1993)<br />
Firo: search &lsquo;shm_swap&rsquo;</p>
<h3 id="ramfs-based">Ramfs based</h3>
<p>history: commit 4d372877c63baaaf4c1c3325cae43f6b9782e59e<br />
Refs: <2.4.0-test13pre3><br />
Author: Linus Torvalds <a href="mailto:[email protected]">[email protected]</a><br />
AuthorDate: Fri Nov 23 15:40:55 2007 -0500<br />
[&hellip;]<br />
The shmfs cleanup should be unnoticeable except to users who use SAP with<br />
huge shared memory segments, where Christoph Rohlands work not only<br />
makes the code much more readable, it should also make it dependable..<br />
[&hellip;]<br />
- Christoph Rohland: shmfs for shared memory handling</p>
</description>
</item>
<item>
<title>From the Nihilism</title>
<link>http://firoyang.org/philosophy/nihilism/</link>
<pubDate>Wed, 31 Jan 2018 00:00:00 UTC</pubDate>
<author>Firo Yang</author>
<guid>http://firoyang.org/philosophy/nihilism/</guid>
<description><p>虚无主义本质上是逻辑问题, 首先虚无主义者, 无法证明某种自己期望的意义的存在。 但这不能否定这种(生命/希望/生活/为之奋斗)的意义的存在的可能, 所以虚无主义者都被用身为人类的本能和自我意识不断思考而编成的绳子悬挂在这个世界。 稍有不慎这个绳子就可能断了, 走向死亡。</p>
<p>无论虚无主义者多么绝望, 都不能理智层面上 否定 他所期望的意义的存在的可能。</p>
<p>所以这个世界是可能存在意义的, 而不是彻底的无意义, 因为我们无法证明所有的事都是无意义的。</p>
<p>依据我个人的感受, 虚无主义者并不是真的不想做任何事, 而是在内心深处, 认为这个世界不配;不能实心实意的把自己交给这个世界&ndash; 依我来看&ndash; 这个世界, 虽然历经人类数千年的打磨, 依然在物质和精神层面而言都是荒野/荒原。 人类被随意的放在这个荒野之中。</p>
<p>虚无主义者, 只是真诚的思考者。 绝不是虚无主义者内心空虚,而是现在这个世界的虚无,没有给虚无主义者提供足够的意义, 才导致人成为虚无主义者。</p>
<p>我们的虚无源自于我们的软弱。</p>
<p>个人的力量始终是有限的, 这个世界上受虚无主义影响的人毕竟是少数, 所以虚无主义者们应该联合到一起, 透过科学的手段 去弄明白世界为什么存在。 很多虚无主义者会死在这个过程中。这是身为而人的短暂生命的悲哀。 虚无主义者应该拥有跨越千万年的生命, 因为很多时候我们舍弃了很多现实的诉求。 所以我们可能需要汇聚跨越数个世代的虚无主义者群体, 最终来完成这个目标。 了解世界为什么存在能找到所有事情的起源, 也就是意义的最开始,揭开所有的谜底 。</p>
<p>同时, 可能存在某种意义, 某种创造者希望我们, 去完成的, 在未来等着我们。</p>
<p>虚无主义并不全然是悲观的坏处, 至少他否定权威, 这会让我们在现实生活中和理性层面获得更多的自由, 减少某些欺骗导致的苦难。</p>
<p>不可避免, 身为人, 虚无主义者, 为了找到某种意义, 我们要好好的活着, 尽管, 内心不能认同那是我们的意义, 但这是这片大地给我的馈赠&ndash;自由,同事伴随而来的副产品, 束缚。 身为人的大地的束缚。</p>
<p>虚无主义者, 很容易忽视自己的感受. 相对其他人类而言, 我们更容易委屈自己. 纯粹的理性逻辑的思维中, 现实与思想的联系被割裂, 更容易陷入思维的泥沼里, 难以自拔. 自己成为某种意义的前提条件, 所以搞懂自己就时必须的. 虚无主义者的阵地就是理性思维, 而来自外界感受, 易被忽视.自己至少是由理性思维和对外在的感受共同组成的.</p>
<p>我不认同虚无主义只是纯粹主观的理性问题,但逻辑会帮我们理顺问题。从某种程度上,这是一个客观现实的问题。由此看来虚无主义是理性和现实共同引发的问题。理性可以容易通过能指表示问题,甚至不惧任何意义,而在浩如烟海的现实世界,人无迹所踪。甚至导致以为虚无主义是纯粹主观的理性的问题。从而忽略现实, 甚至忽略来自外界的感受, 既然外界的感受会是我们的意义的一部分, 了解自己的主观感受,使之达到就如同理性层面的批判的健康状态。当然我们不知道达到健康的主观感受对于我们寻找意义有如何的帮助。 我们只是在寻找意义的过程中。所谓的健康状态,从思维层面看体现的是理性的,批判的和潜在的自由的。主观感受要达到什么样状态呢?首先,是不应限制 约束理性层面的健康状态。 我们存在的基础就是, 我们理性中的自我意识. 我是谁, 谁是我. 个体的意识在外在的世界的影响想不断形成.可以说个人的意识,就是世界的意识. 世界本身也在寻找他的意志. 我是我, 我也是世界的一份. 同时世界也是矛盾的, 各种意识相互影响. 个体自身的意识, 使之遵循自身意义成为可能. 应当维护自身. 在这个世界上, 每个人都在追寻自己的意义, 可能是不自觉, 亦或是有目标, 但总之其他人的意义会影响我们,竞争是如此激烈,以致这个世界在阳光之下潜藏着满满地恶,偶尔包不住了,会泄漏出来。 所以维持自己的存在,对于人生来说格外重要, 这是所有的意义。 而生命中那些不能承受的轻,时不时的成为自我委屈的导火索。</p>
<p>现实的感受对于维持个体主观意志的健康,是如此的必要。并非吹毛求疵。</p>
<p>做到如此, 一个真正的自我,便浮现出来。 这边是人生意义所应该表述的内在,即我们在最大程度上保证自身的自由与健康,个人意志的最大程度的伸展与表述,即这是我们期待的自己。 反抗一切形式的压迫。</p>
<p>搜寻生命的意义。 拥有未自省的人生的人,他所追寻的目标意义,很大程度上,是世界赋予的也就是 世界的意志 自身的本能的体现, 而并非自己真正的意图,也即Griffith 所言 有些人终其一生都不知为何而活,最终慢慢飘出这个世界。 又言 被梦想所奴役。</p>
<p>相较于纯粹的理性思维,生命, 是否值得一个意义?</p>
<p>那么多远大的意义 为什么不能给生命一个?</p>
<p>生命/人生的意义不应该是唯一的.</p>
<p>追寻自由,独立的意志<br />
反抗剥削,压迫,奴役<br />
反抗与逃离那些不被注意,却无处不在,弥漫在这个个社会上潜移默化地,悄无声息的使人变得畸形,制约个人自由的生长, 噤若寒蝉, 放弃生命本身自然的约束</p>
</description>
</item>
<item>
<title>Memory consistency model</title>
<link>http://firoyang.org/cs/consistency_model/</link>
<pubDate>Sat, 16 Dec 2017 15:46:12 CST</pubDate>
<author>Firo Yang</author>
<guid>http://firoyang.org/cs/consistency_model/</guid>
<description>
<p>When we are talking on memory model, we are refering memory consistency model or memory ordering model.</p>
<h1 id="hisotry">Hisotry</h1>
<p>1979<br />
<a href="https://www.microsoft.com/en-us/research/uploads/prod/2016/12/How-to-Make-a-Multiprocessor-Computer-That-Correctly-Executes-Multiprocess-Programs.pdf">How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Progranm</a><br />
1987 ~ 1990<br />
<a href="https://cs.brown.edu/~mph/HerlihyW90/p463-herlihy.pdf">Linearizability: A Correctness Condition for Concurrent Objects</a><br />
1989<br />
<a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.8.3766&amp;rep=rep1&amp;type=pdf">processor consistency: CACHE CONSISTENCY AND SEQUENTIAL CONSISTENCY</a><br />
1990<br />
<a href="https://dl.acm.org/citation.cfm?id=325102">Release consistency: Memory consistency and event ordering in scalable shared-memory multiprocessors</a><br />
1991<br />
<a href="https://dl.acm.org/citation.cfm?id=113406">Proving sequential consistency of high-performance shared memories</a><br />
1992<br />
<a href="https://www.gaisler.com/doc/sparcv8.pdf">TSO Sparc v8: A standard memory model called Total Store Ordering (TSO) is defined for SPARC</a><br />
<a href="https://link.springer.com/chapter/10.1007/978-1-4615-3604-8_2">Formal Specification of Memory Models: and two store ordered models TSO and PSO defined by the Sun Microsystem&rsquo;s SPARC architecture.</a></p>
<p>2001 ~ Present<br />
<a href="https://www.youtube.com/watch?v=WUfvvFD5tAA">IA64 memory ordering</a></p>
<h1 id="purposes">Purposes</h1>
<p><a href="https://www.cs.cmu.edu/afs/cs/academic/class/15418-s12/www/lectures/14_relaxedReview.pdf">Motivation: hiding latency</a><br />
▪ Why are we interested in relaxing ordering requirements?<br />
- Performance<br />
- Speci!cally, hiding memory latency: overlap memory accesses with other operations<br />
- Remember, memory access in a cache coherent system may entail much more then<br />
simply reading bits from memory (!nding data, sending invalidations, etc.)</p>
<h2 id="why-tso-it-s-because-that-write-buffer-or-store-buffer-is-not-invisible-any-more-for-multiprocessor-https-www-cis-upenn-edu-devietti-classes-cis601-spring2016-sc-tso-pdf">Why TSO? <a href="https://www.cis.upenn.edu/~devietti/classes/cis601-spring2016/sc_tso.pdf">It&rsquo;s because that write buffer or Store buffer is not invisible any more for multiprocessor</a></h2>
<p>To abandon SC; to Allow use of a FIFO write buffer.<br />
<a href="https://www.cs.utexas.edu/~bornholt/post/memory-models.html">An example: There’s no reason why performing event (2) (a read from B) needs to wait until event (1) (a write to A) completes. They don’t interfere with each other at all, and so should be allowed to run in parallel. See Memory Consistency Models: A Primer</a><br />
Hide the write latency by putting the data in the store buffer.</p>
<h3 id="why-not-read-write-reordering">Why not read-write reordering?</h3>
<p>reordering read-write is non-sense.</p>
<h1 id="formal-cause">Formal cause</h1>
<p>Shared memory<br />
Multiprocessor<br />
Memory access<br />
program order<br />
<a href="https://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf">Recommened by CAAQA: Observity in SC, TSO, PC: Paragraph Relaxing the Write to Read Program Order in Shared Memory Consistency Models: A Tutorial</a><br />
<a href="http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.06.07c.pdf">Memory Barriers: a Hardware View for Software Hackers - must read</a><br />
<a href="http://15418.courses.cs.cmu.edu/spring2013/article/41">&lsquo;A Summary of Relaxed Consistency&rsquo; CMU</a><a href="https://www.cs.cmu.edu/afs/cs/academic/class/15418-s12/www/lectures/14_relaxedReview.pdf">Slides</a></p>
<h2 id="sc">SC</h2>
<p><a href="https://www.microsoft.com/en-us/research/uploads/prod/2016/12/How-to-Make-a-Multiprocessor-Computer-That-Correctly-Executes-Multiprocess-Programs.pdf">sequential consistency</a><br />
<a href="https://jepsen.io/consistency/models/sequential#formally">Formal of Sequential Consistency by Jepsen</a></p>
<h2 id="tso">TSO</h2>
<p>Total Store Ordering in Appendix k Sparc v8.</p>
<h3 id="tso-in-x86">TSO in x86</h3>
<p><a href="https://www.cl.cam.ac.uk/~pes20/weakmemory/x86tso-paper.tphols.pdf">A Better x86 Memory Model: x86-TSO</a><br />
<a href="https://stackoverflow.com/questions/27595595/when-are-x86-lfence-sfence-and-mfence-instructions-required">When are x86 LFENCE, SFENCE and MFENCE instructions required?</a></p>
<h3 id="tso-vs-pc">TSO vs PC:</h3>
<p><a href="http://15418.courses.cs.cmu.edu/spring2013/article/41">&lsquo;A Summary of Relaxed Consistency&rsquo; CMU</a><a href="https://www.cs.cmu.edu/afs/cs/academic/class/15418-s12/www/lectures/14_relaxedReview.pdf">Slides</a></p>
<h3 id="tso-and-peterson-s-algorithm">TSO and Peterson&rsquo;s algorithm</h3>
<p><a href="https://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/">Who ordered memory fences on an x86?</a><br />
<a href="https://www.cnblogs.com/caidi/p/6708789.html">共同进入与饥饿</a></p>
<h2 id="pc">PC</h2>
<p><a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.8.3766&amp;rep=rep1&amp;type=pdf">processor consistency: CACHE CONSISTENCY AND SEQUENTIAL CONSISTENCY</a></p>
<h2 id="wc">WC</h2>
<p><a href="https://people.eecs.berkeley.edu/~kubitron/cs252/handouts/oldquiz/p434-dubois.pdf">weak consistency: Memory access buffering in multiprocessors</a><br />
They distinguish between ordinary shared accesses and synchronization accesses, where the latter are used to control concurrency<br />
between several processes and to maintain the integrity of ordinary shared data.</p>
<h2 id="rc">RC</h2>
<p><a href="https://dl.acm.org/citation.cfm?id=325102">Firo: a must-read: Release consistency: Memory consistency and event ordering in scalable shared-memory multiprocessors</a><br />
<a href="https://docs.microsoft.com/en-us/windows/win32/dxtecharts/lockless-programming?redirectedfrom=MSDN#read-acquire-and-write-release-barriers">Must-read: Lockless Programming Considerations for Xbox 360 and Microsoft Windows</a><br />
At right top of page 6<br />
Condition 3.1: Conditions for Release Consistency<br />
(A) before an ordinary load or store access is allowed to perform with respect to any other processor,<br />
all previous acquire accesses must be performed, and<br />
(B) before a release access is allowed to perform with<br />
respect to any other processor, all previous ordinary<br />
load and store accesses must be performed, and<br />
&copy; special accesses are processor consistent with respect to one another.<br />
<a href="https://preshing.com/20120913/acquire-and-release-semantics/">Acquire and Release Semantics</a></p>
</description>
</item>
<item>
<title>The definitive guide to Linux x86 entries</title>
<link>http://firoyang.org/cs/entry/</link>
<pubDate>Wed, 26 Apr 2017 21:39:41 CST</pubDate>
<author>Firo Yang</author>
<guid>http://firoyang.org/cs/entry/</guid>
<description>
<h1 id="all-entries">All entries</h1>
<p><a href="https://www.kernel.org/doc/Documentation/x86/entry_64.txt">Documentation/x86/entry_64.txt</a></p>
<h1 id="entry-irq">Entry irq</h1>
<p><a href="http://www.lenky.info/archives/2013/03/2245">对Linux x86-64架构上硬中断的重新认识</a></p>
<h2 id="steps-to-handle-intterrupt">Steps to handle intterrupt</h2>
<p>For logical address to linear address, see intel SDM v3a 3.4 LOGICAL AND LINEAR ADDRESSES.<br />
For stack switching during escalate the CPL, see SDM v3a 5.8.5 stack switching. The processor will automatically chose the espCPL stack to use during changing in privilege level.<br />
For more details on stack switching, please check the Figure 5-13. Stack Switching During an Interprivilege-Level Call<br />
For fast system call, check 3a 5.8.7 Performing Fast Calls to System Procedures<br />
For TSS and TR, check 3a 7.2<br />
For Linux hanld irq processes, check ULK 3rd Chapter 4: Hardware Handling of Interrupts and Exceptions</p>
<h1 id="entry-exception">Entry exception</h1>
<h2 id="paranoid-entry">paranoid_entry</h2>
<p>Check Documentation/x86/entry_64.txt</p>
<h2 id="error-entry">error_entry</h2>
<p>tglx: commit 0457d99a336be658cea1a5bdb689de5adb3b382d<br />
Author: Andi Kleen <a href="mailto:[email protected]">[email protected]</a><br />
AuthorDate: Tue Feb 12 20:17:35 2002 -0800<br />
Commit: Linus Torvalds <a href="mailto:[email protected]">[email protected]</a><br />
CommitDate: Tue Feb 12 20:17:35 2002 -0800<br />
[PATCH] x86_64 merge: arch + asm</p>
<h1 id="entry-system-calls">Entry system calls</h1>
<p><a href="https://blog.packagecloud.io/eng/2016/04/05/the-definitive-guide-to-linux-system-calls/">The Definitive Guide to Linux System Calls</a></p>
<h2 id="fast-path">Fast path</h2>
<p>commit 21d375b6b34ff511a507de27bf316b3dde6938d9<br />
Author: Andy Lutomirski <a href="mailto:[email protected]">[email protected]</a><br />
Date: Sun Jan 28 10:38:49 2018 -0800<br />
x86/entry/64: Remove the SYSCALL64 fast path</p>
<h2 id="sysenter-vs-syscall">sysenter vs syscall</h2>
<p><a href="https://groups.google.com/forum/#!topic/comp.arch/CjDs4MJCBow%5B1-25%5D">SYSENTER/SYSEXIT vs.SYSCALL/SYSRET</a><br />
<a href="http://arkanis.de/weblog/2017-01-05-measurements-of-system-call-performance-and-overhead">Measurements of system call performance and overhead</a><br />
<a href="https://reverseengineering.stackexchange.com/a/16511/16996">AMD vs Intel and syscall vs sysenter</a><br />
<a href="https://www.codeguru.com/cpp/misc/misc/system/article.php/c8223/System-Call-Optimization-with-the-SYSENTER-Instruction.htm">System Call Optimization with the SYSENTER Instruction</a><br />
<a href="http://articles.manugarg.com/systemcallinlinux2_6.html">Sysenter Based System Call Mechanism in Linux 2.6</a></p>
<h2 id="system-call-restart-mechanism-and-orig-eax">system call restart mechanism and ORIG_EAX</h2>
<p><a href="https://lwn.net/Articles/17744/">A new system call restart mechanism</a><br />
<a href="https://lkml.org/lkml/2006/8/29/350">Why set ORIG_EAX(%esp) to -1 in arch/i386/kernel/entry.S:error_code?</a></p>
<h2 id="kernel-implementations">kernel implementations</h2>
<p>arch/x86/include/asm/proto.h<br />
64-bit long mode: syscall; check syscall_init<br />
64-bit compatible kernel: sysenter, syscall, or int 0x80; check __kernel_vsyscall and def_idts<br />
32-bit kernel: int 0x80, sysenter;</p>
<h3 id="64-bit-without-compat-32-compatible-kernel-support">64-bit without COMPAT_32/compatible kernel support</h3>
<p>./int80<br />
[ 730.583700] traps: int80[1697] general protection ip:4000c4 sp:7ffd84b59730 error:402 in int80[400000+1000]<br />
Segmentation fault (core dumped)</p>
<h2 id="x86-64-rcx-and-r10">x86_64 rcx and r10</h2>
<p>Check x86_64 ABI: Linux conventions and according to <a href="https://www.felixcloutier.com/x86/syscall">x86 syscall instruction</a>, rcx is used to passing next rip.<br />
According to entry_SYSCALL_64, rcx is rip before it is pushed on the kernel stack. So r10 is right 4th args passed from userspace.<br />
According to do_syscall_64, regs-&gt;ax = sys_call_table<a href="regs-&gt;di, regs-&gt;si, regs-&gt;dx, regs-&gt;r10, regs-&gt;r8, regs-&gt;r9">nr</a>;</p>
<h2 id="x86-32-asmlinkage">x86_32 asmlinkage</h2>
<p><a href="https://qr.ae/Ti5MJJ">By default gcc passes parameters on the stack for x86-32 arch, so what is it needed for? It&rsquo;s because linux kernel uses -mregparm=3 option which overrides the default behaviour</a><br />
<a href="https://lwn.net/Articles/67175/">enbaled -mregparm=3 Shrinking the kernel with gcc</a><br />
<a href="https://kernelnewbies.org/FAQ/asmlinkage">What is asmlinkage?</a><br />
However, for C functions invoked from assembly code, we should explicitly declare the function&rsquo;s calling convention, because the parameter passing code in assembly side has been fixed. Show all predefined macros for your compiler</p>
<h2 id="hacking">Hacking</h2>
<p><a href="https://www.exploit-db.com/papers/13146">Obtain sys_call_table on amd64(x86_64)</a></p>
<h2 id="vdso">vDSO</h2>
<p><a href="http://www.linuxjournal.com/content/creating-vdso-colonels-other-chicken?page=0,0">Creating a vDSO: the Colonel&rsquo;s Other Chicken</a><br />
<a href="http://www.trilithium.com/johan/2005/08/linux-gate/">What is linux-gate.so.1</a><br />
glibc -&gt; AT_SYSINFO-&gt; __kernel_vsyscall -&gt; sysenter/syscall/in0x80<br />
just for vDSO syscalls<br />
glibc -&gt; AT_SYSINFO_EHDR-&gt; vDSO elf<br />
<a href="https://lwn.net/Articles/446528/">On vsyscalls and the vDSO</a><br />
<a href="http://blog.tinola.com/?e=5">linux syscalls on x86 64</a></p>
</description>
</item>
<item>
<title>Softirq of Linux Kernel</title>
<link>http://firoyang.org/cs/softirq/</link>
<pubDate>Mon, 03 Apr 2017 13:09:05 CST</pubDate>
<author>Firo Yang</author>
<guid>http://firoyang.org/cs/softirq/</guid>
<description>
<h1 id="the-old-bottom-half">The old bottom half</h1>
<p>ULK 1st: 4.6.6 Bottom Half<br />
History: commit ad09492558ffa7c67f2b58d23d04dce9ffb9b9dd (tag: 0.99)<br />
Author: Linus Torvalds <a href="mailto:[email protected]">[email protected]</a><br />
Date: Fri Nov 23 15:09:07 2007 -0500<br />
[PATCH] Linux-0.99 (December 13, 1992)<br />
Firo: There isn&rsquo;t to much useful comment. But the code is very simple. Search bh_base.</p>
<h1 id="task-queue">task queue</h1>
<p>history: commit 98606bddf430f0a60d21fba93806f4e3c736b170 (tag: 1.1.13)<br />
Author: Linus Torvalds <a href="mailto:[email protected]">[email protected]</a><br />
Date: Fri Nov 23 15:09:30 2007 -0500<br />
Import 1.1.13<br />
+ * New proposed &ldquo;bottom half&rdquo; handlers:<br />
+ * &copy; 1994 Kai Petzke, [email protected]<br />
+ * Advantages:<br />
+ * - Bottom halfs are implemented as a linked list. You can have as many<br />
+ * of them, as you want.<br />
+ * - No more scanning of a bit field is required upon call of a bottom half.<br />
+ * - Support for chained bottom half lists. The run_task_queue() function can be<br />
+ * used as a bottom half handler. This is for example usefull for bottom<br />
+ * halfs, which want to be delayed until the next clock tick.<br />
+ * Problems:<br />
+ * - The queue_task_irq() inline function is only atomic with respect to itself.<br />
+ * Problems can occur, when queue_task_irq() is called from a normal system<br />
+ * call, and an interrupt comes in. No problems occur, when queue_task_irq()<br />
+ * is called from an interrupt or bottom half, and interrupted, as run_task_queue()<br />
+ * will not be executed/continued before the last interrupt returns. If in<br />
+ * doubt, use queue_task(), not queue_task_irq().<br />
+ * - Bottom halfs are called in the reverse order that they were linked into<br />
+ * the list.<br />
+struct tq_struct {<br />
Check ULK2nd 4.7.3.1 Extending a bottom half for task queues, especially tq_context and keventd<br />
The Old Task Queue Mechanism in LKD3rd. Cition from it below.<br />
<a href="https://lwn.net/Articles/11351/">The end of task queues</a></p>
<h1 id="softirq">Softirq</h1>
<p><a href="http://www.cs.unca.edu/brock/classes/Spring2013/csci331/notes/paper-1130.pdf">I’ll Do It Later: Softirqs, Tasklets, Bottom Halves, Task Queues, Work Queues and Timers</a><br />
* not allow execute nest but can recusive lock:local_bh_disable<br />
current-&gt;preemt_count + SOFIRQ_OFFSET also disable preempt current process.<br />
* hardirq on, can&rsquo;t sleep<br />
* not percpu</p>
<h1 id="occassions-of-softirq">Occassions of Softirq</h1>
<p>irq_exit()<br />
re-enables softirq, local_bh_enable/spin_unlock_bh(); explicity checks executes, netstack/blockIO.<br />
ksoftirqd</p>
<h1 id="tasklet">Tasklet</h1>
<p>History: commit 6cc120a8e71a8d124bf6411fc6e730a884b82701 (tag: 2.3.43pre7)<br />
Author: Linus Torvalds <a href="mailto:[email protected]">[email protected]</a><br />
Date: Fri Nov 23 15:30:52 2007 -0500<br />
Import 2.3.43pre7<br />
+ Tasklets &mdash; multithreaded analogue of BHs.<br />
+ Main feature differing them of generic softirqs: tasklet<br />
+ is running only on one CPU simultaneously.<br />
+ Main feature differing them of BHs: different tasklets<br />
+ may be run simultaneously on different CPUs.<br />
+ Properties:<br />
+ * If tasklet_schedule() is called, then tasklet is guaranteed<br />
+ to be executed on some cpu at least once after this.<br />
+ * If the tasklet is already scheduled, but its excecution is still not<br />
+ started, it will be executed only once.<br />
+ * If this tasklet is already running on another CPU (or schedule is called<br />
+ from tasklet itself), it is rescheduled for later.<br />
+ * Tasklet is strictly serialized wrt itself, but not<br />
+ wrt another tasklets. If client needs some intertask synchronization,<br />
+ he makes it with spinlocks.</p>
</description>
</item>
<item>
<title>Softirq of Linux Kernel</title>
<link>http://firoyang.org/dark_ages/softirq/</link>
<pubDate>Mon, 03 Apr 2017 13:09:05 CST</pubDate>
<author>Firo Yang</author>
<guid>http://firoyang.org/dark_ages/softirq/</guid>
<description>
<p>##softirq<br />
同一个softirq可以在不同的CPU上同时运行,softirq必须是可重入的。<br />
* not allow execute nest but can recusive lock:local_bh_disable<br />
current-&gt;preemt_count + SOFIRQ_OFFSET also disable preempt current process.<br />
* hardirq on, can&rsquo;t sleep<br />
* not percpu</p>
<h2 id="tasklet-and-kernel-timer-is-based-on-softirq">tasklet and kernel timer is based on softirq</h2>
<p>新增softirq, 是要重新编译内核的, 试试tasklet也不错.<br />
.不允许两个两个相同类型的tasklet同时执行,即使在不同的处理器上<br />
* First of all, it&rsquo;s a conglomerate of mostly unrelated jobs,<br />
which run in the context of a randomly chosen victim<br />
w/o the ability to put any control on them. &ndash;Thomas Gleixner</p>
<p>tasklet different with other softirq is run signal cpu core<br />
spinlock_bh wider then spinlock</p>
<p>###time of softirq<br />
* follow hardirq, irq_exit()<br />
* re-enables softirq, local_bh_enable/spin_unlock_bh(); explicity checks executes, netstack/blockIO.<br />
* ksoftirqd</p>
<p>###tasklet<br />
tasklet like a workqueue, sofirq like kthread. that is wonderful, does it?<br />
tasklet 被__tasklet_schedule到某个cpu的percu 变量tasklet_vec.tail上保证了<br />
只有一个cpu执行同一时刻.</p>
<p>#FAQ<br />
##When to save irq rather than just disable irq<br />
local_irq_disable() used in the code path that never disabled interrupts.<br />
local_irq_save(flags) used in the code path that already disabled interrupts.</p>
<p>##what about irq nested?<br />
<a href="http://lwn.net/Articles/380937/">http://lwn.net/Articles/380937/</a></p>
<p><a href="http://thread.gmane.org/gmane.linux.kernel/1152658">Deal PF_MEMALLOC in softirq</a></p>
</description>
</item>
<item>
<title>x86 interrupt and exception</title>
<link>http://firoyang.org/cs/event/</link>
<pubDate>Mon, 03 Apr 2017 13:02:12 CST</pubDate>
<author>Firo Yang</author>
<guid>http://firoyang.org/cs/event/</guid>
<description>
<h1 id="events">Events</h1>
<p>Interrupts: asynonymous(passively received), external<br />
Exception: synonymous(actively detected), internal<br />
Software interrupts: is a trap. int/int3, into, bound.<br />
IPI<br />
<a href="https://www.youtube.com/watch?v=-pehAzaP1eg">IRQs: the Hard, the Soft, the Threaded and the Preemptible</a><br />
<a href="https://www.youtube.com/watch?v=YE8cRHVIM4E">How Dealing with Modern Interrupt Architectures can Affect Your Sanity</a></p>
<h1 id="stack-management">stack management</h1>
<p><a href="https://www.kernel.org/doc/html/latest/x86/kernel-stacks.html">x86_64 IST Stacks in kernel</a><br />
6.14.4 Stack Switching in IA-32e Mode<br />
irq_stack_union</p>
<h2 id="backtrace">backtrace</h2>
<p>commit a2bbe75089d5eb9a3a46d50dd5c215e213790288<br />
x86: Don&rsquo;t use frame pointer to save old stack on irq entry<br />
/* Save previous stack value <em>/<br />
movq %rsp, %rsi<br />
&hellip;<br />
2: /</em> Store previous stack value */<br />
pushq %rsi<br />
<a href="https://lore.kernel.org/patchwork/patch/736894/">Firo: end of EOI; x86/dumpstack: make stack name tags more comprehensible</a></p>
<h1 id="concurrency-nested">Concurrency, nested?</h1>
<h2 id="mask-exception">Mask exception</h2>
<p>RF in EFLAGS for masking #DB<br />
<a href="https://stackoverflow.com/a/1581729/1025001">Does sti/cli affect software interrupt</a></p>
<h2 id="irq-nested">irq nested?</h2>
<p><a href="http://lwn.net/Articles/380937/">Prevent nested interrupts when the IRQ stack is near overflowing v2</a><br />
<a href="http://www.lenky.info/archives/2013/03/2245">对Linux x86-64架构上硬中断的重新认识</a></p>
<h3 id="firo-clear-the-flags-for-pf-through-interrupt-gate">Firo: clear the flags for PF through interrupt gate</h3>
<p>v3a: 6.12.1 Exception- or Interrupt-Handler Procedures<br />
6.12.1.2 Flag Usage By Exception- or Interrupt-Handler Procedure</p>
<h2 id="synchronization">synchronization</h2>
<p>local_irq_disable() used in the code path that never disabled interrupts.<br />
local_irq_save(flags) used in the code path that already disabled interrupts.</p>
<h2 id="in-interrupt">in_interrupt</h2>
<p>383 static inline void tick_irq_exit(void)<br />
384 {<br />
385 #ifdef CONFIG_NO_HZ_COMMON<br />
386 int cpu = smp_processor_id();<br />
387<br />
388 /* Make sure that timer wheel updates are propagated <em>/<br />
389 if ((idle_cpu(cpu) &amp;&amp; !need_resched()) || tick_nohz_full_cpu(cpu)) {<br />
390 if (!in_interrupt())<br />
391 tick_nohz_irq_exit();<br />
392 }<br />
393 #endif<br />
394 }<br />
395<br />
396 /</em><br />
397 * Exit an interrupt context. Process softirqs if needed and possible:<br />
398 */<br />
399 void irq_exit(void)<br />
400 {<br />
401 #ifndef __ARCH_IRQ_EXIT_IRQS_DISABLED<br />
402 local_irq_disable();<br />
403 #else<br />
404 lockdep_assert_irqs_disabled();<br />
405 #endif<br />
406 account_irq_exit_time(current);<br />
407 preempt_count_sub(HARDIRQ_OFFSET);<br />
408 if (!in_interrupt() &amp;&amp; local_softirq_pending())<br />
409 invoke_softirq();<br />
410<br />
411 tick_irq_exit();</p>
<h1 id="exceptions">Exceptions</h1>
<p><a href="http://wiki.osdev.org/Exceptions">Exceptions</a><br />
related code:<br />
do_nmi do_int3 debug_stack_usage_inc, debug_idt_descr, debug_idt_table,</p>
<h2 id="faults-a-fault-is-an-exception-that-can-generally-be-corrected-and-that-once-corrected-allows-the-program">Faults — A fault is an exception that can generally be corrected and that, once corrected, allows the program</h2>
<p>to be restarted with no loss of continuity. When a fault is reported, the processor restores the machine state to<br />
the state prior to the beginning of execution of the faulting instruction. The return address (saved contents of<br />
the CS and EIP registers) for the fault handler points to the faulting instruction, rather than to the instruction<br />
following the faulting instruction.</p>
<h2 id="traps-a-trap-is-an-exception-that-is-reported-immediately-following-the-execution-of-the-trapping-instruction">Traps — A trap is an exception that is reported immediately following the execution of the trapping instruction.</h2>
<p>Traps allow execution of a program or task to be continued without loss of program continuity. The return<br />
address for the trap handler points to the instruction to be executed after the trapping instruction.</p>
<h2 id="aborts-an-abort-is-an-exception-that-does-not-always-report-the-precise-location-of-the-instruction-causing">Aborts — An abort is an exception that does not always report the precise location of the instruction causing</h2>
<p>the exception and does not allow a restart of the program or task that caused the exception. Aborts are used to<br />
report severe errors, such as hardware errors and inconsistent or illegal values in system tables.</p>
<h2 id="triggering-a-gp-exception">Triggering a #GP exception</h2>
<p>exception_GP_trigger.S</p>
<h2 id="exeception-init">Exeception init</h2>
<p>Rleated code:<br />
idt_setup_early_traps #===&gt; idt_table: ist=0; DB, BP<br />
idt_setup_early_pf #===&gt; idt_table: PF ist=0;<br />
trap_init, idt_setup_traps #===&gt; idt_table: ist=0; DE, 0x80 &hellip; etc.<br />
trap_init-&gt;cpu_init, idt_setup_ist_traps #===&gt; idt_table: ist=1; DB, NMI, BP, DF, MC;<br />
x86_init.irqs.trap_init #===&gt; if !KVM, noop<br />
idt_setup_debugidt_traps #===&gt; debug_idt_table, check debug stack; INTG; #DB debug; #BP int; check arch/x86/entry/entry_64.S</p>
<h1 id="interrupt">Interrupt</h1>
<p>If interrupt occured in user mode, then cpu will context swith for potential reschedule.<br />
The Interrupt Descriptor Table (IDT) is a data structure used by the x86 architecture to implement an interrupt vector table.</p>
<h2 id="hardware-interrupts">Hardware interrupts</h2>
<p>are used by devices to communicate that they require attention from the operating system.<br />
more details in init_IRQ() or set_irq() in driver.</p>
<h2 id="software-interrupt">software interrupt</h2>
<p>more details in trap_init().<br />
* exception or trap<br />
is caused either by an exceptional condition in the processor itself,<br />
divide zero painc?<br />
* special instruction, for example INT 0x80<br />
or a special instruction in the instruction set which causes an interrupt when it is executed.</p>
<h2 id="irq-line-number-vs-interrupt-vector">IRQ line number vs interrupt vector</h2>
<p>cat /proc/interrupts<br />
CPU0 CPU1 CPU2 CPU3<br />
0: 21 0 0 0 IR-IO-APIC 2-edge timer<br />
v3a Chapter 6 and Check ULK3 Chapter 4 Interrupt vectors<br />
the 0 in /proc/interrupts is a IRQ line number<br />
The 0 for Divide error is a interrupt vector.</p>
<h2 id="interrupt-init">Interrupt init</h2>
<p>early_irq_init = alloc NR_IRQS_LEGACY irq_desc; - 16 #===&gt; [ 0.000000] NR_IRQS: 65792, nr_irqs: 1024, preallocated irqs: 16<br />
init_IRQ()-&gt;x86_init.irqs.intr_init=native_init_IRQ #===&gt; external interrupt init;<br />
pre_vector_init = init_ISA_irqs #===&gt; 1) legacy_pic-&gt;init(0); init 8259a; 2) link irq_desc in irq_desc_tree with flow handle and chip.<br />
idt_setup_apic_and_irq_gates #===&gt; apic normal(from 32) and system interrupts;</p>
<h2 id="affinity">affinity</h2>
<p>root@snow:/tmp# cat x.sh<br />
echo 1 &gt; /proc/irq/129/smp_affinity<br />
sudo trace-cmd record -p function_graph &ndash;max-graph-depth 70 -g __irq_set_affinity -c -F ./x.sh<br />
__irq_set_affinity msi_domain_set_affinity intel_ir_set_affinity apic_set_affinity</p>
<p>interrupt balancing<br />
Interrupts not distributed as specified in smp_affinity: <a href="https://www.suse.com/support/kb/doc/?id=000018837">https://www.suse.com/support/kb/doc/?id=000018837</a><br />
De-mystifying interrupt balancing: irqbalance: <a href="https://www.youtube.com/watch?v=hjMWVrqrt2U">https://www.youtube.com/watch?v=hjMWVrqrt2U</a></p>
<h1 id="ipi">IPI</h1>
<p>commit 52aec3308db85f4e9f5c8b9f5dc4fbd0138c6fa4<br />
Author: Alex Shi <a href="mailto:[email protected]">[email protected]</a><br />
Date: Thu Jun 28 09:02:23 2012 +0800<br />
x86/tlb: replace INVALIDATE_TLB_VECTOR by CALL_FUNCTION_VECTOR<br />
ERROR_APIC_VECTOR 0xfe<br />
RESCHEDULE_VECTOR 0xfd<br />
CALL_FUNCTION_VECTOR 0xfc<br />
CALL_FUNCTION_SINGLE_VECTOR 0xfb<br />
THERMAL_APIC_VECTOR 0xfa<br />
THRESHOLD_APIC_VECTOR 0xf9<br />
REBOOT_VECTOR 0xf8</p>
<h1 id="history">History</h1>
<p><a href="https://people.cs.clemson.edu/~mark/interrupts.html">history of interrupts</a><br />
<a href="https://virtualirfan.com/history-of-interrupts">Another History of interrupts with video</a></p>
</description>
</item>
<item>
<title>Scheduling in operating system</title>
<link>http://firoyang.org/cs/sched_/</link>
<pubDate>Wed, 29 Mar 2017 10:49:04 CST</pubDate>
<author>Firo Yang</author>
<guid>http://firoyang.org/cs/sched_/</guid>
<description>
<h1 id="scheduling">scheduling</h1>
<p><a href="https://en.wikipedia.org/wiki/Scheduling_(computing)">Scheduling (computing)</a></p>
<h1 id="context-switch">Context switch</h1>
<p><a href="https://www.maizure.org/projects/evolution_x86_context_switch_linux/index.html">Evolution of the x86 context switch in Linux</a><br />
<a href="https://lwn.net/Articles/520227/">Al Viro&rsquo;s new execve/kernel_thread design</a><br />
commit 0100301bfdf56a2a370c7157b5ab0fbf9313e1cd<br />
Author: Brian Gerst <a href="mailto:[email protected]">[email protected]</a><br />
Date: Sat Aug 13 12:38:19 2016 -0400<br />
sched/x86: Rewrite the switch_to() code<br />
<a href="https://stackoverflow.com/questions/15019986/why-does-switch-to-use-pushjmpret-to-change-eip-instead-of-jmp-directly/15024312">Why does switch_to use push+jmp+ret to change EIP, instead of jmp directly?</a></p>
<h1 id="reference">Reference</h1>
<p>Process scheduling in Linux &ndash; Volker Seeker from University of Edinburgh<br />
<a href="https://tampub.uta.fi/bitstream/handle/10024/96864/GRADU-1428493916.pdf">A complete guide to Linux process scheduling</a><br />
<a href="https://www.kernel.org/doc/Documentation/scheduler/sched-design-CFS.txt">https://www.kernel.org/doc/Documentation/scheduler/sched-design-CFS.txt</a><br />
<a href="https://helix979.github.io/jkoo/post/os-scheduler/">JINKYU KOO&rsquo;s Linux kernel scheduler</a></p>
<p><a href="http://www.joelfernandes.org/linuxinternals/2016/03/20/tif-need-resched-why-is-it-needed.html">TIF_NEED_RESCHED: why is it needed</a></p>
<h1 id="latency">Latency</h1>
<p><a href="https://lwn.net/Articles/404993/">Improving scheduler latency</a></p>
<h1 id="general-runqueues">General runqueues</h1>
<p>static DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);<br />