GitHub - raz4/Lock-Granularity-Performance

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Makefile		Makefile
README.txt		README.txt
SortedList.c		SortedList.c
SortedList.h		SortedList.h
lab2_1.gp		lab2_1.gp
lab2_2.gp		lab2_2.gp
lab2_3.gp		lab2_3.gp
lab2_4.gp		lab2_4.gp
lab2_list.c		lab2_list.c
lab2b_1.png		lab2b_1.png
lab2b_2.png		lab2b_2.png
lab2b_3.png		lab2b_3.png
lab2b_4.png		lab2b_4.png
lab2b_5.png		lab2b_5.png
lab2b_add.csv		lab2b_add.csv
lab_2b_list.csv		lab_2b_list.csv
lab_2b_list_0.csv		lab_2b_list_0.csv
lab_2b_list_1.csv		lab_2b_list_1.csv
lab_2b_list_2.csv		lab_2b_list_2.csv
lab_2b_list_3.csv		lab_2b_list_3.csv
libprofiler.so.0		libprofiler.so.0
pprof		pprof
profile_mutex		profile_mutex
profile_spin		profile_spin

Repository files navigation

README - Lock Granularity and Performance

Objectives:
1. Demonstrate the ability to recognize bottlenecks on large data structures
2. Experience partitioning a serialized resource to improve parallelism
3. Experience basic performance measurements and instrumentation
4. Experience execution profiling
5. Experience finding, installing, and exploiting new development tools

Included files:
SortedList.h - header file containing interfaces for linked list operations including insert, delete, lookup, and length
SortedList.c - implementation for linked list operations as mentioned above
lab2_list.c - implements command line options including --threads, --iterations, --yield, --sync, --lists
Makefile - file to build deliverable programs
lab_2b_list_x.csv (x=0,1,2,3) - generated csv data
lab_2b_list.csv - generated csv data
lab2b_add.csv - add data from part A
lab2b_x.png (x=1,2,3,4,5) - graphs for throughput, mutex timing, lists performance
lab2_x.gp (x=1,2,3,4) - gnuplot scripts for generating the above graphs with the above csv data
pprof and libprofiler.so.0 - files used by google performance tools to generate profile
profile_mutex - file generated by google performance tools to anaylze CPU usage in lab2_list program with mutex sync
profile_spin - file generated by google performance tools to anaylze CPU usage in lab2_list program with spin-lock sync
README.txt - this file

This lab was tested and confirmed to be working on SEASnet lnxsrv09.

*******
files "profile_spin" and "profile_mutex" contain the profiles by google-pprof for spin and mutex respectively.
*******
____________________________

Question 2.3.1 - Cycles in the basic implementation
Q: Where do you believe most of the cycles are spent in the 1 and 2-thread tests (for both add and list)? Why do you believe these to be the most expensive parts of the code?

A: For low threaded tests in the add implementation, most of the cycles are spent in the lock since adding isn’t work intensive. For low threaded tests in the list implementation, most of the cycles are spent in the actual list operations since it is work intensive.

Q: Where do you believe most of the time/cycles are being spent in the high-thread spin-lock tests?

A: In the high-thread spin-lock tests, much of the time is spent in the locks since the lock is being used more often. But depending on the workload, the time can vary. For example, operating on lists can increase the time portion spent on the implementation.

Q: Where do you believe most of the time/cycles are being spent in the high-thread mutex tests?

A: Most of the time/cycles is spent in the lock since there are more threads attempting to run the critical section of the code. This time/cycles can vary depending on the workload of the critical section.

Question 2.3.2 - Execution Profiling:
Q: Where (what lines of code) are consuming most of the cycles when the spin-lock version of the list exerciser is run with a large number of threads?

A: The while loops containing the “__sync_lock_test_and…” are consuming most of the cycles when the spin-lock version of the list exerciser is run with a large number of threads.

Q: Why does this operation become so expensive with large number of threads?

A: Since only one thread can run the critical section, most of the threads are put on hold or in a state of busy waiting, so this operation becomes expensive.

Question 2.3.3 - Mutex Wait Time:
Q: Why does the average lock-wait time rise so dramatically with the number of contending threads?

A: Because there are more threads, the sum of the total mutex waiting time is higher and thus the average lock-wait time is higher. Each extra thread is waiting at least the same amount of time as others, and this has a significant effect on the sum.

Q: Why does the completion time per operation rise (less dramatically) with the number of contending threads?

A: The number of critical sections remains the same as well as maximum working threads, so the completion time per operation rises less dramatically. While there could be many threads waiting (locked), operations are being performed by single running thread(s).

Q: How is it possible for the wait time per operation to go up faster (or higher) than the completion time per operation?

A: There could be many more threads waiting or locked compared to one thread that is running the critical section. With multiple threads waiting, the wait time increases dramatically. Also, some threads could spend more time waiting to execute the critical section than actually executing the it. ie. wait time > time executing operation

Question 2.3.4 - Performance of Partitioned Lists
Q: Explain the change in performance of the synchronized methods as a function of the number of lists.

A: Performance increases with increasing number of lists across all number of threads.

Q:Should the throughput continue increasing as the number of lists is further increased? If not, explain why not.

A: The answer depends on the number of threads and elements. If there are enough lists such that every thread is continuously doing work, there won’t be an increase in throughput when adding more lists since the threads will be doing the same amount of work.

Q: It seems reasonable to suggest the throughput of an N-way partitioned list should be equivalent to the throughput of a single list with fewer (1/N) threads. Does this appear to be true in the above curves? If not, explain why not.

A: Yes, this appears to be true in the above curves. For example, in the spin lock png file, the throughout of 1 list with 1 thread is almost equivalent to the throughput of 3 lists and 16 threads.