This repo serves as the report and codebase for Lab 2/3 for the Advanced Computer Architecture course, on the Electrical and Computer Engineering school of Aristotle University of Thessaloniki
Ανδρονίκου Δημήτρης, 9836
Αλεξανδρίδης Φώτιος, 9953
L1 instruction => size=32768 ~> 32KB
associativity assoc=2
L1 Data caches => size=65536 ~> 65 KB
associativity assoc=2
L2 cache size=2097152 ~> 2MB
associativity assoc=8
cache line cache_line_size=64
-
execution time sim_seconds = 0.083982
-
CPI system.cpu.cpi = 1.679650
-
total miss rates for L1 Instruction cache system.cpu.icache.overall_miss_rate::total 0.000077
-
total miss rates for L1 Data cache system.cpu.dcache.overall_miss_rate::total 0.014798
-
total miss rates for L2 cache
system.l2.overall_miss_rate::total 0.282163
-
execution time sim_seconds = 0.064955
-
CPI system.cpu.cpi = 1.299095
-
total miss rates for L1 Instruction cache system.cpu.icache.overall_miss_rate::total 0.023612
-
total miss rates for L1 Data cache system.cpu.dcache.overall_miss_rate::total 0.002108
-
total miss rates for L2 cache
system.l2.overall_miss_rate::total 0.055046
-
execution time sim_seconds = 0.059396
-
CPI system.cpu.cpi = 1.187917
-
total miss rates for L1 Instruction cache system.cpu.icache.overall_miss_rate::total 0.000221
-
total miss rates for L1 Data cache system.cpu.dcache.overall_miss_rate::total 0.001637
-
total miss rates for L2 cache
system.l2.overall_miss_rate::total 0.077760
-
execution time: sim_seconds = 0.513528
-
CPI: system.cpu.cpi = 10.270554
-
total miss rates for L1 Instruction cache: system.cpu.icache.overall_miss_rate::total 0.000020
-
total miss rates for L1 Data cache: system.cpu.dcache.overall_miss_rate::total 0.121831
-
total miss rates for L2 cache:
system.l2.overall_miss_rate::total 0.999972
-
execution time: sim_seconds = 0.174671
-
CPI: system.cpu.cpi = 3.493415
-
total rates for L1 Instruction cache: system.cpu.icache.overall_miss_rate::total 0.000094
-
total rates for L1 Data cache: system.cpu.dcache.overall_miss_rate::total 0.060972
-
total rates for L2 cache:
system.l2.overall_miss_rate::total 0.999944
χρόνος εκτέλεσης | 0.083982 | 0.064955 | 0.059396 | 0.513528 | 0.174671 |
CPI | 1.679650 | 1.299095 | 1.187917 | 10.270554 | 3.493415 |
Icache_miss_rate::total | 0.000077 | 0.023612 | 0.000221 | 0.000020 | 0.000094 |
dcache_miss_rate::total | 0.014798 | 0.002108 | 0.001637 | 0.121831 | 0.060972 |
l2_miss_rate::total | 0.282163 | 0.055046 | 0.077760 | 0.999972 | 0.999944 |
- execution time of specsjeng is much bigger than the others
- CPI of specsjeng is much bigger than the others
- But miss rate of instructions of specmcf is much bigger than the others
- miss rate of data of specsjeng is much bigger than the others
- miss rate in L2 of specsjeng and of speclibm is much bigger than the others
Due to 5 and 4 it is reasonable to have a high CPI for specsjeng, hence a long runtime.
system.clk_domain.clock
entry
401.bzip2 | 429.mcf | 456.hmmer | 458.sjeng | 470.lbm | |
---|---|---|---|---|---|
default | 1000 | 1000 | 1000 | 1000 | 1000 |
1GHz | 1000 | 1000 | 1000 | 1000 | 1000 |
3GHz | 1000 | 1000 | 1000 | 1000 | 1000 |
cpu_cluster.clk_domain.clock
entry
401.bzip2 | 429.mcf | 456.hmmer | 458.sjeng | 470.lbm | |
---|---|---|---|---|---|
default | 500 | 500 | 500 | 500 | 500 |
1GHz | 1000 | 1000 | 1000 | 1000 | 1000 |
3GHz | 333 | 333 | 333 | 333 | 333 |
To answer the theretical questions, we will focus on the first benchmark, but the same information is appliccable for all benchmarks.
We observe that the system clock entry defines the clock rate for the total system, the components of the motherboard. The CPU clock is responsible for defining the clock rate for the different CPU components. Because a CPU usually has to perform more operations per unit of time compared to the other components, the CPU clock is usually defined to be at least as fast as the system clock, but usually faster. For performance and syncronization reasons, we actually want the system clock rate in
ticks to be an integral product of the CPU clock rate in ticks. By inspecting the config.json
file for a 1GHz benchmark, we can observe that the different CPU components are clocked according to the CPU clock rate. For a MinorCPU model (defined in the commands we use to run the benchmarks), these components are:
- L1 data cache (dcache)
- L1 instruction cache (icache)
- L2 cache
- Instruction walker cache
- Data walker cache
- Data busses that connect the components mentioned above
If we add another CPU, meaning we add another core to the CPU cluster, its clock rate will be the defined CPU clock rate, always according to its architecture.
For the scaling we have the following execution times for the 1GHz and 3GHz simulations:
Simulation time (in seconds)
401.bzip2 | 429.mcf | 456.hmmer | 458.sjeng | 470.lbm | |
---|---|---|---|---|---|
1GHz | 0.165228 | 0.124137 | 0.118530 | 0.329087 | 0.246976 |
3GHz | 0.061589 | 0.043909 | 0.039646 | 0.138401 | 0.135953 |
As we can observe from the table, in some cases the scaling is better than the others (specifically, in benchmark 456.hmmer we can see that the rate is closer to 3 than in 470.lbm). Achieving perfect scaling is close to impossible, because the system performance depends on a lot of parameters other than CPU clock (for example: cache line size).
-
execution time: sim_seconds = 0.083609
-
CPI: system.cpu.cpi = 1.672175
-
total miss rates for L1 Instruction cache: system.cpu.icache.overall_miss_rate::total 0.000077
-
total miss rates for L1 Data cache: system.cpu.dcache.overall_miss_rate::total 0.014795
-
total miss rates for L2 cache:
system.l2.overall_miss_rate::total 0.282159
% |
|||
---|---|---|---|
χρόνος εκτέλεσης | 0.083982 | 0.083609 | -0.4441 |
CPI |
1.679650 | 1.672175 | -0.4465 |
Icache_miss_rate::total | 0.000077 | 0.000077 | 0 |
dcache_miss_rate::total | 0.014798 | 0.014795 | -0.02 |
l2_miss_rate::total | 0.282163 | 0.282159 | -0.0014 |
We observe that with increasing the frequency of memory the miss rate for L2 and L1 data cache decreases. This makes sense because these caches will receive data faster, so we will have fewer misses in the same amount of time. This has an effect on CPI, which is reasonable, and of course CPI as shown before affects runtime almost proportionally.
- In set associative cache, block size does not affect cache tag anyhow.
- A smaller cache tag ensures a lower cache hit time.
- A smaller cache block incurs a lower cache miss penalty.
Also
- Increase cache line size
- Reduces compulsory misses
- Reduces the miss rate
- Increases capacity and conflict misses
- Increase size of the cache
- Increases hit time, increases power consumption
- Reduces the miss rate (it can load more Bytes on caches)
- Higher associativity
- Reduces conflict misses
- Reduces the miss rate
- Increases hit time, increases power consumption
- Higher number of cache levels
- Reduces overall memory access time
- Reduces the miss penalty
So our run tests are these:
Run # | L1 dcache size | L1 icache size | L2 cache size | L1 icache associat. | L1 dcache associat. | L2 cache associat. | Cache line size |
---|---|---|---|---|---|---|---|
1 | 64 | 64 | 1 | 1 | 1 | 2 | 128 |
2 | 64 | 128 | 2 | 1 | 1 | 2 | 32 |
3 | 128 | 64 | 2 | 1 | 1 | 2 | 32 |
4 | 128 | 128 | 4 | 1 | 1 | 2 | 128 |
5 | 128 | 128 | 4 | 2 | 1 | 2 | 128 |
6 | 128 | 128 | 4 | 4 | 1 | 2 | 128 |
7 | 128 | 128 | 4 | 4 | 2 | 2 | 128 |
8 | 128 | 128 | 4 | 4 | 4 | 2 | 64 |
9 | 128 | 128 | 4 | 4 | 4 | 4 | 64 |
10 | 128 | 128 | 4 | 4 | 4 | 8 | 128 |
The results:
run1 | 1.551542 | 0.009703 | 0.000046 | 0.204023 |
run2 | 1.790627 | 0.018864 | 0.000069 | 0.384215 |
run3 | 1.756071 | 0.015868 | 0.000070 | 0.465377 |
run4 | 1.574223 | 0.011891 | 0.000047 | 0.166513 |
run5 | 1.574144 | 0.011891 | 0.000046 | 0.166506 |
run6 | 1.574144 | 0.011891 | 0.000046 | 0.166510 |
run7 | 1.556731 | 0.010273 | 0.000046 | 0.192764 |
run8 | 1.597712 | 0.010896 | 0.000057 | 0.312559 |
run9 | 1.597876 | 0.010896 | 0.000057 | 0.312993 |
run10 | 1.551542 | 0.009703 | 0.000046 | 0.204023 |
run1 | 1.181221 | 0.001966 | 0.000016 | 0.758297 |
run2 | 1.335364 | 0.018105 | 0.000055 | 0.316779 |
run3 | 2.143559 | 0.005244 | 0.103163 | 0.042156 |
run4 | 1.184956 | 0.002541 | 0.000036 | 0.578237 |
run5 | 1.184904 | 0.002541 | 0.000023 | 0.581209 |
run6 | 1.184816 | 0.002541 | 0.000016 | 0.583029 |
run7 | 1.182096 | 0.002097 | 0.000016 | 0.705200 |
run8 | 1.203137 | 0.003132 | 0.000022 | 0.863710 |
run9 | 1.203047 | 0.003132 | 0.000022 | 0.863262 |
run10 | 1.181221 | 0.001966 | 0.000016 | 0.758297 |
run1 | 1,178110 | 0,000367 | 0,000056 | 0,200911 |
run2 | 1,210302 | 0,004346 | 0,000401 | 0.054981 |
run3 | 1,194226 | 0,002259 | 0,000420 | 0,102704 |
run4 | 1,186298 | 0,001133 | 0,000334 | 0,056418 |
run5 | 1,185588 | 0,001133 | 0,000057 | 0,061388 |
run6 | 1,185588 | 0,001133 | 0,000056 | 0,061400 |
run7 | 1,178295 | 0,000387 | 0,000056 | 0,187128 |
run8 | 1,182888 | 0,000662 | 0,000078 | 0,208062 |
run9 | 1,182888 | 0,000662 | 0,000078 | 0,208062 |
run10 | 1,178110 | 0,000367 | 0,000056 | 0,200911 |
run1 | 3.348858 | 0.248364 | 0.011330 | 0.044292 |
run2 | 5.605595 | 0.393710 | 0.003321 | 0.290642 |
run3 | 5.684481 | 0.392898 | 0.012986 | 0.123246 |
run4 | 3.261793 | 0.243757 | 0.002443 | 0.166720 |
run5 | 3.251312 | 0.243745 | 0.000578 | 0.288207 |
run6 | 3.247843 | 0.243744 | 0.000115 | 0.352064 |
run7 | 3.240343 | 0.242518 | 0.000114 | 0.702941 |
run8 | 3.173673 | 0.243469 | 0.000108 | 0.894077 |
run9 | 3.173692 | 0.243468 | 0.000108 | 0.894172 |
run10 | 3.239587 | 0.242393 | 0.000114 | 0.780786 |
run1 | 1.883187 | 0.032168 | 0.000102 | 0.944590 |
run2 | 3.648170 | 0.123694 | 0.000071 | 0.997075 |
run3 | 3.648098 | 0.123577 | 0.000075 | 0.998511 |
run4 | 1.874379 | 0.031517 | 0.000089 | 0.971545 |
run5 | 1.874379 | 0.031517 | 0.000087 | 0.971559 |
run6 | 1.874379 | 0.031517 | 0.000086 | 0.971562 |
run7 | 1.865695 | 0.030867 | 0.000086 | 0.999997 |
run8 | 2.467162 | 0.061731 | 0.000085 | 0.999999 |
run9 | 2.467162 | 0.061731 | 0.000085 | 0.999999 |
run10 | 1.865695 | 0.030867 | 0.000086 | 0.999997 |
Overall, we can see that we achieve the lowest CPI consistently across all benchmarks on runs 1, 7 and 10. We can definitely conclude that a higher cache line size is vital to lowering CPI, since for the runs with the smaller sizes (32) we can consistently see the highest CPI there along every benchmark. Apart from that we can see that increasing associativity has a minimal effect on CPI, by checking the differences between the three runs. Cache size doesn’t impact the CPI much.
As such, we can conclude that the CPI is largely impacted by cache line size (higher -> better CPI), impacted somewhat by associativity (higher -> better CPI) and is marginally impacted by cache sizes (higher -> better CPI in some cases).
We want a function like Performance/Cost, where Performance = 1/CPI
So F = 1CPICost=1CPI*Cost
We want to maximize rhis function, thus to minimize the CPI*Cost
Cost is:
Cost = 10*cost_L1_data + 10*cost_L1_instr + cost_L2 + 10*cost_L1_data_asso + 10*cost_L1_inst_asso + cost_L2_asso + cost_line_s
where cost_L1_data = μέγεθος της L1 data
where cost_L1_instr = μέγεθος της L1 instruction
where cost_L2 = μέγεθος της L2 (σε kB, άρα x1000)
where cost_L1_data_asso = μέγεθος του L1 data associativity
where cost_L1_inst_asso = μέγεθος του L1 instruction associativity
where cost_L2_asso = μέγεθος του L2 associativity
where cost_line_s = μέγεθος του cache line
we use x10 because the cost of L1 is much bigger than the cost of L2
run1 | 2430 |
---|---|
run2 | 3974 |
run3 | 3974 |
run4 | 6710 |
run5 | 6720 |
run6 | 6740 |
run7 | 6750 |
run8 | 6706 |
run9 | 6708 |
run10 | 6776 |
And Cost = Cost /1000 (to be of the same order of magnitude as cpi)
run1 | 0.265234 |
run2 | 0.140529 |
run3 | 0.143295 |
run4 | 0.09467 |
run5 | 0.094534 |
run6 | 0.094253 |
run7 | 0.095166 |
run8 | 0.093333 |
run9 | 0.093296 |
run10 | 0.095119 |
run1 | 0.34839 |
run2 | 0.18844 |
run3 | 0.11739 |
run4 | 0.12577 |
run5 | 0.12559 |
run6 | 0.12522 |
run7 | 0.12533 |
run8 | 0.12394 |
run9 | 0.12392 |
run10 | 0.12494 |
run1 | 0.34931 |
run2 | 0.20791 |
run3 | 0.21071 |
run4 | 0.12563 |
run5 | 0.12552 |
run6 | 0.12514 |
run7 | 0.12573 |
run8 | 0.12606 |
run9 | 0.12603 |
run10 | 0.12527 |
run1 | 0.12288 |
run2 | 0.04489 |
run3 | 0.04427 |
run4 | 0.04569 |
run5 | 0.04577 |
run6 | 0.04568 |
run7 | 0.04572 |
run8 | 0.04699 |
run9 | 0.04697 |
run10 | 0.04556 |
run1 | 0.21852 |
run2 | 0.06898 |
run3 | 0.06898 |
run4 | 0.07951 |
run5 | 0.07939 |
run6 | 0.07916 |
run7 | 0.07941 |
run8 | 0.06044 |
run9 | 0.06042 |
run10 | 0.0791 |
So best choice for all run 1 mainly because of very low cost and good cpi, although the other tests have more cache memory they don't have better cpi.
https://www.gatevidyalay.com/cache-line-cache-line-size-cache-memory/
http://ece-research.unm.edu/jimp/611/slides/chap5_4.html
Computer Architecture John L. Hennesy and A. Patterson, 4th edition