name: inverse layout: true class: center, middle, inverse

Profiling and code optimization

Radovan Bast

Licensed under CC BY 4.0. Code examples: OSI-approved MIT license.

layout: false

When do we want to optimize the code?

Code is too slow
Memory demand is too high

When do we want to optimize the code?

Code is too slow
Memory demand is too high

Things we can do about it

Buy faster computer
Buy more memory
Help the compiler
Change algorithm
Change data structures (optimize data access)
Port to a "faster language"
Parallelize
Port to another platform (GPU, accelerators)

Profile first, optimize later

Before you invest money or time, identify the problem
Here insert famous Donald Knuth quote
Before solving the problem, verify whether it is worth it

Profiling

Benchmark and profile before optimizing

Profiling

Where does the program spend most time?
Dynamic program analysis (runtime)
Statistical sampling or event-based
Profiling perturbs the calculation
Profiled calculation should be representable

Profilers

Statistical (sampling)

Collect data at regular intervals
Typically small overhead
Not all code is seen
May differ between execution
Show bottlenecks in production code
Can be used for monitoring over long execution runs
Reports on external library functions

Event-based (tracing)

Collect data (traces) for pre-defined events (entering or leaving functions, object allocation, communication, disk I/O)
Modify the code
Often large overhead
Selective

Profilers

"Home use", free
- Gprof
- Callgrind (Valgrind)
- Python: cProfile, profile
- OProfile
HPC, free
- HPCToolkit
- DynInst
- Extrae/Paraver
- Scalasca
- TAU
Commercial
- Vampir
- Allinea Performance Reports
- Allinea MAP
- CrayPat
- VTune Amplifier (Intel)

template: inverse

How would you profile stranded on an island?

How would you profile stranded on an island?

Insert timers!
Everybody can do that
In any language
It is good to have timings anyhow in the output

Before solving the problem, verify whether it is worth it

Human time often more valuable than CPU time
We often spend more time reading and writing code than running it
Compare your programming time multiplied with your salary to the cost of CPU time that you will save
Is a 20% faster code worth 2 months of work?
Does it really matter whether the run takes 1 minute or 2 minutes
It probably matters if the computation takes 3 months
If your code is used by 5000 users then 20% speedup may be worth it

Performance vs. code obfuscation

When you optimize for speed/memory, typically you increase complexity
Imagine your code runs 40% faster but nobody understands it
If you have to complicate the code, keep the complexity localized
Define your priorities
Performance is typically not portable

Complexity complexity complexity

We do not know the languages of tomorrow
We do not know the hardware of tomorrow
Code complexity may prevent us from porting our codes to soft- and hardware of tomorrow
Keep it simple as long as you can

Buy faster computer

Moore's law
Frequency race
"The party isn't exactly over, but the police has arrived, and the music has been turned way down" (P. Kogge)
Today we need to think about parallelization

Buy more memory

Number of cores/threads scales faster than memory
Today we need to think about parallelization
Linear scaling requires localization of memory access
Use tools to identify memory bottlenecks

Help the compiler

Restrict types (Cython)
Use pragmas
Tell the compiler where the bottlenecks are
Tell the compiler where vectorization is possible
Compiler flags
Instruction set

Change algorithm

Consider increase in complexity
Pre-factor vs. scaling
"Slow" algorithm may catch the worm
Easy algorithm first
YAGNI

Change data structures (optimize data access)

Latency: https://gist.github.com/jboner/2841832
Cache hierarchy
Cache line
Gorilla
MPI and scaling (disk access serializing)
Swapping

Refactoring

Natural process
"In the long run every program becomes rococo - then rubble."
http://pu.inf.uni-tuebingen.de/users/klaeren/epigrams.html
Needs tests

Recomputing vs. remembering

Sometimes recompute faster than reading from disk
Memoization

Port to a "faster language"

Every language allows to write slow code
Pre-factor vs. scaling
"Slow" language may catch the worm
Combine languages (FFI)
Functional programming vs. OOP

Prototype first with "easy" language

Matlab
Python
Julia

Parallelize

Amdahl's law

Serial and parallel part scale differently w.r.t. system size

Typical problems with calculations that do not scale

Parallel region too small
Serialization of parallel tasks
I/O heavy code
Communication overhead
Work load imbalance
Resource imbalance
Memory-bound code
Sub-optimal use of libraries
System size too small
Wrong code

Port to another platform (GPU, accelerators)

Moving target
Trend goes towards multi-processor and multi-thread
If you rewrite your data structures for GPU/accelerators, the effort is most probably not wasted

Before porting

Restructure mathematical formulation
Innovate at algorithm level
Then maybe port

Do not reinvent the wheel

Use libraries if you can
Libraries are dependencies

General

Use elemental functions
Use pure functions/subroutines (Fortran)
Use intrinsics
Beware of multiple loops
Do not allocating/deallocate inside multiple loops

Example: profiling performance and memory in Python

Conclusions

Profile first, optimize later
When you scale the system, new bottlenecks will appear