name: inverse layout: true class: center, middle, inverse
Licensed under CC BY 4.0. Code examples: OSI-approved MIT license.
layout: false
- Code is too slow
- Memory demand is too high
- Code is too slow
- Memory demand is too high
- Buy faster computer
- Buy more memory
- Help the compiler
- Change algorithm
- Change data structures (optimize data access)
- Port to a "faster language"
- Parallelize
- Port to another platform (GPU, accelerators)
- Before you invest money or time, identify the problem
- Here insert famous Donald Knuth quote
- Before solving the problem, verify whether it is worth it
- Benchmark and profile before optimizing
- Where does the program spend most time?
- Dynamic program analysis (runtime)
- Statistical sampling or event-based
- Profiling perturbs the calculation
- Profiled calculation should be representable
- Collect data at regular intervals
- Typically small overhead
- Not all code is seen
- May differ between execution
- Show bottlenecks in production code
- Can be used for monitoring over long execution runs
- Reports on external library functions
- Collect data (traces) for pre-defined events (entering or leaving functions, object allocation, communication, disk I/O)
- Modify the code
- Often large overhead
- Selective
-
"Home use", free
- Gprof
- Callgrind (Valgrind)
- Python: cProfile, profile
- OProfile
-
HPC, free
- HPCToolkit
- DynInst
- Extrae/Paraver
- Scalasca
- TAU
-
Commercial
- Vampir
- Allinea Performance Reports
- Allinea MAP
- CrayPat
- VTune Amplifier (Intel)
template: inverse
- Insert timers!
- Everybody can do that
- In any language
- It is good to have timings anyhow in the output
- Human time often more valuable than CPU time
- We often spend more time reading and writing code than running it
- Compare your programming time multiplied with your salary to the cost of CPU time that you will save
- Is a 20% faster code worth 2 months of work?
- Does it really matter whether the run takes 1 minute or 2 minutes
- It probably matters if the computation takes 3 months
- If your code is used by 5000 users then 20% speedup may be worth it
- When you optimize for speed/memory, typically you increase complexity
- Imagine your code runs 40% faster but nobody understands it
- If you have to complicate the code, keep the complexity localized
- Define your priorities
- Performance is typically not portable
- We do not know the languages of tomorrow
- We do not know the hardware of tomorrow
- Code complexity may prevent us from porting our codes to soft- and hardware of tomorrow
- Keep it simple as long as you can
- Moore's law
- Frequency race
- "The party isn't exactly over, but the police has arrived, and the music has been turned way down" (P. Kogge)
- Today we need to think about parallelization
- Number of cores/threads scales faster than memory
- Today we need to think about parallelization
- Linear scaling requires localization of memory access
- Use tools to identify memory bottlenecks
- Restrict types (Cython)
- Use pragmas
- Tell the compiler where the bottlenecks are
- Tell the compiler where vectorization is possible
- Compiler flags
- Instruction set
- Consider increase in complexity
- Pre-factor vs. scaling
- "Slow" algorithm may catch the worm
- Easy algorithm first
- YAGNI
- Latency: https://gist.github.com/jboner/2841832
- Cache hierarchy
- Cache line
- Gorilla
- MPI and scaling (disk access serializing)
- Swapping
- Natural process
- "In the long run every program becomes rococo - then rubble."
- http://pu.inf.uni-tuebingen.de/users/klaeren/epigrams.html
- Needs tests
- Sometimes recompute faster than reading from disk
- Memoization
- Every language allows to write slow code
- Pre-factor vs. scaling
- "Slow" language may catch the worm
- Combine languages (FFI)
- Functional programming vs. OOP
- Matlab
- Python
- Julia
- Amdahl's law
- Serial and parallel part scale differently w.r.t. system size
- Parallel region too small
- Serialization of parallel tasks
- I/O heavy code
- Communication overhead
- Work load imbalance
- Resource imbalance
- Memory-bound code
- Sub-optimal use of libraries
- System size too small
- Wrong code
- Moving target
- Trend goes towards multi-processor and multi-thread
- If you rewrite your data structures for GPU/accelerators, the effort is most probably not wasted
- Restructure mathematical formulation
- Innovate at algorithm level
- Then maybe port
- Use libraries if you can
- Libraries are dependencies
- Use elemental functions
- Use pure functions/subroutines (Fortran)
- Use intrinsics
- Beware of multiple loops
- Do not allocating/deallocate inside multiple loops
- Profile first, optimize later
- When you scale the system, new bottlenecks will appear