TornadoVM is a Java library for hardware acceleration of Java and JVM applications. It has a JIT compiler, a runtime system, and several backends that offload, manage memory and handle execution on GPUs, FPGAs, and multicore CPUs transparently.
This repository contains a few examples for demonstration purposes.
Note: Examples using TornadoVM v1.0.9-dev
Outline:
Topic | Link |
---|---|
Install TornadoVM | link |
Setup the examples | link |
Mandelbrot demo | link |
Blur Filter demo | link |
Multi-Image Processing demo | link |
KMeans Clustering demo | link |
Live Task Migration demo | link |
Matrix Multiplication | link |
To run the examples, first build TornadoVM with any backend (OpenCL, PTX and/or SPIR-V).
Important: If you do not have an NVIDIA GPU and the CUDA SDK installed, do not use the flag --ptx
in the following command.
Similarly, if your device/system does not support SPIR-V, do not use the --spirv
flag.
To install TornadoVM, it requires as prerequisite:
- The driver installed (e.g., NVIDIA + CUDA Driver for NVIDIA GPUs, or oneAPI for Intel platforms).
- Maven
TornadoVM includes an easy installer script for Linux and OSx:
$ git clone https://github.com/beehive-lab/TornadoVM
$ cd TornadoVM
## Choose the backend/s that applies to your system. You can install multiple ones
$ ./bin/tornadovm-installer --jdk jdk21 --backend=opencl,ptx,spirv
$ source setvars.sh
tornado --devices
Number of Tornado drivers: 2
Total number of PTX devices : 1
Tornado device=0:0
PTX -- NVIDIA GeForce RTX 2060 with Max-Q Design
Global Memory Size: 5.8 GB
Local Memory Size: 48.0 KB
Workgroup Dimensions: 3
Max WorkGroup Configuration: [1024, 1024, 64]
Device OpenCL C version: N/A
Total number of OpenCL devices : 3
Tornado device=1:0
NVIDIA CUDA -- NVIDIA GeForce RTX 2060 with Max-Q Design
Global Memory Size: 5.8 GB
Local Memory Size: 48.0 KB
Workgroup Dimensions: 3
Max WorkGroup Configuration: [1024, 1024, 64]
Device OpenCL C version: OpenCL C 1.2
Tornado device=1:1
Intel(R) OpenCL HD Graphics -- Intel(R) UHD Graphics [0x9bc4]
Global Memory Size: 24.9 GB
Local Memory Size: 64.0 KB
Workgroup Dimensions: 3
Max WorkGroup Configuration: [256, 256, 256]
Device OpenCL C version: OpenCL C 3.0
Tornado device=1:2
Intel(R) CPU Runtime for OpenCL(TM) Applications -- Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz
Global Memory Size: 31.1 GB
Local Memory Size: 32.0 KB
Workgroup Dimensions: 3
Max WorkGroup Configuration: [8192, 8192, 8192]
Device OpenCL C version: OpenCL C 2.0
Note that, depending on the devices you have and the drivers installed (e.g., NVIDIA CUDA, OpenCL, SPIR-V), you will see different implementations.
git clone https://github.com/jjfumero/tornadovm-examples
cd tornadovm-examples
source /path-to-your-Tornado-DIR/source.sh
export TORNADO_SDK=/path-to-your-Tornado-DIR/bin/sdk
mvn clean package
## List all devices and backends available
tornado --devices
## Run the multi-thread version for reference
tornado -cp target/tornadovm-examples-1.0-SNAPSHOT.jar io.github.jjfumero.Mandelbrot mt
## Run the TornadoVM Version (it will select the device 0:0 by default)
tornado -cp target/tornadovm-examples-1.0-SNAPSHOT.jar io.github.jjfumero.Mandelbrot tornado
## Print the device and thread information in which the application is running
tornado --threadInfo -cp target/tornadovm-examples-1.0-SNAPSHOT.jar io.github.jjfumero.Mandelbrot tornado
## Get the SPIRV code (assuming the SPIRV backend is installed in the device 0:0)
tornado --debug --threadInfo -cp target/tornadovm-examples-1.0-SNAPSHOT.jar io.github.jjfumero.Mandelbrot tornado
spirv-dis <spirv-binary>
## Change the device
tornado --threadInfo --jvm="-Dfractal.mandelbrot.device=1:1" -cp target/tornadovm-examples-1.0-SNAPSHOT.jar io.github.jjfumero.Mandelbrot tornado
## Run with the tornadoVM profiler
tornado --enableProfiler console --threadInfo --jvm="-Dfractal.mandelbrot.device=0:0" -cp target/tornadovm-examples-1.0-SNAPSHOT.jar io.github.jjfumero.Mandelbrot tornado
This examples shows a blur effect in a photo. Example of computational photography.
Place an JPEG image in ./image.jpg
or feel free to change the path your images.
## List all devices and backends available
tornado --devices
## Run the Java Parallel Stream Version on CPU for reference
tornado -cp target/tornadovm-examples-1.0-SNAPSHOT.jar io.github.jjfumero.BlurFilter mt
## Run the Accelerated Version on the default device
tornado -cp target/tornadovm-examples-1.0-SNAPSHOT.jar io.github.jjfumero.BlurFilter tornado
## Run the Accelerated Version on the default device with info about the accelerator
tornado --threadInfo -cp target/tornadovm-examples-1.0-SNAPSHOT.jar io.github.jjfumero.BlurFilter tornado
## Print the generated kernel
tornado --printKernel --threadInfo -cp target/tornadovm-examples-1.0-SNAPSHOT.jar io.github.jjfumero.BlurFilter tornado
## Run in another device
tornado --threadInfo -cp target/tornadovm-examples-1.0-SNAPSHOT.jar io.github.jjfumero.BlurFilter tornado device=1:1
## Run with concurrent multi-devices
tornado --threadInfo --enableConcurrentDevices --jvm=" -Dblur.red.device=1:0 -Dblur.green.device=2:0 -Dblur.blue.device=1:2" -cp target/tornadovm-examples-1.0-SNAPSHOT.jar io.github.jjfumero.BlurFilter tornado
Demonstration of a Task-Graph to compute:
- Black and White filter
- Blur Filer
using multiple GPUs (or accelerators) at the same time for each task.
## Run the Java Parallel Stream Version on CPU for reference
tornado -cp target/tornadovm-examples-1.0-SNAPSHOT.jar io.github.jjfumero.MultiImageProcessor mt
## Run the Accelerated Version on the default device
tornado -cp target/tornadovm-examples-1.0-SNAPSHOT.jar io.github.jjfumero.MultiImageProcessor tornado
The examples is created based on the following device setup:
## List all devices and backends available
$ tornado --devices
Number of Tornado drivers: 3
Driver: SPIRV
Total number of SPIRV devices : 1
Tornado device=0:0 (DEFAULT)
SPIRV -- SPIRV LevelZero - Intel(R) UHD Graphics 770
Global Memory Size: 24.9 GB
Local Memory Size: 64.0 KB
Workgroup Dimensions: 3
Total Number of Block Threads: [512]
Max WorkGroup Configuration: [512, 512, 512]
Device OpenCL C version: (LEVEL ZERO) 1.3
Driver: OpenCL
Total number of OpenCL devices : 4
Tornado device=1:0
OPENCL -- [NVIDIA CUDA] -- NVIDIA GeForce RTX 3070
Global Memory Size: 7.8 GB
Local Memory Size: 48.0 KB
Workgroup Dimensions: 3
Total Number of Block Threads: [1024]
Max WorkGroup Configuration: [1024, 1024, 64]
Device OpenCL C version: OpenCL C 1.2
Tornado device=1:1
OPENCL -- [Intel(R) OpenCL Graphics] -- Intel(R) UHD Graphics 770
Global Memory Size: 24.9 GB
Local Memory Size: 64.0 KB
Workgroup Dimensions: 3
Total Number of Block Threads: [512]
Max WorkGroup Configuration: [512, 512, 512]
Device OpenCL C version: OpenCL C 1.2
Tornado device=1:2
OPENCL -- [Intel(R) OpenCL] -- 12th Gen Intel(R) Core(TM) i7-12700K
Global Memory Size: 31.1 GB
Local Memory Size: 32.0 KB
Workgroup Dimensions: 3
Total Number of Block Threads: [8192]
Max WorkGroup Configuration: [8192, 8192, 8192]
Device OpenCL C version: OpenCL C 3.0
Tornado device=1:3
OPENCL -- [Intel(R) FPGA Emulation Platform for OpenCL(TM)] -- Intel(R) FPGA Emulation Device
Global Memory Size: 31.1 GB
Local Memory Size: 256.0 KB
Workgroup Dimensions: 3
Total Number of Block Threads: [67108864]
Max WorkGroup Configuration: [67108864, 67108864, 67108864]
Device OpenCL C version: OpenCL C 1.2
Driver: PTX
Total number of PTX devices : 1
Tornado device=2:0
PTX -- PTX -- NVIDIA GeForce RTX 3070
Global Memory Size: 7.8 GB
Local Memory Size: 48.0 KB
Workgroup Dimensions: 3
Total Number of Block Threads: [2147483647, 65535, 65535]
Max WorkGroup Configuration: [1024, 1024, 64]
Device OpenCL C version: N/A
To change the accelerator, use the following instructions:
TornadoDevice device0 = TornadoExecutionPlan.getDevice(0, 0);
TornadoDevice device1 = TornadoExecutionPlan.getDevice(1, 0);
TornadoDevice device2 = TornadoExecutionPlan.getDevice(1, 1);
TornadoDevice device3 = TornadoExecutionPlan.getDevice(1, 2);
TornadoDevice device4 = TornadoExecutionPlan.getDevice(2, 0);
executionPlan.withConcurrentDevices() //
.withDevice("imageProcessor.blackAndWhite", device0) //
.withDevice("imageProcessor.blurRed", device1) //
.withDevice("imageProcessor.blurGreen", device2) //
.withDevice("imageProcessor.blurBlue", device4);
Full KMeans in which the assign-cluster function is expressed with TornadoVM.
# For example use 1000000 data points and classify them into 10 clusters.
# Points are selected randomly. This is just for quick experiments
## Sequential
tornado -cp target/tornadovm-examples-1.0-SNAPSHOT.jar io.github.jjfumero.KMeans seq 1000000 10
# TornadoVM version
tornado -cp target/tornadovm-examples-1.0-SNAPSHOT.jar io.github.jjfumero.KMeans tornado 1000000 10
## Run Server in one terminal
./runServer.sh
## Client in another terminal
./runClient.sh ## Change device during runtime
## Note: the application selects the backend 0:0 (default backend)
# type different <backend:deviceNumber> version from the client.
# Examples:
# 0:1
# 1:0
## etc
This application shows different implementation for Matrix Multiplications for:
- Single-core Sequential Implementation using Panama Segments
- Multi-core Parallel Streams Using Panama Segments
- Multi-core Java Threads using Panama Segments
- Single-core Sequential implementation using Panama Vector Types for the reduction part (dotProduct)
- Multi-core Parallel Stream implementation using Panama Vector Types for the reduction part (dotProduct)
- TornadoVM to run on the default device (usually a GPU if a GPU backends is available and installed)
How to run?
mvn clean install
## Run Java Sequential only
tornado -cp target/tornadovm-examples-1.0-SNAPSHOT.jar io.github.jjfumero.MatrixMultiplication onlyJavaSeq
## Run all Java implementations
tornado -cp target/tornadovm-examples-1.0-SNAPSHOT.jar io.github.jjfumero.MatrixMultiplication onlyJava
## Run only TornadoVM. It also runs 1 iteration of the sequential implementation to check results
tornado -cp target/tornadovm-examples-1.0-SNAPSHOT.jar io.github.jjfumero.MatrixMultiplication onlyTornadoVM
## Run all implementations, including TornadoVM
tornado -cp target/tornadovm-examples-1.0-SNAPSHOT.jar io.github.jjfumero.MatrixMultiplication