diff --git a/Benchmarks.md b/Benchmarks.md index 495fdd9..5547993 100644 --- a/Benchmarks.md +++ b/Benchmarks.md @@ -1,5 +1,6 @@ -All tests are performed on i7-4770 with 2-channel DDR3-1600 memory, employing all CPU cores. +All tests are performed on [i7-4770](https://ark.intel.com/products/75122/Intel-Core-i7-4770-Processor-8M-Cache-up-to-3_90-GHz) +with 2-channel DDR3-1600 memory, employing all CPU cores. Speeds are measured in MiB/s (mebibytes/second), add 5% to convert into MB/s (megabytes/second). Executables are compiled by (-DSIMD selects vectorizable code path): @@ -19,7 +20,7 @@ For NTT(2^20), we expect speed of 1 GB/s for SSE2 version, and 2 GB/s for AVX2 v ### Reed-Solomon encoding -Reed-Solomon encoding (2^19 source blocks => 2^19 ECC blocks, 2052 bytes each) in GF(0xFFF00001): +Reed-Solomon encoding (2^19 data blocks => 2^19 parity blocks, 2052 bytes each) in GF(0xFFF00001): ``` rs64g-avx2: 1766 ms = 1162 MiB/s, cpu 12932 ms = 732%, os 31 ms rs64g-sse2: 2354 ms = 872 MiB/s, cpu 16677 ms = 708%, os 62 ms diff --git a/GF.md b/GF.md index f01b896..db69e16 100644 --- a/GF.md +++ b/GF.md @@ -8,7 +8,7 @@ This program is useful for researching ring properties, in particular maximal or ### Lucky number: choosing the best base for computations -Since GF(2^n) doesn't have much roots of unity, NTT-based Reed-Solomon codes implementation can't perform computations in this field. +Since GF(2^n) doesn't have much roots of unity, efficient NTT-based Reed-Solomon codes implementation can't perform computations in this field. Instead, we need to use other Galois Field, or even Ring modulo some number. GF(p^n) has a maximal order of p^n-1. For rings, the maximal order is defined by complex formula that you can find in chapter `39.7 Composite modulus: the ring Z=mZ` of [FxtBook](http://www.jjj.de/fxt/fxtbook.pdf). @@ -85,7 +85,7 @@ After the flag 0, remaining input items contains values of remaining output elem Once input (source) data are recoded in this way, we need to store the extra bit in the way which ensure that the bit can be restored in any situation when the data block can be restored. The best way to ensure this, that I found, is to save the extra bit as one more (1025'th) -source word. So, all operations are performed on 4100-byte blocks, and ECC sectors stored are 4100-byte long. Sad, but I don't see better choice. +source word. So, all operations are performed on 4100-byte blocks, and parity sectors stored are 4100-byte long. Sad, but I don't see better choice. Remaining bits of the extra word can be used to store block checksum, although i don't see much gain in that. Of course, when 64-bit base and/or GF(p^2) field are used, extra data will be increased to 8-16 bytes. diff --git a/Overview.md b/Overview.md index 9be8b82..d275001 100644 --- a/Overview.md +++ b/Overview.md @@ -11,7 +11,7 @@ Topics: * [Discrete Fourier transform](https://en.wikipedia.org/wiki/Discrete_Fourier_transform) * [Fast Fourier transform](https://en.wikipedia.org/wiki/Fast_Fourier_transform) and in particular [Cooley–Tukey FFT algorithm](https://en.wikipedia.org/wiki/Cooley%E2%80%93Tukey_FFT_algorithm) as O(N*log(N)) algorithm implementing DFT -* [Number-theoretic transform](https://en.wikipedia.org/wiki/Discrete_Fourier_transform_(general)) as modified FFT +* [Fast Number-Theoretic Transform](https://en.wikipedia.org/wiki/Discrete_Fourier_transform_(general)) as modified FFT employing the same add/sub/mul operations and unity roots, but in Galois Field Once you grasped all these topics, you can grab some FFT implementation and convert it to NTT. @@ -51,7 +51,7 @@ So we just multiply source `A` vector by Vandermonde `(n+k)*n` matrix generated It's guaranteed that any `n` different `a[i]` numbers form an invertible Vandermonde matrix, so we can restore from any `n` remaining words after a loss. * [Plank proposed](http://web.eecs.utk.edu/~plank/plank/papers/SPE-04.html) to start with Vandermonde `(n+k)*n` matrix and then apply the [Gaussian elimination](https://en.wikipedia.org/wiki/Gaussian_elimination) in order to convert it to some `(I,M)` matrix. -As far as we perform this operation only once per a lot of ECC computations, we can ignore the time required by this operation. +As far as we perform this operation only once per a lot of parity computations, we can ignore the time required by this operation. * PAR2 format employs `(I,V)` encoding matrix, i.e. it employs Vandermonde `k*n` matrix to compute `k` ecc words while employing the systematic code. Despite of special form of `a[i]` used in their Vandermonde matrix, the restoration matrix is sometimes non-invertible. But it seems to be a good compromise between the speed/complexity of computations and recovery strength. diff --git a/README.md b/README.md index 518447a..2417beb 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -FastECC implements O(N*log(N)) [Reed-Solomon coder], running at 1.2 GB/s on i7-4770 in (2^20, 2^19) config, +FastECC implements O(N*log(N)) [Reed-Solomon coder], running at [1.2 GB/s] on [i7-4770] in (2^20, 2^19) config, i.e. calculating 524288 parity blocks from 524288 data blocks. Version 0.1 implements only encoding, so it isn't yet ready for real use. @@ -16,11 +16,11 @@ And computations in GF(2^32), implemented in the same way, will build one millio The only exception is closed-source [RSC32 by persicum] with O(N*log(N)) speed, i.e. it spends O(log(N)) time per parity block. Its speed with million parity blocks is 100 MB/s, i.e. it computes one million of 4 KB parity blocks from one million of data blocks (processing 8 GB overall) in just 80 seconds. -Note that all speeds mentioned here are measured on i7-4770, employing all features available in a particular program - +Note that all speeds mentioned here are measured on [i7-4770], employing all features available in a particular program - including multi-threading, SIMD and x64 support. FastECC is open-source library implementing O(N*log(N)) encoding algorithm. -It computes million parity blocks at 1.2 GB/s. +It computes million parity blocks at [1.2 GB/s]. Future versions will implement decoding that's also `O(N*log(N))`, although 1.5-3 times slower than encoding. Current implementation is limited to 2^20 blocks, removing this limit is the main priority for future work aside of decoder implementation. @@ -41,7 +41,7 @@ Moreover, it works with binary data, so no need for [recoding](GF.md#data-packin ## How All O(N*log(N)) Reed-Solomon implementations I'm aware of, use fast transforms like FFT or FWT. -FastECC employs Number-Theoretic Transform that is just an FFT over integer field or ring. +FastECC employs fast Number-Theoretic Transform that is just an FFT over integer field or ring. Let's see how it works. Note that below by `length-N polynomial` I mean any polynomial with order < N. For any given set of N points, only one length-N polynomial may go through all these points. @@ -184,6 +184,8 @@ So, overall, FastECC should replace any use of 16-bit RS codecs, while LDPC and - [Hacker News story](https://news.ycombinator.com/item?id=14290617) +[1.2 GB/s]: Benchmarks.md#reed-solomon-encoding +[i7-4770]: https://ark.intel.com/products/75122/Intel-Core-i7-4770-Processor-8M-Cache-up-to-3_90-GHz [Reed-Solomon coder]: https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction [MultiPar]: https://www.livebusinesschat.com/smf/index.php?board=396.0 [RSC32 by persicum]: https://www.livebusinesschat.com/smf/index.php?board=399.0 diff --git a/RS.md b/RS.md index e4c7a70..5d299ee 100644 --- a/RS.md +++ b/RS.md @@ -7,7 +7,7 @@ ### Prior art The encoding and decoding algorithms implemented by FastECC were described in the paper -[An Efficient (n,k) Information Dispersal Algorithm based on Fermat Number Transforms](http://ieeexplore.ieee.org/document/6545355/) +[An Efficient (n,k) Information Dispersal Algorithm based on Fermat Number Transforms](https://pdfs.semanticscholar.org/141d/c4ee4cca45b4ed1c07f890f758e427597db8.pdf) published in 2013 by Sian-Jheng Lin and Wei-Ho Chung. The following is my own investigations written prior to reading this great paper :) diff --git a/compile.cmd b/compile.cmd index 0d31ebb..62cd722 100644 --- a/compile.cmd +++ b/compile.cmd @@ -27,6 +27,6 @@ cl -Fe%name%32m.exe -Fa%name%32.asm -arch:SSE2 %options_ms_cl% %options_ms_x86% ::g++ -std=c++14 -m32 -O3 %main% -static -fopenmp -o%name%32g-sse2 -msse2 -DSIMD=SSE2 -Xlinker --large-address-aware ::g++ -std=c++14 -m32 -O3 %main% -static -fopenmp -o%name%32g -mmmx -Xlinker --large-address-aware -::cl -Feprime.exe -O2 -EHsc prime.cpp +::cl -Feprime.exe -O2 -EHsc prime.cpp -link %options_ms_x86% del *.exe.bak *.obj *.res >nul 2>nul