Suite of small tools for different tasks.
Built on stufflib
headers.
Dataset transformation from raw, downloaded files into stufflib_record
binary files.
./build/debug/tools/dataset cifar_to_png dataset_path output_path [-v]
./build/debug/tools/dataset spambase dataset_path output_path [-v]
./build/debug/tools/dataset rcv1 dataset_path output_path [-v]
Download datasets to some directory, which will be referred to as dataset_path
in this readme.
- dataset homepage, last accessed 2024-06-01
- download the
.tar.gz
file from here and uncompress into a directory, for examplecifar-10
. - after extraction, the directory should look like this
cifar-10
├── batches.meta.txt
├── data_batch_1.bin
├── data_batch_2.bin
├── data_batch_3.bin
├── data_batch_4.bin
├── data_batch_5.bin
├── readme.html
└── test_batch.bin
- dataset homepage, last accessed 2024-06-09
- download the
.zip
file from here and unzip into a directory, for examplespambase
. - after extraction, the directory should look like this
spambase
├── spambase.data
├── spambase.DOCUMENTATION
└── spambase.names
- RCV1 homepage, last accessed 2024-06-23
- download the compressed data files into a directory, for example
rcv1
- after extraction, the directory should look like this
rcv1
├── lyrl2004_vectors_test_pt0.dat
├── lyrl2004_vectors_test_pt1.dat
├── lyrl2004_vectors_test_pt2.dat
├── lyrl2004_vectors_test_pt3.dat
├── lyrl2004_vectors_train.dat
├── rcv1-v2.topics.qrels
└── rcv1v2-ids.dat
- NOTE that Shalev-Shwartz et al. (2011) seems to use the test set for training and the training set for testing.
Simple PNG decoder.
./build/debug/tools/png info png_path
./build/debug/tools/png dump_raw png_path block_type [block_types...]
./build/debug/tools/png segment png_src_path png_dst_path [--threshold-percent=N] [-v]
Decode a PNG image and output information in JSON.
This example requires jq
for formatting the output.
If you don't want to install jq
, remove | jq .
from the below example to get the unformatted JSON on a single line.
./build/debug/tools/png info ./docs/img/tokyo.png | jq .
stdout
:
{
"chunks": {
"IHDR": 1,
"IDAT": 13,
"IEND": 1,
"bKGD": 1,
"cHRM": 1,
"gAMA": 1,
"pHYs": 1,
"tEXt": 11,
"tIME": 1
},
"header": {
"width": 500,
"height": 500,
"bit depth": 8,
"color type": "rgb",
"compression": 0,
"filter": 0,
"interlace": 0
},
"data": {
"length": 756012,
"filters": {
"Sub": 31,
"Average": 228,
"Paeth": 241
}
}
}
Apply mean segmentation on PNG images.
Merges adjacent image segments by comparing the Euclidian distance between the average RGB-pixel of each segment, where each RGB-pixel (3 bytes) is interpreted as a vector of length 3: [R, G, B]
.
./build/debug/tools/png segment \
--threshold-percent=10 \
./docs/img/tokyo.png \
./docs/img/tokyo_segmented_10p.png
./build/debug/tools/png segment \
--threshold-percent=20 \
./docs/img/tokyo.png \
./docs/img/tokyo_segmented_20p.png
./build/debug/tools/png segment \
--threshold-percent=30 \
./docs/img/tokyo.png \
./docs/img/tokyo_segmented_30p.png
Decode a PNG image into chunks and write raw chunk data to stdout. Use positional arguments to filter a subset of chunk types.
This example requires xxd
.
./build/debug/tools/png dump_raw ./test-data/png/ff0000-1x1-rgb-fixed.png IHDR IDAT | xxd -b
stdout
:
00000000: 00000000 00000000 00000000 00000001 00000000 00000000 ......
00000006: 00000000 00000001 00001000 00000010 00000000 00000000 ......
0000000c: 00000000 00001000 00011101 01100011 11111000 11001111 ...c..
00000012: 11000000 00000000 00000000 00000011 00000001 00000001 ......
00000018: 00000000 .
Simple line sorting.
./build/debug/tools/sort { numeric | ascii } path
Create data (on macOS, use gfind
) by calculating the size of each input file used during testing:
find ./test-data/png -name '*.png' -printf '%s\n' > test-data-sizes.txt
./build/debug/tools/sort numeric ./test-data-sizes.txt
stdout
:
69
69
69
72
72
72
72
160
237
238
238
1554
2970
4096
11223
24733
./build/debug/tools/sort ascii ./test-data-sizes.txt
stdout
:
11223
1554
160
237
238
238
24733
2970
4096
69
69
69
72
72
72
72
./build/debug/tools/sort numeric --reverse ./test-data-sizes.txt
stdout
:
24733
11223
4096
2970
1554
238
238
237
160
72
72
72
72
69
69
69
./build/debug/tools/svm experiment dataset_dir [-v]
- linear SVM
- all features min-max rescaled to
[-1, 1]
- random train-test split with 2000 test samples
- learning rate
1e-9
./build/debug/tools/svm spambase out -v &| jq .
output
{
"level": "info",
"file": "src/tools/svm.c",
"line": 20,
"msg": "training SVM on spambase dataset from 'out'"
}
{
"level": "info",
"file": "src/tools/svm.c",
"line": 94,
"msg": "spambase dataset, random train set, linear SVM"
}
{
"tp": 854,
"tn": 1459,
"fp": 210,
"fn": 78,
"accuracy": 0.889,
"precision": 0.803,
"recall": 0.916,
"f1_score": 0.856
}
{
"level": "info",
"file": "src/tools/svm.c",
"line": 105,
"msg": "spambase dataset, random test set, linear SVM"
}
{
"tp": 605,
"tn": 1197,
"fp": 144,
"fn": 54,
"accuracy": 0.901,
"precision": 0.808,
"recall": 0.918,
"f1_score": 0.859
}
./build/debug/tools/txt concat path [paths...]
./build/debug/tools/txt count pattern path
./build/debug/tools/txt slicelines begin count path
./build/debug/tools/txt replace pattern replacement path
./build/debug/tools/txt linefreq path
./build/debug/tools/txt concat ./test-data/txt/wikipedia/water_{ja,is,hi,zh}.txt
stdout
:
水(みず、(英: water、他言語呼称は「他言語での呼称」の項を参照)とは、化学式 H2O で表される、水素と酸素の化合物である。日本語においては特に湯と対比して用いられ、液体ではあるが温度が低く、かつ凝固して氷にはなっていない物を言う。また、液状の物全般を指す。
Vatn er ólífrænn lyktar-, bragð- og nær litlaus vökvi sem er lífsnauðsynlegur öllum þekktum lífverum, þrátt fyrir að gefa þeim hvorki fæðu, orku né næringarefni. Vatnssameindin er samsett úr tveimur vetnisfrumeindum og einni súrefnisfrumeind sem tengjast með samgildistengi og hefur efnaformúluna H2O. Vatn er uppistaðan í vatnshvolfi jarðar. Orðið „vatn“ á við um efnið eins og það kemur fyrir við staðalhita og staðalþrýsting.
जल या पानी एक आम रासायनिक पदार्थ है जिसका अणु दो हाइड्रोजन परमाणु और एक प्राणवायु परमाणु से बना है - H2O। यह सारे प्राणियों के जीवन का आधार है। आमतौर पर जल शब्द का प्रयोग द्रव अवस्था के लिए उपयोग में लाया जाता है पर यह ठोस अवस्था (बर्फ) और गैसीय अवस्था (भाप या जल वाष्प) में भी पाया जाता है। पानी जल-आत्मीय सतहों पर तरल-क्रिस्टल के रूप में भी पाया जाता है।
水是地球上最常见的物质之一,是由氢、氧两种元素經過化學反應後组成的无机化合物(分子式:H2O),在常温常压下为无色无味的透明液体。
./build/debug/tools/txt count 'struct' src/tools/txt.c
./build/debug/tools/txt count '##' README.md
./build/debug/tools/txt count 'ある' README.md
./build/debug/tools/txt count 'ið' README.md
stdout
:
36
43
3
4
./build/debug/tools/txt slicelines 315 10 ./src/tools/txt.c
stdout
:
int main(int argc, char* const argv[argc + 1]) {
struct sl_args args = {.argc = argc, .argv = argv};
bool ok = false;
char* command = sl_args_get_positional(&args, 0);
if (command) {
if (strcmp(command, "concat") == 0) {
ok = concat(&args);
} else if (strcmp(command, "count") == 0) {
ok = count(&args);
} else if (strcmp(command, "slicelines") == 0) {
./build/debug/tools/txt replace '水' water ./test-data/txt/wikipedia/water_ja.txt
stdout
:
water(みず、(英: water、他言語呼称は「他言語での呼称」の項を参照)とは、化学式 H2O で表される、water素と酸素の化合物である。日本語においては特に湯と対比して用いられ、液体ではあるが温度が低く、かつ凝固して氷にはなっていない物を言う。また、液状の物全般を指す。
clang-18 -std=c23 -E -I./include ./src/tools/txt.c \
| ./build/debug/tools/txt replace ' ' '' /dev/stdin \
| ./build/debug/tools/txt replace $'\n ' $'\n' /dev/stdin \
| ./build/debug/tools/txt linefreq /dev/stdin \
| ./build/debug/tools/sort numeric --reverse /dev/stdin \
| ./build/debug/tools/txt slicelines 0 25 /dev/stdin
stdout
:
249 }
19 };
18 __attribute__ ((__const__));
17 {
13 goto done;
11 } break;
10 __extension__
9 return false;
9 return dst;
9 __attribute__ ((__nothrow__ )) __attribute__ ((__pure__)) __attribute__ ((__nonnull__ (1, 2)));
7 for (size_t i = 0; i < n; ++i) {
6 } else {
6 __attribute__ ((__nothrow__ )) __attribute__ ((__nonnull__ (1)));
6 if (0x80 <= byte && byte <= 0xbf) {
6 done:
6 # 1 "/usr/include/aarch64-linux-gnu/bits/libc-header-start.h" 1 3 4
5 struct sl_string content = sl_string_from_file(path);
5 bool is_done = false;
5 # 1 "/usr/include/aarch64-linux-gnu/bits/wordsize.h" 1 3 4
5 const int args_count = sl_args_count_positional(args) - 1;
5 # 1 "/usr/include/assert.h" 1 3 4
5 ;
5 is_done = true;
5 sl_string_destroy(&content);
5 "\n"
./build/debug/tools/png dump_raw ./docs/img/tokyo.png tEXt \
| ./build/debug/tools/txt replace date: $'\n'date= /dev/stdin \
| ./build/debug/tools/txt replace exif: $'\n'exif= /dev/stdin \
| ./build/debug/tools/txt replace 0x00 ': ' /dev/stdin \
&& echo
stdout
:
date=create: 2023-01-23T21:22:19+00:00
date=modify: 2023-01-23T21:22:19+00:00
exif=ColorSpace: 1
exif=ComponentsConfiguration: 1, 2, 3, 0
exif=ExifOffset: 90
exif=ExifVersion: 48, 50, 50, 49
exif=FlashPixVersion: 48, 49, 48, 48
exif=PixelXDimension: 2100
exif=PixelYDimension: 2100
exif=SceneCaptureType: 0
exif=YCbCrPositioning: 1: :