Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
dataset.c	dataset.c
png.c	png.c
sort.c	sort.c
svm.c	svm.c
txt.c	txt.c

Tools

Suite of small tools for different tasks. Built on stufflib headers.

dataset

Dataset transformation from raw, downloaded files into stufflib_record binary files.

Usage

./build/debug/tools/dataset cifar_to_png dataset_path output_path [-v]
./build/debug/tools/dataset spambase dataset_path output_path [-v]
./build/debug/tools/dataset rcv1 dataset_path output_path [-v]

Dataset preparation

Download datasets to some directory, which will be referred to as dataset_path in this readme.

CIFAR-10

dataset homepage, last accessed 2024-06-01
download the .tar.gz file from here and uncompress into a directory, for example cifar-10.
after extraction, the directory should look like this

cifar-10
├── batches.meta.txt
├── data_batch_1.bin
├── data_batch_2.bin
├── data_batch_3.bin
├── data_batch_4.bin
├── data_batch_5.bin
├── readme.html
└── test_batch.bin

spambase

dataset homepage, last accessed 2024-06-09
download the .zip file from here and unzip into a directory, for example spambase.
after extraction, the directory should look like this

spambase
├── spambase.data
├── spambase.DOCUMENTATION
└── spambase.names

RCV1

RCV1 homepage, last accessed 2024-06-23
download the compressed data files into a directory, for example rcv1
after extraction, the directory should look like this

rcv1
├── lyrl2004_vectors_test_pt0.dat
├── lyrl2004_vectors_test_pt1.dat
├── lyrl2004_vectors_test_pt2.dat
├── lyrl2004_vectors_test_pt3.dat
├── lyrl2004_vectors_train.dat
├── rcv1-v2.topics.qrels
└── rcv1v2-ids.dat

NOTE that Shalev-Shwartz et al. (2011) seems to use the test set for training and the training set for testing.

png

source

Simple PNG decoder.

Usage

./build/debug/tools/png info png_path
./build/debug/tools/png dump_raw png_path block_type [block_types...]
./build/debug/tools/png segment png_src_path png_dst_path [--threshold-percent=N] [-v]

info

Decode a PNG image and output information in JSON.

This example requires jq for formatting the output. If you don't want to install jq, remove | jq . from the below example to get the unformatted JSON on a single line.

Input

./build/debug/tools/png info ./docs/img/tokyo.png | jq .

stdout:

{
  "chunks": {
    "IHDR": 1,
    "IDAT": 13,
    "IEND": 1,
    "bKGD": 1,
    "cHRM": 1,
    "gAMA": 1,
    "pHYs": 1,
    "tEXt": 11,
    "tIME": 1
  },
  "header": {
    "width": 500,
    "height": 500,
    "bit depth": 8,
    "color type": "rgb",
    "compression": 0,
    "filter": 0,
    "interlace": 0
  },
  "data": {
    "length": 756012,
    "filters": {
      "Sub": 31,
      "Average": 228,
      "Paeth": 241
    }
  }
}

Image segmentation

Apply mean segmentation on PNG images.

Merges adjacent image segments by comparing the Euclidian distance between the average RGB-pixel of each segment, where each RGB-pixel (3 bytes) is interpreted as a vector of length 3: [R, G, B].

Threshold 10%

./build/debug/tools/png segment \
  --threshold-percent=10 \
  ./docs/img/tokyo.png \
  ./docs/img/tokyo_segmented_10p.png

Threshold 20%

./build/debug/tools/png segment \
  --threshold-percent=20 \
  ./docs/img/tokyo.png \
  ./docs/img/tokyo_segmented_20p.png

Threshold 30%

./build/debug/tools/png segment \
  --threshold-percent=30 \
  ./docs/img/tokyo.png \
  ./docs/img/tokyo_segmented_30p.png

Dump raw chunks

Decode a PNG image into chunks and write raw chunk data to stdout. Use positional arguments to filter a subset of chunk types.

Example: dump IHDR and IDAT contents of a single red pixel

This example requires xxd.

./build/debug/tools/png dump_raw ./test-data/png/ff0000-1x1-rgb-fixed.png IHDR IDAT | xxd -b

stdout:

00000000: 00000000 00000000 00000000 00000001 00000000 00000000  ......
00000006: 00000000 00000001 00001000 00000010 00000000 00000000  ......
0000000c: 00000000 00001000 00011101 01100011 11111000 11001111  ...c..
00000012: 11000000 00000000 00000000 00000011 00000001 00000001  ......
00000018: 00000000                                               .

sort

source

Simple line sorting.

Usage

./build/debug/tools/sort { numeric | ascii } path

Example

Create data (on macOS, use gfind) by calculating the size of each input file used during testing:

find ./test-data/png -name '*.png' -printf '%s\n' > test-data-sizes.txt

Sort lines as numbers

./build/debug/tools/sort numeric ./test-data-sizes.txt

stdout:

Sort lines as ASCII strings

./build/debug/tools/sort ascii ./test-data-sizes.txt

stdout:

Sort lines as numbers in descending order

./build/debug/tools/sort numeric --reverse ./test-data-sizes.txt

stdout:

svm

source

./build/debug/tools/svm experiment dataset_dir [-v]

spambase

linear SVM
all features min-max rescaled to [-1, 1]
random train-test split with 2000 test samples
learning rate 1e-9

./build/debug/tools/svm spambase out -v &| jq .

output

{
  "level": "info",
  "file": "src/tools/svm.c",
  "line": 20,
  "msg": "training SVM on spambase dataset from 'out'"
}
{
  "level": "info",
  "file": "src/tools/svm.c",
  "line": 94,
  "msg": "spambase dataset, random train set, linear SVM"
}
{
  "tp": 854,
  "tn": 1459,
  "fp": 210,
  "fn": 78,
  "accuracy": 0.889,
  "precision": 0.803,
  "recall": 0.916,
  "f1_score": 0.856
}
{
  "level": "info",
  "file": "src/tools/svm.c",
  "line": 105,
  "msg": "spambase dataset, random test set, linear SVM"
}
{
  "tp": 605,
  "tn": 1197,
  "fp": 144,
  "fn": 54,
  "accuracy": 0.901,
  "precision": 0.808,
  "recall": 0.918,
  "f1_score": 0.859
}

txt

source

Usage

./build/debug/tools/txt concat path [paths...]
./build/debug/tools/txt count pattern path
./build/debug/tools/txt slicelines begin count path
./build/debug/tools/txt replace pattern replacement path
./build/debug/tools/txt linefreq path

Examples

Concatenate

./build/debug/tools/txt concat ./test-data/txt/wikipedia/water_{ja,is,hi,zh}.txt

stdout:

水（みず、（英: water、他言語呼称は「他言語での呼称」の項を参照）とは、化学式 H2O で表される、水素と酸素の化合物である。日本語においては特に湯と対比して用いられ、液体ではあるが温度が低く、かつ凝固して氷にはなっていない物を言う。また、液状の物全般を指す。
Vatn er ólífrænn lyktar-, bragð- og nær litlaus vökvi sem er lífsnauðsynlegur öllum þekktum lífverum, þrátt fyrir að gefa þeim hvorki fæðu, orku né næringarefni. Vatnssameindin er samsett úr tveimur vetnisfrumeindum og einni súrefnisfrumeind sem tengjast með samgildistengi og hefur efnaformúluna H2O. Vatn er uppistaðan í vatnshvolfi jarðar. Orðið „vatn“ á við um efnið eins og það kemur fyrir við staðalhita og staðalþrýsting.
जल या पानी एक आम रासायनिक पदार्थ है जिसका अणु दो हाइड्रोजन परमाणु और एक प्राणवायु परमाणु से बना है - H2O। यह सारे प्राणियों के जीवन का आधार है। आमतौर पर जल शब्द का प्रयोग द्रव अवस्था के लिए उपयोग में लाया जाता है पर यह ठोस अवस्था (बर्फ) और गैसीय अवस्था (भाप या जल वाष्प) में भी पाया जाता है। पानी जल-आत्मीय सतहों पर तरल-क्रिस्टल के रूप में भी पाया जाता है।
水是地球上最常见的物质之一，是由氢、氧两种元素經過化學反應後组成的无机化合物（分子式：H2O），在常温常压下为无色无味的透明液体。

Count pattern occurrence

./build/debug/tools/txt count 'struct' src/tools/txt.c
./build/debug/tools/txt count '##' README.md
./build/debug/tools/txt count 'ある' README.md
./build/debug/tools/txt count 'ið' README.md

stdout:

Slice lines

./build/debug/tools/txt slicelines 315 10 ./src/tools/txt.c

stdout:

int main(int argc, char* const argv[argc + 1]) {
  struct sl_args args = {.argc = argc, .argv = argv};
  bool ok = false;
  char* command = sl_args_get_positional(&args, 0);
  if (command) {
    if (strcmp(command, "concat") == 0) {
      ok = concat(&args);
    } else if (strcmp(command, "count") == 0) {
      ok = count(&args);
    } else if (strcmp(command, "slicelines") == 0) {

Replace pattern

./build/debug/tools/txt replace '水' water ./test-data/txt/wikipedia/water_ja.txt

stdout:

water（みず、（英: water、他言語呼称は「他言語での呼称」の項を参照）とは、化学式 H2O で表される、water素と酸素の化合物である。日本語においては特に湯と対比して用いられ、液体ではあるが温度が低く、かつ凝固して氷にはなっていない物を言う。また、液状の物全般を指す。

Combine commands by using `/dev/stdin` as input path

Run preprocessor on source file and count 25 most common lines

clang-18 -std=c23 -E -I./include ./src/tools/txt.c \
  | ./build/debug/tools/txt replace '  ' '' /dev/stdin \
  | ./build/debug/tools/txt replace $'\n ' $'\n' /dev/stdin \
  | ./build/debug/tools/txt linefreq /dev/stdin \
  | ./build/debug/tools/sort numeric --reverse /dev/stdin \
  | ./build/debug/tools/txt slicelines 0 25 /dev/stdin

stdout:

249 }
19 };
18 __attribute__ ((__const__));
17 {
13 goto done;
11 } break;
10 __extension__
9 return false;
9 return dst;
9 __attribute__ ((__nothrow__ )) __attribute__ ((__pure__)) __attribute__ ((__nonnull__ (1, 2)));
7 for (size_t i = 0; i < n; ++i) {
6 } else {
6 __attribute__ ((__nothrow__ )) __attribute__ ((__nonnull__ (1)));
6 if (0x80 <= byte && byte <= 0xbf) {
6 done:
6 # 1 "/usr/include/aarch64-linux-gnu/bits/libc-header-start.h" 1 3 4
5 struct sl_string content = sl_string_from_file(path);
5 bool is_done = false;
5 # 1 "/usr/include/aarch64-linux-gnu/bits/wordsize.h" 1 3 4
5 const int args_count = sl_args_count_positional(args) - 1;
5 # 1 "/usr/include/assert.h" 1 3 4
5 ;
5 is_done = true;
5 sl_string_destroy(&content);
5 "\n"

Format `NUL`-separated metadata fields in a PNG `tEXt` block

./build/debug/tools/png dump_raw ./docs/img/tokyo.png tEXt \
  | ./build/debug/tools/txt replace date: $'\n'date= /dev/stdin \
  | ./build/debug/tools/txt replace exif: $'\n'exif= /dev/stdin \
  | ./build/debug/tools/txt replace 0x00 ': ' /dev/stdin \
  && echo

stdout:


date=create: 2023-01-23T21:22:19+00:00
date=modify: 2023-01-23T21:22:19+00:00
exif=ColorSpace: 1
exif=ComponentsConfiguration: 1, 2, 3, 0
exif=ExifOffset: 90
exif=ExifVersion: 48, 50, 50, 49
exif=FlashPixVersion: 48, 49, 48, 48
exif=PixelXDimension: 2100
exif=PixelYDimension: 2100
exif=SceneCaptureType: 0
exif=YCbCrPositioning: 1: :

Files

tools

Directory actions

More options

Directory actions

More options

Latest commit

History

tools

Folders and files

parent directory

README.md

Tools

dataset

Usage

Dataset preparation

CIFAR-10

spambase

RCV1

png

Usage

info

Input

Image segmentation

Threshold 10%

Threshold 20%

Threshold 30%

Dump raw chunks

Example: dump IHDR and IDAT contents of a single red pixel

sort

Usage

Example

Sort lines as numbers

Sort lines as ASCII strings

Sort lines as numbers in descending order

svm

spambase

txt

Usage

Examples

Concatenate

Count pattern occurrence

Slice lines

Replace pattern

Combine commands by using /dev/stdin as input path

Run preprocessor on source file and count 25 most common lines

Format NUL-separated metadata fields in a PNG tEXt block

Combine commands by using `/dev/stdin` as input path

Format `NUL`-separated metadata fields in a PNG `tEXt` block