icubaby

A C++ Baby Library to Immediately Convert Unicode. The icubaby library offers a portable, header-only, dependency-free, library for C++ 17 or later. Fast, minimal, and easy to use for converting sequences of text between any of the Unicode UTF encodings. It does not allocate dynamic memory and neither throws nor catches exceptions.

icubaby is in no way related to the International Components for Unicode library and has no connection to any Intensive Care Unit!

Status

Category	Badges
License
Continuous Integration
Static Analysis
Runtime Analysis
OpenSSF

Introduction

C++ 17 deprecated the standard library's <codecvt> header file which contained its unicode conversion facets. Those features weren’t easy to use correctly but without them code is forced to look to other libraries. icubaby is such a library that fulfills the role of converting between the expressions of Unicode. It is simple to use and exceptionally simple to integrate into a project.

The library offers an API which converts to and from UTF-8, UTF-16, or UTF-32 encodings. It can also consume a byte stream of where an optional byte order mark at the start of the stream identifies both the source encoding and byte-order.

Installation

icubaby is entirely contained within a single header file. Installation can be as simple as copying that file (include/icubaby/icubaby.hpp) into your project. It has no dependencies and self-configures to your environment.

Usage

Check out the project documentation: https://paulhuggett-icubaby.readthedocs.io/en

icubaby uses four different types to express the different Unicode encodings that it supports:

Type	Meaning
`std::byte`	Encoding and byte-order is determined by the stream byte order mark
`icubaby::char8`	UTF-8. `icubaby::char8` is defined as `char8_t` when the native type is available and `char` otherwise
`char16_t`	UTF-16 host-native endian
`char32_t`	UTF-32 host-native endian

There are three ways to use the icubaby library depending on your needs:

C++ 20 range adaptor
Output Iterator interface
Converting one code-unit at a time

1. C++ 20 Range Adaptor

C++ 20 introduced the ranges library for composable and less error-prone interaction with iterators and containers. In icubaby, we can transform a range of input values from one Unicode encoding to another using a single range adaptor:

auto const in = std::array{char32_t{0x1F600}};
auto r = in | icubaby::views::transcode<char32_t, char16_t>;
std::vector<char16_t> out;
std::ranges::copy(r, std::back_inserter(out));

This code converts a single Unicode code-point 😀 (U+1F600 GRINNING FACE) from UTF-32 to UTF-16 and will copy two UTF-16 code-units (0xD83D and 0xDE00) into the out vector.

auto const in = std::array{std::byte{0xFE}, std::byte{0xFF}, std::byte{0x00},
                           std::byte{'A'},  std::byte{0x00}, std::byte{'b'}};
auto r = in | icubaby::views::transcode<std::byte, icubaby::char8>;
std::vector<icubaby::char8> out;
std::ranges::copy(r, std::back_inserter(out));

This snippet converts “Ab” (U+0041 LATIN CAPITAL LETTER A), (U+0042 LATIN SMALL LETTER B) from big-endian UTF-16 to UTF-8.

See the C++20 Range Adaptor documentation for more details.

2. The Output Iterator Interface

auto const in = std::vector{char8_t{0xF0}, char8_t{0x9F}, char8_t{0x98}, char8_t{0x80}};
std::vector<char16_t> out;
icubaby::t8_16 t;
auto it = icubaby::iterator{&t, std::back_inserter (out)};
for (auto cu: in) {
  *(it++) = cu;
}
it = t.end_cp (it);

The icubaby::iterator<> class offers a familiar output iterator for using a transcoder. Each code unit from the input encoding is written to the iterator and this writes the output encoding to a second iterator. This enables use to use standard algorithms such as std::copy with the library.

3. Converting One Code-Unit at a Time

Let’s try converting a single Unicode emoji character 😀 (U+1F600 GRINNING FACE) expressed as four UTF-8 code units (0xF0, 0x9F, 0x98, 0x80) to UTF-16 (where it is the surrogate pair 0xD83D, 0xDE00).

std::vector<char16_t> out;
auto it = std::back_inserter (out);
icubaby::t8_16 t;
for (auto cu: {0xF0, 0x9F, 0x98, 0x80}) {
  it = t (cu, it);
}
it = t.end_cp (it);

The out vector will contain a two UTF-16 code units 0xD83D and 0xDE00. See the explicit conversion documentation for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 443 Commits
.github		.github
cmake		cmake
docs		docs
examples		examples
googletest @ 34ad51b		googletest @ 34ad51b
include/icubaby		include/icubaby
tests		tests
unittests		unittests
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.codacy.yaml		.codacy.yaml
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
CMakeLists.txt		CMakeLists.txt
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Doxyfile		Doxyfile
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
codecov.yml		codecov.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

icubaby

Status

Introduction

Installation

Usage

1. C++ 20 Range Adaptor

2. The Output Iterator Interface

3. Converting One Code-Unit at a Time

About

Releases 6

Contributors 3

Languages

License

paulhuggett/icubaby

Folders and files

Latest commit

History

Repository files navigation

icubaby

Status

Introduction

Installation

Usage

1. C++ 20 Range Adaptor

2. The Output Iterator Interface

3. Converting One Code-Unit at a Time

About

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases 6

Contributors 3

Languages