Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add transliteration option. #219

Open
wants to merge 25 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
9acd8cd
Preliminary work to support UTF-32.
kshetline Jun 24, 2019
ace3618
Finish off general UTC-32 (auto LE or BE), and add UCS-4 aliases.
kshetline Jun 24, 2019
f0f9624
Fix typo in unit test.
kshetline Jun 24, 2019
8d104ad
Fix uses of Buffer.from() that caused compatibility problems with old…
kshetline Jun 24, 2019
25e0413
Updated README.md to include UTF-32 options.
kshetline Jun 24, 2019
9b84cb3
Get rid of package-lock.json.
kshetline Jun 25, 2019
325c0fe
Merging Utf32-LE and-BE codec into a single set of classes with an is…
kshetline Jun 25, 2019
56a4754
Added all-codepoint unit tests for UTF-32.
kshetline Jun 25, 2019
cfe04d6
Disable some unit tests on older versions of Node.
kshetline Jun 25, 2019
1f56e00
Fixes for working correctly with older versions of Node.
kshetline Jun 25, 2019
af9cff7
Add comparison to node-iconv, and possible speed improvement.
kshetline Jun 26, 2019
8e038b5
Merge branch 'master' of https://github.com/ashtuchkin/iconv-lite
kshetline Jun 26, 2019
7d6f955
Preliminary work for transliteration support.
kshetline Jun 28, 2019
a6f91b4
Changing computers check-in.
Jun 28, 2019
a550b78
First version of transliteration support.
kshetline Jun 29, 2019
f574282
Odd... my Node.js environment was fine with a "catch" without a varia…
kshetline Jun 29, 2019
7b13689
Update documentation for transliteration.
kshetline Jun 29, 2019
198d7e2
Much to my surprise, a regex global replace turns out to be much fast…
kshetline Jun 30, 2019
5797fb5
Add typings for transliteration. Add ability to deal with smart spaci…
kshetline Jul 1, 2019
bb0a2ea
Make extra unit testing for smart spacing and German dependent on whe…
kshetline Jul 1, 2019
84f01e4
Update with unidecode-plus 0.0.0-alpha.1.
kshetline Jul 1, 2019
cac5bc8
Update with new unidecode plus and related documentation.
Jul 1, 2019
c69bdb5
Update with unidecode-1.0.1.
kshetline Jul 2, 2019
6d2d25a
Transliteration in German mode now works with combining diaeresis. Ad…
kshetline Jul 13, 2019
4e4f4da
Transliteration in German mode now works with combining diaeresis. Ad…
kshetline Jul 13, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,6 @@ wiki
*sublime-*
coverage
/.idea
package-lock.json
/package-lock.json
/.npmrc
/.vscode/settings.json
1 change: 1 addition & 0 deletions .npmignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,4 @@ test
wiki
coverage
.travis.yml
/.npmrc
39 changes: 35 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
* In-browser usage via [Browserify](https://github.com/substack/node-browserify) (~180k gzip compressed with Buffer shim included).
* Typescript [type definition file](https://github.com/ashtuchkin/iconv-lite/blob/master/lib/index.d.ts) included.
* React Native is supported (need to explicitly `npm install` two more modules: `buffer` and `stream`).
* Transliteration option is available when either [unidecode-plus](https://www.npmjs.com/package/unidecode-plus) or [unidecode](https://www.npmjs.com/package/unidecode) are added to your project
* License: MIT.

[![NPM Stats](https://nodei.co/npm/iconv-lite.png)](https://npmjs.org/package/iconv-lite/)
Expand All @@ -30,7 +31,10 @@ str = iconv.decode(Buffer.from([0x68, 0x65, 0x6c, 0x6c, 0x6f]), 'win1251');
buf = iconv.encode("Sample input string", 'win1251');

// Check if encoding is supported
iconv.encodingExists("us-ascii")
iconv.encodingExists("us-ascii");

// Convert from js string to an encoded buffer, keeping accented characters like "é", but transliterating Chinese.
buf2 = iconv.encode("Café 北京", 'iso-8859-1', { transliterate: true });
```

### Streaming API (Node v0.10+)
Expand Down Expand Up @@ -143,11 +147,38 @@ This library supports UTF-32LE, UTF-32BE and UTF-32 encodings. Like the UTF-16 e
* The default of UTF-32LE can be overridden with the `defaultEncoding: 'utf-32be'` option. Strips BOM unless `stripBOM: false`.
* Encoding: uses UTF-32LE and writes BOM by default. Use `addBOM: false` to override. (`defaultEncoding: 'utf-32be'` can also be used here to change encoding.)

## Transliteration

If either [unidecode-plus](https://www.npmjs.com/package/unidecode-plus) or [unidecode](https://www.npmjs.com/package/unidecode) are added to your project ("`npm install unidecode-plus`" or "`npm install unidecode`"), the option will be available to transliterate characters which are not available in a particular encoding. The transliterations are always plain ASCII characters, but unlike using unidecode directly (which will convert *all* non-ASCII characters into transliterations), transliterations done using iconv will only transliterate characters which are not available in the target character encoding.

In this example:
```
buf = iconv.encode("Café 北京", 'iso-8859-1', { transliterate: true });
```
The output is `<Buffer 43 61 66 e9 20 42 65 69 20 4a 69 6e 67 20>`. Converted back into ISO-8859-1 text, this is "Café Bei Jing ", preserving the accented "é", and only transliterating the Chinese characters.

Transliteration to a string instead of a buffer can also be done directly, like this:
```
str = iconv.transliterate("Café 北京", 'iso-8859-1');
```
When encoding to create a buffer, the node-iconv style of requesting transliteration can also be used:
```
buf = iconv.encode("Café 北京", 'iso-8859-1//translit');
```

If you use `unidecode-plus` instead of `unidecode`, you get two additional transliteration options: `german`, and `smartSpacing`.

The `german` option transliterates `Ä`, `ä`, `Ö`, `ö`, `Ü`, and `ü` to `AE`, `ae`, `OE`, `oe`, `UE`, and `ue`, respectively, instead of just removing the umlauts.

The `smartSpacing` options improves the formatting of transliterated text, removing some unnecessary spaces, and adding others for clarity. For example, "Café 北京, 鞋 size 10½" becomes "Cafe Bei Jing, Xie size 10 1/2" using `smartSpacing`. Without it, you get "Cafe Bei Jing , Xie size 101/2". (See the [unidecode-plus site](https://github.com/kshetline/unidecode-plus/blob/master/README.md) for more detail.)

Please take note that transliteration only affects encoding, not decoding.

## Other notes

When decoding, be sure to supply a Buffer to decode() method, otherwise [bad things usually happen](https://github.com/ashtuchkin/iconv-lite/wiki/Use-Buffers-when-decoding).
Untranslatable characters are set to � or ?. No transliteration is currently supported.
Node versions 0.10.31 and 0.11.13 are buggy, don't use them (see #65, #77).
* When decoding, be sure to supply a Buffer to decode() method, otherwise [bad things usually happen](https://github.com/ashtuchkin/iconv-lite/wiki/Use-Buffers-when-decoding).
* Untranslatable characters are set to � or ? unless using transliteration.
* Node versions 0.10.31 and 0.11.13 are buggy, don't use them (see #65, #77).

## Testing

Expand Down
Loading