Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
kirbyj committed May 15, 2020
1 parent 9448902 commit e3deafc
Showing 1 changed file with 22 additions and 21 deletions.
43 changes: 22 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# vPhon: a Vietnamese phonetizer

Package: vPhon version 2.0.0 (for Python 3)
Package: vPhon version 2.0.0

Author: James Kirby <[email protected]>

Expand Down Expand Up @@ -31,9 +31,9 @@ Segmental correspondences follow Thompson (1965: 98-103), Cao (1997: 126). vPhon

### Finals

By default, vPhon does not recognize final palatal segments [c ɲ], as their values may be predicted from the preceding vocalic segments. However, the `-p` flag causes palatal [c ɲ] codas to be output (Hoàng 1989: 172 *ff*.; cf. Cao 1998: 88-102).
By default, vPhon does not recognize final palatal segments [c ɲ], as their values may be predicted from the preceding vocalic segments. However, the `-p, --palatals` flag causes palatal [c ɲ] codas to be output (Hoàng 1989: 172 *ff*.; cf. Cao 1998: 88-102).

As of version 0.2.2, final labialized allophones of /ŋ k/ are represented as [ŋ͡m k͡p].
Final labialized allophones of /ŋ k/ are represented as [ŋ͡m k͡p].

### Tones

Expand All @@ -60,27 +60,31 @@ vPhon also provides an option (given the `-6` flag) to return integer values for
| sắc | 5 |
| nặng | 6 |

If passed the `-8` flag, *sắc* and *nặng* tones in closed syllables are returned as 7 and 8, respectively (Cao 1998; Michaud 2004; Phạm 2001). Note that these were returned as 5b and 6b in vPhon v.1.0.0 and earlier.
If passed the `-8` flag, *sắc* and *nặng* tones in closed syllables are returned as 7 and 8, respectively. These were returned as 5b and 6b in vPhon v.1.0.0 and earlier.

Note that for the Central and Southern dialects, the relationship of tone to number is slightly different. Orthographic *hỏi* and *ngã* tones are both phonetized as 4 when vPhon is passed the `-6` or `-8` flags, representing the (phonological) mergers present in those dialects (Hoàng 1989: 212 *ff.*)

## Installation

No installation is required. You must have a working version of Python 3 installed and in your path. We have tested only using 3.7.1. vPhon requires the `sys`, `string`, `re`, `io`, and `argparse` modules, all of which should come standard with Python >= 3.5.

If you need to use vPhon with Python 2, see [v1.0.0](https://github.com/kirbyj/vPhon/releases/tag/v1.0).

## Usage

From v2.0.0, vPhon does not take any obligatory arguments. If a `-d, --dialect` option is not specified, it defaults to using the standard (Northern) dialect correspondence set. The correspondence files may be found in the `Rules/` directory.

If no argument is supplied on the command line, vPhon will enter an interactive mode allowing you to enter UTF-8 Vietnamese orthography on the command line. When you are done, send `EOF` (Ctrl-D) to get the output. By default, output is sent to STDOUT.

Otherwise, you can send vPhon a stream of UTF-8 text to be phonetized. If you have a file called `tuoi.txt`, for example, and want to create Southern-dialect IPA from it, either of the following will work:
vPhon was designed to work in a manner similar to that of Unix command line utilities, and as such it expects to receive UTF-8 text via STDIN. If you have a file called `tuoi.txt`, for example, and want to create Southern-dialect IPA from it, either of the following will work:

```
[user@terminal]$ python vPhon.py -d S < tuoi.txt
[user@terminal]$ cat tuoi.txt | python vPhon.py --dialect Southern
[user@terminal]$ cat tuoi.txt | python vPhon.py --dialect s
```

If no input is provided, vPhon will enter an interactive mode allowing you to enter UTF-8 Vietnamese orthography on the command line. When you are done, send `EOF` (Ctrl-D) to get the output. By default, output is sent to STDOUT.

## Options

The full list of options can be seen by using the `-h, --help` flag:

```
Expand Down Expand Up @@ -110,22 +114,21 @@ optional arguments:
The `--tokenize` flag is useful if you are processing an older source in which morphemes are separated by hyphens, and you wish to retain the hyphens in your output, or if you are processing the output of e.g. [vnTokenizer](http://mim.hus.vnu.edu.vn/phuonglh/softwares/vnTokenizer):

```
[user@terminal]$ python vPhon.py -d N -t test/tokenized.txt
[user@terminal]$ python vPhon.py -t test/tokenized.txt
căw24 oŋ͡m33_ta3 kuŋ͡m35g viən33 cɯə33 biət45
```

The `--delimit` flag will produce produce output where each phonetic symbol is separated by user-specified delimiter. If you use this flag, you must also specify a delimiter, e.g.

```
[user@terminal]$ python vPhon.py -m . -8 -d N < test/tokenized.txt
[user@terminal]$ python vPhon.py -m . -8 < test/tokenized.txt
.c.ă.w.5. [ông_ta] .k.u.ŋ.3. .v.iə.n.1. .c.ɯə.1. .b.iə.t.5b.
```


The `--ortho` flag will output the orthographic input followed by a user-specified delimiter, followed by the phonetized output. If you use this flag, you must also specify a delimiter, e.g.

```
[user@terminal]$ python vPhon.py -d N -o , < test/wordlist-top.txt
[user@terminal]$ python vPhon.py -o , < test/wordlist-top.txt
a dua,a33 zuə33
a ha,a33 ha33
a hoàn,a33 hwan32
Expand All @@ -134,7 +137,7 @@ a-lô,[a-lô]
```

```
[user@terminal]$ python vPhon.py -d N -o $'\n' < test/wordlist-top.txt
[user@terminal]$ python vPhon.py -o $'\n' < test/wordlist-top.txt
a dua
a33 zuə33
a ha
Expand All @@ -154,23 +157,21 @@ All non-alphanumeric characters in the input are stripped prior to processing (u
Any input containing non-Vietnamese orthography, or series of characters not conforming to Vietnamese phonotactics, will be braced in the output, e.g.

```
[These] [are] [not] [licit] [words] [20mi] [10-15km] [etc]
[user@terminal] cat test/tuoi.txt | python vPhon.py
[tt] [hacao] [linux] la32 mot21 he21g diəw32 hɛŋ32 ɲɔ312 ɣɔn21g tok͡p45 do21g dɤ̆j32 du312 tiən21g ŋi33 miən35g fi24 ma35g ŋuən32 mɤ312 dɯək21 taw21g za33 zɛŋ32 cɔ33 tɤ̆t45 ka312 mɔj21g ŋɯəj32
[hacao] [linux] kɔ24 tʰe312 căj21g cɯk21 tiəp45 cen33 [cd] ma32 xoŋ͡m33 lam32 ɛŋ312 hɯəŋ312 den24 he21g tʰoŋ͡m24 măj24 tiŋ24 hiən21g taj21g ŋwaj32 za33 [hacao] [linux] kɔn32 kɔ24 tʰe312 kaj32 dăt21 vaw32 kak45 tʰɛŋ33 ɲɤ24 [usb] hăj33 kaj32 tʰăŋ312 vaw32 o312 kɯŋ24 măj24 tiŋ24
hiən21g năj33 [hacao] [linux] kɔ24 haj33 fiən33 ban312
fiən33 ban312 ɲɔ312 ɣɔn21g tok͡p45 do21g zɛŋ32 cɔ33 ɲɯŋ35g ŋɯəj32 iəw33 tʰik45 sɯ21g ɲɔ312 ɣɔn21g [79,5mb]
```

You could then extract just these items, e.g.
You could also extract just these items, e.g.

```
[user@terminal] cat test/tuoi.txt | python vPhon.py | awk -F '[][]' '{for (i=2; i<=NF; i+=2) {printf "%s ", $i}; print ""}'
tt hacao linux
hacao linux cd hacao linux usb
hacao linux
79,5mb
174mb
windows cd hacao linux cd hacao linux
hacao linux cd
openoffice 2.03 stardict click see
word excel powerpoint windows
stardict logic web
```

Try running the examples in the `test/` directory to get a better idea of this behavior.
Expand Down

0 comments on commit e3deafc

Please sign in to comment.