-
Notifications
You must be signed in to change notification settings - Fork 1
/
spec.rst
451 lines (299 loc) · 12.4 KB
/
spec.rst
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
Current Updates
===============
Scribblings on cihai dev.
Configuration
~~~~~~~~~~~~~
It can accept a custom configuration file via command line with ``-c``:
.. code-block :: bash
$ python -m cihai -c myconfig.yml
Where your configuration file overrides the default settings. You can see
the default settings in the ``cihai`` package as ``config.yml``.
Developers may use ``dev/config.yml``. The TestCase will use the
``test_config.yml``.
.. code-block:: bash
$ python -m cihai
Will start up cihai with normal configuration settings. A configuration
file may also be used.
.. code-block:: bash
$ python -m cihai -c dev/config.yml
History of CJK libraries
========================
Unihan
------
Unihan, which is short for "Han Unification" is a standard published by the
Unicode Consortium for CJK ideographs (also interchangeable referred to as
"glyphs", "characters", "chars").
`Unihan's History`_ goes into greater detail on this. The first electronic
release was in July 1995 as `CJKXREF.TXT`_ (961 kB). The second release, which
resembles the formatting used in modern versions, was released in July 1996
with Unicode 2.0 as `Unihan-1.txt`_. In an accident, the ``Unihan-1.txt``
(7.9MB) file was missing the final pieces after ``U+8BC1``, no corrected
version was made availabe. In May 1998, `Unihan-2.txt`_ was released with
Unicode 2.1.2.
Unihan Inc. is the center of the universe for all glyphs. For those who study
Egyptian hieroglyphics, which are still mysterious, they are covered in
Unicode block `U+13000..U+1342F`_.
.. _U+13000..U+1342F: Fhttp://en.wikipedia.org/wiki/Egyptian_Hieroglyphs_(Unicode_block)
.. _Unihan's History: http://www.unicode.org/reports/tr38/#History
.. _CJKXREF.TXT: http://www.unicode.org/Public/1.1-Update/CJKXREF.TXT
.. _Unihan-1.txt: http://www.unicode.org/Public/2.0-Update/Unihan-1.txt
.. _Unihan-2.txt: http://www.unicode.org/Public/2.1-Update/Unihan-2.txt
cjklib
------
`cjklib`_ is a major python library created by Christoph Burgmer for han
character research.
"Cjklib provides language routines related to Han characters (characters based
on Chinese characters named Hanzi, Kanji, Hanja and chu Han respectively) used
in writing of the Chinese, the Japanese, infrequently the Korean and formerly
the Vietnamese language(s). Functionality is included for character
pronunciations, radicals, glyph components, stroke decomposition and variant
information. Cjklib is implemented in Python."
.. cjklib: https://code.google.com/p/cjklib/
Cihai Pre-alpha
---------------
Early iterations of Cihai focused on external API as a first. Every data set
was to be a plugin.
The idea was, `Hanzi`_, a similar project in nodejs could share a similar API
and datasets could be universal. The potential would be to provide two
high-quality libraries for python and node, which are extendable to new data
sets and reduce duplication.
It is better to take the time to discover the variable nature of datasets and
how they interconnect.
Current
-------
The next iteration of cihai is to grasp an understand of:
- what different data sets look like, how they return data?
- is their commonality between all?
- how their results can illicit deeper research and exploring of chinese
characters
This is an exploration phase.
External API
============
Cihai Spec
----------
Both Cihai and Hanzi libraries can use a similar API.
- Reduce duplicated effort
- Provide a main, tested CJK library to Python and node
- Collaborate to assure both projects have access to open data sets and
chinese character techniques.
Larger charter:
- Workgroup to develop a specification for core, pluggable CJK library
across various programming languages.
- follow best practices.
- documentation
- unit tests / ci
- consistent with coding idiom's / pragmas (pythonic / pocoo / reits,
connect / underscore / node)
- be available on package archives (npm, pypi).
- Across languages, core tools should have similar API method names,
creating instance of data retrieval object
- Extendable to new datasets as middleware.
- Documentation for creating a new middleware.
- Find more data sets and encourage data providers / data owners to use an
open data license.
- Find more libraries across various programming language with a CJK tool.
- If project is a duplicate effort, notify that there is another
effort underway and they can participate.
- If project is a new tool:
- see if they have a dataset. If it does, see license of ODC/OBDC.
- see if their library is BSD or MIT. If not see if they're willing to
license as such. *
- see if they are willing to use the Workgroup's API specification.
- If willing, but no time, offer to patch.
- If not interested at all, create an adapter for the project as a
separate effort.
* if the library is GPL, it can cause conflict down the road, if the
project author does not have the time / interest in adopting
specification, even creating an adapter to their project could trigger
GPL.
Licensing
---------
Core software
"""""""""""""
BSD or MIT. The Core apps should be BSD 3-clause to protect the name of
the app (Cihai or Hanzi).
Extensions / Contrib licensing
""""""""""""""""""""""""""""""
Middleware can be included in the project as officially supported.
Contrib and third party plugins can be available under BSD or MIT.
Data sets
"""""""""
Data for chinese should be available under the most permissive license
possible.
What data is being accessed / looked up
---------------------------------------
How should data be looked up?
-----------------------------
I would like to try to encourage use of a single, simple hook,
``.get``.
After ``.get`` is used, the arguments may then be passed through
middleware classes / methods.
The same principle applies for ``.reverse`` matches.
Chinese character
"""""""""""""""""
Currently, Hanzi uses:
.. code-block:: javascript
hanzi.decompose('爱')
# transition to:
hanzi.get('爱')
hanzi.reverse('爱') # to look up any indices / decompositions / words
where 爱 may match.
Currently cjklib uses:
.. code-block:: python
cjk.getStrokeOrder(u'说')
# transition to:
cjk.get('说')
.. code-block:: python
Cihai.get('好')
String of Chinese Characters
""""""""""""""""""""""""""""
Use ``.get`` too. This may seem problematic, but checking the
``.length`` or ``len()`` of the argument can suffice.
.. code-block:: javascript
var decomposition = hanzi.decomposeMany('爱橄黃');
# transition to
var decomposition = hanzi.get('爱橄黃');
.. code-block:: python
Cihai.get('爱橄黃')
How should data returned look? Schema.
--------------------------------------
Questions:
- Is there already an open standard that can be adopted?
- Should ``.get`` return an raw object / dict or an object::
c = c.get('你') # return a ResultObject / Backbone.Model / mongoose
# document type of object.
c.toJSON() # backbone / sqlalchemy style
The data should follow the same schema. What would an API response for
these possibilities look like?
If something generic like .get() is entered,
- character decomposition
- a unihan field ('kDefinition', 'kStrokes', 'kFrequency', ...)
- https://github.com/tsroten/zhon
- https://github.com/fxsjy/jieba
If ``.get`` is the only way to retrieve hits, more possibilities
exist.
For hanzi/node:
.. code-block:: javascript
results = hanzi.get('你好。怎么样?')
or for cihai/python:
.. code-block:: python
results = cihai.get('你好。怎么样?')
May return hits jieba middleware (jieba doesn't exist in node yet)::
results.words = [
'你好',
'怎么样'
]
The user may then further tool:
.. code-block:: python
for word in results.words:
print(cihai.get(word))
or
.. code-block:: javascript
for _.each(results.words, function(word) {
console.log(hanzi.get(word))
});
.. warning::
If dictionaries / datasets are extensible, there may be collision
if they can reserve keys in the official result namespace.
Two plugins may could try to reserve ``.words`` as a name. Many
dictionaries would want to reserve ``.definition`` as a name.
To counteract this, a namespace can be adopted for middleware, we can have
the Core resolve the conflict:
1. Append underscore + number on conflict, etc.
(``c.definition_1``, ``c.definition_2``):
The first middleware using ``words`` can get ``result.words``. The
middleware called after will get ``results.words_1``.
This is seen in `SQLAlchemy's labels`_ to `avoid label collisions`_.
2. Middleware / datasets use namespace with ``_``
(``c.unihan_kDefinition``):
Pros:
- iterable access to python ``c.keys()`` and ``for var key in dict``
in js.
- all data returned can be accessed without nesting into dotted
namespaces.
Cons:
- ``result.unihan_kDefinition_these_things_getlong``
- extension name and word separation can be confused.
3. Middleware may use dot namespace (``c.unihan.kDefinition``)
Pros:
- Internal Core API is far simpler and lighter
- Easier to look at
- More common practice, `aws_cli`_.
- Middleware is a package module, symbolically ``.``'s are used to
separate modules and packages (java, python, informally in JS).
.. _SQLAlchemy's labels: https://github.com/zzzeek/sqlalchemy/blob/347e89044ce53ef0ec8d07937cd8279e9c4e5226/lib/sqlalchemy/sql/elements.py#L2393
.. _avoid label collisions: https://github.com/zzzeek/sqlalchemy/blob/347e89044ce53ef0ec8d07937cd8279e9c4e5226/test/sql/test_compiler.py#L2549
.. _aws_cli: https://github.com/aws/aws-cli
Extension philosophy
--------------------
The middleware approach provides the best practice to get the job done.
`Connect`_ in node represents the best practice in plugin architecture in
JS. Middleware is added as a way to provide a lite, dead-simple framework.
Cihai / Hanzi can take a similar approach.
Hanzi can take example directly from connect's approach. It is clean and
proven. Cihai can note middleware is already used in Django, packages can
be maintained using pattern for Flask extensions and sphinx. Flask already
has experience / lesson's heard from packaging and namespacing extensions.
It can use the same data sets, similar API and extension strategy.
.. _Connect: https://github.com/senchalabs/connect
Accessing extensions directly?
------------------------------
Perhaps extensions can also be searched directly::
c.unihan.get('好')
Third party API's can specify optional extra arguments, for instance,
unihan may allow searching by one field::
c.unihan.get('好', 'kDefinition')
This allows a simple way to "drill down" cjk data across extensions.
API examples
------------
Example:
.. code-block:: python
obj = unihan.get('好') retrieves all rows. it will create a keyed object:
obj.kDefinition
obj['kDefinition']
obj.keys()
['kDefinition',]
obj = unihan.get('好', 'kDefinition', ...)
>>> obj.kDefinition
good
>>> obj.kStrokes
None
Creating a cihai plugin
-----------------------
.. code-block:: python
class Unihan(Cihai.Contrib):
"""
Utilizing a parent class can allow raising ``NotImplementedError``
errors. Further, this can provide access to a ``db``.
However, ultimately, the only thing that's really required is::
class Example(object):
def get(self, char):
return {
'char': char
}
"""
def get(self):
pass
def install(self):
pass
cihai = Cihai()
cihai.use(Unihan) # register the middleware with
c = cihai.get('好')
>>> c.keys()
['unihan']
>>> c.get('好')
<Cihai.Contrib.Unihan>
>>> print(c.get('好'))
>>> print(c.get('好').parent)
# Below this point, libunihan splits into subplugins for its libraries.
>>> print(dict(c.get('好')))
Cihai will allows extensibility to new dictionaries, vocabularies and data.
Middleware allows an arbitrary plugin to make data available.
By default, ``Cihai()`` creates an instance of Cihai with access to :meth:`Cihai.get`.
However, since no middleware are included with Cihai, no results are returned.
With ``Cihai(middleware=[Cihai.Unihan])``
or ``c = Cihai()``
``c.use(Cihai.Unihan)``
the Cihai_Unihan is available. What is Cihai_Unihan? Simply an object with
class Unihan(Cihai.Contrib):
pass