Skip to content

Commit 5b7779d

Browse files
committed
docs: enhance documents
Signed-off-by: Mingzhuo Yin <[email protected]>
1 parent b361595 commit 5b7779d

15 files changed

+905
-294
lines changed

Cargo.toml

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ pg14 = ["pgrx/pg14", "pgrx-tests/pg14"]
1717
pg15 = ["pgrx/pg15", "pgrx-tests/pg15"]
1818
pg16 = ["pgrx/pg16", "pgrx-tests/pg16"]
1919
pg17 = ["pgrx/pg17", "pgrx-tests/pg17"]
20-
pg_test = ["insta"]
20+
pg_test = []
2121
lindera-ipadic = ["lindera/ipadic"]
2222
lindera-ipadic-neologd = ["lindera/ipadic-neologd"]
2323
lindera-unidic = ["lindera/unidic"]
@@ -38,10 +38,7 @@ toml = "0.8.20"
3838
unicode-normalization = "0.1.24"
3939
unicode-segmentation = "1.12.0"
4040
validator = { version = "0.20.0", features = ["derive"] }
41-
42-
# dependencies for testing
4341
anyhow = "1.0.97"
44-
insta = { version = "1.42.1", optional = true }
4542
serde_json = "1.0.139"
4643

4744
[dev-dependencies]

README.md

Lines changed: 16 additions & 236 deletions
Original file line numberDiff line numberDiff line change
@@ -2,249 +2,29 @@
22

33
A PostgreSQL extension that provides tokenizers for full-text search.
44

5-
## Example
5+
## Quick Start
66

7-
```sql
8-
SELECT tokenizer_catalog.create_tokenizer('tokenizer1', $$
9-
model = "bert_base_uncased"
10-
pre_tokenizer.regex = '(?u)\b\w\w+\b'
11-
[[character_filters]]
12-
to_lowercase = {}
13-
[[token_filters]]
14-
stopwords = "nltk"
15-
[[token_filters]]
16-
stemmer = "english_porter2"
17-
$$);
18-
19-
SELECT tokenizer_catalog.tokenize('PostgreSQL is a powerful, open-source object-relational database system. It has over 15 years of active development.', 'tokenizer1');
20-
----
21-
{2015:1, 2095:1, 2128:1, 2140:1, 2278:1, 2291:1, 2321:1, 2330:1, 2373:1, 2552:1, 2695:1, 2951:1, 4160:1, 4503:1, 4874:1, 12848:1, 14768:1, 17603:1, 20051:1, 22083:1}
22-
```
23-
24-
```sql
25-
CREATE TABLE documents (
26-
id SERIAL PRIMARY KEY,
27-
passage TEXT,
28-
embedding INT[]
29-
);
30-
31-
SELECT create_text_analyzer('text_analyzer1', $$
32-
pre_tokenizer = "unicode_segmentation"
33-
[[character_filters]]
34-
to_lowercase = {}
35-
[[character_filters]]
36-
unicode_normalization = "nfkd"
37-
[[token_filters]]
38-
skip_non_alphanumeric = {}
39-
[[token_filters]]
40-
stopwords = "nltk"
41-
[[token_filters]]
42-
stemmer = "english_porter2"
43-
$$);
7+
TODO
448

45-
SELECT create_custom_model_tokenizer_and_trigger(
46-
tokenizer_name => 'tokenizer1',
47-
model_name => 'model1',
48-
text_analyzer_name => 'text_analyzer1',
49-
table_name => 'documents',
50-
source_column => 'passage',
51-
target_column => 'embedding'
52-
);
53-
54-
INSERT INTO documents (passage) VALUES
55-
('PostgreSQL is a powerful, open-source object-relational database system. It has over 15 years of active development.'),
56-
('Full-text search is a technique for searching in plain-text documents or textual database fields. PostgreSQL supports this with tsvector.'),
57-
('BM25 is a ranking function used by search engines to estimate the relevance of documents to a given search query.'),
58-
('PostgreSQL provides many advanced features like full-text search, window functions, and more.'),
59-
('Search and ranking in databases are important in building effective information retrieval systems.'),
60-
('The BM25 ranking algorithm is derived from the probabilistic retrieval framework.'),
61-
('Full-text search indexes documents to allow fast text queries. PostgreSQL supports this through its GIN and GiST indexes.'),
62-
('The PostgreSQL community is active and regularly improves the database system.'),
63-
('Relational databases such as PostgreSQL can handle both structured and unstructured data.'),
64-
('Effective search ranking algorithms, such as BM25, improve search results by understanding relevance.');
65-
66-
SELECT embedding FROM documents ORDER BY id;
67-
----
68-
{1:1, 2:1, 3:1, 4:1, 5:1, 6:1, 7:1, 8:1, 9:1, 10:1, 11:1, 12:1}
69-
{1:1, 7:1, 13:1, 14:2, 15:2, 16:1, 18:1, 20:1, 21:1, 23:1, 25:1, 26:1}
70-
{15:2, 20:1, 27:1, 28:1, 29:1, 30:1, 32:1, 33:1, 34:1, 36:1, 38:1}
71-
{1:1, 13:1, 14:1, 15:1, 29:1, 40:1, 41:1, 42:1, 43:1, 44:1, 48:1}
72-
{7:1, 8:1, 15:1, 28:1, 53:1, 54:1, 55:1, 56:1, 57:1}
73-
{27:1, 28:1, 57:1, 61:1, 62:1, 63:1, 65:1}
74-
{1:1, 13:1, 14:2, 15:1, 20:1, 25:1, 38:1, 69:2, 71:1, 72:1, 77:1, 78:1}
75-
{1:1, 7:1, 8:1, 11:1, 81:1, 83:1, 84:1}
76-
{1:1, 6:1, 7:1, 90:1, 91:1, 92:1, 93:1}
77-
{15:2, 27:1, 28:1, 34:1, 55:1, 61:1, 84:1, 101:1, 102:1}
78-
```
9+
## Example
7910

8011
```sql
81-
SELECT tokenizer_catalog.create_text_analyzer('test_german', $$
82-
pre_tokenizer = "unicode_segmentation"
83-
[[token_filters]]
84-
pg_dict = "german_stem"
12+
SELECT create_tokenizer('tokenizer1', $$
13+
model = "llmlingua2"
8514
$$);
8615

87-
SELECT tokenizer_catalog.apply_text_analyzer('Aus so krummen Holze, als woraus der Mensch gemacht ist, kann nichts ganz Gerades gezimmert werden.', 'test_german');
88-
----
89-
{krumm,holz,woraus,mensch,gemacht,ganz,gerad,gezimmert}
90-
```
91-
92-
## Installation
93-
94-
First, you need to add `pg_tokenizer` to your `shared_preload_libraries` in `postgresql.conf`.
95-
96-
```sql
97-
-- Note: A restart is required for this setting to take effect.
98-
ALTER SYSTEM SET shared_preload_libraries = 'pg_tokenizer.so';
16+
SELECT tokenize('PostgreSQL is a powerful, open-source object-relational database system. It has over 15 years of active development.', 'tokenizer1');
9917
```
10018

101-
Then, you can install the extension in the database.
102-
103-
```bash
104-
CREATE EXTENSION pg_tokenizer;
105-
```
106-
107-
## Usage
108-
109-
The extension is mainly composed by 2 parts, `text analyzer` and `model`. `text analyzer` is used to parse the text and generate token arrays, which has similar functionality as `tsvector`. While `model` is used to generate token embeddings(token id array, can be casted to `bm25vector`), which is used for similarity search.
110-
111-
### Text Analyzer
112-
113-
`text analyzer` has 3 main components, `character filters`, `pre tokenizer`, `token filters`.
114-
115-
- character_filters: It is used to filter out some characters before tokenization. e.g. `to_lowercase`, `unicode_normalization`.
116-
- pre-tokenizer: It is used to split the text into tokens. For example, `unicode segmentation` will split texts on Grapheme Cluster, Word or Sentence boundaries, according to the [`Unicode Standard Annex #29`](https://unicode.org/reports/tr29/)
117-
- token_filters: It is used to filter out some tokens after tokenization. e.g. `stopwords`, `stemmer`.
118-
119-
### Model
120-
121-
`model` has 3 main types, `pre-trained`, `custom`, `external`.
122-
- `pre-trained` models have pre-trained vocab lists and some pre-defined tokenization rules. e.g. [`bert_base_uncased`](https://huggingface.co/google-bert/bert-base-uncased), `wiki_tocken`(https://huggingface.co/datasets/iohadrubin/wikitext-103-raw-v1).
123-
- `custom` models will maintain their own vocab mapping. You can build a custom model based on your own corpus easily.
124-
- There are some useful tokenizers that cannot be decoupled into `text_analyzer` and `model`. We provide them as `external` models, and customize all configurations in the `model` part.
125-
126-
## Reference
127-
128-
### Functions
129-
130-
#### Text Analyzer
131-
- `tokenizer_catalog.create_text_analyzer(name TEXT, config TEXT)`: Create a text analyzer.
132-
- `tokenizer_catalog.drop_text_analyzer(name TEXT)`: Drop a text analyzer.
133-
- `tokenizer_catalog.apply_text_analyzer(text TEXT, text_analyzer_name TEXT) RETURNS TEXT[]`: Apply a text analyzer to a text.
134-
135-
<br/>
136-
137-
- `tokenizer_catalog.create_stopwords(name TEXT, config TEXT)`: Create a stopwords dictionary.
138-
- `tokenizer_catalog.drop_stopwords(name TEXT)`: Drop a stopwords dictionary.
139-
140-
<br/>
141-
142-
- `tokenizer_catalog.create_synonym(name TEXT, config TEXT)`: Create a synonym dictionary.
143-
- `tokenizer_catalog.drop_synonym(name TEXT)`: Drop a synonym dictionary.
144-
145-
#### Model
146-
147-
- `tokenizer_catalog.create_custom_model(name TEXT, config TEXT)`: Create a custom model.
148-
- `tokenizer_catalog.create_custom_model_tokenizer_and_trigger(tokenizer_name TEXT, model_name TEXT, text_analyzer_name TEXT, table_name TEXT, source_column TEXT, target_column TEXT)`: Create a custom model tokenizer and trigger to update the target column automatically.
149-
- `tokenizer_catalog.drop_custom_model(name TEXT)`: Drop a custom model.
150-
151-
<br/>
152-
153-
- `tokenizer_catalog.add_preload_model(name TEXT)`: Add a model to the preload list.
154-
- `tokenizer_catalog.remove_preload_model(name TEXT)`: Remove a model from the preload list.
155-
- `tokenizer_catalog.list_preload_models() RETURNS TEXT[]`: List all preload models.
156-
157-
<br/>
158-
159-
- `tokenizer_catalog.create_lindera_model(name TEXT, config TEXT)`: Create a lindera model.
160-
- `tokenizer_catalog.drop_lindera_model(name TEXT)`: Drop a lindera model.
161-
162-
<br/>
163-
164-
- `tokenizer_catalog.create_huggingface_model(name TEXT, config TEXT)`: Create a huggingface model.
165-
- `tokenizer_catalog.drop_huggingface_model(name TEXT)`: Drop a huggingface model.
166-
167-
#### Tokenizer
168-
169-
- `tokenizer_catalog.create_tokenizer(name TEXT, config TEXT)`: Create a tokenizer.
170-
- `tokenizer_catalog.drop_tokenizer(name TEXT)`: Drop a tokenizer.
171-
- `tokenizer_catalog.tokenize(text TEXT, tokenizer_name TEXT) RETURNS INT[]`: Tokenize a text.
172-
173-
174-
## Configuration
175-
176-
> We utilize `TOML` syntax to express all configurations.
177-
178-
### Options for `text analyzer`
179-
180-
| Key | Type | Description |
181-
| ----------------- | -------------- | -------------------------------------------------------------------------------------- |
182-
| character_filters | Array of Table | Character filters, see [Options for `character_filter`](#options-for-character_filter) |
183-
| pre_tokenizer | Table | Pre-tokenizer, see [Options for `pre_tokenizer`](#options-for-pre_tokenizer) |
184-
| token_filters | Array of Table | Token filters, see [Options for `token_filter`](#options-for-token_filter) |
185-
186-
### Options for `character_filter`
187-
188-
| Key | Type | Description |
189-
| --------------------- | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------ |
190-
| to_lowercase | Empty Table | Convert all characters to lowercase |
191-
| unicode_normalization | String | Unicode normalization form, see [Unicode Normalization Forms](https://unicode.org/reports/tr15/), supported values: `nfkc`, `nfkd`, `nfc`, `nfd` |
192-
193-
You can choose only one of the above options for each character filter.
194-
195-
### Options for `pre_tokenizer`
196-
197-
| Key | Type | Description |
198-
| -------------------- | ----------- | -------------------------------------------------------------------------------------------------------------- |
199-
| regex | String | It will generate all tokens that match the regex pattern |
200-
| unicode_segmentation | Empty Table | Split the text into tokens based on the Unicode Standard Annex #29 |
201-
| jieba | Table | Split the text into tokens based on the Jieba Chinese tokenizer, see [Options for `jieba`](#options-for-jieba) |
202-
203-
#### Options for `jieba`
204-
205-
| Key | Type | Description |
206-
| ---------- | ------- | -------------------------------------------------------------------------------------- |
207-
| mode | String | Jieba tokenizer mode, supported values: `full`, `precise`, `search`. default: `search` |
208-
| enable_hmm | Boolean | Whether to enable HMM, default: `true` |
209-
210-
### Options for `token_filter`
211-
212-
| Key | Type | Description |
213-
| --------------------- | ----------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
214-
| skip_non_alphanumeric | Empty Table | Skip tokens that all characters are non-alphanumeric |
215-
| stemmer | String | Stemmer, see [Supported values for `stemmer`](#supported-values-for-stemmer) |
216-
| stopwords | String | Stopwords name, builtin: `lucene_english`, `nltk_english`, `iso_english` |
217-
| synonym | String | Synonym name |
218-
| pg_dict | String | Using [postgres text search dictionary](https://www.postgresql.org/docs/current/textsearch-dictionaries.html). We currently support all dictionaries except `Thesaurus Dictionary`. |
219-
220-
You can choose only one of the above options for each token filter.
221-
222-
#### Supported values for `stemmer`
223-
224-
arabic, armenian, basque, catalan, danish, dutch, english_porter, english_porter2, estonian, finnish, french, german, greek, hindi, hungarian, indonesian, irish, italian, lithuanian, nepali, norwegian, portuguese, romanian, russian, serbian, spanish, swedish, tamil, turkish, yiddish
225-
226-
### Options for `tokenizer`
227-
228-
| Key | Type | Description |
229-
| ------------- | ------ | --------------------------------------------------------------------------------- |
230-
| text_analyzer | String | Text analyzer name. If you are using an external model, you can just ignore this. |
231-
| model | String | Model name. We have some builtin models, see [Builtin models](#builtin-models) |
232-
233-
#### Builtin models
234-
235-
- [bert-base-uncased](https://huggingface.co/google-bert/bert-base-uncased)
236-
- [wiki_tocken](https://huggingface.co/datasets/iohadrubin/wikitext-103-raw-v1)
237-
- [gemma2b](https://huggingface.co/google/gemma-2b)
238-
- [llmlingua2](https://huggingface.co/microsoft/llmlingua-2-xlm-roberta-large-meetingbank)
239-
240-
### Options for `custom model`
241-
242-
| Key | Type | Description |
243-
| ------------- | ------ | ------------------- |
244-
| table | String | Table name. |
245-
| column | String | Column name. |
246-
| text_analyzer | String | Text analyzer name. |
19+
More examples can be found in [docs/03-examples.md](docs/03-examples.md).
24720

248-
### Options for `lindera model`
21+
## Documentation
24922

250-
It's totally the same as lindera tokenizer configs, see [Lindera](https://github.com/lindera/lindera).
23+
- [Installation](docs/01-installation.md)
24+
- [Development](docs/02-development.md)
25+
- [Examples](docs/03-examples.md)
26+
- [Usage](docs/04-usage.md)
27+
- [Text Analyzer](docs/05-text-analyzer.md)
28+
- [Model](docs/06-model.md)
29+
- [Limitation](docs/07-limitation.md)
30+
- [Reference](docs/00-reference.md)

0 commit comments

Comments
 (0)