|
2 | 2 |
|
3 | 3 | A PostgreSQL extension that provides tokenizers for full-text search.
|
4 | 4 |
|
5 |
| -## Example |
| 5 | +## Quick Start |
6 | 6 |
|
7 |
| -```sql |
8 |
| -SELECT tokenizer_catalog.create_tokenizer('tokenizer1', $$ |
9 |
| -model = "bert_base_uncased" |
10 |
| -pre_tokenizer.regex = '(?u)\b\w\w+\b' |
11 |
| -[[character_filters]] |
12 |
| -to_lowercase = {} |
13 |
| -[[token_filters]] |
14 |
| -stopwords = "nltk" |
15 |
| -[[token_filters]] |
16 |
| -stemmer = "english_porter2" |
17 |
| -$$); |
18 |
| - |
19 |
| -SELECT tokenizer_catalog.tokenize('PostgreSQL is a powerful, open-source object-relational database system. It has over 15 years of active development.', 'tokenizer1'); |
20 |
| ----- |
21 |
| -{2015:1, 2095:1, 2128:1, 2140:1, 2278:1, 2291:1, 2321:1, 2330:1, 2373:1, 2552:1, 2695:1, 2951:1, 4160:1, 4503:1, 4874:1, 12848:1, 14768:1, 17603:1, 20051:1, 22083:1} |
22 |
| -``` |
23 |
| - |
24 |
| -```sql |
25 |
| -CREATE TABLE documents ( |
26 |
| - id SERIAL PRIMARY KEY, |
27 |
| - passage TEXT, |
28 |
| - embedding INT[] |
29 |
| -); |
30 |
| - |
31 |
| -SELECT create_text_analyzer('text_analyzer1', $$ |
32 |
| -pre_tokenizer = "unicode_segmentation" |
33 |
| -[[character_filters]] |
34 |
| -to_lowercase = {} |
35 |
| -[[character_filters]] |
36 |
| -unicode_normalization = "nfkd" |
37 |
| -[[token_filters]] |
38 |
| -skip_non_alphanumeric = {} |
39 |
| -[[token_filters]] |
40 |
| -stopwords = "nltk" |
41 |
| -[[token_filters]] |
42 |
| -stemmer = "english_porter2" |
43 |
| -$$); |
| 7 | +TODO |
44 | 8 |
|
45 |
| -SELECT create_custom_model_tokenizer_and_trigger( |
46 |
| - tokenizer_name => 'tokenizer1', |
47 |
| - model_name => 'model1', |
48 |
| - text_analyzer_name => 'text_analyzer1', |
49 |
| - table_name => 'documents', |
50 |
| - source_column => 'passage', |
51 |
| - target_column => 'embedding' |
52 |
| -); |
53 |
| - |
54 |
| -INSERT INTO documents (passage) VALUES |
55 |
| -('PostgreSQL is a powerful, open-source object-relational database system. It has over 15 years of active development.'), |
56 |
| -('Full-text search is a technique for searching in plain-text documents or textual database fields. PostgreSQL supports this with tsvector.'), |
57 |
| -('BM25 is a ranking function used by search engines to estimate the relevance of documents to a given search query.'), |
58 |
| -('PostgreSQL provides many advanced features like full-text search, window functions, and more.'), |
59 |
| -('Search and ranking in databases are important in building effective information retrieval systems.'), |
60 |
| -('The BM25 ranking algorithm is derived from the probabilistic retrieval framework.'), |
61 |
| -('Full-text search indexes documents to allow fast text queries. PostgreSQL supports this through its GIN and GiST indexes.'), |
62 |
| -('The PostgreSQL community is active and regularly improves the database system.'), |
63 |
| -('Relational databases such as PostgreSQL can handle both structured and unstructured data.'), |
64 |
| -('Effective search ranking algorithms, such as BM25, improve search results by understanding relevance.'); |
65 |
| - |
66 |
| -SELECT embedding FROM documents ORDER BY id; |
67 |
| ----- |
68 |
| -{1:1, 2:1, 3:1, 4:1, 5:1, 6:1, 7:1, 8:1, 9:1, 10:1, 11:1, 12:1} |
69 |
| -{1:1, 7:1, 13:1, 14:2, 15:2, 16:1, 18:1, 20:1, 21:1, 23:1, 25:1, 26:1} |
70 |
| -{15:2, 20:1, 27:1, 28:1, 29:1, 30:1, 32:1, 33:1, 34:1, 36:1, 38:1} |
71 |
| -{1:1, 13:1, 14:1, 15:1, 29:1, 40:1, 41:1, 42:1, 43:1, 44:1, 48:1} |
72 |
| -{7:1, 8:1, 15:1, 28:1, 53:1, 54:1, 55:1, 56:1, 57:1} |
73 |
| -{27:1, 28:1, 57:1, 61:1, 62:1, 63:1, 65:1} |
74 |
| -{1:1, 13:1, 14:2, 15:1, 20:1, 25:1, 38:1, 69:2, 71:1, 72:1, 77:1, 78:1} |
75 |
| -{1:1, 7:1, 8:1, 11:1, 81:1, 83:1, 84:1} |
76 |
| -{1:1, 6:1, 7:1, 90:1, 91:1, 92:1, 93:1} |
77 |
| -{15:2, 27:1, 28:1, 34:1, 55:1, 61:1, 84:1, 101:1, 102:1} |
78 |
| -``` |
| 9 | +## Example |
79 | 10 |
|
80 | 11 | ```sql
|
81 |
| -SELECT tokenizer_catalog.create_text_analyzer('test_german', $$ |
82 |
| -pre_tokenizer = "unicode_segmentation" |
83 |
| -[[token_filters]] |
84 |
| -pg_dict = "german_stem" |
| 12 | +SELECT create_tokenizer('tokenizer1', $$ |
| 13 | +model = "llmlingua2" |
85 | 14 | $$);
|
86 | 15 |
|
87 |
| -SELECT tokenizer_catalog.apply_text_analyzer('Aus so krummen Holze, als woraus der Mensch gemacht ist, kann nichts ganz Gerades gezimmert werden.', 'test_german'); |
88 |
| ----- |
89 |
| -{krumm,holz,woraus,mensch,gemacht,ganz,gerad,gezimmert} |
90 |
| -``` |
91 |
| - |
92 |
| -## Installation |
93 |
| - |
94 |
| -First, you need to add `pg_tokenizer` to your `shared_preload_libraries` in `postgresql.conf`. |
95 |
| - |
96 |
| -```sql |
97 |
| --- Note: A restart is required for this setting to take effect. |
98 |
| -ALTER SYSTEM SET shared_preload_libraries = 'pg_tokenizer.so'; |
| 16 | +SELECT tokenize('PostgreSQL is a powerful, open-source object-relational database system. It has over 15 years of active development.', 'tokenizer1'); |
99 | 17 | ```
|
100 | 18 |
|
101 |
| -Then, you can install the extension in the database. |
102 |
| - |
103 |
| -```bash |
104 |
| -CREATE EXTENSION pg_tokenizer; |
105 |
| -``` |
106 |
| - |
107 |
| -## Usage |
108 |
| - |
109 |
| -The extension is mainly composed by 2 parts, `text analyzer` and `model`. `text analyzer` is used to parse the text and generate token arrays, which has similar functionality as `tsvector`. While `model` is used to generate token embeddings(token id array, can be casted to `bm25vector`), which is used for similarity search. |
110 |
| - |
111 |
| -### Text Analyzer |
112 |
| - |
113 |
| -`text analyzer` has 3 main components, `character filters`, `pre tokenizer`, `token filters`. |
114 |
| - |
115 |
| -- character_filters: It is used to filter out some characters before tokenization. e.g. `to_lowercase`, `unicode_normalization`. |
116 |
| -- pre-tokenizer: It is used to split the text into tokens. For example, `unicode segmentation` will split texts on Grapheme Cluster, Word or Sentence boundaries, according to the [`Unicode Standard Annex #29`](https://unicode.org/reports/tr29/) |
117 |
| -- token_filters: It is used to filter out some tokens after tokenization. e.g. `stopwords`, `stemmer`. |
118 |
| - |
119 |
| -### Model |
120 |
| - |
121 |
| -`model` has 3 main types, `pre-trained`, `custom`, `external`. |
122 |
| -- `pre-trained` models have pre-trained vocab lists and some pre-defined tokenization rules. e.g. [`bert_base_uncased`](https://huggingface.co/google-bert/bert-base-uncased), `wiki_tocken`(https://huggingface.co/datasets/iohadrubin/wikitext-103-raw-v1). |
123 |
| -- `custom` models will maintain their own vocab mapping. You can build a custom model based on your own corpus easily. |
124 |
| -- There are some useful tokenizers that cannot be decoupled into `text_analyzer` and `model`. We provide them as `external` models, and customize all configurations in the `model` part. |
125 |
| - |
126 |
| -## Reference |
127 |
| - |
128 |
| -### Functions |
129 |
| - |
130 |
| -#### Text Analyzer |
131 |
| -- `tokenizer_catalog.create_text_analyzer(name TEXT, config TEXT)`: Create a text analyzer. |
132 |
| -- `tokenizer_catalog.drop_text_analyzer(name TEXT)`: Drop a text analyzer. |
133 |
| -- `tokenizer_catalog.apply_text_analyzer(text TEXT, text_analyzer_name TEXT) RETURNS TEXT[]`: Apply a text analyzer to a text. |
134 |
| - |
135 |
| -<br/> |
136 |
| - |
137 |
| -- `tokenizer_catalog.create_stopwords(name TEXT, config TEXT)`: Create a stopwords dictionary. |
138 |
| -- `tokenizer_catalog.drop_stopwords(name TEXT)`: Drop a stopwords dictionary. |
139 |
| - |
140 |
| -<br/> |
141 |
| - |
142 |
| -- `tokenizer_catalog.create_synonym(name TEXT, config TEXT)`: Create a synonym dictionary. |
143 |
| -- `tokenizer_catalog.drop_synonym(name TEXT)`: Drop a synonym dictionary. |
144 |
| - |
145 |
| -#### Model |
146 |
| - |
147 |
| -- `tokenizer_catalog.create_custom_model(name TEXT, config TEXT)`: Create a custom model. |
148 |
| -- `tokenizer_catalog.create_custom_model_tokenizer_and_trigger(tokenizer_name TEXT, model_name TEXT, text_analyzer_name TEXT, table_name TEXT, source_column TEXT, target_column TEXT)`: Create a custom model tokenizer and trigger to update the target column automatically. |
149 |
| -- `tokenizer_catalog.drop_custom_model(name TEXT)`: Drop a custom model. |
150 |
| - |
151 |
| -<br/> |
152 |
| - |
153 |
| -- `tokenizer_catalog.add_preload_model(name TEXT)`: Add a model to the preload list. |
154 |
| -- `tokenizer_catalog.remove_preload_model(name TEXT)`: Remove a model from the preload list. |
155 |
| -- `tokenizer_catalog.list_preload_models() RETURNS TEXT[]`: List all preload models. |
156 |
| - |
157 |
| -<br/> |
158 |
| - |
159 |
| -- `tokenizer_catalog.create_lindera_model(name TEXT, config TEXT)`: Create a lindera model. |
160 |
| -- `tokenizer_catalog.drop_lindera_model(name TEXT)`: Drop a lindera model. |
161 |
| - |
162 |
| -<br/> |
163 |
| - |
164 |
| -- `tokenizer_catalog.create_huggingface_model(name TEXT, config TEXT)`: Create a huggingface model. |
165 |
| -- `tokenizer_catalog.drop_huggingface_model(name TEXT)`: Drop a huggingface model. |
166 |
| - |
167 |
| -#### Tokenizer |
168 |
| - |
169 |
| -- `tokenizer_catalog.create_tokenizer(name TEXT, config TEXT)`: Create a tokenizer. |
170 |
| -- `tokenizer_catalog.drop_tokenizer(name TEXT)`: Drop a tokenizer. |
171 |
| -- `tokenizer_catalog.tokenize(text TEXT, tokenizer_name TEXT) RETURNS INT[]`: Tokenize a text. |
172 |
| - |
173 |
| - |
174 |
| -## Configuration |
175 |
| - |
176 |
| -> We utilize `TOML` syntax to express all configurations. |
177 |
| -
|
178 |
| -### Options for `text analyzer` |
179 |
| - |
180 |
| -| Key | Type | Description | |
181 |
| -| ----------------- | -------------- | -------------------------------------------------------------------------------------- | |
182 |
| -| character_filters | Array of Table | Character filters, see [Options for `character_filter`](#options-for-character_filter) | |
183 |
| -| pre_tokenizer | Table | Pre-tokenizer, see [Options for `pre_tokenizer`](#options-for-pre_tokenizer) | |
184 |
| -| token_filters | Array of Table | Token filters, see [Options for `token_filter`](#options-for-token_filter) | |
185 |
| - |
186 |
| -### Options for `character_filter` |
187 |
| - |
188 |
| -| Key | Type | Description | |
189 |
| -| --------------------- | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------ | |
190 |
| -| to_lowercase | Empty Table | Convert all characters to lowercase | |
191 |
| -| unicode_normalization | String | Unicode normalization form, see [Unicode Normalization Forms](https://unicode.org/reports/tr15/), supported values: `nfkc`, `nfkd`, `nfc`, `nfd` | |
192 |
| - |
193 |
| -You can choose only one of the above options for each character filter. |
194 |
| - |
195 |
| -### Options for `pre_tokenizer` |
196 |
| - |
197 |
| -| Key | Type | Description | |
198 |
| -| -------------------- | ----------- | -------------------------------------------------------------------------------------------------------------- | |
199 |
| -| regex | String | It will generate all tokens that match the regex pattern | |
200 |
| -| unicode_segmentation | Empty Table | Split the text into tokens based on the Unicode Standard Annex #29 | |
201 |
| -| jieba | Table | Split the text into tokens based on the Jieba Chinese tokenizer, see [Options for `jieba`](#options-for-jieba) | |
202 |
| - |
203 |
| -#### Options for `jieba` |
204 |
| - |
205 |
| -| Key | Type | Description | |
206 |
| -| ---------- | ------- | -------------------------------------------------------------------------------------- | |
207 |
| -| mode | String | Jieba tokenizer mode, supported values: `full`, `precise`, `search`. default: `search` | |
208 |
| -| enable_hmm | Boolean | Whether to enable HMM, default: `true` | |
209 |
| - |
210 |
| -### Options for `token_filter` |
211 |
| - |
212 |
| -| Key | Type | Description | |
213 |
| -| --------------------- | ----------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | |
214 |
| -| skip_non_alphanumeric | Empty Table | Skip tokens that all characters are non-alphanumeric | |
215 |
| -| stemmer | String | Stemmer, see [Supported values for `stemmer`](#supported-values-for-stemmer) | |
216 |
| -| stopwords | String | Stopwords name, builtin: `lucene_english`, `nltk_english`, `iso_english` | |
217 |
| -| synonym | String | Synonym name | |
218 |
| -| pg_dict | String | Using [postgres text search dictionary](https://www.postgresql.org/docs/current/textsearch-dictionaries.html). We currently support all dictionaries except `Thesaurus Dictionary`. | |
219 |
| - |
220 |
| -You can choose only one of the above options for each token filter. |
221 |
| - |
222 |
| -#### Supported values for `stemmer` |
223 |
| - |
224 |
| -arabic, armenian, basque, catalan, danish, dutch, english_porter, english_porter2, estonian, finnish, french, german, greek, hindi, hungarian, indonesian, irish, italian, lithuanian, nepali, norwegian, portuguese, romanian, russian, serbian, spanish, swedish, tamil, turkish, yiddish |
225 |
| - |
226 |
| -### Options for `tokenizer` |
227 |
| - |
228 |
| -| Key | Type | Description | |
229 |
| -| ------------- | ------ | --------------------------------------------------------------------------------- | |
230 |
| -| text_analyzer | String | Text analyzer name. If you are using an external model, you can just ignore this. | |
231 |
| -| model | String | Model name. We have some builtin models, see [Builtin models](#builtin-models) | |
232 |
| - |
233 |
| -#### Builtin models |
234 |
| - |
235 |
| -- [bert-base-uncased](https://huggingface.co/google-bert/bert-base-uncased) |
236 |
| -- [wiki_tocken](https://huggingface.co/datasets/iohadrubin/wikitext-103-raw-v1) |
237 |
| -- [gemma2b](https://huggingface.co/google/gemma-2b) |
238 |
| -- [llmlingua2](https://huggingface.co/microsoft/llmlingua-2-xlm-roberta-large-meetingbank) |
239 |
| - |
240 |
| -### Options for `custom model` |
241 |
| - |
242 |
| -| Key | Type | Description | |
243 |
| -| ------------- | ------ | ------------------- | |
244 |
| -| table | String | Table name. | |
245 |
| -| column | String | Column name. | |
246 |
| -| text_analyzer | String | Text analyzer name. | |
| 19 | +More examples can be found in [docs/03-examples.md](docs/03-examples.md). |
247 | 20 |
|
248 |
| -### Options for `lindera model` |
| 21 | +## Documentation |
249 | 22 |
|
250 |
| -It's totally the same as lindera tokenizer configs, see [Lindera](https://github.com/lindera/lindera). |
| 23 | +- [Installation](docs/01-installation.md) |
| 24 | +- [Development](docs/02-development.md) |
| 25 | +- [Examples](docs/03-examples.md) |
| 26 | +- [Usage](docs/04-usage.md) |
| 27 | +- [Text Analyzer](docs/05-text-analyzer.md) |
| 28 | +- [Model](docs/06-model.md) |
| 29 | +- [Limitation](docs/07-limitation.md) |
| 30 | +- [Reference](docs/00-reference.md) |
0 commit comments