You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+226-1Lines changed: 226 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -22,6 +22,231 @@ bin/rake db:create db:migrate
22
22
bin/rake wordnet:import
23
23
```
24
24
25
+
## Project overview
26
+
27
+
[Słowosieć][1] is a Polish equivalent of Princeton Wordnet, a lexical database of word senses and relations between them.
28
+
29
+
The purpose of this document is to describe a successful effort of making the web interface of Polish Wordnet more performant and user-friendly. In particular we'll elaborate on developed architecture, used components, and database designs.
30
+
31
+
The front-end and back-end of application were rebuilt from scratch. As as result the browsing latency dropped from 30 seconds in some cases to 110ms on average.
32
+
33
+
## Architecture
34
+
35
+
Following decisions has been made:
36
+
37
+
* Data is stored in normalised form using relational database
38
+
* Data is indexed and queried using graph database
39
+
* Data is rendered on client-side using templates
40
+
* Data is loaded through a well-crafted API endpoint
41
+
42
+
Given [multiple issues with MySQL database][2] and [performance issues with handling UUIDs][17], the [PostgreSQL][3] were chosen as relational database backend. This has an additional advantage of storing data in Hstore and Array types (where sensible), avoiding unnecessary `JOIN` statements for data retrieval.
43
+
44
+
[Neo4J][4] has been chosen as relational database backend. The main reasons included being open-source, mature, and reliable graph store. Neo4J is one of the few graph databases providing declarative way of querying data, using [Cypher][5] language (similar in some ways to SQL).
45
+
46
+
On front-end an [Angular.js][6] framework is used. It is relatively new, but popular product developed and maintained by Google. It allows for easy decoupling of application logic and template rendering using unique concepts of [directives, services, and controllers][7].
47
+
48
+
[Rails 4][8] web-framework is used for both API endpoint, and serving front-end. Rails is mature software, allowing for robust development of modern web applications. Made in [Ruby][9], allows us to use use tens of thousands of [Ruby Gems][10], significantly boosting the development.
49
+
50
+
API allows for disjoint development of front-end and back-end.
51
+
52
+
## Other technologies used
53
+
54
+
Experience made us choose following set of tool for application development:
55
+
56
+
*[CoffeeScript][11] replacing plain JavaScript
57
+
*[SASS][12] replacing plain CSS stylesheets
58
+
*[SLIM][13] for rendering front-end HTML markup
59
+
60
+
## Definitions
61
+
62
+
-[Lexeme][14] - unit of lexical meaning that exists regardless of the number of inflectional endings it may have or the number of words it may contain (e.g. run, ran, runs)
63
+
-[Lemma][15] - particular form of a lexeme that is chosen by convention to represent a canonical form of a lexeme (e.g. run)
64
+
-[Sense][16] - a Lexeme associated with particular meaning. Each Lexeme can have multiple Senses. In Wordnet each Sense is associated with number to easily distinguish (e.g. I can write `run 4` meaning an unbroken series of events, or `run 5` meaning the act of running)
65
+
-[Synset](https://en.wikipedia.org/wiki/Synonym_ring) - a set of Senses (not Lexemes) with similar meaning, i.e. synonyms (e.g. `run 2` forms Synset with following Senses: `bunk 3`, `escape 6`, turn `tail 1`).
66
+
-[Sense Relation](https://academic.cuesta.edu/acasupp/as/507.HTM) - a relationship between two Senses, i.e. relationship between two particular meanings of words (e.g. `big 1` is antonym of `little 1`)
67
+
- Synset Relation - a relationship between two Synsets, i.e. relationship between two groups of Senses (e.g. `Synset { act 10, play 25 }` is hyponym of `Synset { overact 1, overplay 1 }`).
68
+
- Relation Type - each SenseRelation and SynsetRelation has its type, it can be among others: antonym, hyponym, hyperonym, meronym, ...
69
+
70
+
In summary: Each Lexeme is represented by Lemma. Each Lexeme has multiple Senses. Each Sense forms Synset with other Senses. Each Sense can be in SenseRelation to other Senses. Each Synset can be in SynsetRelation to other Synsets. Each Relation has its own RelationType.
71
+
72
+
Above concepts of Wordnet are modelled in application in following way:
73
+
74
+

75
+
76
+
77
+
## Relational Database
78
+
79
+
Introducing Relational Database as primary store had two purposes:
80
+
1. Reliably and economically storing data in normalised form
81
+
2. Ability to use de-normalised graph database as index
82
+
83
+
The data is imported to normalised form from Polish Wordnet, but the process allows for importing arbitrary Wordnet-alike database.
84
+
85
+
Non-conventionally the primary keys of database tables are UUIDs, instead of auto-incrementing values. It has few advantages:
86
+
- Plays well with graph databases, each node has its own unique ID
87
+
- UUIDs for records can be generated by application code what makes inserting interconnected data into the database easier & performant.
88
+
- Makes replication of relational database trivial
89
+
- Allows for easy merging of two databases with same schema
90
+
91
+
The overall schema closely reassembles concepts described earlier:
92
+
93
+
### senses
94
+
95
+
*`id`: The UUID identifier
96
+
*`synset_id`: The UUID of connected Synset
97
+
*`external_id`: The ID from external database, used for importing
98
+
*`lemma`: The lemma of Lexeme that Sense belongs to (e.g. car)
99
+
*`sense_index`: The index of sense in context of its Synset (e.g. 1)
100
+
*`comment`: The short comment, used in UI (e.g. transporting machine)
101
+
*`language`: Currently can be `en_GB` or `pl_PL`
102
+
*`part_of_speech`: The part of speech of Sense (noun etc.)
103
+
*`domain_id`: The ID of the Domain of Sense (not used yet)
104
+
105
+
### synsets
106
+
107
+
*`id`: The UUID identifier
108
+
*`external_id`: The ID from external database, used for importing
109
+
*`comment`: The short comment by Słowosieć, used in UI
110
+
*`definition`: The short comment by Princeton Wordnet, used in UI
111
+
*`examples`: The examples of usage of synset from Princeton Wordnet
112
+
113
+
### relation_types
114
+
115
+
*`name`: Name of the relation
116
+
*`reverse_relation`: Name of reverse relation (see: normalisation)
117
+
*`parent_id`: Name of parent RelationType (inheritance-like)
118
+
*`priority`: It is used for sorting relation types in UI (lower-better)
119
+
*`description`: Description of the relation (not used yet)
120
+
121
+
### sense\_relations and synset\_relations
122
+
123
+
*`parent_id`: UUID of base sense (or synset)
124
+
*`child_id`: UUID of of related sense (or synset)
125
+
*`relation_id`: UUID of relation in which child is toward parent (e.g. UUID hyponymy relation means child is hyponym of parent)
126
+
127
+
### Normalisation of Relations
128
+
Imported relations are normalised in few ways:
129
+
130
+
1. For reverse relation types we leave only one relation type (by convention the one where where are more children than parents, e.g. hyponymes, not hyperonymes).
131
+
2. The name of removed reverse relation is assigned to reverse_name
132
+
3. Name and reverse_name are in plural form for for UI purposes
133
+
4. Even name has it’s parent, the name describes full relation type name (for example “Meronymes (place)”, not “place”)
134
+
135
+

136
+
137
+
## Graph Database
138
+
139
+

140
+
Graph database has slightly different structure than relational database. Most importantly Sense and Synset nodes don’t contain any data except their IDs. The relationships of type `relation` exist only between Synset and Senses. All data displayed in UI columns is hold in Data nodes.
141
+
142
+
Each Synset and each Sense is represented by connected Data node in UI.
143
+
144
+
Data node holds following data from Sense model:
145
+
* lemma
146
+
* sense_index
147
+
* comment
148
+
* language
149
+
* part_of_speech
150
+
* domain_id
151
+
152
+
## Importing data from external Wordnets
153
+
154
+
Wordnet uses internal, normalised representation of database. The normalised structure is defined in Relational Database section.
155
+
156
+
The data mapping is done by 5 classes inherited from Importer class:
157
+
158
+
* WordnetPl::RelationType
159
+
* WordnetPl::Sense
160
+
* WordnetPl::Synset
161
+
* WordnetPl::SenseRelation
162
+
* WordnetPl::SynsetRelation
163
+
164
+
Each class is responsible for importing data to corresponding models.
165
+
166
+
Importer class processes data in batches for performance reasons. It handles progress bar rendering, parallelising import process, and synchronising writes. It expects following methods to be defined in descendants:
167
+
168
+
*`total_count`: The total count of items to be imported
169
+
*`load_entities(limit, offset)`: This method should load `limit` records from external database with given `offset` and return hash consumed later by `process_entities!` method
170
+
*`process_entities!(entities)`: This method is responsible for processing data returned from `load_entites` and passing them to `persist_entities!` method described below
171
+
172
+
`persist_entities!(table_name, collection, unique_attributes)` uses [Upsert][18] method to insert or update data in database in performant way. It accepts table in database where the record should be inserted/updated, the actual `collection` of records as array of hashes where keys are column names (see relational database schema) and values are row values. The `unique_attributes` is an array of column names that upsert method will use for selecting data to merge (usually “id”, but can be for example `[“parent_id”, “child_id”]` for relations.
173
+
174
+
Import process can be triggered by issuing command:
175
+
176
+
```
177
+
bin/rake wordnet:import
178
+
```
179
+
180
+
The source database defaults to `mysql2://root@localhost/wordnet`, but you can change it by passing `SOURCE_URL` environment variable.
181
+
182
+
## Exporting to Neo4J index
183
+
184
+
The same way importer classes inherit from Importer, exporter classes inherit from Exporter. The are only 4 exporter classes:
185
+
186
+
* Neo4J::Sense
187
+
* Neo4J::Synset
188
+
* Neo4J::SenseRelation
189
+
* Neo4J::SynsetRelation
190
+
191
+
Each exporter is supposed to define 2 methods:
192
+
193
+
*`export_index!`: that ensures at the beginning of export that proper indexes are present in Neo4J database
194
+
*`process_batch(entities)`: method that accepts array of entity hashes, just like `process_entities!` and returns array of queries to execute in batch request by [Neography][19] gem.
195
+
196
+
Export process can be triggered by issuing command:
197
+
198
+
```
199
+
bin/rake wordnet:export
200
+
```
201
+
202
+
The destination defaults to `http://127.0.0.1:7474`, but you can change it by passing `NEO4J_URL` environment variable.
203
+
204
+
## Deployment
205
+
206
+
Application is supposed to be run on at least 3 servers:
207
+
208
+
1. Application server
209
+
2. PostgreSQL server
210
+
3. Neo4J server
211
+
212
+
On application server the Rails application should be deployed, using any method. At least Node.js, Ruby 2.0, and development libraries of Postgresql and Mysql are required to be installed on system.
213
+
214
+
The addresses of PostgreSQL database and Neo4J database are passed by `NEO4J_URL` environment variable, and database information is configured in `config/database.yml`.
215
+
216
+
The assets need to be precompiled before deploying app on production:
217
+
218
+
```
219
+
RAILS_ENV=production bin/rake assets:precompile
220
+
```
221
+
222
+
The server can be started by hand with:
223
+
224
+
```
225
+
RAILS_ENV=production bin/rails server --port 80
226
+
```
227
+
228
+
Or by tool you choose (Capistrano or other).
229
+
25
230
## License
26
231
27
-
As Rails, this project is [MIT-licensed](http://opensource.org/licenses/mit-license.php). As usual, you are awesome.
0 commit comments