-
Notifications
You must be signed in to change notification settings - Fork 8
Server Add Languages
Home / Developer / Server Tier / Add Languages
PxStat may be implemented in any number of languages. Additionally, it may be used in bilingual or multi-lingual settings. In a multi-lingual setup, data, products and subjects may be represented in all the chosen languages simultaneously. Multi-lingual implementations affect the following areas of the API application:
- Search optimisation.
- Returning data where it’s available in a preferred language.
- Viewing Products and Services in other languages.
Please note that this document applies only to the APIs. Front end language support is a separate topic. In this document, the term “[iso]” is used as a neutral term for the ISO code of whatever language is in question. So, for example, if you are dealing with the German language, you should replace “[iso]” with “de”. A full list of ISO codes can be found at https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes. Use the two letter (6391) version of the code.
This section lists Entities whose data can depend on the language code supplied in the API call. The General Approach as described above applies to these entities.
- Cube.
- Matrix.
- Collection.
- Subject.
- Product.
- Logging.
- Alert.
- Navigation.
- Keywords.
A PxStat implementation must have a default language. This sets the following rules for the implementation:
- All languages must be setup in the Settings/Language. Typically, this is done via the front end. The API is PxStat.System.Settings.Language_API.Create.
- The default language is set in the APP_DEFAULT_LANGUAGE setting in Static.config.
- Any dataset must be represented at minimum in the default language. It should be possible to represent the same data in other languages if necessary.
- Subjects and Products must be represented in the default language but may also be represented as translations in other languages. Use the
An unlimited number of secondary languages can be supported in the API along with the default language. This will require some additional configuration and some language specific development. This additional development is described further on in this document.
In PX files, there can be multiple language versions of the metadata for the same data. The first language on the file, which typically does not have a language label on its metadata tags, is known as the Top Language. This may or may not be the same as the system’s Default Language.
The general approach to languages in the API can be summarised as “Give me the data in my preferred language, but if it’s not available in my preferred language then give me the data in the default language”. This applies to searches, reads, etc.
This means that all variations of a word are represented as a single word when we are creating keywords or generating them for a search. In English, for example, this typically means representing a noun in the singular form.
In PxStat, searching is optimised by applying keywords to the following entities:
- Release.
- Subject.
- Product.
Keywords are used to optimise the search process. A set of keywords are associated with a particular matrix and finding those keywords in a search means that the reference matrices are returned. Mandatory keywords are created when these entities are created (and deleted when corresponding entities are deleted). The process is summarized as follows:
- Obtain the individual words from the entity and related entities.
- Remove unwanted characters. This is controlled by the regex in the keyword_[iso].json file.
- Remove unwanted words as listed in the keyword_[iso].json file. Typically, these are articles, prepositions etc.
- Remove any duplicates.
- Singularize the keywords. This means that all variations of a word are represented in one standard form. So, for example, “children” is represented as “child”, etc.
- Store the keywords against the entity.
Users may also create optional keywords. This occurs where there is a logical connection between words, but it is not possible for an automated process or a config file to represent this. For example, we might like to ensure a dataset containing “wheat” will be found with the search term “agriculture”. In this case the user can create an optional keyword via the front-end. When creating an optional keyword, the user may flag it as “Acronym”. This means that it will not be singularised.
The search process is summarised as follows:
- User enters a search term as list of space separated words.
- API breaks this into a list of individual words.
- API cleanses unwanted characters from the words. This is controlled by the regex in the keyword_[iso].json file.
- API removes any forbidden words, e.g. prepositions, articles etc from the list. These words are listed in the keyword_[iso].json file.
- API searches for synonyms for each of the words in the search list.
- API adds the singularized version of each of the words to the search list
- Search is run and the found data is given a score. The returned data is ordered by its relevance score.
The process for running the search is:
- Any dataset to be returned from a search must hold all the search words (apart from those removed by cleansing).
- In the case of synonyms, either the original word or the synonym is sufficient for a match.
- In the case of keywords that are flagged as acronyms, the existence of the keyword is optional.
Each language in the system must have a configuration file. This is to internationalize log messages, errors, emails etc.
Each language will have [iso].json file where [iso] is the ISO language name. So, for example, if a system is to include English and Irish, there must be an en.json and a ga.json file. These are in \Resources\Internationalisation. If you are implementing a specific language, please copy the default language version and supply translations for each item in your new file.
Keywords are important from the point of view of searching. However, for nouns in particular, there may be considerable variation in how a word is spelt, depending on context, singular/plural etc. A keyword_[iso].json file is used for each language in order to store settings for keyword variation.
Additionally, it will be necessary to develop a keyword_[iso] class for each language. This conforms to a specific interface and will apply the word changing rules for the language.
Searches should also be sensitive to synonyms, that is, words that effectively mean the same thing. For example, a search for “Car” should be equivalent to a search for “Vehicle”. For this reason, there is a synonym list for each language, i.e. synonym_[iso].json.
Some languages are more highly inflected than others. So, for example in English, nouns will typically only change depending on whether they are singular or plural. This means that we can use a Singularize dll for switching between versions of words. However, the more inflected languages may need to create a dictionary of the various inflexions. The different versions of these words will be held in dictionary_[iso].js.
The keyword_[iso].json files, dictionary_[iso].json and the keyword_[iso].cs files will be held in the \Resources\Internationalisation\keyword\ folder.
These instructions must be followed for any *.json file that is to be added to the application.
- Insert the file in the appropriate location.
- Find the Resources visual designer in Visual Studio under the Properties of the Project in Solution Explorer. Right click on Resource.resx and select “View Designer”
- On the Designer, click on the “Add Resource” button and add the new file.
- Right click on Resource.resx in Solution Explorer and select “View Code”
- Ensure that the type of the new item is System.String. Change it if necessary.
- Go to Resources.Designer.cs. Ensure that the property declaration of the new item (a) returns a string and (b) uses the ResourceManager.GetString() method.
You must complete the following steps to implement a new language.
- Add the new language to the system (this can be done via the front-end).
- Create a translated set of system messages in that language. This is stored as \Resources\Internationalisation[iso].json.
- Create a dictionary of synonyms for that language. This is stored as \Resources\Internationalisation[synonym_[iso].json
- Create a json configuration to store additional information about keywords. This is stored as \Resources\Internationalisation\keyword\keyword_[iso].json.
- Create a dictionary of word equivalence for singularization (i.e. a morphology database), stored as dictionary_[iso].json in \Resources\Internationalisation. This will be used in cases where singularization needs to search for equivalents. Where another singularization method is used (e.g. in the case of English), you must still create an empty version of this file.
- Write a class called Keyword_[iso]. This must implement the IKeywordExtractor interface. While you must implement the functions of the interface, the workings of each function must depend on the rules of the specific language.
This is a translation of the general system messages for the application. These may either be returned to the user or recorded in logging. To create a version in the new language, save the existing English language version (en.json) as [iso].json. Then translate the values (not the keys!) to the new language.
This tells us how individual words are treated both from the point of view of creating search keywords and matching search words to them. The entries are described below:
- Excluded – Excluded words. These will not become keywords and are removed from search terms. They are broken into suggested categories outlined below.
Excluded:
- article – a list of definite and indefinite articles.
- preposition – list of prepositions.
- interrogative – list of interrogative terms.
- miscellaneous – any other word or words that must be excluded.
Additionally, there is a regex entry in this category.
- regex – regex for removing unwanted characters.
Where singularization is done via a lookup process, this holds lookup pairs of words. Some Irish language example entries are shown below: {"úlloird":"úllord","úllord":"úllord"}. In the case above, we are saying that a search for either "úlloird" (orchards) or "úllord" (orchard) will result in only "úllord" being searched. Where singularization is done by other means (e.g. a dll, a rules-based system, etc) you must still create this file. However, there will be no entries in this file.
This is a list of synonyms. They are shown as a list of match:lemma pairs:
[
{"match":"individual","lemma":"person"},
{"match":"mortal","lemma":"person"},
{"match":"person","lemma":"person"},
{"match":"somebody","lemma":"person"},
{"match":"someone","lemma":"person"} ,
{"match":"altogether","lemma":"whole"},
{"match":"completely","lemma":"whole"},
{"match":"entirely","lemma":"whole"}
]
In the example above, the first five entries will resolve to “person” while the last three will resolve to “whole”. To construct a list of synonyms for a language, a good approach is to use the appropriate WordNet. This process is described later in this document.
You must write the language specific code to enable keywords to function in the target language. Replace 'xx' with the iso code of the language you are adding.
namespace PxStat.Resources
{
public class Keyword_xx : IKeywordExtractor
{
readonly Keyword keyword;
readonly Dictionary <string, string> nounDictionary;
public List<Synonym> SynonymList { get; set; }
public string LngIsoCode { get; set; }
public CultureInfo cultureInfo { get; set; }
public Keyword_xx()
{
LngIsoCode = "xx"; //insert language iso code here. You may wish to use a config item
keyword = new Keyword(LngIsoCode);
nounDictionary = Keyword_BSO_ResourceFactory.GetMorphology(LngIsoCode);
this.SynonymList = Keyword_BSO_ResourceFactory.GetSynonyms(LngIsoCode);
}
public string Sanitize(string words)
{
return Regex.Replace(words, keyword.Get("excluded.regex"), " ");
}
public List <string> ExtractSplit(string readString)
{
return readString.Split(' ').ToList < string >();
}
public bool IsDoNotAmend(string word)
{
return false;
}
public string Pluralize(string word)
{
throw new NotImplementedException();
}
public bool IsPlural(string word)
{
throw new NotImplementedException();
}
public bool IsSingular(string word)
{
throw new NotImplementedException();
}
public string Singularize(string word)
{
//Insert code to return a standard version of a given word for all variations on that word. This includes plurals
//and other inflections.
}
}
}
At minimum, all you need to do is to complete the Singularize method and insert the language iso code where it’s required in the constructor. You are free to create private methods, etc as you see fit.
Wordnet is described at https://wordnet.princeton.edu/. It’s a set of tables that outline relationships among words. These relationships include synonyms, hypernyms, meronyms, etc. Many languages have their own Wordnet database. A comprehensive list is found at http://globalwordnet.org/resources/wordnets-in-the-world/. Note that while many sets are free of charge, others are licenced and require payment. Wordnets usually consist of XML datasets. They may also be published as databases; typically, SqlLite or MySql.
This assumes that we are (a) only interested in nouns, (b) accepting of hypernyms and hyponyms as synonyms and (c) not interested in multi-word phrases. If this is not the case, then amend the instructions accordingly.
- Find the Lemma table. This will be the main word list.
- Get a list of nouns by filtering this to where ("partOfSpeech") == "n".
- For each noun, get a list of Sense table entries that have the same value of LexicalEntry id.
- Get the other entries in the Sense table that have the same “synset” entry.
- For each of these entries, link back to the Lemma table via LexicalEntry id.
- Get the “Lemma” as the original noun and the “Match” match as the newly found entry.
- Run through the entire dataset like this until we have entries for as many words as possible.
- Remove any duplicates.
- Output this dataset to a json file as described above.
An English language Wordnet is found at https://wordnet.princeton.edu/ . You have a number of options regarding the physical representation of this data. A useful strategy, for users with database experience is to download a Sqlite database version of the data. You may do this at https://sourceforge.net/p/wnsql/activity/?page=0&limit=100#572012f15fcbc95c29dadd6d. If you can't find it here, do a Google search for "sqlite-31_snapshot.db". Other versions may be released from time to time.
If you don't have (or don't want to have) a Sqlite data browser, you may use an online version. An example can be found at https://sqliteonline.com/.
We assume that:
- We are interested in not just the synonyms but in all logical relationships between words
- We are not interested in phrases or compound words. Hyphenated words are allowed.
If we need to reduce the types of relationships, this can be done by specifying them in the join conditions when querying the LINKTYPES table.
For a full set, run the following query:
SELECT distinct w1.lemma,w2.lemma as [Match] FROM lexlinks ll
inner join words w1
on ll.word1id=w1.wordid
inner join words w2
on ll.word2id=w2.wordid
inner join linktypes lt
on ll.linkid=lt.linkid
where w1.lemma not like '% %'
and w2.lemma not like '% %'
UNION
select distinct d1.lemma as [match], d2.lemma as lemma
from words d1
inner join senses s1
on d1.wordid=s1.wordid
inner join senses s2
on s1.synsetid=s2.synsetid
inner join words d2
on s2.wordid=d2.wordid
where d2.lemma not like '% %'
and d1.lemma not like '% %'
UNION
select distinct d2.lemma as [match], d1.lemma as lemma
from words d1
inner join senses s1
on d1.wordid=s1.wordid
inner join senses s2
on s1.synsetid=s2.synsetid
inner join words d2
on s2.wordid=d2.wordid
where d2.lemma not like '% %'
and d1.lemma not like '% %'
Once this query is run, you have enough information to create a synonym_en.json file. If you are using https://sqliteonline.com/, you may do this easily by running the query and then selecting the "Export".."Json" option.
Once this file has been created, you may place in the PxStat\Resources\Internationalisation\keyword folder.
PxStat is can store and use any character as UTF-8. This means that all characters and alphabets are permissible in the system. No further configuration of the system is required for this.
Update
Database Scripts
Configuration
API
- Home
- Data
- Security
- Subscription
- System
- Navigation
- Notification
- Settings
- Workflow
- Px Build
Developer
- Home
- Developer Tools
- Client Tier
- Server Tier
- Database Tier