Here we describe the steps for processing of data, extracting features, and training of our models for each of profession and nationality relations.
- Indexing of Wikipedia sentences
wp_sentence_index.py
- Extracting 1st Wikipedia sentences and paragraphs
extract_fst_wp_sentences.py
- For each
<snippet>
(sentences
,paragraphs
):<input: config_json/fst_wp_sentences_config.json>
<output: first_wp_sentences/persons_without_<snippet>.txt>
<output: first_wp_sentences/first_wp_<snippet>.tsv>
- Getting the to-id mappings from persons and from items (check code documentation for a much clearer understanding, since the functions are parametrized, and all the development was intended to be used as library, so no command-line support is provided in general).
make_item_ids.py
functions:make_item_ids.make_persons_fb_ids
<input: persons>
<output: persons_ids.tsv>
make_item_ids.make_relation_item_ids
<input: professions>
<output: professions_ids.tsv>
make_item_ids.make_relation_item_ids
<input: nationalities>
<output: nationalities_ids.tsv>
make_item_ids.make_professions_kb_translation
<input: persons_ids.tsv; professions_ids.tsv; professions.kb>
<output: profession_translations.kb>
- Generating tfidf weights
prof_stats.py
<output:tf_idf.tsv>
- Generating term statistic features of sumProfTerms and simCos
feat_termstats.py
<input:tf_idf.tsv>
<output:profession_translations.kb;
features_termstats_prof.tsv>
- Generating features: isProfWPSent, isProfWPPar, isFirstProfWPSent, and isFirstProfWPPar
feat_fst_wp_sentences.py
- For each
<snippet>
(sentences
,paragraphs
):<input: professions; first_wp_sentences/first_wp_<snippet>.tsv; professions.kb; "profession">
<output: is_profession_in_first_wp_<snippet>.tsv>
<output: is_1st_profession_in_first_wp_<snippet>.tsv>
- Generating simCosW2VPar feature
feat_w2v_sim_approx.py
<input: profession_translations.kb>
<output: profession-approx-w2v_aggr_cos_sim.tsv>
- Generating simCosW2V feature
feat_w2v_sim.py
<input: profession_translations.kb>
<output: profession_w2v_aggr_cos_sim.tsv>
-
Generating person-nationality input files
-
`make_fre_input.py
<input:nationalities_ids.tsv;
nationality_adjectives.tsv;
persons_ids.tsv;
nationality.kb;
nationality_translations.kb>
-
<output: nationality_translations2.kb>
-
-
Generating features of isNatWPSent, isNatWPPar, isFirstNatWPSent, and isFirstNatWPPar
-
`feat_fst_up_sentences_nationality.py
<input:nationality_adjectives.tsv;
nationality_translations2.kb;
first_wp_sentences.txt;
first_wp_paragraphs.txt>
<output:is_1st_nationality_in_first_wp_paragraphs_Adj.tsv;
is_1st_nationality_in_first_wp_paragraphs_Noun.tsv;
is_1st_nationality_in_first_wp_sentences_Adj.tsv;
is_1st_nationality_in_first_wp_sentences_Noun.tsv;
is_nationality_in_first_wp_paragraphs_Adj.tsv;
is_nationality_in_first_wp_paragraphs_Noun.tsv;
is_nationality_in_first_wp_sentences_Adj.tsv;
is_nationality_in_first_wp_sentences_Noun.tsv;>
-
-
Generating freqPerNat feature
-
feat_freq.py
<input:nationality_translations.kb>
<output:nat_features_freq_Noun2.tsv;
nat_features_freq_Adj2.tsv>
-
- Aggregating all features
feat_agg.py [-t]
, use-t
for training dataset<output:RELATION_train.json;
RELATION_all.json>
- Training/test/cross validation
uis_software -i <input data> -o <output/path>
- Main software entry point
uis_software.py -i <input> -o <output>
(as required here)