Train the L2L model

First get COCA.zip from our local DARPA-ARC Teams Files area. Users not at NPS will have to request this from the COCA developer. Save COCA.zip to $CORPORA/COCA.zip and unzip. We need to generate word co-occurance data from COCA. Run

java  -Xmx40g -classpath   $ONTOLOGYPORTAL_GIT/sigmanlp/build/sigmanlp.jar:$ONTOLOGYPORTAL_GIT/sigmanlp/build/lib/* com.articulate.nlp.corpora.COCA -a

This will save files nouns.txt and verbs.txt in $CORPORA/COCA

Then generate language-logic pairs with GenSimpTestData. For 1M pairs, do

java -Xmx40g -classpath $ONTOLOGYPORTAL_GIT/sigmanlp/build/sigmanlp.jar:$ONTOLOGYPORTAL_GIT/sigmanlp/build/lib/* com.articulate.nlp.GenSimpTestData -s out 10000000

which will generate out-eng.txt and out-log.txt. Then generate the language logic pairs from all SUMO relations and all SUMO axioms with -

java -Xmx14g -classpath $ONTOLOGYPORTAL_GIT/sigmanlp/lib/*:$ONTOLOGYPORTAL_GIT/sigmanlp/build/classes com.articulate.nlp.GenSimpTestData -a allAxioms
java -Xmx60g -classpath $ONTOLOGYPORTAL_GIT/sigmanlp/lib/*:$ONTOLOGYPORTAL_GIT/sigmanlp/build/classes com.articulate.nlp.GenSimpTestData -g groundRelations

Concetenate these files together

cat allAxioms-eng.txt groundRelations-eng.txt out-eng.txt > combined-eng.txt
cat allAxioms-log.txt groundRelations-log.txt out-log.txt > combined-log.txt

Execution time

On a 16 core fast laptop:

10,000 pairs takes 1.5 minutes
100,000 pairs takes 12.5 minutes
1,000,000 pairs takes 122 minutes

Then tokenize the data by calling sumonlp/src/l2l/train/tokenize_data.py with

tokenize_data('combined-eng.txt', 'combined-log.txt', 'tokenized_data.json')

This line appears in train.py commented out, but change the filenames to the combined files generate above.

Put these files in a 'data' subdirectory below where you call train.py along with renaming combined-eng.txt to data/input_sentences.txt and combined-log.txt to data/output_logical.txt

A CUDA OutOfMemory error meant I changed train.py line 35 to batch_size=16 instead of 32

10 hours to train on 1M pairs in 3 epochs to an average loss of 0.0085

242MB model file is the result.

The model gets saved under the same directory where one puts the input/output pairs and tokens, as full_12m_sentences/t5_model_3_epochs . We should consider changing the directory names and file names, for example making the name "full_12m_sentences" a parameter in model.py instead of a hard coded name.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Train the L2L model

Execution time

Files

README.md

Latest commit

History

README.md

File metadata and controls

Train the L2L model

Execution time