Issues in ELECTRA-base pre-training and fine-tuning #1505

szha · 2021-01-26T16:57:17Z

Description

As part of #1413 I was running the ELECTRA-base model and found several issues along the way.

dataloader KeyError and crash in pre-training [BUGFIX] gc before fork in loader #1525

dataloader KeyError error message

[2]<stderr>:multiprocessing.pool.RemoteTraceback:
[2]<stderr>:"""
[2]<stderr>:Traceback (most recent call last):
[2]<stderr>:  File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
[2]<stderr>:    result = (True, func(*args, **kwds))
[2]<stderr>:  File "/home/ubuntu/gluon-nlp/src/gluonnlp/data/loading.py", line 147, in _batch_worker_fn
[2]<stderr>:    if len(dataset[0]) > 1:
[2]<stderr>:  File "<string>", line 2, in __getitem__
[2]<stderr>:  File "/usr/lib/python3.6/multiprocessing/managers.py", line 772, in _callmethod
[2]<stderr>:    raise convert_to_error(kind, result)
[2]<stderr>:multiprocessing.managers.RemoteError:
[2]<stderr>:---------------------------------------------------------------------------
[2]<stderr>:Traceback (most recent call last):
[2]<stderr>:  File "/usr/lib/python3.6/multiprocessing/managers.py", line 235, in serve_client
[2]<stderr>:    self.id_to_local_proxy_obj[ident]
[2]<stderr>:KeyError: '7f9f048a0608'
[2]<stderr>:
[2]<stderr>:During handling of the above exception, another exception occurred:
[2]<stderr>:
[2]<stderr>:Traceback (most recent call last):
[2]<stderr>:  File "/usr/lib/python3.6/multiprocessing/managers.py", line 237, in serve_client
[2]<stderr>:    raise ke
[2]<stderr>:  File "/usr/lib/python3.6/multiprocessing/managers.py", line 231, in serve_client
[2]<stderr>:    obj, exposed, gettypeid = id_to_obj[ident]
[2]<stderr>:KeyError: '7f9f048a0608'
[2]<stderr>:---------------------------------------------------------------------------
[2]<stderr>:"""
[2]<stderr>:
[2]<stderr>:The above exception was the direct cause of the following exception:
[2]<stderr>:
[2]<stderr>:Traceback (most recent call last):
[2]<stderr>:  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
[2]<stderr>:    "__main__", mod_spec)
[2]<stderr>:  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
[2]<stderr>:    exec(code, run_globals)
[2]<stderr>:  File "/home/ubuntu/gluon-nlp/scripts/pretraining/run_electra.py", line 557, in <module>
[2]<stderr>:    train(args)
[2]<stderr>:  File "/home/ubuntu/gluon-nlp/scripts/pretraining/run_electra.py", line 362, in train
[2]<stderr>:    sample_l = next(train_loop_dataloader)
[2]<stderr>:  File "/home/ubuntu/gluon-nlp/src/gluonnlp/utils/misc.py", line 226, in repeat
[2]<stderr>:    for sample in iterable:
[2]<stderr>:  File "/home/ubuntu/gluon-nlp/src/gluonnlp/data/loading.py", line 252, in __next__
[2]<stderr>:    batch, counter = ret.get()
[2]<stderr>:  File "/usr/lib/python3.6/multiprocessing/pool.py", line 644, in get
[2]<stderr>:    raise self._value
[2]<stderr>:multiprocessing.managers.RemoteError:
[2]<stderr>:---------------------------------------------------------------------------
[2]<stderr>:Traceback (most recent call last):
[2]<stderr>:  File "/usr/lib/python3.6/multiprocessing/managers.py", line 235, in serve_client
[2]<stderr>:    self.id_to_local_proxy_obj[ident]
[2]<stderr>:KeyError: '7f9f048a0608'
[2]<stderr>:
[2]<stderr>:During handling of the above exception, another exception occurred:
[2]<stderr>:
[2]<stderr>:Traceback (most recent call last):
[2]<stderr>:  File "/usr/lib/python3.6/multiprocessing/managers.py", line 237, in serve_client
[2]<stderr>:    raise ke
[2]<stderr>:  File "/usr/lib/python3.6/multiprocessing/managers.py", line 231, in serve_client
[2]<stderr>:    obj, exposed, gettypeid = id_to_obj[ident]
[2]<stderr>:KeyError: '7f9f048a0608'
[2]<stderr>:---------------------------------------------------------------------------
[7]<stderr>:munmap_chunk(): invalid pointer
[7]<stderr>:malloc_consolidate(): invalid chunk size
[0]<stdout>:
[0]<stdout>:Fatal Error: Segmentation fault
[0]<stderr>:Stack trace:
[0]<stdout>:
[0]<stderr>:Stack trace:
[0]<stdout>:Fatal Error: Segmentation fault
[0]<stderr>:Stack trace:
[0]<stdout>:
[0]<stdout>:Fatal Error: Segmentation fault
[0]<stderr>:  /usr/lib/x86_64-linux-gnu/libstdc++.so.6 ( operator new(unsigned long)               + 0x18  )  [0x7f68d1a42298]
[0]<stderr>:  /home/ubuntu/.local/lib/python3.6/site-packages/horovod/mxnet/mpi_lib.cpython-36m-x86_64-linux-gnu.so ( std::vector<std::shared_ptr<horovod::common::Tensor>, std::allocator<std::shared_ptr<horovod::common::Tensor> > >::reserve(unsigned long)  + 0x78  )  [0x7f630422daa8]
[0]<stderr>:  /home/ubuntu/.local/lib/python3.6/site-packages/horovod/mxnet/mpi_lib.cpython-36m-x86_64-linux-gnu.so ( horovod::mxnet::DoHorovodOperation(void*, void*, void*)  + 0x8c3 )  [0x7f63042273d3]
[0]<stderr>:  /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so (                                           + 0x1f67619)  [0x7f6686402619]
[0]<stderr>:  /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so ( mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)  + 0x10c )  [0x7f6686523dbc]
[0]<stderr>:  /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so ( std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::Start()::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&)  + 0xd0  )  [0x7f6686526310]
[0]<stderr>:  /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so ( std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)>, std::shared_ptr<dmlc::ManualEvent> > > >::_M_run()  + 0x32  )  [0x7f6686522fa2]
[0]<stderr>:  /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so (                                           + 0x1226877f)  [0x7f669670377f]
[0]<stderr>:  /lib/x86_64-linux-gnu/libpthread.so.0 (                                           + 0x76db)  [0x7f68d85a76db]
[0]<stderr>:  /lib/x86_64-linux-gnu/libc.so.6 ( clone                                     + 0x3f  )  [0x7f68d88e071f]
[5]<stderr>:double free or corruption (out)
[6]<stderr>:double free or corruption (out)
[4]<stderr>:double free or corruption (out)
[3]<stderr>:Traceback (most recent call last):
[3]<stderr>:  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
[3]<stderr>:    "__main__", mod_spec)
[3]<stderr>:  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
[3]<stderr>:    exec(code, run_globals)
[3]<stderr>:  File "/home/ubuntu/gluon-nlp/scripts/pretraining/run_electra.py", line 557, in <module>
[3]<stderr>:    train(args)
[3]<stderr>:  File "/home/ubuntu/gluon-nlp/scripts/pretraining/run_electra.py", line 418, in train
[3]<stderr>:    params, args.max_grad_norm * num_workers)
[3]<stderr>:  File "/home/ubuntu/gluon-nlp/src/gluonnlp/utils/parameter.py", line 237, in clip_grad_global_norm
[3]<stderr>:    total_norm = grad_global_norm(parameters)
[3]<stderr>:  File "/home/ubuntu/gluon-nlp/src/gluonnlp/utils/parameter.py", line 187, in grad_global_norm
[3]<stderr>:    total_norm = float(total_norm)
[3]<stderr>:  File "/home/ubuntu/.local/lib/python3.6/site-packages/mxnet/numpy/multiarray.py", line 1225, in __float__
[3]<stderr>:    return float(self.item())
[3]<stderr>:  File "/home/ubuntu/.local/lib/python3.6/site-packages/mxnet/numpy/multiarray.py", line 1264, in item
[3]<stderr>:    return self.asnumpy().item(*args)
[3]<stderr>:  File "/home/ubuntu/.local/lib/python3.6/site-packages/mxnet/ndarray/ndarray.py", line 2610, in asnumpy
[3]<stderr>:    ctypes.c_size_t(data.size)))
[3]<stderr>:  File "/home/ubuntu/.local/lib/python3.6/site-packages/mxnet/base.py", line 246, in check_call
[3]<stderr>:    raise get_last_ffi_error()
[3]<stderr>:mxnet.base.MXNetError: MXNetError: Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message.
[1]<stderr>:Traceback (most recent call last):
[1]<stderr>:  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
[1]<stderr>:    "__main__", mod_spec)
[1]<stderr>:  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
[1]<stderr>:    exec(code, run_globals)
[1]<stderr>:  File "/home/ubuntu/gluon-nlp/scripts/pretraining/run_electra.py", line 557, in <module>
[1]<stderr>:    train(args)
[1]<stderr>:  File "/home/ubuntu/gluon-nlp/scripts/pretraining/run_electra.py", line 418, in train
[1]<stderr>:    params, args.max_grad_norm * num_workers)
[1]<stderr>:  File "/home/ubuntu/gluon-nlp/src/gluonnlp/utils/parameter.py", line 237, in clip_grad_global_norm
[1]<stderr>:    total_norm = grad_global_norm(parameters)
[1]<stderr>:  File "/home/ubuntu/gluon-nlp/src/gluonnlp/utils/parameter.py", line 187, in grad_global_norm
[1]<stderr>:    total_norm = float(total_norm)
[1]<stderr>:  File "/home/ubuntu/.local/lib/python3.6/site-packages/mxnet/numpy/multiarray.py", line 1225, in __float__
[1]<stderr>:    return float(self.item())
[1]<stderr>:  File "/home/ubuntu/.local/lib/python3.6/site-packages/mxnet/numpy/multiarray.py", line 1264, in item
[1]<stderr>:    return self.asnumpy().item(*args)
[1]<stderr>:  File "/home/ubuntu/.local/lib/python3.6/site-packages/mxnet/ndarray/ndarray.py", line 2610, in asnumpy
[1]<stderr>:    ctypes.c_size_t(data.size)))
[1]<stderr>:  File "/home/ubuntu/.local/lib/python3.6/site-packages/mxnet/base.py", line 246, in check_call
[1]<stderr>:    raise get_last_ffi_error()
[1]<stderr>:mxnet.base.MXNetError: MXNetError: Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message.
[3]<stdout>:
[3]<stderr>:Stack trace:
[3]<stdout>:Fatal Error: Segmentation fault
[3]<stderr>:  /usr/lib/x86_64-linux-gnu/libstdc++.so.6 ( std::__exception_ptr::operator!=(std::__exception_ptr::exception_ptr const&, std::__exception_ptr::exception_ptr const&)  + 0x9   )  [0x7f322a46c9f9]
[3]<stderr>:  /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so ( mxnet::engine::ThreadedEngine::WaitForAll()  + 0x111 )  [0x7f2fea318881]
[3]<stderr>:  /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so ( MXNotifyShutdown                          + 0x40  )  [0x7f2fea202a60]
[3]<stderr>:  /usr/lib/x86_64-linux-gnu/libffi.so.6 ( ffi_call_unix64                           + 0x4c  )  [0x7f323a399dae]
[3]<stderr>:  /usr/lib/x86_64-linux-gnu/libffi.so.6 ( ffi_call                                  + 0x22f )  [0x7f323a39971f]
[3]<stderr>:  /usr/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so ( _ctypes_callproc                          + 0x4d3 )  [0x7f323a5ad7e3]
[3]<stderr>:  /usr/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (                                           + 0x11c33)  [0x7f323a5adc33]
[3]<stderr>:  python3                        ( _PyObject_FastCallKeywords                + 0x19c )  [0x5a9dac]
[3]<stderr>:  python3                        (                                                   )  [0x50a433]
[3]<stderr>:  python3                        ( _PyEval_EvalFrameDefault                  + 0x444 )  [0x50beb4]
[3]<stderr>:  python3                        (                                                   )  [0x507be4]
[3]<stderr>:  python3                        (                                                   )  [0x588c8b]
[3]<stderr>:  python3                        ( PyObject_Call                             + 0x3e  )  [0x59fd0e]
[3]<stderr>:  python3                        (                                                   )  [0x5de69d]
[3]<stderr>:  python3                        ( Py_FinalizeEx                             + 0x24  )  [0x637fe4]
[3]<stderr>:  python3                        ( Py_Main                                   + 0x395 )  [0x639085]
[3]<stderr>:  python3                        ( main                                      + 0xe0  )  [0x4b0dc0]
[3]<stderr>:  /lib/x86_64-linux-gnu/libc.so.6 ( __libc_start_main                         + 0xe7  )  [0x7f323c5ddbf7]
[3]<stderr>:  python3                        ( _start                                    + 0x2a  )  [0x5b259a]
[1]<stderr>:Stack trace:
[1]<stdout>:
[1]<stdout>:Fatal Error: Segmentation fault
[1]<stderr>:  /usr/lib/x86_64-linux-gnu/libstdc++.so.6 ( std::__exception_ptr::operator!=(std::__exception_ptr::exception_ptr const&, std::__exception_ptr::exception_ptr const&)  + 0x9   )  [0x7f230268a9f9]
[1]<stderr>:  /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so ( mxnet::engine::ThreadedEngine::WaitForAll()  + 0x111 )  [0x7f232a7c4881]
[1]<stderr>:  /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so ( MXNotifyShutdown                          + 0x40  )  [0x7f232a6aea60]
[1]<stderr>:  /usr/lib/x86_64-linux-gnu/libffi.so.6 ( ffi_call_unix64                           + 0x4c  )  [0x7f257a845dae]
[1]<stderr>:  /usr/lib/x86_64-linux-gnu/libffi.so.6 ( ffi_call                                  + 0x22f )  [0x7f257a84571f]
[1]<stderr>:  /usr/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so ( _ctypes_callproc                          + 0x4d3 )  [0x7f257aa597e3]
[1]<stderr>:  /usr/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so (                                           + 0x11c33)  [0x7f257aa59c33]
[1]<stderr>:  python3                        ( _PyObject_FastCallKeywords                + 0x19c )  [0x5a9dac]
[1]<stderr>:  python3                        (                                                   )  [0x50a433]
[1]<stderr>:  python3                        ( _PyEval_EvalFrameDefault                  + 0x444 )  [0x50beb4]
[1]<stderr>:  python3                        (                                                   )  [0x507be4]
[1]<stderr>:  python3                        (                                                   )  [0x588c8b]
[1]<stderr>:  python3                        ( PyObject_Call                             + 0x3e  )  [0x59fd0e]
[1]<stderr>:  python3                        (                                                   )  [0x5de69d]
[1]<stderr>:  python3                        ( Py_FinalizeEx                             + 0x24  )  [0x637fe4]
[1]<stderr>:  python3                        ( Py_Main                                   + 0x395 )  [0x639085]
[1]<stderr>:  python3                        ( main                                      + 0xe0  )  [0x4b0dc0]
[1]<stderr>:  /lib/x86_64-linux-gnu/libc.so.6 ( __libc_start_main                         + 0xe7  )  [0x7f257ca89bf7]
[1]<stderr>:  python3                        ( _start                                    + 0x2a  )  [0x5b259a]
Process 2 exit with status code 1.
[0]<stderr>:Segmentation fault (core dumped)
[5]<stderr>:Aborted (core dumped)
[6]<stderr>:Aborted (core dumped)
[4]<stderr>:Aborted (core dumped)
[1]<stderr>:Segmentation fault (core dumped)
Process 1 exit with status code 139.
[3]<stderr>:Segmentation fault (core dumped)
Process 3 exit with status code 139.
[7]<stderr>:Aborted (core dumped)

training hangs
~~The pre-training script can't resume from last checkpoint.~~ Pre-training scripts should allow resuming from checkpoints #1526
SQuAD fine-tuning script can't load pre-trained model

SQuAD parameter loading error message

% python3 scripts/question_answering/run_squad.py \
    --model_name google_electra_base \
    --data_dir squad \
    --backbone_path output/0300000.params \
    --output_dir output_finetune \
    --version 1.1 \
    --do_eval \
    --do_train \
    --batch_size 32 \
    --num_accumulated 1 \
    --gpus 0 \
    --epochs 2 \
    --lr 3e-4 \
    --layerwise_decay 0.8 \
    --warmup_ratio 0.1 \
    --max_saved_ckpt 6 \
    --all_evaluate \
    --wd 0 \
    --max_seq_length 128 \
    --max_grad_norm 0.1 \

All Logs will be saved to output_finetune/finetune_squad1.1.log
2021-01-26 16:14:25,942 - root - INFO - Namespace(adam_betas='(0.9, 0.999)', adam_epsilon=1e-06, all_evaluate=True, backbone_path='output/0300000.params', batch_size=32, classifier_dropout=0.1, comm_backend='device', data_dir='squad', do_eval=True, do_train=True, doc_stride=128, dtype='float32', end_top_n=5, epochs=2.0, eval_batch_size=16, eval_log_interval=10, gpus='0', layerwise_decay=0.8, log_interval=50, lr=0.0003, max_answer_length=30, max_grad_norm=0.1, max_query_length=64, max_saved_ckpt=6, max_seq_length=128, model_name='google_electra_base', n_best_size=20, num_accumulated=1, num_train_steps=None, optimizer='adamw', output_dir='output_finetune', overwrite_cache=False, param_checkpoint=None, pre_shuffle_seed=100, round_to=None, save_interval=None, seed=100, start_top_n=5, untunable_depth=-1, version='1.1', warmup_ratio=0.1, warmup_steps=None, wd=0.0)
[16:14:26] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for CPU
Traceback (most recent call last):
  File "../question_answering/run_squad.py", line 1007, in <module>
    train(args)
  File "../question_answering/run_squad.py", line 449, in train
    args.backbone_path)
  File "../question_answering/run_squad.py", line 407, in get_network
    ctx=ctx_l, cast_dtype=True)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/mxnet/util.py", line 299, in _with_np_shape
    return func(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/mxnet/util.py", line 480, in _with_np_array
    return func(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/mxnet/gluon/block.py", line 432, in load_parameters
    self.load_dict(full_dict, ctx, allow_missing, ignore_extra, cast_dtype, dtype_source)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/mxnet/util.py", line 299, in _with_np_shape
    return func(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/mxnet/util.py", line 480, in _with_np_array
    return func(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/mxnet/gluon/block.py", line 477, in load_dict
    name, error_str, _brief_print_list(loaded.keys()))
AssertionError: Parameter 'encoder.all_encoder_layers.0.attn_qkv.weight' is missing in 'file: output/0300000.params', which contains parameters: 'disc_backbone.embed_layer_norm.beta', 'disc_backbone.embed_layer_norm.gamma', 'disc_backbone.encoder.all_encoder_layers.0.attention_proj.bias', ..., 'generator.mlm_decoder.2.beta', 'generator.mlm_decoder.2.gamma', 'generator.mlm_decoder.3.bias', 'generator.mlm_decoder.3.weight'. Set allow_missing=True to ignore missing parameters.

To Reproduce

Follow the steps in https://github.com/dmlc/gluon-nlp/blob/09f343564e4f735df52e212df87ca073a824e829/scripts/pretraining/README.md. See below for the exact commands I used.

Steps to reproduce

Run ELECTRA-base pre-training

horovodrun --verbose -np 8 -H localhost:8 python3 -m run_electra \
    --model_name google_electra_base \
    --data 'preprocessed_owt/*.npz' \
    --generator_units_scale 0.25 \
    --gpus 0,1,2,3,4,5,6,7 \
    --do_train \
    --do_eval \
    --output_dir output \
    --num_accumulated 1 \
    --batch_size 128 \
    --lr 5e-4 \
    --wd 0.01 \
    --max_seq_len 128 \
    --max_grad_norm 1 \
    --warmup_steps 10000 \
    --num_train_steps 1000000 \
    --log_interval 200 \
    --save_interval 50000 \
    --mask_prob 0.15 \
    --comm_backend horovod \

Use ELECTRA-base pre-trained weights for SQuAD fine-tuning

python3 scripts/question_answering/run_squad.py \
    --model_name google_electra_base \
    --data_dir squad \
    --backbone_path output/0300000.params \  # update parameter file name here
    --output_dir output_finetune \
    --version 1.1 \
    --do_eval \
    --do_train \
    --batch_size 32 \
    --num_accumulated 1 \
    --gpus 0 \
    --epochs 2 \
    --lr 3e-4 \
    --layerwise_decay 0.8 \
    --warmup_ratio 0.1 \
    --max_saved_ckpt 6 \
    --all_evaluate \
    --wd 0 \
    --max_seq_length 128 \
    --max_grad_norm 0.1

Environment

I ran both scripts on p4dn.24xlarge with an environment bootstrapped by this cloudformation template. Details on some important dependencies:

MXNet: I used https://repo.mxnet.io/dist/python/cu110/mxnet_cu110-2.0.0b20210117-py3-none-manylinux2014_x86_64.whl
Horovod: I used HOROVOD_WITH_MXNET=1 HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_WITH_GLOO=1 python3 -m pip install --no-cache-dir horovod for Horovod 0.21.1.

The text was updated successfully, but these errors were encountered:

szha · 2021-01-26T17:26:59Z

The ELECTRA-base 300k steps checkpoint can be found at https://szha-nlp.s3.amazonaws.com/output_electra_base/0300000.params. This should help reproduce the parameter loading issue in SQuAD fine-tuning.

szha · 2021-02-21T20:12:11Z

#1527 adds a script to convert ElectraForPretrain parameters to ElectraModel version. SQuAD script doesn't allow specifying the generator dimension and layer scaling factors so the checkpoints are still not loadable there yet..

szha added bug Something isn't working enhancement New feature or request labels Jan 26, 2021

szha mentioned this issue Jan 30, 2021

[MISC] add decorator for logging exceptions #1512

Merged

5 tasks

This comment has been minimized.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues in ELECTRA-base pre-training and fine-tuning #1505

Issues in ELECTRA-base pre-training and fine-tuning #1505

szha commented Jan 26, 2021 •

edited

Loading

szha commented Jan 26, 2021

This comment has been minimized.

szha commented Feb 21, 2021

Issues in ELECTRA-base pre-training and fine-tuning #1505

Issues in ELECTRA-base pre-training and fine-tuning #1505

Comments

szha commented Jan 26, 2021 • edited Loading

Description

To Reproduce

Steps to reproduce

Environment

szha commented Jan 26, 2021

This comment has been minimized.

szha commented Feb 21, 2021

szha commented Jan 26, 2021 •

edited

Loading