Data Commons JSON upload error #16

egrace479 · 2023-05-02T17:55:12Z

I attempted to upload a 4.45GB zip file to Data Commons using
dva upload <zipfile> <doi>
After about an hour of the terminal stating Uploading <zipfile>, it printed the following error:

Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/miniforge/base/envs/dataverse/lib/python3.11/site-packages/requests/models.py", line 971, in json
    return complexjson.loads(self.text, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/dataverse/lib/python3.11/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/dataverse/lib/python3.11/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/dataverse/lib/python3.11/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/miniforge/base/envs/dataverse/bin/dva", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/dataverse/lib/python3.11/site-packages/dva/cli.py", line 93, in main
    cli()
  File "/opt/homebrew/Caskroom/miniforge/base/envs/dataverse/lib/python3.11/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/dataverse/lib/python3.11/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/dataverse/lib/python3.11/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/dataverse/lib/python3.11/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/dataverse/lib/python3.11/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/dataverse/lib/python3.11/site-packages/dva/cli.py", line 74, in upload
    api.upload_file(doi, path)
  File "/opt/homebrew/Caskroom/miniforge/base/envs/dataverse/lib/python3.11/site-packages/dva/api.py", line 65, in upload_file
    status = resp.json()["status"]
             ^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/dataverse/lib/python3.11/site-packages/requests/models.py", line 975, in json
    raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)```

The text was updated successfully, but these errors were encountered:

johnbradley · 2023-05-04T13:57:17Z

The json decode error is likely due to the website returning an error that is not json, perhaps html.
When trying to reproduce I received a OverflowError: string longer than 2147483647 bytes error that already has an issue upstream: gdcc/pyDataverse#137.

I also tried uploading a file using curl based on docs: https://guides.dataverse.org/en/latest/api/native-api.html#add-a-file-to-a-dataset
I uploaded a 50 MB file, but it was really slow (565.8 KB /s).
Here is the code I used.

export API_TOKEN=<YOURTOKEN>
export FILENAME='<FILETOUPLOAD>'
export SERVER_URL=https://datacommons.tdai.osu.edu/
export PERSISTENT_ID=<YOURPID>

curl  -w 'Speed: %{speed_upload}\n' -v -H X-Dataverse-key:$API_TOKEN -X POST -F "file=@$FILENAME" -F 'jsonData={"description":"My description.","directoryLabel":"data/subdir1","categories":["Data"], "restrict":"false", "tabIngest":"false"}' "$SERVER_URL/api/datasets/:persistentId/add?persistentId=$PERSISTENT_ID"

Next I am trying to upload a 2G file, but based on the earlier performance this could take 1 hour.

johnbradley · 2023-05-04T14:14:39Z

I tested uploading a 50 MB file on OSC with curl. It uploaded much faster: 8 MB/s.
I then tried uploading a 2GB file on OSC, but received a 500 error. I created the file with truncate -s 2G bigfile.dat.
Then uploaded using the code from my previous comment.
The response html was rather verbose but no useful details were included:

Internal Server Error - An unexpected error was encountered, no more information is available.

johnbradley · 2023-05-04T14:33:13Z

It seems hit or miss with the 500 error. I tried uploading the same 2G file on OSC and it uploaded fine.
I tried a 5G file which failed after a minute. I then retried the 5G and it failed immediately two more tries.

johnbradley · 2023-05-04T14:35:43Z

Code that often reproduces the 500 error (after filling in TODO with your datacommons token):

export API_TOKEN=<TODO>
export FILENAME='bigfile.dat'
export SERVER_URL=https://datacommons.tdai.osu.edu/
export PERSISTENT_ID=doi:10.5072/FK2/BZATJO
truncate -s 4G $FILENAME
curl  -w 'Speed: %{speed_upload}\n' -v -H X-Dataverse-key:$API_TOKEN -X POST -F "file=@$FILENAME" -F 'jsonData={"description":"My description.","dir
ectoryLabel":"data/subdir1","categories":["Data"], "restrict":"false", "tabIngest":"false"}' "$SERVER_URL/api/datasets/:persistentId/add?persistentI
d=$PERSISTENT_ID"

thompsonmj · 2023-05-04T17:55:36Z

So a 500 error is the server's fault, right? Would it make sense to split files 1GB+ into smaller parts so each file takes less time to upload and reduces the probability of a 500 mid-transfer, and add some error handling to retry when a 500 occurs?

johnbradley · 2023-05-04T17:58:00Z

So a 500 error is the server's fault, right? Would it make sense to split files 1GB+ into smaller parts so each file takes less time to upload and reduces the probability of a 500 mid-transfer, and add some error handling to retry when a 500 occurs?

Wouldn't that mean a user who downloads needs to merge the file parts back together when downloading?

thompsonmj · 2023-05-04T18:06:50Z

That could be integrated into dva I think to avoid extra steps for the user.

If running the upload command auto splits files based on size (and does a checksum before splitting), they could be named in a way that it can identify and auto combine them for a download command as well (and compares a recombined checksum to the pre-split).

Or might it be better to check with Data Commons support on what might be causing the 500 error and see if that root issue can be sorted out? I feel like it shouldn't be normal for something like this, but maybe there's a fundamental technical limitation on their side that we should be prepared to deal with? I don't know enough about networking to know what exactly is normal.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Commons JSON upload error #16

Data Commons JSON upload error #16

egrace479 commented May 2, 2023

johnbradley commented May 4, 2023 •

edited

Loading

johnbradley commented May 4, 2023

johnbradley commented May 4, 2023

johnbradley commented May 4, 2023

thompsonmj commented May 4, 2023

johnbradley commented May 4, 2023

thompsonmj commented May 4, 2023

Data Commons JSON upload error #16

Data Commons JSON upload error #16

Comments

egrace479 commented May 2, 2023

johnbradley commented May 4, 2023 • edited Loading

johnbradley commented May 4, 2023

johnbradley commented May 4, 2023

johnbradley commented May 4, 2023

thompsonmj commented May 4, 2023

johnbradley commented May 4, 2023

thompsonmj commented May 4, 2023

johnbradley commented May 4, 2023 •

edited

Loading