Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Commons JSON upload error #16

Open
egrace479 opened this issue May 2, 2023 · 7 comments
Open

Data Commons JSON upload error #16

egrace479 opened this issue May 2, 2023 · 7 comments

Comments

@egrace479
Copy link
Member

I attempted to upload a 4.45GB zip file to Data Commons using
dva upload <zipfile> <doi>
After about an hour of the terminal stating Uploading <zipfile>, it printed the following error:

Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/miniforge/base/envs/dataverse/lib/python3.11/site-packages/requests/models.py", line 971, in json
    return complexjson.loads(self.text, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/dataverse/lib/python3.11/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/dataverse/lib/python3.11/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/dataverse/lib/python3.11/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/miniforge/base/envs/dataverse/bin/dva", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/dataverse/lib/python3.11/site-packages/dva/cli.py", line 93, in main
    cli()
  File "/opt/homebrew/Caskroom/miniforge/base/envs/dataverse/lib/python3.11/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/dataverse/lib/python3.11/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/dataverse/lib/python3.11/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/dataverse/lib/python3.11/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/dataverse/lib/python3.11/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/dataverse/lib/python3.11/site-packages/dva/cli.py", line 74, in upload
    api.upload_file(doi, path)
  File "/opt/homebrew/Caskroom/miniforge/base/envs/dataverse/lib/python3.11/site-packages/dva/api.py", line 65, in upload_file
    status = resp.json()["status"]
             ^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/dataverse/lib/python3.11/site-packages/requests/models.py", line 975, in json
    raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)```
@johnbradley
Copy link
Collaborator

johnbradley commented May 4, 2023

The json decode error is likely due to the website returning an error that is not json, perhaps html.
When trying to reproduce I received a OverflowError: string longer than 2147483647 bytes error that already has an issue upstream: gdcc/pyDataverse#137.

I also tried uploading a file using curl based on docs: https://guides.dataverse.org/en/latest/api/native-api.html#add-a-file-to-a-dataset
I uploaded a 50 MB file, but it was really slow (565.8 KB /s).
Here is the code I used.

export API_TOKEN=<YOURTOKEN>
export FILENAME='<FILETOUPLOAD>'
export SERVER_URL=https://datacommons.tdai.osu.edu/
export PERSISTENT_ID=<YOURPID>

curl  -w 'Speed: %{speed_upload}\n' -v -H X-Dataverse-key:$API_TOKEN -X POST -F "file=@$FILENAME" -F 'jsonData={"description":"My description.","directoryLabel":"data/subdir1","categories":["Data"], "restrict":"false", "tabIngest":"false"}' "$SERVER_URL/api/datasets/:persistentId/add?persistentId=$PERSISTENT_ID"

Next I am trying to upload a 2G file, but based on the earlier performance this could take 1 hour.

@johnbradley
Copy link
Collaborator

I tested uploading a 50 MB file on OSC with curl. It uploaded much faster: 8 MB/s.
I then tried uploading a 2GB file on OSC, but received a 500 error. I created the file with truncate -s 2G bigfile.dat.
Then uploaded using the code from my previous comment.
The response html was rather verbose but no useful details were included:

Internal Server Error - An unexpected error was encountered, no more information is available.

@johnbradley
Copy link
Collaborator

It seems hit or miss with the 500 error. I tried uploading the same 2G file on OSC and it uploaded fine.
I tried a 5G file which failed after a minute. I then retried the 5G and it failed immediately two more tries.

@johnbradley
Copy link
Collaborator

Code that often reproduces the 500 error (after filling in TODO with your datacommons token):

export API_TOKEN=<TODO>
export FILENAME='bigfile.dat'
export SERVER_URL=https://datacommons.tdai.osu.edu/
export PERSISTENT_ID=doi:10.5072/FK2/BZATJO
truncate -s 4G $FILENAME
curl  -w 'Speed: %{speed_upload}\n' -v -H X-Dataverse-key:$API_TOKEN -X POST -F "file=@$FILENAME" -F 'jsonData={"description":"My description.","dir
ectoryLabel":"data/subdir1","categories":["Data"], "restrict":"false", "tabIngest":"false"}' "$SERVER_URL/api/datasets/:persistentId/add?persistentI
d=$PERSISTENT_ID"

@thompsonmj
Copy link

So a 500 error is the server's fault, right? Would it make sense to split files 1GB+ into smaller parts so each file takes less time to upload and reduces the probability of a 500 mid-transfer, and add some error handling to retry when a 500 occurs?

@johnbradley
Copy link
Collaborator

So a 500 error is the server's fault, right? Would it make sense to split files 1GB+ into smaller parts so each file takes less time to upload and reduces the probability of a 500 mid-transfer, and add some error handling to retry when a 500 occurs?

Wouldn't that mean a user who downloads needs to merge the file parts back together when downloading?

@thompsonmj
Copy link

That could be integrated into dva I think to avoid extra steps for the user.

If running the upload command auto splits files based on size (and does a checksum before splitting), they could be named in a way that it can identify and auto combine them for a download command as well (and compares a recombined checksum to the pre-split).

Or might it be better to check with Data Commons support on what might be causing the 500 error and see if that root issue can be sorted out? I feel like it shouldn't be normal for something like this, but maybe there's a fundamental technical limitation on their side that we should be prepared to deal with? I don't know enough about networking to know what exactly is normal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants