Skip to content

DVUploader, a Command line Bulk Uploader for Dataverse

qqmyers edited this page Nov 8, 2024 · 21 revisions

Motivation:

Dataverse supports file uploads through its web interface. However, that interface has a limit of 1000 files per upload session and, since it displays uploaded files in a single long list, it can be difficult to use even with fewer files than that (e.g. hundreds). The web interface's support for unzipping zip files is one way to simplify the process if/when supported - files can be pre-zipped for upload as one larger zip file - but the interface still shows a long list of the included files.

The Dataverse community has a number of initiatives underway to support upload of larger files (greater than a few GB) and/or large numbers of files. Many of these involve configuring external storage and/or data transfer software. One, whose development was initially supported by TDL, is a relatively simple application (DVUploader) that can be downloaded by users. It uses the existing Dataverse application programming interface (API) to upload files from a specified directory into a specified Dataset. It can be a useful alternative to the web interface when:

  • there are hundreds or thousands of files to upload,
  • files reside in multiple subfolders and keeping these paths in Dataverse is desired,
  • when automatic verification of error-free and complete upload of files is desired,
  • when new files are being generated/added to a directory and Dataverse needs to be updated with just the new files,
  • uploading of files needs to be automated, e.g. added to an instrument or analysis script or program.

The DVUploader does need to be installed and, as a command-line tool, may not be as intuitive as the Dataverse web interface. However, unlike other bulk tools being developed, it will work with any Dataverse installation without any server-side changes. (Since it does upload and store data via Dataverse, it shares the basic performance and performance limitations of Dataverse's web interface, although many Dataverse installations interested in supporting larger data now support 'direct upload' which significantly improves performance and scaling). Other tools bypass Dataverse to handle larger data or do not move data from their remote locations and simply reference it in a Dataverse Dataset.) DVUploader can thus be a useful tool for individuals and for Dataverse installations interested in supporting larger numbers of files.

Since the development of DVUploader, the DVWebloader Plugin has been created. It provides similar functionality to DVUploader (using the same Dataverse APIs) and might be a useful alternative for manually uploading large numbers of files from a folder tree if it has been installed.

Installation

The DVUploader is a Java application packaged as a single jar file. For the current version (v1.2.1):

Step 1: Install Java (if needed) DVUploader requires version 8 or greater - download for most operating systems is available from https://java.com/en/download/. Mac users may need to download a Java JDK from, e.g. https://adoptopenjdk.net/ or https://www.oracle.com/technetwork/java/javase/downloads/index.html instead. (See https://anas.pk/2015/09/02/solution-no-java-runtime-present-mac-yosemite/ for a discussion of the Mac issue some users are seeing.) You may wish to ask your local IT support organization about which Java download they recommend.

Any warning about Java not being able to run in the user's browser (MS Edge is one where this warning is shown) can be ignored as the DVUploader does not run in the browser. Step 2: Download the DVUploader-v1.2.1.jar file to a directory on your computer.

Uploading Files

To prepare, log in to Dataverse and:

  • find the DOI for the dataset you wish to add files to, and
  • find or generate an API key for yourself in the Dataverse instance you are using (from the popup menu under your profile).

The simplest way to run the DVUploader is to place the jar file into the directory containing a subdirectory with the files intended for upload. (The DVUploader can be placed anywhere on disk and can upload files from any directory, but this requires adding these paths to the command line and/or configuration of Java's classpath.)

REQUIRED: Run the jar with the following command line:

java -jar DVUploader-v1.2.1.jar -key=<api key> -did=<dataset doi> -server=<server URL> <dir or file names> where

<apikey> is replaced with the API Key generated by the user in Dataverse

<dataset doi> is replaced with the DOI of the target Dataset

<serverURL> is replaced by the URL of the Dataverse server being used (with no trailing '/' and do not include any path to a specific Dataverse on the server), and

<dir or file names> is replaced by the name of a directory and/or a list of individual files to upload.

These four arguments are always required. There are additional options listed below. **Note: **For a first test, adding -listonly is useful - it will make the DVUploader list what it would do, but will not perform any uploads.

For example, java -jar DVUploader-v1.1.0.jar -key=8599b802-659e-49ef-823c-20abd8efc05c -did=doi:10.5072/FK2/TUNNVE -server=https://dataverse.tdl.org testdir would upload all of the files in the 'testdir' directory (relative to the current directory where the java command is run) to the Dataset at https://dataverse.tdl.org/dataset.xhtml?persistentId=10.5072/FK2/TUNNVE (if it existed: the dataset in this example is not real).

The output from the DVUploader looks like:

Dataverse Mode: Uploading files to a Dataverse instance Using apiKey: 8599b802-659e-49ef-823c-20abd8efc05c Adding content to: doi:10.5072/FK2/TUNNVE Using server: https://dataverse.tdl.org Request to upload: testdir

PROCESSING(C): testdir Found as: doi:10.5072/FK2/TUNNVE

PROCESSING(D): testdir\Capture3.JPG Does not yet exist on server. UPLOADED as: MD5:b2d8726f4ddba30705259143dbb283e3 CURRENT TOTAL: 1 files :9506 bytes

PROCESSING(D): testdir\Capture4.GIF Does not yet exist on server. UPLOADED as: MD5:3b9b536bd0abaf9c2677846f62d77ed9 CURRENT TOTAL: 2 files :23973 bytes

PROCESSING(D): testdir\Capture5.PNG Does not yet exist on server. UPLOADED as: MD5:ce26585c19bd1470b7229b2cfcc879f0 CURRENT TOTAL: 3 files :35448 bytes

(The same information is written into a log file.)

Optional Parameters

The full set of available command-line arguments are shown in the example below.

java -jar DVUploader-v1.2.1.jar -key=<api key> -did=<dataset doi> -server=<server URL> <-listonly> <-limit=<X>> <-ex=<ext>> < -verify> -recurse <dir or file names>

(Note all combinations should work, but not all have been tested together.)

-listonly: write information about what would/would not be transferred without doing any uploads. Useful as a testing/debugging option and in combination with the -verify flag as discussed below

-limit=<X>: limit this run to at most <X> data file uploads. Repeatedly running the uploader with, for example -limit=5, will upload five more files at a time. This can also be useful for testing, or as a way to break uploads into chunks as part of an automated workflow.

-ex=<ext>: exclude any file that matches the provided regular expression pattern, e.g. -ex=^\..* (exclude files that start with a period) -ex=*.txt (exclude all files ending in .txt). Multiple repeats of this flag can be used to exclude based on multiple patterns. A common use for this flag would be to avoid uploading resource files (which start with a period) on MacOS.

-verify : use the cryptographic hash generated by Dataverse (usually MD5, but configurable now to SHA-1 and in the future to SHA-256 or SHA-512) and verify that the corresponding hash of the local file matches. This can be used with Dataverse to verify transfers as they occur, or, used with the -listonly flag, can be used in a second pass to verify that all files previously uploaded match the current file system contents

-recurse : Upload files from subdirectories of the listed directory(ies). Now that Dataverse supports folders, your data files will be uploaded with path information into the Dataset. Paths are relative but include the name of the top-level directory specified on the command line.

-maxlockwait=<X> : - the maximum time to wait (in seconds) for a Dataset lock (i.e. while the last file is ingested) to expire (default 60 seconds)

-uploadviaserver : - By default, DVUploader assumes directupload is supported and will use that. It will fail with a warning and a note about this flag if direct upload is not supported for the specified dataset. Setting this flag will allow DVUploader to send the files to Dataverse itself which will then manage sending the files to their final storage location. This is less efficient and not recommended for large files/many files.

-trustall : - trusts any/all certificates. This is useful for working with test servers that have a self-signed certificate but should not generally be used with production servers.

-singlefile : - By default DVUploader will use the /addFiles endpoint added to Dataverse in v5.6. This makes fewer calls to Dataverse, registering multiple direct uploaded files with Dataverse at once, thus minimizing the server load when uploading many files. This flag will revert to updating the dataset on Dataverse after every file upload. This is not recommended with many files.

-skip=<X> : - Skip X files before trying to upload. This is potentially useful if you know that the first X files have been successfully uploaded. Using skip will avoid DVUploader checking to see if these files exist on the server (and trying to validate them by checksum if that is enabled) on these files before new uploads can begin.

Dataverse Requirements:

DVUploader uses the native API of Dataverse and will work with v4.8.4 through v5.14+. Until the file upload changes, this version of DVUploader should continue to work with newer Dataverse versions. In 4.9+ DVUploader uses the new lock API to more robustly wait for the lock during ingest to expire.

Frequently Asked Questions:

Can I upload a whole directory tree?

Yes, using the -recurse flag described above. Now that Dataverse supports folders within datasets, the Uploader sends the relative path of files when using -recurse. If you supply a directory name (e.g. test) which contains a sub-dir, e.g. test/subdir, any files in /test/subdir will be ignored by default. However, if you run the uploader with the -recurse flag files will be given the 'directoryLabel' of 'test/subdir' in Dataverse. To only upload some subdirectories, don't specify -recurse and instead provide a list, e.g. java … testdir testdir/subdir1 testdir/subdir2 Note that this uploads files without path information and if there are files with the same name or content in these directories, Dataverse may fail to upload them or may modify their names (e.g. file_1.txt). (The DVUploader will do the same thing as if you had uploaded all files via the Dataverse web application.)

Can I run more than one copy to upload to the same Dataset?

No! DVUploader reads the current list of files in a dataset at the start and then makes its decisions about whether a local file needs to be uploaded using that. If another instance, or someone using the Dataverse software directly uploads a file, the DVUploader won't know and would try to upload the same file (failing and currently retrying several times). (Also, the DVUploader already uploads in parallel, so there isn't a big performance reason to run multiple copies. The parallelism could be increased if there is evidence that it would help. - Add an issue if you think this is a problem for you.)

Java cannot find the DVUploader / Can I put the jar file in one place and not move it to upload different directories?

Yes. The examples shown use Windows-style paths. The DVUploader is a standard Java application so, as long as you configure your classpath and use paths in identifying directories to upload, you can put the DVUploader jar where you like and run it from any directory. In the command-line examples given above, you can change the -jar DVUploader-v1.1.0.jar to -jar /DVUploader-v1.1.0.jar for Unix.

The DVUploader was stopped before it finished, what do I do?

The DVUploader can just be restarted. It will scan through the existing files and start uploading when it finds the first one that does not exist in the Dataset.

Is the DVUploader Open Source?

Yes. The DVUploader is distributed under the Apache2 Open Source License. The source code has been posted to GitHub and is distributed as a Dataverse community product.

I see 'waiting' messages or errors from Dataverse!

Problems with the apikey, server URL, or Dataset DOI should be discovered and reported early. If a required argument is missing, the DVUploader will display usage information. Any problems occurring as files are uploaded should relate to the specifics of that file, or, if you see 'Waiting' messages, to the one before it. The DVUploader uses the Dataverse API to upload files, so any problems that could occur using the web interface can occur for the DVUploader as well. Specifically, issues related to data size (upload size limits), network connections (failures, connections timing out), or to Dataverse-specific operations, such as two files with the same content not being allowed can occur. When uploading files such as zip files or spreadsheets that are further processed by Dataverse, you may see errors such as the file already existing (i.e. if you upload an Excel file for which a .tab file has already been uploaded). Further when Dataverse ingests a file, it places a lock on the Dataset until the processing is done. DVUploader attempts to wait for such a lock to be removed before proceeding to upload the next file, but it only waits for 60 seconds by default. (On 4.8.x instances of Dataverse, it also cannot tell whether the Dataset is locked due to ingest or for another cause such as being 'in review'). If you see an error uploading a file after one where the DVUploader was 'waiting', try increasing the -maxlockwait setting. In all cases, it can be useful to try uploading any file for which the DVUploader reported an error through Dataverse's web interface.

I'd like the DVUploader to do 'X'...

Great! Tell your Dataverse administrators who can help communicate your request to the larger community, or help develop it yourself. (The DVUploader leverages code originally developed as part of the SEAD project and there are a number of features that have not yet been ported to Dataverse including the ability to create a new Dataset and upload metadata that would not have to be built from scratch.)