Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk import of identifier (and other?) table data from files #62

Open
seasidesparrow opened this issue Mar 27, 2024 · 0 comments
Open

Comments

@seasidesparrow
Copy link
Member

JournalsDB tables were initially populated by loading existing classic files into tables. Subsequent to this, additions to the tables are made by exporting the tables to Google Sheets, making edits, and reimporting the data. The export/import process records the editorial history and enforces some data requirements, but it is time-intensive if more than a few records are being added.

We need to implement a process in run.py that will trigger the bulk import of data from files, but modified so that the data undergo validity checks prior to db insert. This discussion will focus on the bulk import of ISSNs.

Starting from the assumption that the bibcodes for new records already exist in the master table, we need a task that:

  • will read a formatted text file containing a bibstem, an ISSN_type, and an ISSN
  • verify that the bibstems are already in journalsdb (i.e. have a masterid)
  • verify that the ISSN isn't already in journalsdb.idents
  • check whether there is an existing ISSN of the given ISSN_type for that bibstem

If the given ISSN is truly new, the task should import the masterid, ISSN_type, and ISSN into idents. If it is not new, or if there is some other issue, it should provide actionable information to the user about what is wrong.

The procedure for this will be something like:

  • read the file
  • search journalsdb for the bibstem, and whether that bibstem already has any ISSNs
  • search journalsdb for the ISSN, and return a warning if it has already been assigned to another bibstem, or the same bibstem but with a different ISSN_type
  • if bibstem exists, ISSN does not, and ISSN_type has not already been assigned to bibstem, insert (masterid, ISSN, ISSN_type)
  • if bibstem exists, ISSN does not, but ISSN_type has already been assigned (and you're sure the new record is a valid alternate ISSN for this bibstem), insert (masterid, ISSN, ISSN_type_altN), where the altN is generated from existing table data for alternate ISSNs
  • if bibstem does not exist, reject the row and warn the user
  • if bibstem exists and ISSN exists for the same bibstem, reject the row and notify the user the ISSN is already present
  • if bibstem exists, and ISSN exists but for a different bibstem and/or ISSN_type, reject the row and warn the user there's a problematic record that requires investigation

The code for inserting an identifier of given type already exists in journalsmanager.tasks.task_db_load_identifier, so the new coding required is for reading the file, doing the required sqlalchemy queries to validate the incoming data, and generating logging messages of the appropriate level (info/warning) that will direct the curator on what needs to be resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant