Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a FOLIO Workflow that identifies duplicate instances #350

Open
rmathew1011 opened this issue Apr 25, 2024 · 3 comments · Fixed by #394
Open

Create a FOLIO Workflow that identifies duplicate instances #350

rmathew1011 opened this issue Apr 25, 2024 · 3 comments · Fixed by #394
Assignees
Labels
enhancement New feature or request

Comments

@rmathew1011
Copy link
Contributor

rmathew1011 commented Apr 25, 2024

This workflow should identify duplicate instances based on the following comparisons:

  • OCLC Match
  • ISBN Match
  • ISSN Match
  • LCCN Match
  • Call Number Match

The output of this workflow should be a report in the following format:
Screenshot 2024-09-11 at 10 26 02 AM

Each match should apply the following criteria:

OCLC Match

  • The OCLC numbers will need to be normalized.
    • Only treat values with the (OCLoC) prefix as an OCLC number. Ignore all fields lacking this prefix.
    • All non numerical characters shoud be removed.
    • Trim all leading 0s
  • A Normalized OCLC number is considered to be a match when:
    • The Normalized OCLC number from the field:035 ord:1 sf:a of the source record matches the Normalized OCLC number from the field:035 ord:1 sf:a of the target record
    • The Normalized OCLC number from the field:035 ord:1 sf:a of the source record matches any of the Normalized OCLC numbers within the field:035 sf:z fields of the target record.
  • The srs_marctab table should be used for this comparison

ISBN Match

  • The ISBN will need to be normalized as follows:
    • Trim everything after and including the first white space
    • Take the last 10 characters
    • Of those last 10 characters, remove the last character
    • 9780134553351 (foo) : $16.50 -> 9780134553351 -> ~978~ 013455335 ~1~ -> 013455335
  • The Normalized ISBN of the source instance is considered a match if it is identical to the Normalized ISBN of the target instance.
  • The instance_identifiers table should be used for this comparison

ISSN Match

  • The ISSNs do not need to be normalized
  • The ISSN is considered to be a match when the field:022 ord:0 sf:a or the source instance is identical the field:022 ord:0 sf:a of the target record.
  • The srs_marctab table should be used for this comparison

LCCN Match

  • The LCCN's do not need to be normalized
  • The LCCN is considered to be a match if the source LCCN is identical to the target LCCN
  • The instance_identifiers table should be used for this comparison

Call Number Match

  • The call numbers do not need to be normalized
  • An source instance is considered to match a target instance if any of the source instance's holding's call numbers match any of the target instance's holding's call numbers.
  • This comparison should be done using the call_number, call_number_prefix, call_number_suffix columns of the holdings_ext table in folio reporting.

This needs a schedule worklow, at an anual cadence.

You can create working tables in the mis schema of the LDP.

The results should be emailed to a variable email address.

Rows should only be included if at least one of the matches is true.

Original Text

This script should identify duplicate instances. The specific criteria for to determine that instances are duplicates will be provided, and will most likely be a comparison of multiple data points on the two instances.

The script should combine the two instances by keeping the oldest of the two instances, and removing the newest. All holdings and items from the newest instance should be moved to the oldest instance.

Update:
Create a workflow to accomplish the above report - sent as an email (csv as an attachment)

Additional requirements:

Add title and author field for both matching instances.

Report columns as:

hrid, hrid2, oclc, isbn, lccn, issn, call_number, title, title2, author, author2
@rmathew1011 rmathew1011 added the enhancement New feature or request label Apr 25, 2024
@Dbreck-TAMU
Copy link
Contributor

Working on the identification of duplicate records first to then better determine what the definition of a 'duplicate' record is. We may want to venture into a plan involved heuristics to determine what a duplicate is.

This workflow is meant to be the first step in a large, overarching workflow that will help the librarians maintain and keep their data clean at scale.

Utilizing the LDP data for now.

@Dbreck-TAMU Dbreck-TAMU self-assigned this Apr 29, 2024
@rmathew1011
Copy link
Contributor Author

rmathew1011 commented Apr 30, 2024

Implementation Strategy:

  • A CTE that creates a cached table with two columns: hrid and oclcs
    • for each hrid we will use the public.srs_marctab table to create a concatenated string of all normalized OCLC numbers.
    • The OCLC numbers that will be concatenated are the field:035 ord:1 sf:a, and all field:035 sf:z
    • Normalization will retain all integers, and remove all alphanumerics, special characters, and spaces
  • The deliverable report will be a two column csv, with hrid and duplicates as the column names
    • For each hrid, we will identify every hrid from the CTE where the OCLC (field:035 ord:1 sf:a from public.srs_marctab) is contained within the oclcs string
    • Since there may be multiple duplicates for each hrid, the duplicates column should be a comma separated list of all hrids that are identified as duplicates.

@rmathew1011
Copy link
Contributor Author

IdentifyDuplicateInstances

@Dbreck-TAMU Dbreck-TAMU removed their assignment Sep 12, 2024
@wwelling wwelling self-assigned this Sep 16, 2024
wwelling added a commit that referenced this issue Sep 19, 2024
wwelling added a commit that referenced this issue Sep 19, 2024
@wwelling wwelling linked a pull request Sep 23, 2024 that will close this issue
24 tasks
wwelling added a commit that referenced this issue Sep 26, 2024
…report

[Issue #350]: Duplicate Instance Report Workflow
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: 🏃 On Sprint
Development

Successfully merging a pull request may close this issue.

3 participants