Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue #350]: Duplicate Instance Report Workflow #394

Merged
merged 1 commit into from
Sep 26, 2024

Conversation

wwelling
Copy link
Contributor

@wwelling wwelling commented Sep 18, 2024

This workflow should identify duplicate instances based on the following comparisons:

  • OCLC Match
  • ISBN Match
  • ISSN Match
  • LCCN Match
  • Call Number Match

The output of this workflow should be a report in the following format:
Screenshot 2024-09-11 at 10 26 02 AM

This is closely followed excluding the Match in the column headers and call number column will not be included. Additional columns comparing title and author have been added as well.

OCLC Match

  • 1. The OCLC numbers will need to be normalized.
    • a. Only treat values with the (OCLoC) prefix as an OCLC number. Ignore all fields lacking this prefix.
    • b. All non numerical characters shoud be removed.
    • c. Trim all leading 0s
  • 2. A Normalized OCLC number is considered to be a match when:
    • a. The Normalized OCLC number from the field:035 ord:1 sf:a of the source record matches the Normalized OCLC number from the field:035 ord:1 sf:a of the target record
    • b. The Normalized OCLC number from the field:035 ord:1 sf:a of the source record matches any of the Normalized OCLC numbers within the field:035 sf:z fields of the target record.
    • c. The srs_marctab table should be used for this comparison

ISBN Match

  • 1. The ISBN will need to be normalized as follows:
    • a. Trim everything after and including the first white space
    • b. Take the last 10 characters
    • c. Of those last 10 characters, remove the last character
    • d. 9780134553351 (foo) : $16.50 -> 9780134553351 -> ~978~ 013455335 ~1~ -> 013455335
  • 2. The Normalized ISBN of the source instance is considered a match if it is identical to the Normalized ISBN of the target instance.
  • 3. The instance_identifiers table should be used for this comparison

ISSN Match

  • 1. The ISSNs do not need to be normalized
  • 2. The ISSN is considered to be a match when the field:022 ord:0 sf:a or the source instance is identical the field:022 ord:0 sf:a of the target record.
  • 3. The srs_marctab table should be used for this comparison

LCCN Match

  • 1. The LCCN's do not need to be normalized
  • 2. The LCCN is considered to be a match if the source LCCN is identical to the target LCCN
  • 3. The instance_identifiers table should be used for this comparison

Call Number Match

  • 1. The call numbers do not need to be normalized
  • 2. An source instance is considered to match a target instance if any of the source instance's holding's call numbers match any of the target instance's holding's call numbers.
  • 3. This comparison should be done using the call_number, call_number_prefix, call_number_suffix columns of the holdings_ext table in folio reporting.

Caveats:

  1. There are far too many call number matches to include in this report
  2. ISBN require additional normalization to handle semi-colon and dollar sign
  3. ISSN only had content for ord:1

Here is the instance duplicate report query excluding matches on call number:

WITH oclc_with_title AS (
  SELECT ie.instance_hrid, LTRIM(REGEXP_REPLACE(SUBSTRING(sm.content FROM 8), '[^0-9]', '', 'g'), '0') AS oclc, ie.title
  FROM folio_reporting.instance_ext ie
  JOIN public.srs_marctab sm ON ie.instance_hrid = sm.instance_hrid
  WHERE sm.field = '035' AND sm.ord = 1 AND sm.sf IN ('a', 'z') AND sm.content LIKE '(OCoLC)%'
),
oclc AS (
  SELECT oclcwt.instance_hrid, oclcwt.oclc, oclcwt.title, sm.content AS author
  FROM oclc_with_title oclcwt
  JOIN public.srs_marctab sm ON oclcwt.instance_hrid = sm.instance_hrid
  WHERE sm.field = '100' AND sm.ord = 1 AND sm.sf = 'a'
),
oclc_matches AS (
  SELECT l.instance_hrid AS hrid, r.instance_hrid AS hrid2, l.title AS title, r.title AS title2, l.author AS author, r.author AS author2, 'oclc' AS match_type
  FROM oclc l
  JOIN oclc r ON l.oclc = r.oclc AND l.instance_hrid < r.instance_hrid
),
isbn AS (
  SELECT ie.instance_hrid, NULLIF(LEFT(RIGHT(REGEXP_REPLACE(ii.identifier, ' .*', ''), 10), 9), ':') AS isbn, ie.title, sm.content AS author
  FROM folio_reporting.instance_ext ie
  JOIN folio_reporting.instance_identifiers ii ON ie.instance_hrid = ii.instance_hrid
  JOIN public.srs_marctab sm ON ie.instance_hrid = sm.instance_hrid
  WHERE ii.identifier_type_name = 'ISBN' AND ii.identifier NOT SIMILAR TO '(:|$)%'
    AND sm.field = '100' AND sm.ord = 1 AND sm.sf = 'a'
),
isbn_matches AS (
  SELECT l.instance_hrid AS hrid, r.instance_hrid AS hrid2, l.title AS title, r.title AS title2, l.author AS author, r.author AS author2, 'isbn' AS match_type
  FROM isbn l
  JOIN isbn r ON l.isbn = r.isbn AND l.instance_hrid < r.instance_hrid
),
lccn AS (
  SELECT ie.instance_hrid, ii.identifier AS lccn, ie.title, sm.content AS author
  FROM folio_reporting.instance_ext ie
  JOIN folio_reporting.instance_identifiers ii ON ie.instance_hrid = ii.instance_hrid
  JOIN public.srs_marctab sm ON ie.instance_hrid = sm.instance_hrid
  WHERE ii.identifier_type_name = 'LCCN'
    AND sm.field = '100' AND sm.ord = 1 AND sm.sf = 'a'
),
lccn_matches AS (
  SELECT l.instance_hrid AS hrid, r.instance_hrid AS hrid2, l.title AS title, r.title AS title2, l.author AS author, r.author AS author2, 'lccn' AS match_type
  FROM lccn l
  JOIN lccn r ON l.lccn = r.lccn  AND l.instance_hrid < r.instance_hrid
),
issn_with_title AS (
  SELECT ie.instance_hrid, sm.content AS issn, ie.title
  FROM folio_reporting.instance_ext ie
  JOIN public.srs_marctab sm ON ie.instance_hrid = sm.instance_hrid
  WHERE field = '022' AND ord = 1 AND sf = 'a'
),
issn AS (
  SELECT issnwt.instance_hrid, issnwt.issn, issnwt.title, sm.content AS author
  FROM issn_with_title issnwt
  JOIN public.srs_marctab sm ON issnwt.instance_hrid = sm.instance_hrid
  WHERE sm.field = '100' AND sm.ord = 1 AND sm.sf = 'a'
),
issn_matches AS (
  SELECT l.instance_hrid AS hrid, r.instance_hrid AS hrid2, l.title AS title, r.title AS title2, l.author AS author, r.author AS author2, 'issn' AS match_type
  FROM issn l
  JOIN issn r ON l.issn = r.issn AND l.instance_hrid < r.instance_hrid
),
all_matches AS (
  SELECT hrid, hrid2, title, title2, author, author2, match_type FROM oclc_matches
  UNION ALL SELECT hrid, hrid2, title, title2, author, author2, match_type FROM isbn_matches
  UNION ALL SELECT hrid, hrid2, title, title2, author, author2, match_type FROM lccn_matches
  UNION ALL SELECT hrid, hrid2, title, title2, author, author2, match_type FROM issn_matches
)
SELECT
  hrid AS HRID,
  hrid2 AS HRID2,
  MAX(CASE WHEN match_type = 'oclc' THEN 'T' END) AS OCLC,
  MAX(CASE WHEN match_type = 'isbn' THEN 'T' END) AS ISBN,
  MAX(CASE WHEN match_type = 'issn' THEN 'T' END) AS ISSN,
  MAX(CASE WHEN match_type = 'lccn' THEN 'T' END) AS LCCN,
  '"' || REPLACE(title, '"', '""') || '"' AS TITLE,
  '"' || REPLACE(title2, '"', '""') || '"' AS TITLE2,
  '"' || REPLACE(author, '"', '""') || '"' AS AUTHOR,
  '"' || REPLACE(author2, '"', '""') || '"' AS AUTHOR2
FROM all_matches
GROUP BY hrid, hrid2, title, title2, author, author2;

Here is the complete instance duplication report query with call number matches:

very slow
very sparse report

WITH filtered_holdings AS (
    SELECT instance_id, call_number, call_number_prefix, call_number_suffix
    FROM folio_reporting.holdings_ext
    WHERE he.call_number IS NOT NULL AND he.call_number !~ '^\s*$' AND he.call_number_type_id IS NOT NULL
),
filtered_marctab AS (
    SELECT instance_hrid, content
    FROM public.srs_marctab
    WHERE field = '100' AND ord = 1 AND sf = 'a'
),
call_number AS (
  SELECT ie.instance_hrid, TRIM(CONCAT_WS(' ', he.call_number_prefix, he.call_number, he.call_number_suffix)) AS full_call_number, ie.title, sm.content AS author
  FROM folio_reporting.instance_ext ie
  JOIN filtered_holdings he ON ie.instance_id = he.instance_id
  JOIN filtered_marctab sm ON ie.instance_hrid = sm.instance_hrid
),
call_number_matches AS (
  SELECT l.instance_hrid AS hrid, r.instance_hrid AS hrid2, l.title AS title, r.title AS title2, l.author AS author, r.author AS author2, 'call_number' AS match_type
  FROM call_number l
  JOIN call_number r ON l.full_call_number = r.full_call_number AND l.instance_hrid < r.instance_hrid
),
oclc_with_title AS (
  SELECT ie.instance_hrid, LTRIM(REGEXP_REPLACE(SUBSTRING(sm.content FROM 8), '[^0-9]', '', 'g'), '0') AS oclc, ie.title
  FROM folio_reporting.instance_ext ie
  JOIN public.srs_marctab sm ON ie.instance_hrid = sm.instance_hrid
  WHERE sm.field = '035' AND sm.ord = 1 AND sm.sf IN ('a', 'z') AND sm.content LIKE '(OCoLC)%'
),
oclc AS (
  SELECT oclcwt.instance_hrid, oclcwt.oclc, oclcwt.title, sm.content AS author
  FROM oclc_with_title oclcwt
  JOIN public.srs_marctab sm ON oclcwt.instance_hrid = sm.instance_hrid
  WHERE sm.field = '100' AND sm.ord = 1 AND sm.sf = 'a'
),
oclc_matches AS (
  SELECT l.instance_hrid AS hrid, r.instance_hrid AS hrid2, l.title AS title, r.title AS title2, l.author AS author, r.author AS author2, 'oclc' AS match_type
  FROM oclc l
  JOIN oclc r ON l.oclc = r.oclc AND l.instance_hrid < r.instance_hrid
),
isbn AS (
  SELECT ie.instance_hrid, NULLIF(LEFT(RIGHT(REGEXP_REPLACE(ii.identifier, ' .*', ''), 10), 9), ':') AS isbn, ie.title, sm.content AS author
  FROM folio_reporting.instance_ext ie
  JOIN folio_reporting.instance_identifiers ii ON ie.instance_hrid = ii.instance_hrid
  JOIN public.srs_marctab sm ON ie.instance_hrid = sm.instance_hrid
  WHERE ii.identifier_type_name = 'ISBN' AND ii.identifier NOT SIMILAR TO '(:|$)%'
    AND sm.field = '100' AND sm.ord = 1 AND sm.sf = 'a'
),
isbn_matches AS (
  SELECT l.instance_hrid AS hrid, r.instance_hrid AS hrid2, l.title AS title, r.title AS title2, l.author AS author, r.author AS author2, 'isbn' AS match_type
  FROM isbn l
  JOIN isbn r ON l.isbn = r.isbn AND l.instance_hrid < r.instance_hrid
),
lccn AS (
  SELECT ie.instance_hrid, ii.identifier AS lccn, ie.title, sm.content AS author
  FROM folio_reporting.instance_ext ie
  JOIN folio_reporting.instance_identifiers ii ON ie.instance_hrid = ii.instance_hrid
  JOIN public.srs_marctab sm ON ie.instance_hrid = sm.instance_hrid
  WHERE ii.identifier_type_name = 'LCCN'
    AND sm.field = '100' AND sm.ord = 1 AND sm.sf = 'a'
),
lccn_matches AS (
  SELECT l.instance_hrid AS hrid, r.instance_hrid AS hrid2, l.title AS title, r.title AS title2, l.author AS author, r.author AS author2, 'lccn' AS match_type
  FROM lccn l
  JOIN lccn r ON l.lccn = r.lccn  AND l.instance_hrid < r.instance_hrid
),
issn_with_title AS (
  SELECT ie.instance_hrid, sm.content AS issn, ie.title
  FROM folio_reporting.instance_ext ie
  JOIN public.srs_marctab sm ON ie.instance_hrid = sm.instance_hrid
  WHERE field = '022' AND ord = 1 AND sf = 'a'
),
issn AS (
  SELECT issnwt.instance_hrid, issnwt.issn, issnwt.title, sm.content AS author
  FROM issn_with_title issnwt
  JOIN public.srs_marctab sm ON issnwt.instance_hrid = sm.instance_hrid
  WHERE sm.field = '100' AND sm.ord = 1 AND sm.sf = 'a'
),
issn_matches AS (
  SELECT l.instance_hrid AS hrid, r.instance_hrid AS hrid2, l.title AS title, r.title AS title2, l.author AS author, r.author AS author2, 'issn' AS match_type
  FROM issn l
  JOIN issn r ON l.issn = r.issn AND l.instance_hrid < r.instance_hrid
),
all_matches AS (
  SELECT hrid, hrid2, title, title2, author, author2, match_type FROM call_number_matches
  UNION ALL SELECT hrid, hrid2, title, title2, author, author2, match_type FROM oclc_matches
  UNION ALL SELECT hrid, hrid2, title, title2, author, author2, match_type FROM isbn_matches
  UNION ALL SELECT hrid, hrid2, title, title2, author, author2, match_type FROM lccn_matches
  UNION ALL SELECT hrid, hrid2, title, title2, author, author2, match_type FROM issn_matches
)
SELECT
  hrid AS HRID,
  hrid2 AS HRID2,
  MAX(CASE WHEN match_type = 'oclc' THEN 'T' END) AS OCLC,
  MAX(CASE WHEN match_type = 'isbn' THEN 'T' END) AS ISBN,
  MAX(CASE WHEN match_type = 'issn' THEN 'T' END) AS ISSN,
  MAX(CASE WHEN match_type = 'call_number' THEN 'T' END) AS CALL_NUMBER,
  MAX(CASE WHEN match_type = 'lccn' THEN 'T' END) AS LCCN,
  '"' || REPLACE(title, '"', '""') || '"' AS TITLE,
  '"' || REPLACE(title2, '"', '""') || '"' AS TITLE2,
  '"' || REPLACE(author, '"', '""') || '"' AS AUTHOR,
  '"' || REPLACE(author2, '"', '""') || '"' AS AUTHOR2
FROM all_matches
GROUP BY hrid, hrid2, title, title2, author, author2;

Here are the individual identifier match queries:

Call Number

WITH call_number AS (
  SELECT ie.instance_hrid, he.call_number_type_id, he.call_number_type_name, he.call_number, TRIM(CONCAT_WS(' ', he.call_number_prefix, he.call_number, he.call_number_suffix)) AS full_call_number, ie.title, sm.content AS author
  FROM folio_reporting.instance_ext ie
  JOIN folio_reporting.holdings_ext he ON ie.instance_id = he.instance_id
  JOIN public.srs_marctab sm ON ie.instance_hrid = sm.instance_hrid
  WHERE he.call_number IS NOT NULL AND he.call_number !~ '^\s*$' AND he.call_number_type_id IS NOT NULL
    AND sm.field = '100' AND sm.ord = 1 AND sm.sf = 'a'
)
SELECT
  l.instance_hrid AS hrid,
  r.instance_hrid AS hrid2,
  l.call_number AS call_number,
  r.call_number AS call_number2,
  l.full_call_number,
  '"' || REPLACE(l.title, '"', '""') || '"' AS title,
  '"' || REPLACE(r.title, '"', '""') || '"' AS title2,
  '"' || REPLACE(l.author, '"', '""') || '"' AS author,
  '"' || REPLACE(r.author, '"', '""') || '"' AS author2
FROM call_number l
JOIN call_number r ON l.full_call_number = r.full_call_number AND l.instance_hrid < r.instance_hrid;

ISBN

WITH isbn AS (
  SELECT ie.instance_hrid, ii.identifier, NULLIF(LEFT(RIGHT(REGEXP_REPLACE(ii.identifier, ' .*', ''), 10), 9), ':') AS isbn, ie.title, sm.content AS author
  FROM folio_reporting.instance_ext ie
  JOIN folio_reporting.instance_identifiers ii ON ie.instance_hrid = ii.instance_hrid
  JOIN public.srs_marctab sm ON ie.instance_hrid = sm.instance_hrid
  WHERE ii.identifier_type_name = 'ISBN' AND ii.identifier NOT SIMILAR TO '(:|$)%'
    AND sm.field = '100' AND sm.ord = 1 AND sm.sf = 'a'
)
SELECT
  l.instance_hrid AS hrid,
  r.instance_hrid AS hrid2,
  l.identifier,
  l.isbn,
  '"' || REPLACE(l.title, '"', '""') || '"' AS title,
  '"' || REPLACE(r.title, '"', '""') || '"' AS title2,
  '"' || REPLACE(l.author, '"', '""') || '"' AS author,
  '"' || REPLACE(r.author, '"', '""') || '"' AS author2
FROM isbn l
JOIN isbn r ON l.isbn = r.isbn AND l.instance_hrid < r.instance_hrid;

LCCN

WITH lccn AS (
  SELECT ie.instance_hrid, ii.identifier AS lccn, ie.title, sm.content AS author
  FROM folio_reporting.instance_ext ie
  JOIN folio_reporting.instance_identifiers ii ON ie.instance_hrid = ii.instance_hrid
  JOIN public.srs_marctab sm ON ie.instance_hrid = sm.instance_hrid
  WHERE ii.identifier_type_name = 'LCCN'
    AND sm.field = '100' AND sm.ord = 1 AND sm.sf = 'a'
)
SELECT
  l.instance_hrid AS hrid,
  r.instance_hrid AS hrid2,
  l.lccn,
  '"' || REPLACE(l.title, '"', '""') || '"' AS title,
  '"' || REPLACE(r.title, '"', '""') || '"' AS title2,
  '"' || REPLACE(l.author, '"', '""') || '"' AS author,
  '"' || REPLACE(r.author, '"', '""') || '"' AS author2
FROM lccn l
JOIN lccn r ON l.lccn = r.lccn  AND l.instance_hrid < r.instance_hrid;

ISSN

WITH issn_with_title AS (
  SELECT ie.instance_hrid, sm.content AS issn, ie.title
  FROM folio_reporting.instance_ext ie
  JOIN public.srs_marctab sm ON ie.instance_hrid = sm.instance_hrid
  WHERE field = '022' AND ord = 1 AND sf = 'a'
),
issn AS (
  SELECT issnwt.instance_hrid, issnwt.issn, issnwt.title, sm.content AS author
  FROM issn_with_title issnwt
  JOIN public.srs_marctab sm ON issnwt.instance_hrid = sm.instance_hrid
  WHERE sm.field = '100' AND sm.ord = 1 AND sm.sf = 'a'
)
SELECT
  l.instance_hrid AS hrid,
  r.instance_hrid AS hrid2,
  l.issn,
  '"' || REPLACE(l.title, '"', '""') || '"' AS title,
  '"' || REPLACE(r.title, '"', '""') || '"' AS title2,
  '"' || REPLACE(l.author, '"', '""') || '"' AS author,
  '"' || REPLACE(r.author, '"', '""') || '"' AS author2
FROM issn l
JOIN issn r ON l.issn = r.issn AND l.instance_hrid < r.instance_hrid;

OCLC

WITH oclc_with_title AS (
  SELECT ie.instance_hrid, LTRIM(REGEXP_REPLACE(SUBSTRING(sm.content FROM 8), '[^0-9]', '', 'g'), '0') AS oclc, ie.title
  FROM folio_reporting.instance_ext ie
  JOIN public.srs_marctab sm ON ie.instance_hrid = sm.instance_hrid
  WHERE sm.field = '035' AND sm.ord = 1 AND sm.sf IN ('a', 'z') AND sm.content LIKE '(OCoLC)%'
),
oclc AS (
  SELECT oclcwt.instance_hrid, oclcwt.oclc, oclcwt.title, sm.content AS author
  FROM oclc_with_title oclcwt
  JOIN public.srs_marctab sm ON oclcwt.instance_hrid = sm.instance_hrid
  WHERE sm.field = '100' AND sm.ord = 1 AND sm.sf = 'a'
)
SELECT
  l.instance_hrid AS hrid,
  r.instance_hrid AS hrid2,
  l.oclc,
  '"' || REPLACE(l.title, '"', '""') || '"' AS title,
  '"' || REPLACE(r.title, '"', '""') || '"' AS title2,
  '"' || REPLACE(l.author, '"', '""') || '"' AS author,
  '"' || REPLACE(r.author, '"', '""') || '"' AS author2
FROM oclc l
JOIN oclc r ON l.oclc = r.oclc AND l.instance_hrid < r.instance_hrid;

image

@wwelling wwelling force-pushed the spritn20-350-duplicate-instance-report branch 5 times, most recently from cb32f9a to 2ea0e99 Compare September 21, 2024 02:32
@wwelling wwelling marked this pull request as ready for review September 21, 2024 02:35
@wwelling wwelling force-pushed the spritn20-350-duplicate-instance-report branch 8 times, most recently from cef031e to 13bf408 Compare September 21, 2024 04:26
@wwelling wwelling linked an issue Sep 23, 2024 that may be closed by this pull request
@wwelling wwelling marked this pull request as draft September 24, 2024 13:48
@wwelling wwelling force-pushed the spritn20-350-duplicate-instance-report branch 3 times, most recently from 1f751c9 to 6891dad Compare September 24, 2024 18:33
@wwelling wwelling marked this pull request as ready for review September 24, 2024 18:38
@wwelling wwelling changed the base branch from main to sprint20-staging September 24, 2024 18:38
@wwelling wwelling force-pushed the spritn20-350-duplicate-instance-report branch 3 times, most recently from 8ed11b2 to ca3082c Compare September 24, 2024 19:34
@wwelling wwelling force-pushed the spritn20-350-duplicate-instance-report branch 3 times, most recently from 9d5fb40 to 3d973f0 Compare September 25, 2024 17:23
@wwelling wwelling force-pushed the spritn20-350-duplicate-instance-report branch from 3d973f0 to 62c711a Compare September 25, 2024 17:30
@wwelling wwelling merged commit 8a162f6 into sprint20-staging Sep 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create a FOLIO Workflow that identifies duplicate instances
3 participants