Skip to content

A Python tool that enriches InCites Benchmarking & Analytics exports with Article Processing Charge (APC) data from the OpenAlex API, including OA status and optional discount calculations.

License

Notifications You must be signed in to change notification settings

clarivate/incites-to-openalex-apc

Repository files navigation

OpenAlex APC from Excel

Given an input Excel file with DOI in column B and Title in column D, this script queries the OpenAlex API for each row and appends Article Processing Charge (APC) fields and Open Access status to a new workbook.

This was created to take an output file from InCites Benchmarking & Analytics, and add additional useful information in addition to all the current metrics from that database.

It's important to know how OpenAlex sources the APC paid and APC list price. See their technical documentation for definitions and sources:

https://docs.openalex.org/api-entities/works/work-object#apc_list

What it does

  • Looks up each row by DOI first, then title as a fallback.

  • Appends these columns to your sheet:

    • APC (preferred) — the paid APC if present, otherwise the listed APC (includes currency, ≈USD if provided, and provenance)
    • APC (preferred) USD — numeric USD value for analysis/summing (blanked for duplicates and for non-allowed OA statuses)
    • APC USD discounted(optional; appears only if a discount parameter is provided) numeric USD value after applying the discount percentage
    • Discount (X%)(optional) the dollar amount discounted for each row, where X is the percentage you specified
    • APC (paid) — formatted string for the paid APC
    • APC (list price) — formatted string for the listed APC
    • OpenAlex ID — canonical OpenAlex work id
    • OpenAlex oa_status — Open Access status from OpenAlex
    • Lookup methoddoi or title
    • Notes — e.g., Duplicate, OA Status not Diamond/Gold/Hybrid
  • Duplicates by OpenAlex ID
    The first occurrence counts; subsequent occurrences keep text APC fields and oa_status but have APC (preferred) USD blanked and Notes = "Duplicate".

  • OA status gating for totals
    Only rows with OpenAlex oa_status in {diamond, gold, hybrid} are included in the numeric USD total.
    If an APC is present but oa_status is not one of those, the script keeps the text APC fields, blanks APC (preferred) USD, and adds Notes = "OA Status not Diamond/Gold/Hybrid".

  • Optional Discount parameter
    If you supply a discount percentage (e.g., --discount 10), the script will:

    • Add the APC USD discounted and Discount (10%) columns.
    • Populate them based on the discount value.
    • Include totals for both APC USD discounted and Discount (10%) at the bottom.
    • Leave these columns out entirely if no discount is provided.
  • Progress output every 50 rows.

  • Suppresses the openpyxl "no default style" warning.

  • Lightweight retry/backoff for transient HTTP errors (429/5xx).

  • Optional polite API contact via --mailto [email protected].

  • Optional request throttling via --rate-limit (requests per second).

  • Sets Column W and Column X widths to ~150px (≈ 21.5), and applies the same width to discount-related columns when present.

  • Freezes the top row of the output workbook.

Definition of Preferred APC

  • Preferred APC = the paid APC (apc_paid) if available in OpenAlex, otherwise the listed APC (apc_list).

Installation

pip install -r requirements.txt

Usage

Minimal (uses defaults from the PARAMETERS dict at the top of the script, or pass via CLI):

python openalex_apc_from_excel.py

Typical:

python openalex_apc_from_excel.py "Web of Science Documents (16).xlsx" --mailto [email protected]

Specify sheet and output path:

python openalex_apc_from_excel.py input.xlsx --sheet savedrecs -o output_with_apc.xlsx

Respectful rate limiting (2 req/s):

python openalex_apc_from_excel.py input.xlsx --mailto [email protected] --rate-limit 2

Include a discount calculation (10%):

python openalex_apc_from_excel_discount.py input.xlsx --mailto [email protected] --discount 10

CLI options

  • input_xlsx (positional, optional): Path to input workbook (defaults from PARAMETERS if omitted)
  • -o, --output-xlsx: Output workbook path (defaults to <input>_with_apc.xlsx)
  • --sheet: Worksheet name (defaults to first sheet)
  • --header-row: Zero-based header row index (default 0)
  • --doi-col: Zero-based DOI column index (default 1 → column B)
  • --title-col: Zero-based Title column index (default 3 → column D)
  • --mailto: Email for OpenAlex (adds mailto= query param)
  • --rate-limit: Requests per second (float). Defaults to ~0.2s between requests when omitted.
  • --discount: (optional) Percentage discount to apply to APC values (e.g., --discount 10)

Output details

  • The script appends the new columns to the right of your existing sheet.
  • It inserts OpenAlex oa_status at Excel column AB (0-based index 27). If your sheet has fewer than 28 columns, it will be appended at the end.
  • The workbook ends with three summary rows:
    1. Total row showing:
      • Total APC (preferred) USD
      • Total APC USD discounted (if discount used)
      • Total Discount (X%) (if discount used)
    2. Unable to pull APC information for X papers (excludes rows marked Duplicate and OA Status not Diamond/Gold/Hybrid)
    3. X entries were duplicates
  • Column W, Column X, and any discount-related columns are set to ~150px (≈ 21.5 width units).
  • The top row is frozen for easier navigation in Excel.

Notes & Tips

  • If both DOI and Title are missing in a row, APC fields are blank.
  • Currency conversions rely on OpenAlex fields. If neither value_usd nor a native USD value is present in the API result, the USD column is left blank.
  • Use --mailto so OpenAlex can contact you if needed.
  • You can add your own DOIs to the spreadsheet before calling OpenAlex for any additional papers not on the spreadsheet. The script will remove duplicates.

Troubleshooting

  • OpenPyXL UserWarning about default style: already suppressed by the script.
  • HTTP 429 / 5xx: the script retries with backoff.
  • Multiple sheets: use --sheet or set a default; otherwise the first sheet is used.
  • Header row: default is 0. Use --header-row to adjust if needed.
  • Duplicates included in “Unable to pull” count: make sure you’re using the updated script (duplicates and OA-status-excluded rows are now excluded from that count).

Legend for Notes column

Note value Meaning
Duplicate This record’s OpenAlex ID was already retrieved earlier in the file. The first instance contributes to totals; this one does not.
OA Status not Diamond/Gold/Hybrid The record has an APC value but its oa_status is not diamond, gold, or hybrid — therefore excluded from the USD total.
(blank) Either an APC was successfully included in the total or no APC data was available for this record.

License

MIT — see LICENSE.

About

A Python tool that enriches InCites Benchmarking & Analytics exports with Article Processing Charge (APC) data from the OpenAlex API, including OA status and optional discount calculations.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages