Releases: opensanctions/yente
v3.7.3
- Improvements to matching of company names
- Disable phonetic matching on names that do not use a Western-style alphabet
- Fix a race condition in the indexer which can delete the active index
Full Changelog: v3.7.2...v3.7.3
v3.7.2
This release is very focussed on improving the scoring quality of the matcher system. Four areas in particular have seen work:
- Improvements to the candidate generation system which finds possible matches using ElasticSearch. The candidate generation is the step before the generation of result scores, which pre-selected possible matches from the OpenSanctions database. It has been re-worked to assign higher scores to literal name matches, and to weight the individual terms in a company or person name in more detail (in particular, considering company type information less strongly).
- We've made the
logic-v1
matching implementations for Jaro-Winkler and Metaphone more precise in their ratings, meaning they score higher for close matches but also decrease in score for invalid candidates. - We've introduced a method to assign custom weights to the features in the
logic-v1
algorithm, allowing API users to fine-tune the scoring system to their needs. More information: https://www.opensanctions.org/docs/api/scoring/#tuning - We've re-introcuced the Jaro-Winkler and Soundex implementations from
yente
3.6.1 and frozen those in place, providing stability to any adopters.
What's Changed
- Add schema facet and option to specify which facets are included in the response by @jbothma in #332
- Bump jellyfish from 1.0.0 to 1.0.1 by @dependabot in #333
- Bump elasticsearch[async] from 8.9.0 to 8.10.0 by @dependabot in #334
- Bump fastapi from 0.103.1 to 0.103.2 by @dependabot in #336
New Contributors
Full Changelog: v3.7.0...v3.7.2
v3.7.0
- Introduces an improved scoring system for the
/match
API, see: https://www.opensanctions.org/articles/2023-09-18-scoring-rules/ - Limits phonetic name searches to searches targeted at people
- Fixes a bug regarding the preview popups shown in OpenRefine
What's Changed
- Bump followthemoney from 3.5.3 to 3.5.4 by @dependabot in #331
Full Changelog: v3.6.2...v3.7.0
v3.6.2
This is mainly a maintenance release that updates software components. It introduces two new features:
- The
changed_since
query parameter on both the /match and /search endpoints constrains results to only entities which have changed since the given ISO timestamp. - The API now has CORS access enabled, which is used by the OpenRefine reconciliation API.
What's Changed
- Bump aiofiles from 23.1.0 to 23.2.1 by @dependabot in #309
- Bump fingerprints from 1.1.0 to 1.1.1 by @dependabot in #313
- Bump fastapi from 0.101.0 to 0.101.1 by @dependabot in #314
- Bump orjson from 3.9.4 to 3.9.5 by @dependabot in #316
- Update types-aiofiles requirement from <23.2,>=23.1.0.4 to >=23.1.0.4,<23.3 by @dependabot in #317
- support since parameter for incremental scans by @everplays in #315
- Bump fastapi from 0.101.1 to 0.103.0 by @dependabot in #319
- Bump fastapi from 0.103.0 to 0.103.1 by @dependabot in #321
- Bump actions/checkout from 3 to 4 by @dependabot in #322
- Bump orjson from 3.9.5 to 3.9.7 by @dependabot in #324
- Bump followthemoney from 3.5.2 to 3.5.3 by @dependabot in #325
- Bump docker/setup-qemu-action from 2 to 3 by @dependabot in #326
- Bump docker/metadata-action from 4 to 5 by @dependabot in #327
- Bump docker/login-action from 2 to 3 by @dependabot in #328
- Bump docker/setup-buildx-action from 2 to 3 by @dependabot in #329
- Bump docker/build-push-action from 4 to 5 by @dependabot in #330
New Contributors
- @everplays made their first contribution in #315
Full Changelog: v3.6.1...v3.6.2
v3.6.1
This version includes a lot of small changes based on customer feedback. In particular:
- Introduce an
exclude_dataset
query parameter to/match
and/search
to remove a single dataset from results. - Make the maximal result count of
/match
configurable via the server variableYENTE_MAX_MATCHES
- The index freshness check now tests if the new index has the given alias assigned, not just if it exists. This should handle partial indexing more gracefully.
What's Changed
- Bump elasticsearch[async] from 8.8.2 to 8.9.0 by @dependabot in #302
- Bump uvicorn[standard] from 0.23.1 to 0.23.2 by @dependabot in #305
- Bump fastapi from 0.100.0 to 0.100.1 by @dependabot in #303
- Bump countrynames from 1.15.1 to 1.15.2 by @dependabot in #304
- Bump fastapi from 0.100.1 to 0.101.0 by @dependabot in #307
- Bump orjson from 3.9.2 to 3.9.3 by @dependabot in #306
- Bump orjson from 3.9.3 to 3.9.4 by @dependabot in #308
Full Changelog: v3.6.0...v3.6.1
v.3.6.0
This release includes improved metadata handling for datasets, introduces some new entity types in the followthemoney
data model and allows for less performance-heavy matching queries using the fuzzy
flag. In detail:
- We've introduced several new entity types in the
followthemoney
data model which will be used to provide more detailed information regarding politically exposed persons. We advise all users to update the API now so that the new entity types will be reflected correctly. - Using the
/match
API on a very large dataset can cause heavy load on the ElasticSearch index because of the Levenshtein-based fuzzy matching it uses. In this version, we've introduced afuzzy=
query parameter, which lets users disable that functionality. Please note that this doesn't affect the scores generated by the API; but it may lead to less recall on very specific queries.
What's Changed
- Pydantic 2 by @pudo in #287
- Bump elasticsearch[async] from 8.8.0 to 8.8.2 by @dependabot in #280
- Bump uvicorn[standard] from 0.23.0 to 0.23.1 by @dependabot in #293
Full Changelog: v3.5.0...v3.6.0
v.3.5.0
This is a simple maintenance release that should improve performance and memory consumption of the application.
What's Changed
- Bump orjson from 3.9.0 to 3.9.1 by @dependabot in #269
- Bump fastapi from 0.96.0 to 0.97.0 by @dependabot in #270
- Bump fastapi from 0.97.0 to 0.98.0 by @dependabot in #273
- Bump types-aiofiles from 23.1.0.3 to 23.1.0.4 by @dependabot in #268
- Bump fastapi from 0.98.0 to 0.99.1 by @dependabot in #278
- Bump nomenklatura from 3.0.3 to 3.1.0 by @dependabot in #279
Full Changelog: v3.4.1...v3.5.0
v3.4.1
This release tries to improve error handling, and avoid some situations where async gets locked out by slow blocking situations.
The default manifest file included in the container now indexes the default
collection instead of all
.
What's Changed
- Bump orjson from 3.8.11 to 3.8.12 by @dependabot in #253
- Bump countrynames from 1.14.3 to 1.15.0 by @dependabot in #254
- Bump fastapi from 0.95.1 to 0.95.2 by @dependabot in #255
- Bump types-aiofiles from 23.1.0.2 to 23.1.0.3 by @dependabot in #257
- Bump followthemoney from 3.3.0 to 3.4.0 by @dependabot in #258
- Bump ubuntu from 23.04 to 23.10 by @dependabot in #256
- Bump orjson from 3.8.12 to 3.9.0 by @dependabot in #264
- Bump asyncstdlib from 3.10.7 to 3.10.8 by @dependabot in #265
- Bump fastapi from 0.95.2 to 0.96.0 by @dependabot in #266
- Bump elasticsearch[async] from 8.7.0 to 8.8.0 by @dependabot in #260
- Bump nomenklatura from 2.11.0 to 2.14.0 by @dependabot in #267
Full Changelog: v3.4.0...v3.4.1
v3.4.0
This release completely re-works the way in which the OpenSanctions API will score matches in the /match
API.
Until now, the API has used a simple statistical model to assign a match quality score to each result it has returned. With the new release of yente 3.4
, we've made that mechanism more flexible: clients can now select one of a set of supported algorithms to optimise the behaviour of the API for their use case.
With the new release, we've added three new scoring systems to augment the existing model (now called regression-v1
):
-
regression-v2
is a new statistical model for matching people and companies. Unlikeregression-v1
it uses pronounciation-based (phonetic/soundex) comparison for entity names, and it has reduced the impact of birthdates as a decision criterion. The new model will generally produce much lower scores for results, so you may want to reduce your matchingthreshold
parameter in the API to 0.5 or 0.6. -
name-based
is a simple scoring mechanism based on name similarity only. It uses two criteria, the Jaro-Winkler string distance mechanism and the Soundex phonetic algorithm. This can be a useful tool to conduct matching on data where you only have entity names, and no other details such as birth dates, nationalities, etc. -
name-qualified
uses the score from thename-based
mechanism but then considers other criteria, such as birth dates, nationalities, tax and registration identifiers. If any of these mismatch between the query and the result, the score is lowered. This attempts to anticipate a simple review process that a human analyst might otherwise undertake when a result is found.
What's Changed
- Bump asyncstdlib from 3.10.6 to 3.10.7 by @dependabot in #250
- Bump types-aiofiles from 23.1.0.1 to 23.1.0.2 by @dependabot in #249
- Bump orjson from 3.8.10 to 3.8.11 by @dependabot in #246
- Bump uvicorn[standard] from 0.21.1 to 0.22.0 by @dependabot in #247
- Multiple scoring algorithms by @pudo in #251
- Stable patches by @pudo in #248
Full Changelog: v3.3.1...v3.4.0
v3.3.1
Updates nomenklatura to a new version that is fully statement-based.
What's Changed
- Bump aiocsv from 1.2.3 to 1.2.4 by @dependabot in #238
- Bump orjson from 3.8.8 to 3.8.10 by @dependabot in #237
- Bump elasticsearch[async] from 8.6.2 to 8.7.0 by @dependabot in #235
- Bump types-aiofiles from 23.1.0.0 to 23.1.0.1 by @dependabot in #232
- Bump structlog from 22.3.0 to 23.1.0 by @dependabot in #236
Full Changelog: v3.3.0...v3.3.1