Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harmonization of substances #2

Open
tomlue opened this issue Dec 27, 2022 · 0 comments
Open

Harmonization of substances #2

tomlue opened this issue Dec 27, 2022 · 0 comments
Assignees

Comments

@tomlue
Copy link
Contributor

tomlue commented Dec 27, 2022

After integrating three databases in #1 we need to start harmonizing things. Lets begin with chemicals. One process is to:

  1. Reshape the substances table to SID, KEY, VALUE
  2. Match substance ids by those that share the same identifiers
  3. look for substances where some identifiers match, but others don't. These require conflict resolution (later step)

Example:

ksjdlk-ksdfj-ksdjf {cas:"123-1243-1233"}
qweret-qwer-ksdjf {cas:"123-1243-1233"}

becomes

ksjdlk-ksdfj-ksdjf cas 123-1243-1233
qweret-qwer-ksdjf cas 123-1243-1233

and we can match row 1 and 2 because they share a cas number

Note*
We might want to add an additional column to the substance table identifying the source(s) of the substance. This might allow us to improve chemical matching and curation. For example, one source might use the key 'casrn' to identify the cas number while another resource might use 'cas'. In general, we should try to normalize the keys used by different resources, and knowing which sources use what is one step in that process.

tomlue added a commit that referenced this issue Apr 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants