Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Treatment of statements is inconsistent #150

Closed
markwhiting opened this issue Jun 11, 2024 · 2 comments · May be fixed by Watts-Lab/commonsense-statements#24
Closed

Treatment of statements is inconsistent #150

markwhiting opened this issue Jun 11, 2024 · 2 comments · May be fixed by Watts-Lab/commonsense-statements#24
Assignees
Labels
bug Something isn't working

Comments

@markwhiting
Copy link
Member

This likely has 2 layers:

  1. some statements have trailing punctuation and capitalization (so we aren't cleaning them correctly somewhere in the pipeline)
  2. when statements do not have those things, they are displayed without them on the platform.

I think the behavior we would like is that statements should be stored in a consistent state (i.e., all have trailing punctuation and capitalization or all not have those things) and displayed in a consistent way (i.e., always shown with the first letter capitalized and a period at the end).

Additionally, I think we want to clean up any existing statements that have these issues without breaking IDs.

To be finished, we should have a test in place that checks that statements are correctly formatted after ingestion and are correctly formatted at render time.

@JamesPHoughton
Copy link

JamesPHoughton commented Jul 15, 2024

Steps:

  • add a utility function that cleans statements,
  • clean entire db once as a PR to the statements repo, Use raw statements for translation, then clean english language and foreign language statements after translation. May depend on language?
  • set up for part of the pipeline for statement ingestion

blockers:

  • understanding how local, dev, prod dbs relate and how we can move stuff to prod?
  • @dankim444 how should things look in other languages? Some languages don't use capitalization, punctuation, in the same way.
  • Decide where to do this. Should we specify this in the cleaning utility function? Probably not render, but maybe some object that describes the mutations?

@dankim444 dankim444 self-assigned this Jul 19, 2024
@JamesPHoughton
Copy link

JamesPHoughton commented Aug 5, 2024

Goals is to make the presentation consistent, more than to make it any one specific format.

  • currently dealing with edge cases that have to do with "escaping" differences between different services that touch the statements (e.g. A string with a quote in it, and maybe some apostrophes, and a cursed backtick)
  • starting with NLP libraries to check
  • will complete by Friday (hopefully)
  • One idea is to run it through a language model, that should handle most situations.
  • Think about types of errors that we have, and figure out ways to detect and fix them specifically.
  • Count the number of times we see each type of issue and decide whether to make a rule or just manually fix them depending on how many there are.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants