-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why are entities extracted from examples, and how can I avoid them #233
Comments
Hi @NaiveteYaYa, likely this is because the LLM is making a mistake. Could you share an example of code you're using? I may be able to help you improve the results |
Facing same issue ! |
@NaiveteYaYa / @ahmed-bhs please share a snippet of code. It's hard to know how to help without having context :) What code are you using and which LLM? |
Absent other information for other folks bumping into issues please refer to the guidelines: https://eyurtsev.github.io/kor/guidelines.html |
@eyurtsev thank you for your help, I really appreciate, SO the context, that I' mextracting chimichal inforamtion from a raw text. This is my prompt, you can test it using chatgpt: Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below. ```TypeScript information: { // Extracting chemichal informations for an fds file. Please filter and remove duplicate items. product_name: string // Nom du produit chimique manufacturer_name: string // Le nom du fornisseur phrase_euh: Array // 6-character code for additional information statements (EUH phrases). pattern: ^EUH\d{3}$' phrase_h: Array // Code à 4 caractères pour les mention de danger (phrases H). pattern: ^H\d{3}$ phrase_ghs: Array // Code à 5 caractères pour les pictogrammes de danger (phrases GHS). pattern: ^GHS\d{3}$ phrase_p: Array // Code à 4 caractères pour les conseils de prudence (phrases P). pattern: ^(P\d+(\s*\+\s*P\d+)*)$ substances: Array<{ // les paires code CAS et code EC des substances. ec: string // cas: string // }> warning_notice: string // Mention d’avertissement, par défaut NA. pattern: '/DANGER|ATTENTION|NA/ } Please output the extracted information in JSON format. Do not output anything except for the extracted information. Do not add any clarifying information. Do not add any fields that are not in the schema. If the text contains attributes that do not appear in the schema, please ignore them. All output must be in JSON format and follow the schema specified above. Wrap the JSON in tags. Input: [user input] Output: This is the [user input] (it's in frensh language): This is the output result, as you see "phrase_ghs" contain duplicated items: { "information": { "product_name": "TUV IDROTOP VELOURS", "manufacturer_name": "CROMOLOGY SERVICES", "phrase_euh": [ "EUH208", "EUH208", "EUH208", "EUH211" ], "phrase_h": [], "phrase_ghs": [ "GHS08", "GHS05", "GHS07", "GHS09", "GHS05", "GHS09", "GHS06", "GHS05", "GHS09", "GHS06", "GHS05", "GHS09" ], "phrase_p": [ "P102", "P271", "P273", "P501" ], "substances": [ { "ec": "236-675-5", "cas": "13463-67-7" }, { "ec": "220-120-9", "cas": "2634-33-5" }, { "ec": "220-239-6", "cas": "2682-20-4" } ], "warning_notice": "NA" } } This a par of my code: schema = Object( id="information", description=( "Extracting chemichal informations for an fds file. Please filter and remove duplicate items." ), attributes=[ Text( id="product_name", description="Nom du produit chimique", default="NA" ), Text( id="manufacturer_name", description="Le nom du fornisseur", default="NA" ), Text( id="phrase_euh", description="6-character code for additional information statements (EUH phrases). pattern: ^EUH\d{3}$'", examples=[ ], default=[], many=True, ), Text( id="phrase_h", description="Code à 4 caractères pour les mention de danger (phrases H). pattern: ^H\d{3}$", examples=[], default=[], many=True, ), Text( id="phrase_ghs", description="Code à 5 caractères pour les pictogrammes de danger (phrases GHS). pattern: ^GHS\d{3}$", examples=[], default=[], many=True, ), Text( id="phrase_p", description="Code à 4 caractères pour les conseils de prudence (phrases P). pattern: ^(P\d+(\s*\+\s*P\d+)*)$", examples=[], default=[], many=True, ), Object( id="substances", description="les paires code CAS et code EC des substances. ", attributes=[ Text(id="ec"), Text(id="cas"), ], default=[], many=True ), Text( id="warning_notice", description="Mention d’avertissement, par défaut NA. pattern: '/DANGER|ATTENTION|NA/", ), ], many=False, ) prompt = ChatPromptTemplate.from_messages( [ ("system", "You are a language detection bot specialized in extracting information from FDS text."), ] ) llm = ChatOpenAI(model="gpt-3.5-turbo-16k", temperature=0.2, model_kwargs = { 'frequency_penalty':0, 'presence_penalty':0, 'top_p':0.1 }) chain = create_extraction_chain(llm, schema, encoder_or_encoder_class='json', verbose=False) with get_openai_callback() as cb: response = chain.run(user_input_text)) print(response) Cheers! |
Follow the guidelines here: https://eyurtsev.github.io/kor/guidelines.html
Also if the codes have a standardized format, consider mixing in approaches with regular expressions. Eugene |
@eyurtsev thank you for your answer, could you give an example about how i can mix approaches with regular expressions ? |
You could experiment by adding more guardrails as well to avoid the duplication issue. |
Why are entities extracted from examples, and how can I avoid them
The text was updated successfully, but these errors were encountered: