Modify validity check + clean up creds #209

wendy-aw · 2024-08-13T09:23:12Z

Modify translate script to test SQL validity on the exisiting DBs instead of creating empty test DBs. This allows us to flag translated SQL which are invalid due to empty results.
Also took the chance to modify 1 question-SQL pair that had empty results, and remove a redundant alternate query.
Cleaned up the creds in dialects.py to always refer to db_creds_all in utils.creds

rishsriv · 2024-08-13T09:24:51Z

translate_sql_dialect.py

 from tqdm import tqdm
 from eval.eval import get_all_minimal_queries
 import os

 tqdm.pandas()

 dataset_file = (
-    "data/instruct_advanced_postgres.csv"  # Postgres dataset file to translate
+    "data/instruct_advanced_postgresTEST.csv"  # Postgres dataset file to translate


Typo spotted!

thanks! i'll revert to the original

rishsriv

Thanks Wendy! Looks good to me – thanks for the big changes in dialects.py in particular, that was a very useful blind spot to look at!

wongjingping

just 1 small comment!

wongjingping · 2024-08-13T09:28:56Z

data/instruct_advanced_sqlite.csv

@@ -54,7 +54,7 @@ Last 30 days = DATE('now', '-30 days') to DATE('now')
 PMSPS = per month salesperson signups
 To get the total sales amount per salesperson, join the salespersons and sales tables, group by salesperson, and sum the sale_price. Always order results with NULLS last.
 Truncate date to week for aggregation."
-car_dealership,sqlite,instructions_cte_join,"WITH recent_sales AS (SELECT sp.id, sp.first_name, sp.last_name, COUNT(s.id) AS num_sales FROM salespersons AS sp LEFT JOIN sales AS s ON sp.id = s.salesperson_id WHERE s.sale_date >= DATE('now', '-30 days') GROUP BY sp.id) SELECT id, first_name, last_name, num_sales FROM recent_sales ORDER BY num_sales DESC;WITH recent_sales AS (SELECT sp.id, sp.first_name, sp.last_name, COUNT(s.id) AS num_sales FROM salespersons AS sp LEFT JOIN sales AS s ON sp.id = s.salesperson_id AND s.sale_date >= DATE('now', '-30 days') GROUP BY sp.id, sp.first_name, sp.last_name) SELECT id, first_name, last_name, num_sales FROM recent_sales ORDER BY num_sales DESC;","How many sales did each salesperson make in the past 30 days, inclusive of today's date? Return their ID, first name, last name and number of sales made, ordered from most to least sales.","To get the number of sales made by each salesperson in the past 30 days, join the salespersons and sales tables and filter for sales in the last 30 days.","When using car makes, model names, engine_type, and vin_number, ensure matching is case-insensitive and allows for partial matches using LIKE with wildcards.
+car_dealership,sqlite,instructions_cte_join,"WITH recent_sales AS (SELECT sp.id, sp.first_name, sp.last_name, COUNT(s.id) AS num_sales FROM salespersons AS sp LEFT JOIN sales AS s ON sp.id = s.salesperson_id AND s.sale_date >= DATE('now', '-30 days') GROUP BY sp.id) SELECT id, first_name, last_name, num_sales FROM recent_sales ORDER BY num_sales DESC;","How many sales did each salesperson make in the past 30 days, inclusive of today's date? Return their ID, first name, last name and number of sales made, ordered from most to least sales.","To get the number of sales made by each salesperson in the past 30 days, join the salespersons and sales tables and filter for sales in the last 30 days.","When using car makes, model names, engine_type, and vin_number, ensure matching is case-insensitive and allows for partial matches using LIKE with wildcards.


Should it be GROUP BY sp.id, sp.first_name, sp.last_name) instead of GROUP BY sp.id)? Since we're selecting 3 columns

Eh sorry i made a mistake. Shouldn't have removed the original because it provided an alternative where zero counts were not included.

On the grouping, I found just grouping by id producing the exact same results so I thought the other grouping cols was redundant. Is there a fundamental difference in meaning that I'm missing?

Answering my own question: Ohh it's cos of postgres' flexi rules that it still works with grouping only by id. Should be better practice to add all selected cols in GROUP BY. Lemme revert to original!

wendy-aw added 6 commits August 13, 2024 11:58

modify creds

5c6f3eb

clean up creds

93a5942

validity check on existing DBs

2a83cac

remove redundant query alternative

620eee7

correct qn/sql to give non-empty result

269a22a

remove unused functions

37c7f91

wendy-aw requested review from wongjingping and rishsriv August 13, 2024 09:23

rishsriv reviewed Aug 13, 2024

View reviewed changes

wendy-aw added 2 commits August 13, 2024 17:25

lint

94e7db7

change dataset file name

627611e

rishsriv approved these changes Aug 13, 2024

View reviewed changes

wongjingping approved these changes Aug 13, 2024

View reviewed changes

revert to original sql

f548b8a

wendy-aw merged commit d00ffdf into main Aug 13, 2024
2 checks passed

wendy-aw deleted the wendy/test_validity branch August 13, 2024 10:01

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify validity check + clean up creds #209

Modify validity check + clean up creds #209

wendy-aw commented Aug 13, 2024

rishsriv Aug 13, 2024

wendy-aw Aug 13, 2024

rishsriv left a comment

wongjingping left a comment

wongjingping Aug 13, 2024

wendy-aw Aug 13, 2024

wendy-aw Aug 13, 2024

Modify validity check + clean up creds #209

Modify validity check + clean up creds #209

Conversation

wendy-aw commented Aug 13, 2024

rishsriv Aug 13, 2024

Choose a reason for hiding this comment

wendy-aw Aug 13, 2024

Choose a reason for hiding this comment

rishsriv left a comment

Choose a reason for hiding this comment

wongjingping left a comment

Choose a reason for hiding this comment

wongjingping Aug 13, 2024

Choose a reason for hiding this comment

wendy-aw Aug 13, 2024

Choose a reason for hiding this comment

wendy-aw Aug 13, 2024

Choose a reason for hiding this comment