-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wrong primary keys in GeoQuery schema #49
Comments
Thanks for the question! The primary keys can be multi-part, so in that case the key is actually ( |
Wow, that was a quick response, thanks! Is this also the reason why primary key constraints are not declared in https://github.com/jkkummerfeld/text2sql-data/blob/master/data/geography-db.sql ? |
Lucky timing with when I took a break to check email :) On not being declared, that's an interesting point. I'm not sure what the MySQL default behaviour is (use all fields?) but it certainly would be better to have the DB file and our text schema match. I'll label this issue as an enhancement and keep it open. |
Following up on this for my own future reference. Suhr et al.'s work also explored improvements to the schemas, see the data at https://github.com/google-research/language/tree/master/language/xsp There is some subtlety to the process though, as raised in this issue: google-research/language#80 |
Great to see that you want to get things right. Let me also share some extra analysis here. In principle, the purpose of primary and foreign key constraints is to inform database software and allow it perform consistency checks. But in the context of Lang2SQL translation they start playing a new role. Especially when we consider 0-shot evaluation on a unseen database, foreign key links let the model guess how to join tables in the query. There is typically close to no information on how tables should be joined in the question. The column names sometimes give a hint in the right direction, but sometimes the column names are not informative (check out e.g. So it is important to get the foreign key links right in the database in order for Lang2SQL on a unseen database to be feasible. But when the database schema is not fully normalized, this can be non-trivial. For example, in the IMDB database the The few options available in such situation are as follows:
As far as I understand, Suhr et al chose the option of adding "virtual" foreign key links (see e.g. the tuple So far I have performed a detailed audit of GeoQuery, Restaurants, IMDB, Yelp and Academic. Yelp and Academic are perfectly normalized, and legit foreign key constraints are easy to define (although it would be nice in they were defined in the MySQL dump). I have described the situation with IMDB above. As for GeoQuery and Restaurants, even though these databases are not normalized at all, it is quite clear what foreign key links are sensible (both as the constraints on DB content and as hints to Lang2SQL models). |
Interesting, and tricky! You may want to contact the authors of https://arxiv.org/abs/2010.02840 as I spoke to them and they had to do some similar work (in their case, adapting the schemas to match the format of Spider). |
Also, I should note, I don't expect either Cathy or myself to have the time to make this update in the near future (6 months at least). She is busy with her work at IBM and I am on the job market this cycle. We'd be happy to review and accept a pull request though or will hopefully get to it down the line. |
It appears to me that primary key information in geography-schema.csv is wrong. For example, how can
state_name
be the primary key in themountain
table?The text was updated successfully, but these errors were encountered: