- 
                Notifications
    
You must be signed in to change notification settings  - Fork 1k
 
Description
Environment
python version
Python 3.12.8
sqlglot commit
cb9338ad8ed35eca80d57509c1de049b0110a9d8
Problem Description
The SQL optimizer incorrectly alters the logic of a query that uses CTEs and nested JOIN operations.
The original query is designed to start with a core dataset (an INNER JOIN of t0 and t1), and then enrich it by LEFT JOIN-ing two other data sources: t3 and a pre-joined result of t4 and t5. The key intent is that records from the core dataset should be preserved even if they don't have a corresponding match in t3 or in the (t4 INNER JOIN t5) subquery.
However, the optimized query flattens the JOIN structure and incorrectly changes the final join to t5 into an INNER JOIN (written as JOIN). This causes any record where t4.id is not found (resulting in a NULL from the LEFT JOIN t4) or where t4.country_code does not exist in t5 to be filtered out from the final result set. This behavior contradicts the logic of the original query and leads to unintended data loss.
Fully reproducible code snippet
import sqlglot
import sqlglot.optimizer as sgopt
dialect = "spark"
sql = '''
WITH `res` AS (
  SELECT
    `t2`.`id`,
    CAST(IF(NOT `t3`.`id` IS NULL, 1, 0) AS BIGINT) AS `id_active`,
    `t6`.`country_id` AS `cid`,
    `t2`.`uip`,
  FROM (
    SELECT
      `t0`.`id`,
      `t0`.`uip`,
    FROM `t0`
    INNER JOIN `t1`
      ON `t0`.`id` = `t1`.`id`
  ) AS `t2`
  LEFT OUTER JOIN `t3`
    ON `t2`.`id` = `t3`.`id`
  LEFT OUTER JOIN (
    SELECT
      `t4`.`id`,
      `t5`.`country_code`,
      `t5`.`country_id`
    FROM `t4`
    INNER JOIN `t5`
      ON `t4`.`country_code` = `t5`.`country_code`
  ) AS `t6`
    ON `t2`.`id` = `t6`.`id`
)
INSERT OVERWRITE TABLE `res_table`
SELECT
  *
FROM `res`
'''
print(sgopt.optimize(sqlglot.parse_one(sql, dialect=dialect)).sql(dialect=dialect, pretty=True))result is
INSERT OVERWRITE TABLE `res_table`
SELECT
  `t0`.`id` AS `id`,
  CAST(IF(NOT `t3`.`id` IS NULL, 1, 0) AS BIGINT) AS `id_active`,
  `t5`.`country_id` AS `cid`,
  `t0`.`uip` AS `uip`
FROM `t0` AS `t0`
JOIN `t1` AS `t1`
  ON `t0`.`id` = `t1`.`id`
LEFT JOIN `t3` AS `t3`
  ON `t0`.`id` = `t3`.`id`
LEFT JOIN `t4` AS `t4`
  ON `t0`.`id` = `t4`.`id`
JOIN `t5` AS `t5`
  ON `t4`.`country_code` = `t5`.`country_code`however, in the optimized sql code. the third join condition should be left join not inner join. in the original SQL code, the condition should be  (t0 inner join t1) left join (t3) left join (t4 inner join t5). but the optimized code remove the brackets.
In the final result, the optimized code will remove all NULL country_code from the table, which is not expected.
here is the expected sql
SELECT
  t0.id AS id,
  CAST(IF(t3.id IS NOT NULL, 1, 0) AS BIGINT) AS id_active,
  t5.country_id AS cid,
  t0.uip AS uip
FROM t0
JOIN t1 ON t0.id = t1.id
LEFT JOIN t3 ON t0.id = t3.id
LEFT JOIN t4 ON t0.id = t4.id
-- CRITICAL FIX: This must be a LEFT JOIN to preserve records that don't match in t5.
LEFT JOIN t5 ON t4.country_code = t5.country_code
ORDER BY id;Official Documentation
N/A