[syntax-errors] Remaining syntax errors raised by the compiler

## Summary

A number of semantic or compile-time syntax errors have already been implemented
in #11934, but there are still many remaining ones. The purpose of this issue is
to track progress on this remainder. The initial source for most of these is
[this comment], which reports fuzzer results comparing Python's compiler and
ruff, but others may be included over time.

The next section lists the syntax errors that still need to be implemented, and
the sections after that include some more information on getting started with an
implementation.

## Syntax Errors

### Errors mapping to current rules

These might be the easiest to start with because the existing rules have working
implementations and corresponding tests. However, they could possibly require
new context methods (see below), which could complicate things.

- [ ] `return` with value in async generator (B901)
  - https://github.com/astral-sh/ruff/pull/21200
  - partial overlap with B901, which also applies to sync generators. This is
    only a syntax error in the async case
- [x] no binding for nonlocal (PLE0117)
  - https://github.com/astral-sh/ruff/pull/21032
- [x] `break` outside loop (F701)
  - https://github.com/astral-sh/ruff/pull/20556
  - https://github.com/astral-sh/ruff/pull/20944
- [x] `continue` not properly in loop (F702)
  - https://github.com/astral-sh/ruff/pull/20869
  - https://github.com/astral-sh/ruff/pull/20944
- [x] `yield from` in async function (PLE1700)
  - https://github.com/astral-sh/ruff/pull/20051
- [x] future feature is not defined (F407)
  - https://github.com/astral-sh/ruff/pull/20554
- [x] import `*` only allowed at module level (F406)
  - https://github.com/astral-sh/ruff/pull/20166
- [x] multiple starred expressions in assignment (F622)
  - https://github.com/astral-sh/ruff/pull/20243

### Errors mapping to current parse errors

These are also straightforward and will be similar to #17131.

- duplicate keyword argument
  ```python
  def foo(x): ...
  
  foo(x=1, x=2)  # SyntaxError: keyword argument repeated: x
  ```

### New errors

- [ ] name is parameter and global
  ``` python
  def f(a): global a
  ```
- [ ] name is parameter and nonlocal
  ``` python
  def f(a): nonlocal a
  ```
- [ ] annotated name can't be global
  ``` python
  x: int = 1

  def f():
      global x
      x: str = "foo"  # SyntaxError: annotated name 'x' can't be global
  ```
- [x] alternative patterns bind different names (https://github.com/astral-sh/ruff/pull/20682)
    ```python
    match 42:
        case [x] | [y]: ...
    ```
    This can probably be handled in the `MatchPatternVisitor` along with the other `match`-related errors.

- [x] can't use starred expression here (https://github.com/astral-sh/ruff/pull/17624)
  ``` python
  x = *value
  ```
- [x] nonlocal declaration not allowed at module level (https://github.com/astral-sh/ruff/pull/17559)
  ``` python
  nonlocal x
  ```

### Extensions to existing syntax errors

- Assigning to and deleting attributes named `__debug__` is also a syntax error
  ```python
  __debug__ = False  # Currently caught by ruff
  x.__debug__ = False  # Not currently caught, but also a syntax error
  ```

### Encoding issues

These syntax errors are all related to #6791 and should only be implemented if
we decide to respect these alternate encodings in general.

These can also be a bit tricky to put into a file, so most of these have shell
code snippets for generating the input rather than Python code directly.

- ascii codec can't decode byte `...` in position `...`: ordinal not in `range(128)`

  ```shell
  printf '# -*- coding: ascii -*-\n\xc2' > example.py
  ```

- charmap codec can't decode byte ...

  ```shell
  printf '# -*- coding: cp1252 -*-\n\x8d' > example.py
  ```

- utf7 codec can't decode byte 0xc3

  ```shell
  printf '# -*- coding: utf-7 -*-\n\xc3' > example.py
  ```

  This one reports an `E902` error, but only if it's enabled. Otherwise we only print a warning about invalid UTF-8.

- encoding problem: `...` with BOM

  ``` shell
  printf '\ufeff# -*- coding: ascii -*-' > example.py
  ```

  I think any encoding other than UTF-8 with a BOM (`\uFEFF`) will cause this issue.

- invalid character
- invalid non-printable character
  - both this and `invalid character` mostly seem to happen with the
    `iso-8859-1` encoding and a unicode character

- unknown encoding

  ```python
  # coding: not-a-real-encoding
  ```
- source code string cannot contain null bytes
  ```shell
  printf '# \x00' > example.py
  ```
  This mostly seems like an issue in comments. This also isn't exactly an encoding error, but it's more closely related to these. It's also raised by the CPython parser, not the compiler.

## Implementation

Issue #11934 contains links to each PR implementing a new semantic syntax error.
These can be used as examples of what a new implementation looks like, but this
section has a few more general notes.

All of the semantic syntax errors currently live in [`semantic_errors.rs`] and
correspond to a variant of the [`SemanticSyntaxErrorKind`] enum. The
[`SemanticSyntaxChecker`] is responsible for emitting these errors, but unlike
other AST visitors, it does not generally traverse the AST on its owns[^1]. Instead,
it relies on a parent visitor to call its `visit_stmt` and `visit_expr` methods
in the parent visitor's `visit_stmt` or `visit_expr` methods.

The [`SemanticSyntaxChecker`] also tracks very little of its own state. Instead,
it again defers to the parent visitor via the [`SemanticSyntaxContext`] trait.
Examples of methods on this trait include:
* `python_version` -- returns the target Python version for linting
* `in_async_context` -- returns whether or not async code like `await` or `async
  for` loops are allowed in the current scope

Typically the parent visitor, like the [`Checker`] from the `ruff_linter` crate,
will implement [`SemanticSyntaxContext`] and pass itself to any
[`SemanticSyntaxChecker`] method requiring a context.

Another important context method is `report_semantic_error`, which may require
special handling of new [`SemanticSyntaxErrorKind`] variants. For example, some
syntax errors emitted by CPython overlap with existing ruff rules (e.g.
[`await-outside-async`
(PLE1142)](https://docs.astral.sh/ruff/rules/await-outside-async/)). These need
to be transformed into normal ruff [`Diagnostic`]s instead of being reported
directly as syntax errors. This happens within the
[`Checker::report_semantic_error`] method.

## Testing

There are two main ways of testing new semantic errors. The first, and easier,
method is to write inline parser tests using `test_ok` and `test_err` comments,
for example:

https://github.com/astral-sh/ruff/blob/014bb526f45afd654e5f42db92310e98b5068a54/crates/ruff_python_parser/src/semantic_errors.rs#L80-L95

These are automatically extracted from the source code by the parser's
[`generate_inline_tests`] function and then run like normal snapshot tests.
These work very well for simple errors that don't require detailed information
from the [`SemanticSyntaxContext`].

For errors that do require such contextual information tracked by the parent
visitor, a better alternative is to use integration tests, such as those found
in `linter.rs` in the `ruff_linter` crate:

https://github.com/astral-sh/ruff/blob/014bb526f45afd654e5f42db92310e98b5068a54/crates/ruff_linter/src/linter.rs#L1015-L1025

https://github.com/astral-sh/ruff/blob/014bb526f45afd654e5f42db92310e98b5068a54/crates/ruff_linter/src/linter.rs#L1064-L1066

The second of these can be reused for other semantic errors that correspond to
ruff rules, while something like the first example can be used for new
syntax-error-specific tests. In both cases, these integration tests will take
advantage of the existing context tracking done by the real [`Checker`] instead
of trying to duplicate that tracking in the parser's test
[`SemanticSyntaxCheckerVisitor`].

## Comparing to the fuzzer results

Again, the initial errors in this issue were extracted from [this comment] in
#7633 by following this procedure:
1. Download and unzip the fuzzer results from [here](https://github.com/qarmin/Automated-Fuzzer/releases/download/test/Cpython.no.parsable.parsable.by.ruff.zip)
2. Run the Python script below to get a partially-deduplicated list of errors
3. Double check these errors against ruff and CPython's parser and compiler

<details>
<summary>Python script for processing fuzzer results</summary>

```python
import argparse
import compileall
import contextlib
import dataclasses
import io
import os
import re
import subprocess
from collections import Counter

FILE = re.compile(r'File "([^"]+)", line (\d+)')
MESSAGE = re.compile(r"SyntaxError: (.*)")


@dataclasses.dataclass
class Error:
    file: str
    line_number: int
    message: str


# Ruff rules already re-implemented as semantic syntax errors, not just rules
# that match CPython syntax errors.
RULES = [
    "F404",
    "F704",
    "F706",
    "PLE0118",
    "PLE1142",
]


def run_ruff(ruff_bin, path):
    return subprocess.run(
        [
            ruff_bin,
            "check",
            "--no-cache",
            "--silent",
            "--preview",
            "--config",
            f"lint.select={RULES}",
            path,
        ]
    ).returncode


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--ruff-bin", default="ruff")

    args = parser.parse_args()

    with (
        contextlib.redirect_stdout(io.StringIO()) as out,
        contextlib.redirect_stderr(os.devnull),
    ):
        compileall.compile_dir(".", force=True, quiet=1)

    seen_messages = Counter()
    errors = list()

    file, line_number = None, None
    for line in out.getvalue().splitlines():
        if m := FILE.search(line):
            file, line_number = m.groups()
        if m := MESSAGE.search(line):
            message = m[1]
            # combine repetitive messages into single entries
            for prefix in [
                "unknown encoding",
                "invalid escape sequence",
                "'ascii' codec can't decode",
                "future feature",
                "invalid character",
                "invalid non-printable character",
                "no binding for nonlocal",
            ]:
                if message.startswith(prefix):
                    message = prefix
                    break
            if (
                message not in seen_messages
                and run_ruff(args.ruff_bin, file) == 0
            ):
                errors.append(Error(file, line_number, message))

            seen_messages[message] += 1

    for error in sorted(errors, key=lambda e: e.message):
        count = seen_messages[error.message]
        print(f"{error.file}:{error.line_number}: {error.message} ({count})")


if __name__ == "__main__":
    main()
```
</p>
</details>

[^1]: There are some local violations of this principle. For example, the
    [`ReboundComprehensionVisitor`] recursively visits the expressions in a
    comprehension looking for any expressions rebinding one of the
    comprehension's iteration variables. However, this is a highly-localized AST
    traversal and won't have the same kind of performance implications as an
    additional full AST traversal.

[this comment]: https://github.com/astral-sh/ruff/issues/7633#issuecomment-1740424031
[`semantic_errors.rs`]: https://github.com/astral-sh/ruff/blob/014bb526f45afd654e5f42db92310e98b5068a54/crates/ruff_python_parser/src/semantic_errors.rs
[`SemanticSyntaxErrorKind`]: https://github.com/astral-sh/ruff/blob/014bb526f45afd654e5f42db92310e98b5068a54/crates/ruff_python_parser/src/semantic_errors.rs#L891
[`SemanticSyntaxChecker`]: https://github.com/astral-sh/ruff/blob/014bb526f45afd654e5f42db92310e98b5068a54/crates/ruff_python_parser/src/semantic_errors.rs#L19
[`ReboundComprehensionVisitor`]: https://github.com/astral-sh/ruff/blob/014bb526f45afd654e5f42db92310e98b5068a54/crates/ruff_python_parser/src/semantic_errors.rs#L1291
[`SemanticSyntaxContext`]: https://github.com/astral-sh/ruff/blob/014bb526f45afd654e5f42db92310e98b5068a54/crates/ruff_python_parser/src/semantic_errors.rs#L1586C11-L1586C32
[`Checker`]: https://github.com/astral-sh/ruff/blob/014bb526f45afd654e5f42db92310e98b5068a54/crates/ruff_linter/src/checkers/ast/mod.rs#L182
[`Diagnostic`]: https://github.com/astral-sh/ruff/blob/014bb526f45afd654e5f42db92310e98b5068a54/crates/ruff_diagnostics/src/diagnostic.rs#L22
[`Checker::report_semantic_error`]: https://github.com/astral-sh/ruff/blob/014bb526f45afd654e5f42db92310e98b5068a54/crates/ruff_linter/src/checkers/ast/mod.rs#L572
[`generate_inline_tests`]: https://github.com/astral-sh/ruff/blob/014bb526f45afd654e5f42db92310e98b5068a54/crates/ruff_python_parser/tests/generate_inline_tests.rs#L24
[`SemanticSyntaxCheckerVisitor`]: https://github.com/astral-sh/ruff/blob/014bb526f45afd654e5f42db92310e98b5068a54/crates/ruff_python_parser/tests/fixtures.rs#L470



	Stmt::Assign(ast::StmtAssign { targets, .. }) => {
	if let [Expr::Starred(ast::ExprStarred { range, .. })] = targets.as_slice() {
	// test_ok single_starred_assignment_target
	// (*a,) = (1,)
	// *a, = (1,)
	// [*a] = (1,)

	// test_err single_starred_assignment_target
	// *a = (1,)
	Self::add_error(
	ctx,
	SemanticSyntaxErrorKind::SingleStarredAssignment,
	*range,
	);
	}
	}

	#[test_case(
	"deferred_function_body",
	"
	async def f(): [x for x in foo()] and [x async for x in foo()]
	async def f():
	def g(): ...
	[x async for x in foo()]
	",
	PythonVersion::PY310
	)]
	fn test_async_comprehension_in_sync_comprehension(

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[syntax-errors] Remaining syntax errors raised by the compiler #17412

Summary

Syntax Errors

Errors mapping to current rules

Errors mapping to current parse errors

New errors

Extensions to existing syntax errors

Encoding issues

Implementation

Testing

Comparing to the fuzzer results

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	#[test_case(Rule::YieldOutsideFunction, Path::new("yield_scope.py"))]
	#[test_case(Rule::ReturnOutsideFunction, Path::new("return_outside_function.py"))]
	fn test_syntax_errors(rule: Rule, path: &Path) -> Result<()> {

[syntax-errors] Remaining syntax errors raised by the compiler #17412

Description

Summary

Syntax Errors

Errors mapping to current rules

Errors mapping to current parse errors

New errors

Extensions to existing syntax errors

Encoding issues

Implementation

Testing

Comparing to the fuzzer results

Footnotes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions