Skip to content

Lexical Structure

dylanjtuttle edited this page Aug 11, 2022 · 4 revisions

Encoding

soup only accepts ASCII encoding, although not all ASCII characters are necessarily represented in a valid token.

Ignored

Whitespace

"Whitespace" consists of any combination of one or more of the following four ASCII characters:

  • ' ': ASCII character 0x20, also known as "Space"
  • '\t': ASCII character 0x9, also known as "Horizontal Tab"
  • '\n': ASCII character 0xA, also known as "Line Feed" or "New Line"
  • '\r': ASCII character 0xD, also known as "Carriage Return"

Comments

"Comments" begin with two consecutive forward slashes // and continue until the end of the line (i.e. the next "New Line" \n). A comment can begin at any point in a file and has a higher precedence than any other construct in the language (for example, a comment cannot be included inside a string, as the rest of the line will be ignored).

Tokens

Identifiers

An "Identifier" is an unlimited-length sequence of alphabetic characters, digits, and underscores. An identifier cannot start with a digit. soup also has 11 reserved words (see Reserved Words) plus the boolean literals true and false, and none of these cannot be reused as identifiers. The following are examples of valid identifiers:

  • identifier
  • iden_tifier
  • _ID
  • iD
  • _
  • id_1
  • I2d

And the following are examples of invalid identifiers:

  • 3id
  • return
  • false

Integer Literals

An "Integer Literal" consists of one or more digits and can only be represented in base 10. For example, an integer literal of length 2 or more beginning with 0 is not an octal number, it is simply the decimal number without the leading zero(s). The following are examples of integer literals:

  • 0
  • 02
  • 1309242463024963

Note that the definition of an integer literal does not include an optional hyphen - to indicate negative numbers. Negative numbers are represented in the compiler as a unary minus operator applied to an integer literal.

String Literals

A "String Literal" consists of zero or more characters contained within two quotation marks ". The following escape characters are supported:

  • \b
  • \f
  • \t
  • \r
  • \n
  • \'
  • \"
  • \\

As mentioned above, two consecutive forward slashes // can not be contained in a string because the compiler will interpret that as the beginning of a comment.

Boolean Literals

A "Boolean Literal" is one of the following, formed from ASCII characters:

  • true
  • false

Reserved Words

A "Reserved Word" is one of the following, formed from ASCII characters:

  • int
  • bool
  • void
  • if
  • else
  • while
  • break
  • return
  • func
  • returns
  • main

Operators

An "Operator" is one of the following, formed from ASCII characters:

  • +
  • +=
  • -
  • -=
  • *
  • *=
  • /
  • /=
  • %
  • %=
  • <
  • <=
  • >
  • >=
  • =
  • ==
  • !
  • !=
  • &&
  • ||

Separators

A "Separator" is one of the following ASCII characters:

  • (
  • )
  • {
  • }
  • ;
  • ,

End Of File

An "End Of File" does not correspond to an ASCII character, it is simply appended to the list of tokens at the very end of the scanning portion of compilation, to check for errors such as a string not being closed.