Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interpret assembly syntax #165

Merged
merged 23 commits into from
Dec 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .clang-format
Original file line number Diff line number Diff line change
Expand Up @@ -37,3 +37,4 @@ WhitespaceSensitiveMacros:
- testcase
- testgroup
- TEST_ITEM
- DESTRINGIFY_NAME
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@ All notable changes to this project will be documented in this file. See [Keep a

## Unreleased

### Added

- The interpreter can now handle assembly syntax, with the `-A` flag in the command line and a checkbox in the online interpreter. Currently, interactive debugging of assembly is not supported in the web interpreter.

### Changed

- Fixed C++23 support
Expand Down
31 changes: 31 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ Trilangle is a 2-D, stack-based programming language inspired by [Hexagony].
> - [Interpreter flags](#interpreter-flags)
> - [Exit codes](#exit-codes)
> - [The disassembler](#the-disassembler)
> - [Assembly syntax](#assembly-syntax)
> - [C compiler](#c-compiler)
> - [Sample programs](#sample-programs)
> - [cat](#cat)
Expand Down Expand Up @@ -225,6 +226,36 @@ For example, when passing [the cat program below](#cat) with the flags `-Dn`, th
2.2: EXT
```

### Assembly syntax

In addition to producing this syntax, Trilangle is capable of interpreting this syntax. Currently, the output with `--hide-nops` is not guaranteed to be interpretable, as it may be missing jump targets. Each line can maximally consist of a label, an instruction, and a comment. The syntax can be described with the following extended Backus-Naur form:

```
program = line, {newline, line};
line = [label, [":"]], [multiple_whitespace, instruction], {whitespace}, [comment];

newline = ? U+000A END OF LINE ?;
tab = ? U+0009 CHARACTER TABULATION ?;
whitespace = " " | ? U+000D CARRIAGE RETURN ? | tab;
non_whitespace = ? Any single unicode character not in 'newline' or 'whitespace' ?;
multiple_whitespace = whitespace, {whitespace};

label = non_whitespace, {non_whitespace};
comment = ";", {non_whitespace | whitespace};

instruction = instruction_with_target | instruction_with_argument | plain_instruction;
instruction_with_target = ("BNG" | "TSP" | "JMP"), multiple_whitespace, label;
instruction_with_argument = ("PSI" | "PSC"), multiple_whitespace, number_literal;
plain_instruction = ? Any three-character instruction besides the five already covered ?;

number_literal = character_literal | decimal_literal | hex_literal;
character_literal = "'", (non_whitespace | tab), "'";
decimal_literal = "#", decimal_digit;
hex_literal = "0x", hex_digit, {hex_digit};
decimal_digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9";
hex_digit = "a" | "b" | "c" | "d" | "e" | "f" | "A" | "B" | "C" | "D" | "E" | "F" | decimal_digit;
```

## C compiler

When using the `-c` flag, the input program will be translated into C code. The C code is not meant to be idiomatic or easy to read, as it is a literal translation of the input. Optimizers such as those used by clang and GCC tend to do a good job of improving the program, in some cases removing the heap allocation altogether; MSVC does not.
Expand Down
2 changes: 1 addition & 1 deletion qdeql/disassembly.txt
Original file line number Diff line number Diff line change
Expand Up @@ -296,7 +296,7 @@ ct_bgn_not_end:
ct_bgn_not_end_cleanup:
POP ; Remove the uop from the stack
DEC ; Advance the PC
JMP ct_find_end_loop
JMP ct_bgn_find_end_loop
ct_bgn_found_end:
; If PC is pointing at the end uop, the stack layout is this:
; +======+------+----+
Expand Down
13 changes: 13 additions & 0 deletions src/any_program_holder.hh
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
#pragma once

#include <functional>
#include "instruction.hh"

template<typename IP>
class any_program_holder {
public:
virtual void advance(IP& ip, std::function<bool()> go_left) = 0;
virtual instruction at(const IP& ip) = 0;
virtual std::string raw_at(const IP& ip) = 0;
virtual std::pair<size_t, size_t> get_coords(const IP& ip) const = 0;
};
293 changes: 293 additions & 0 deletions src/assembly_scanner.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,293 @@
#include "assembly_scanner.hh"
#include <cinttypes>
#include <cstring>
#include <iostream>
#include <set>
#include <sstream>

using std::cerr;
using std::endl;
using std::string;
using std::string_view;
using std::string_view_literals::operator""sv;

#define WHITESPACE " \n\r\t"

[[noreturn]] static void invalid_literal(const string& argument) {
cerr << "Invalid format for literal: " << argument << endl;
exit(EXIT_FAILURE);
bbrk24 marked this conversation as resolved.
Show resolved Hide resolved
}

const std::vector<instruction>& assembly_scanner::get_instructions() {
if (!m_instructions.empty()) {
return m_instructions;
}

// We need to do two passes: one to resolve labels, and one to assign targets to jumps. During the first pass, the
// fragments are actually constructed. However, jumps may not have valid targets yet, so we need some way to store
// the label's name inside an IP. This code relies on the following assumption:
static_assert(sizeof(NONNULL_PTR(const string)) <= sizeof(IP), "Cannot fit string pointer inside IP");
// Using an ordered set over any other container so that references are not invalidated after insertion
std::set<string> label_names;

auto label_to_fake_location = [&](const string& name) -> IP {
auto iter = label_names.find(name);
if (iter == label_names.end()) {
auto p = label_names.insert(name);
iter = p.first;
}
NONNULL_PTR(const string) ptr = &*iter;
return reinterpret_cast<uintptr_t>(ptr);
};

// First pass
size_t line_end = 0;
while (true) {
size_t line_start = m_program.find_first_not_of('\n', line_end);
if (line_start >= m_program.size()) {
break;
}
line_end = m_program.find_first_of('\n', line_start);
string_view curr_line = string_view(m_program).substr(line_start, line_end - line_start);

// Unquoted semicolons are comments. Remove them.
size_t i;
for (i = 0; i < curr_line.size(); ++i) {
if (curr_line[i] != ';') {
continue;
}
if (i == 0 || curr_line[i - 1] != '\'') {
break;
}
}
if (i < curr_line.size()) {
curr_line.remove_suffix(curr_line.size() - i);
}
// If the line is only a comment, move on
if (curr_line.empty()) {
continue;
}
// Remove trailing whitespace. If there's only whitespace, skip this line
i = curr_line.find_last_not_of(WHITESPACE);
if (i == string::npos) {
continue;
} else {
curr_line.remove_suffix(curr_line.size() - i - 1);
}

// Look for labels (non-whitespace in the first column)
i = curr_line.find_first_not_of(WHITESPACE ":");
assert(i != string::npos);
if (i == 0) {
// Label, find end
i = curr_line.find_first_of(WHITESPACE ":");
if (i == string::npos) {
i = curr_line.size();
}
string label(curr_line.substr(0, i));

[[maybe_unused]] auto _0 = label_names.insert(label);
auto [_, inserted] = m_label_locations.insert({ label, m_instructions.size() });

if (!inserted) {
cerr << "Label '" << label << "' appears twice" << endl;
exit(EXIT_FAILURE);
bbrk24 marked this conversation as resolved.
Show resolved Hide resolved
}

// Set i to the first non-whitespace character after the label
i = curr_line.find_first_not_of(WHITESPACE ":", i + 1);
if (i == string::npos) {
// Line was only a label
continue;
}
}

// Remove leading whitespace, and label if there is one
curr_line.remove_prefix(i);

// Line should only be the opcode and, if there is one, the argument
if (curr_line.size() < 3) {
cerr << "Instruction too short: " << curr_line << endl;
exit(EXIT_FAILURE);
bbrk24 marked this conversation as resolved.
Show resolved Hide resolved
}
if (curr_line.size() > 3 && !strchr(WHITESPACE, curr_line[3])) {
cerr << "Instruction too long: " << curr_line << endl;
exit(EXIT_FAILURE);
}

string_view instruction_name(curr_line.data(), 3);
instruction::operation opcode = assembly_scanner::opcode_for_name(instruction_name);
switch (opcode) {
case instruction::operation::JMP: {
size_t label_start = curr_line.find_first_not_of(WHITESPACE, 3);
bbrk24 marked this conversation as resolved.
Show resolved Hide resolved
instruction::argument arg;
string label(curr_line.substr(label_start));
arg.next = { SIZE_C(0), label_to_fake_location(label) };
m_instructions.push_back({ opcode, arg });
m_slices.push_back(curr_line);
break;
}
case instruction::operation::BNG:
case instruction::operation::TSP: {
size_t label_start = curr_line.find_first_not_of(WHITESPACE, 3);
bbrk24 marked this conversation as resolved.
Show resolved Hide resolved
instruction::argument arg;
string label(curr_line.substr(label_start));
arg.choice = { { SIZE_C(0), m_instructions.size() + 1 }, { SIZE_C(0), label_to_fake_location(label) } };
m_instructions.push_back({ opcode, arg });
m_slices.push_back(curr_line);
break;
}
case instruction::operation::PSI:
case instruction::operation::PSC: {
size_t arg_start = curr_line.find_first_not_of(WHITESPACE, 3);
bbrk24 marked this conversation as resolved.
Show resolved Hide resolved
if (arg_start == string::npos) {
cerr << "Missing argument for push instruction" << endl;
exit(EXIT_FAILURE);
}

string argument(curr_line.substr(arg_start));
int24_t arg_value;
// Should be in one of three formats:
// - 'c' (single UTF-8 character)
// - 0xff (arbitrary length hex number)
// - #9 (single decimal digit)
if (argument[0] == '\'' && argument.back() == '\'') {
if (argument.size() < 3 || argument.size() > 6) {
// One UTF-8 character is 1 to 4 bytes
invalid_literal(argument);
}
i = 1;
arg_value = parse_unichar([&]() { return argument[i++]; });
if (arg_value < INT24_C(0) || i != argument.size() - 1) {
invalid_literal(argument);
}
} else if (argument[0] == '0' && argument[1] == 'x') {
char* last = nullptr;
unsigned long ul = strtoul(argument.c_str(), &last, 16);
if (*last != '\0' || ul > 0x1f'ffffUL) {
invalid_literal(argument);
}
arg_value = static_cast<int24_t>(ul);
} else if (argument[0] == '#' && argument.size() == 2) {
arg_value = static_cast<int24_t>(argument[1] - '0');
if (arg_value < INT24_C(0) || arg_value > INT24_C(9)) {
invalid_literal(argument);
}
} else {
invalid_literal(argument);
}

if (opcode == instruction::operation::PSI) {
// PSI expects to be given the digit, not the actual value
auto p = arg_value.add_with_overflow('0');
assert(!p.first);
arg_value = p.second;
}

instruction::argument arg;
arg.number = arg_value;
m_instructions.push_back({ opcode, arg });
m_slices.push_back(curr_line);
break;
}
default:
m_instructions.push_back({ opcode, instruction::argument() });
m_slices.push_back(curr_line);
break;
}
}

if (!m_instructions.empty()) {
const auto& last_instr = m_instructions.back();
if (!(last_instr.is_exit() || last_instr.m_op == instruction::operation::JMP)) {
cerr << "Program does not end in an exit instruction or loop" << endl;
exit(EXIT_FAILURE);
}
}

// Second pass
for (auto& instr : m_instructions) {
if (instr.m_op == instruction::operation::JMP) {
fake_location_to_real(instr.m_arg.next);
} else if (instr.m_op == instruction::operation::TSP || instr.m_op == instruction::operation::BNG) {
fake_location_to_real(instr.m_arg.choice.second);
}
}

return m_instructions;
}

void assembly_scanner::advance(IP& ip, std::function<bool()> go_left) {
instruction i = at(ip);

if (i.get_op() == instruction::operation::JMP) {
ip = i.get_arg().next.second;
return;
}

const auto* to_left = i.second_if_branch();
if (to_left != nullptr && go_left()) {
ip = to_left->second;
return;
}

ip++;
}

void assembly_scanner::fake_location_to_real(std::pair<size_t, size_t>& p) const {
uintptr_t reconstructed = static_cast<uintptr_t>(p.second);
auto ptr = reinterpret_cast<NONNULL_PTR(const string)>(reconstructed);
const string& str = *ptr;
auto loc = m_label_locations.find(str);
if (loc == m_label_locations.end()) {
cerr << "Undeclared label '" << str << "'" << endl;
exit(EXIT_FAILURE);
}
p = { SIZE_C(0), loc->second };
}

#define DESTRINGIFY_NAME(op) \
if (name == #op##sv) \
return instruction::operation::op

instruction::operation assembly_scanner::opcode_for_name(const string_view& name) noexcept {
DESTRINGIFY_NAME(BNG);
DESTRINGIFY_NAME(JMP);
DESTRINGIFY_NAME(TKL);
DESTRINGIFY_NAME(TSP);
DESTRINGIFY_NAME(TJN);
DESTRINGIFY_NAME(TKL);
DESTRINGIFY_NAME(NOP);
DESTRINGIFY_NAME(ADD);
DESTRINGIFY_NAME(SUB);
DESTRINGIFY_NAME(MUL);
DESTRINGIFY_NAME(DIV);
DESTRINGIFY_NAME(UDV);
DESTRINGIFY_NAME(MOD);
DESTRINGIFY_NAME(PSI);
DESTRINGIFY_NAME(PSC);
DESTRINGIFY_NAME(POP);
DESTRINGIFY_NAME(EXT);
DESTRINGIFY_NAME(INC);
DESTRINGIFY_NAME(DEC);
DESTRINGIFY_NAME(AND);
DESTRINGIFY_NAME(IOR);
DESTRINGIFY_NAME(XOR);
DESTRINGIFY_NAME(NOT);
DESTRINGIFY_NAME(GTC);
DESTRINGIFY_NAME(PTC);
DESTRINGIFY_NAME(GTI);
DESTRINGIFY_NAME(PTI);
DESTRINGIFY_NAME(PTU);
DESTRINGIFY_NAME(IDX);
DESTRINGIFY_NAME(DUP);
DESTRINGIFY_NAME(DP2);
DESTRINGIFY_NAME(RND);
DESTRINGIFY_NAME(EXP);
DESTRINGIFY_NAME(SWP);
DESTRINGIFY_NAME(GTM);
DESTRINGIFY_NAME(GDT);

cerr << "Unrecognized opcode '" << name << '\'' << endl;
exit(EXIT_FAILURE);
}
Loading
Loading