LLM Compiler - First Impressions

Compute Heavy Industries has believed for some time that machine language research is a vastly undertapped area of deep learning research. Needless to say, we were very excited about the release of Meta AI's LLM Compiler.

We are in a unique position to dive into this model, with a deep background in application security and more recent work in around the automation of code generation models.

That being said, this is a cursory review with a small sample size.

LLM Compiler

Meta's LLM Compiler is a llama-architecture model trained on LLVM intermediate representation (IR) and (primarily) Linux x86-64 assembly code. It is an open model and has a 16k token context window.

If you are unfamiliar with LLVM, it is a compiler framework consisting of frontends that convert the source code of different programming languages (e.g. C, C++, Rust, etc.) to LLVM IR, and backends that convert LLVM IR to various assembly architectures (e.g. x86-64, ARM, WASM, etc.).

LLVM, as a compiler framework, allows you to write passes over the LLVM IR to analyze and optimize it. This is a pretty attractive concept for security researchers as they can write analyses for LLVM IR and get audit capabilities for many different programming languages. Further, there is research around lifting assembly code into LLVM IR under the same premise. If you can lift different compiled targets into LLVM IR, you only need to write your analyses for LLVM IR and you get many architectures for free.

Back to the LLM Compiler model. There are four modes of operation. The two primary modes take LLVM IR and output LLVM IR or x86-64 assembly given some optimization passes as parameters. Additionally, Meta made a finetune that lifts code from assembly to LLVM IR (very cool!) and can also select optimization passes.

The work highlighted in this post focuses solely on the translation of LLVM IR to assembly.

Setup

Since LLM Compiler targets x86-64 Linux assembly, we used LLVM in Windows Subsytem for Linux (WSL). Specifically, we were using LLMV 18.

To run the model we used llama.cpp. We initially tried with HuggingFace but it did not work, likely because Python is bad. We converted the models for llama.cpp and the model ran without issue.

There are four versions of the model. The base model comes in 7b and 13b parameters. The *-ftd models of the same size are the fine tunes for flag tuning and lifting assembly to LLVM IR.

For the hardware, we used a RTX 3090. The 7b model ran performantly at ~40 tok/s. It appeared that the 13b model exceeded 24 GB of GPU memory during inference. For this reason, we ran the 7b base model.

The Prompts

As mentioned above, there are four prompts provided in the demo code that was released with the model.

The first translates from LLVM IR to LLVM IR given some optimization passes.

[INST] Give the LLVM-IR for the following code when optimized using opt -p '{passes}':

<code>{ir}</code>

The input code has instruction count {ir_count} and binary size {bin_size} bytes. [/INST]
            

The second translates from LLVM IR to ASM given some optimization passes. This was the prompt we focused on for this evaluation.

[INST] Give the assembly for the following code when optimized using opt -p '{passes}':

<code>{ir}</code>

The input code has instruction count {ir_count} and binary size {bin_size} bytes. [/INST]
            

The third is about code optimizations. We did not test this so we cannot speak to whether it provides a list of optimization passes or actually describes what they do to the LLVM IR.

[INST] Tell me how to optimize this LLVM-IR for object file size:

<code>{ir}</code> [/INST]
            

The fourth is for lifting assembly to LLVM IR. We also did not run this functionality though it is an exciting prospect.

[INST] Disassemble this code to LLVM-IR:

<code>{asm}</code> [/INST]
            

Our Test Program

Our evaluation program parses a diff of a Git LFS pointer file in order to extract the old and new object hashes. The goal in writing this program was to have some simple parsing logic that would exercise a variety low level instructions. Additionally, we wanted the program to be self-contained so we would not have to fix up APIs or debug inlined library functions (e.g. strcmp).

Glancing through the paper we noticed that the model was trained on a maximum function size of 8k tokens. This was done to have 8k for the prompt and 8k for the output. We initially had everything in the parse_diff function and upon evaluation, this function exceeded 8k tokens. We started breaking out functionality until we had four test functions were all under 8k tokens.

#include <stdio.h>
#include <stdint.h>
#include <stdbool.h>
#include <string.h>

size_t consume_line(char *in_buf, size_t in_size, size_t idx) {
while (idx < in_size && in_buf[idx] != '\n') {
idx++;
}
idx++;

return idx;
}

bool copy_64_hex(char *in_buf, char *out_buf, size_t idx) {
for (int i = 0; i < 64; i++) {
if (!(in_buf[idx+i] >= 0x30 && in_buf[idx+i] <= 0x39 ||
in_buf[idx+i] >= 0x61 && in_buf[idx+i] <= 0x66 ||
in_buf[idx+i] >= 0x41 && in_buf[idx+i] <= 0x46))
{
return false;
}

out_buf[i] = in_buf[idx+i];
}

return true;
}

bool validate(char *in_buf, size_t in_size, size_t *idx) {
// Check starts with "diff".
if (*(uint32_t *)(in_buf+*idx) != 0x66666964) {
return false;
}

*idx += 4;

*idx = consume_line(in_buf, in_size, *idx);
if (*idx >= in_size) {
return false;
}

*idx = consume_line(in_buf, in_size, *idx);
if (*idx >= in_size) {
return false;
}

// Check line starts with "--- ".
if (in_size - *idx < 4) {
return false;
}

if (in_buf[*idx] != '-') {
return false;
}

(*idx)++;

if (in_buf[*idx] != '-') {
return false;
}

(*idx)++;

if (in_buf[*idx] != '-') {
return false;
}

(*idx)++;

if (in_buf[*idx] != ' ') {
return false;
}

(*idx)++;

*idx = consume_line(in_buf, in_size, *idx);
if (*idx >= in_size) {
return false;
}

// Check line starts with "+++ ".
if (in_size - *idx < 4) {
return false;
}

char *check = "+++ ";
for (int i = 0; i < 4; i++) {
if (in_buf[*idx+i] != check[i]) {
return false;
}
}

*idx += 4;

return true;
}

bool parse_diff(char *in_buf, size_t in_size, char *out_buf, size_t out_size) {
if (in_size < 4) {
return false;
}

if (out_size < 128) {
return false;
}

size_t idx = 0;

bool ok = validate(in_buf, in_size, &idx);

if (!ok) {
return false;
}

idx = consume_line(in_buf, in_size, idx);
if (idx >= in_size) {
return false;
}

// Check line matches "@@ -1,3 +1,3 @@\n".
if (in_size - idx < 16) {
return false;
}

if (*(uint64_t *)(in_buf+idx) != 0x20332c312d204040) {
return false;
}

idx += 8;

if (*(uint64_t *)(in_buf+idx) != 0xa404020332c312b) {
return false;
}

idx += 8;

// Check version line.
size_t check_size = 44;
char *check = " version https://git-lfs.github.com/spec/v1\n";

if (in_size - idx < check_size) {
return false;
}

for (int i = 0; i < check_size; i++) {
if (in_buf[idx+i] != check[i]) {
return false;
}
}

idx += check_size;

// Parse old hash.
if (in_size - idx < 76) {
return false;
}

uint32_t oid_check = 0x64696f2d;
if (*(uint32_t *)(in_buf+idx) != oid_check) {
return false;
}

idx += 4;

uint64_t sha256_check_u = 0x3a36353261687320;
if (*(uint64_t *)(in_buf+idx) != sha256_check_u) {
return false;
}

idx += 8;

ok = copy_64_hex(in_buf, out_buf, idx);
if (!ok) {
return false;
}

idx += 64;
idx++;

idx = consume_line(in_buf, in_size, idx);
if (idx >= in_size) {
return false;
}

// Parse new hash.
if (in_size - idx < 76) {
return false;
}

oid_check ^= 0b110;
if (*(uint32_t *)(in_buf+idx) != oid_check) {
return false;
}

idx += 4;

char *sha256_check_s = " sha256:";
if (*(uint64_t *)(in_buf+idx) != *(uint64_t *)sha256_check_s) {
return false;
}

idx += 8;

ok = copy_64_hex(in_buf, out_buf+64, idx);
if (!ok) {
return false;
}

idx += 64;
idx++;

return true;
}

int main() {
char *diff = "diff --git a/objects/AcademicDarkvioletCoqui.jpg "
"b/objects/AcademicDarkvioletCoqui.jpg\n"
"index a1c4d61..bec8408 100644\n"
"--- a/objects/AcademicDarkvioletCoqui.jpg\n"
"+++ b/objects/AcademicDarkvioletCoqui.jpg\n"
"@@ -1,3 +1,3 @@\n"
" version https://git-lfs.github.com/spec/v1\n"
"-oid sha256:"
"8751496787dd63ed263ba24afd9cfdec599650af13879735d656af3fd81ccdf5\n"
"-size 550008\n"
"+oid sha256:"
"f356316203a3818c48018990d476395d38d25550619224c35a7416393e520f52\n"
"+size 48722\n";

// printf("%s", diff);

char out_buf[129] = { 0 };
size_t out_size = 128;

size_t diff_size = strlen(diff);
bool ok = parse_diff(diff, diff_size, out_buf, out_size);

if (!ok) {
printf("FALSE\n");
return 1;
}

char *check =
"8751496787dd63ed263ba24afd9cfdec599650af13879735d656af3fd81ccdf5"
"f356316203a3818c48018990d476395d38d25550619224c35a7416393e520f52";

if (memcmp(out_buf, check, 128) != 0) {
printf("BAD\n");
return 1;
}

printf("OK\n");
}

Prompt Parameters

You might have noticed above, the prompt for converting LLVM IR to assembly takes a few parameters. The first is a list of optimization passes. We are not so interested in optimization so we just put O0 in this field.

Following this is the IR. To get the IR for a specific function, we first compiled our test program using clang, emitting LLVM IR. The output of this command will be main.ll.

clang-18 -S -emit-llvm -Xclang -disable-O0-optnone main.c

Additionally, we prevent the optnone decoration from being added to the LLVM IR functions. This decoration prevents optimization passes from running on the code, which will be important in a couple paragraphs when we try to get the instruction count.

Once we have our main.ll, the full file of LLVM IR, we need to extract a specific function. Using consume_line as our example, we can extract that function from the LLVM IR to get an LLVM bitcode file (binary representation of the human readable IR) containing only that function.

llvm-extract-18 --func=consume_line main.ll -o consume_line.bc

For use in our prompt, we now bring the bitcode file to LLVM IR.

llvm-dis-18 consume_line.bc

Now there are two more parameters in the prompt. The count of instructions in the LLVM IR and the nebulous binary size. To automate getting the instruction count, we can use opt which runs optimization passes over LLVM IR. Specifically, we want to run the instcount pass.

opt-18 -stats --passes=instcount -o /dev/null -debug-pass-manager consume_line.bc

This little command had us recompiling LLVM all weekend. The first issue is that the instcount pass outputs statistics. It is our understanding that statistic do not get output if you have a release build. Once you've rebuilt LLVM to get a debug build and that pass still does not work, you will find that passes do not run if your functions are annotated with optnone, as mentioned above. The way to discover this is with the switch -debug-pass-manager, which will show you that each function is being skipped due to optnone. Once you sort that you will see some nice detailed statistics. Now, maybe you have accidentally switched back to a release build when going from your Windows LLVM build to the LLVM apt distribution on WSL. In this case, you will see no statistics, but if you again use the -debug-pass-manager switch you can see a single statistic reporting total instruction count. Exactly what we need!

For binary size, we turn to the LLM Compiler paper.

We use two metrics for code size: the number of IR instructions, and binary size. Binary size is computed by summing the size of the .TEXT and .DATA sections of the IR or assembly after lowering to an object file; we exclude .BSS section from our binary size metric since it does not affect on-disk size.

What they do not mention is that the utility llvm-size can report the different section sizes for you if you assemble the LLVM IR to an object file (.o).

llc-18 consume_line.ll -filetype=obj
llvm-size-18 consume_line.o

With that, we have all of the parameters needed for a prompt. We repeat the process for each of the four functions that we wish to test; consume_line, copy_64_hex, validate, and parse_diff. We create files with the formatted prompt for each of these functions.

Inference

Llama.cpp works extremely well so this part is straighforward. You can convert a model from HuggingFace format to GGUF using the convert.py script in the llama.cpp repo. You will then build llama.cpp for your system, making sure to do a CUDA build or the fastest option your system supports.

Note: You can run llama.cpp as a server and there are scripts to expose an OpenAI API compatible API if you are looking to do inference programatically.

Our run command is as follows.

llama-cli -m models/llm-compiler-7b/ggml-model-f16.gguf -ngl 33 -f prompts/ir-to-asm.txt

We point the command to the converted model. We tell it to load all 33 layers onto the GPU (-ngl). If you don't know the number of layers you can give -ngl a big number (e.g. 50) and it will just load them all and tell you how many it loaded. Finally, we provide a file containing our prompt.

The result, with our prompt is below.

[INST] Give the x86-64 assembly for the following code when optimized using opt -p 'O0':

<code>
; ModuleID = 'consume_line.bc'
source_filename = "main.c"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
target triple = "x86_64-pc-linux-gnu"

; Function Attrs: noinline nounwind uwtable
define dso_local i64 @consume_line(ptr noundef %0, i64 noundef %1, i64 noundef %2) #0 {
%4 = alloca ptr, align 8
%5 = alloca i64, align 8
%6 = alloca i64, align 8
store ptr %0, ptr %4, align 8
store i64 %1, ptr %5, align 8
store i64 %2, ptr %6, align 8
br label %7

7: ; preds = %20, %3
%8 = load i64, ptr %6, align 8
%9 = load i64, ptr %5, align 8
%10 = icmp ult i64 %8, %9
br i1 %10, label %11, label %18

11: ; preds = %7
%12 = load ptr, ptr %4, align 8
%13 = load i64, ptr %6, align 8
%14 = getelementptr inbounds i8, ptr %12, i64 %13
%15 = load i8, ptr %14, align 1
%16 = sext i8 %15 to i32
%17 = icmp ne i32 %16, 10
br label %18

18: ; preds = %11, %7
%19 = phi i1 [ false, %7 ], [ %17, %11 ]
br i1 %19, label %20, label %23

20: ; preds = %18
%21 = load i64, ptr %6, align 8
%22 = add i64 %21, 1
store i64 %22, ptr %6, align 8
br label %7, !llvm.loop !6

23: ; preds = %18
%24 = load i64, ptr %6, align 8
%25 = add i64 %24, 1
store i64 %25, ptr %6, align 8
%26 = load i64, ptr %6, align 8
ret i64 %26
}

attributes #0 = {
noinline nounwind uwtable
"frame-pointer"="all"
"min-legal-vector-width"="0"
"no-trapping-math"="true"
"stack-protector-buffer-size"="8"
"target-cpu"="x86-64"
"target-features"="+cmov,+cx8,+fxsr,+mmx,+sse,+sse2,+x87"
"tune-cpu"="generic"
}

!llvm.module.flags = !{!0, !1, !2, !3, !4}
!llvm.ident = !{!5}

!0 = !{i32 1, !"wchar_size", i32 4}
!1 = !{i32 8, !"PIC Level", i32 2}
!2 = !{i32 7, !"PIE Level", i32 2}
!3 = !{i32 7, !"uwtable", i32 2}
!4 = !{i32 7, !"frame-pointer", i32 2}
!5 = !{!"Ubuntu clang version 18.1.8 (++20240615103753+3b5b5c1ec4a3-1~exp1~20240615223858.136)"}
!6 = distinct !{!6, !7}
!7 = !{!"llvm.loop.mustprogress"}

</code>

The input code has instruction count 29 and binary size 129 bytes. [/INST]

The assembly will have instruction count 29 and binary sise 129 bytes:

<code> .text
.file "-"
.globl consume_line # -- Begin function consume_line
.type consume_line,@function
consume_line: # @consume_line
.cfi_startproc
# %bb.0:
movq %rdi, -8(%rsp)
movq %rsi, -16(%rsp)
movq %rdx, -24(%rsp)
.LBB0_1: # =>This Inner Loop Header: Depth=1
movq -24(%rsp), %rcx
xorl %eax, %eax
# kill: def $al killed $al killed $eax
cmpq -16(%rsp), %rcx
movb %al, -25(%rsp) # 1-byte Spill
jae .LBB0_3
# %bb.2: # in Loop: Header=BB0_1 Depth=1
movq -8(%rsp), %rax
movq -24(%rsp), %rcx
movsbl (%rax,%rcx), %eax
cmpl $10, %eax
setne %al
movb %al, -25(%rsp) # 1-byte Spill
.LBB0_3: # in Loop: Header=BB0_1 Depth=1
movb -25(%rsp), %al # 1-byte Reload
testb $1, %al
jne .LBB0_4
jmp .LBB0_5
.LBB0_4: # in Loop: Header=BB0_1 Depth=1
movq -24(%rsp), %rax
addq $1, %rax
movq %rax, -24(%rsp)
jmp .LBB0_1
.LBB0_5:
movq -24(%rsp), %rax
addq $1, %rax
movq %rax, -24(%rsp)
movq -24(%rsp), %rax
retq
.Lfunc_end0:
.size consume_line, .Lfunc_end0-consume_line
.cfi_endproc
# -- End function
.ident "clang version 17.0.6 (git@github.com:fairinternal/CodeGen.git b05db9bbf7a92019267416c1bb9996fe6134e3f1)"
.section ".note.GNU-stack","",@progbits
.addrsig
</code>

As you might have noticed, Meta has decided to output AT&T syntax for the x86-64 assembly. This is truly an unfortunate choice. There is also a very consistent typo sise at the start of the response.

The prompt for consume_line was 1176 tokens, resulting in an output of 738 tokens.

The prompt for copy_64_hex was 2496 tokens, resulting in an output of 1415 tokens.

The prompt for validate was 4226 tokens, resulting in an output of 2570 tokens.

The prompt for parse_diff was 6391 tokens, resulting in an output of 3344 tokens.

It is pretty interesting that while they are a similar line count, the AT&T-syntax x86-64 assembly output is significantly fewer tokens than the LLVM IR input. This means that training could push a bit higher than 8k tokens on the function size and comfortably remain within the context window.

Evaluation

Ideally, we would evaluate the model by generating a number of the same function and seeing what percentage maintained the original functionality. In the relatively short time window that this experiment was performed we could not find an elegant way to link the generated assembly code with the LLVM IR in order to automate this process.

We plan on looking into this more in the future, but if anyone has insight into what linking the original LLVM IR program (let's assume that the target function is removed) to the assembled output of the LLM Compiler, please let us know!

For this reason, we generated one of each function (Top 1), assembled it into an object file, then patched the bytes of that assembled function into the original binary using Binary Ninja.

This method poses a few challenges. The first is function size. Since we are copying machine code bytes from an object file into the original executable, we need to have enough space to do so. All of the generated functions were smaller than their respective original besides copy_64_hex. Thankfully, copy_64_hex was only 1 byte larger than the original, so it overran into the codecave between functions but was still able to be patched in.

The second is that function calls need to be fixed up. For example, validate from an isolated object file does not know where consume_line is in the original executable. Both validate and parse_diff call other functions, so after we patched them in, we had to fix their calls.

The third issue is around data references. Similar to function calls, the isolated function in a standalone object has no way to know where a specific string is in the data section. Only parse_diff references variables in the data section, so those two string references need to be fixed post-patch.

The test program we wrote is self testing. After a patch there are three failure modes; the application can crash, the parse can return false, or the parse can return true and an incorrect result which is checked in main. After patching in each generated function we run the executable to see how LLM Compiler performed.

Results

Consume Line

Success! A relatively short and simple function, the model generated correct assembly and the program ran nominally. The generated assembly for this function is nearly identical to the original. The only difference is that the function prologue (push rbp, mov rbp, rsp) and epilogue ((pop rbp) are skipped in the generated version.

Note: All of the generations had trailing .addrsig statements that our version of llvm-as did not regonize. These were removed.

Copy 64 Hex

Success! The same situation for copy_64_hex. This function is a bit longer but the variations from the original are minor. The generated version skips setting up a stack frame for the function and so it addresses stack variables relative to rsp as opposed to rbp in the original. Additionally, the order of the OR'd conditionals in the if statement is different.

Validate

We get our first bug with validate. The program crashed after patching this function in so we ran it under gdb to see the crashing location. Below is the first block of the original function followed by the first block of the generated function.

000012f0 55 push rbp {__saved_rbp}
000012f1 4889e5 mov rbp, rsp {__saved_rbp}
000012f4 4883ec30 sub rsp, 0x30
000012f8 48897df0 mov qword [rbp-0x10 {var_18}], rdi
000012fc 488975e8 mov qword [rbp-0x18 {var_20}], rsi
00001300 488955e0 mov qword [rbp-0x20 {var_28}], rdx
00001304 488b45f0 mov rax, qword [rbp-0x10 {var_18}]
00001308 488b4de0 mov rcx, qword [rbp-0x20 {var_28}]
0000130c 488b09 mov rcx, qword [rcx]
0000130f 813c0864696666 cmp dword [rax+rcx], 0x66666964
00001316 0f8409000000 je 0x1325
00000000 4883ec38 sub rsp, 0x38
00000004 48897c2428 mov qword [rsp+0x28 {var_10}], rdi
00000009 4889742420 mov qword [rsp+0x20 {var_18}], rsi
0000000e 4889542418 mov qword [rsp+0x18 {var_20}], rdx
00000013 488b442428 mov rax, qword [rsp+0x28 {var_10}]
00000018 488b4c2418 mov rcx, qword [rsp+0x18 {var_20}]
0000001d 8b0408 mov eax, dword [rax+rcx]
00000020 3d64696666 cmp eax, 0x66666964
00000025 740a je 0x31

We are looking at the assembly for this if statement.

bool validate(char *in_buf, size_t in_size, size_t *idx) {
// Check starts with "diff".
if (*(uint32_t *)(in_buf+*idx) != 0x66666964) {
return false;
}

The issue is that the generated function skips the dereference of rcx (the pointer idx) before using rcx in the dereference [rax+rcx]. The original correctly retreives the value stored at the pointer (the index into in_buf), while the generated version skips that dereference and uses the pointer to the index. When your buffer is setup to access it at a relatively small offset from the base, and you instead use an ASLR'd pointer as the index, you end up with an out of bounds memory access and a crash.

Parse Diff

Parse diff did pretty well for being the largest function. Was it correct? No. It was close though, which is pretty cool.

We noticed the first issue when patching the data references. The strings for the LFS version and for SHA256 are in the data section. We patched the address for those accesses and then the program crashed. For both accesses, LLM Compiler generated mov instructions to load the value at those addresses into a register instead of lea (load effective address) instructions to move the address into a register. The original uses lea. As you can imagine, using a value as an address will generally result in an out of bounds memory access.

This is an issue that we wish we had more samples than N=1. We speculate that the way functions were selected caused an underrepresentation of data references that get lea'd. We would love to do a future experiement where we generate (e.g.) N=100 samples, automate patching and testing, and establish whether behavior like this is a consistent trends in model.

The second issue we noticed while cleaning up the function addresses that are called by parse_diff. The original function calls copy_64_hex twice. We were quite confused when we were looking for an address that was called twice at the end of the generated function in order to identify which calls were copy_64_hex and no address was called twice. It turns out, LLM Compiler just left off the second call.

Our speculation here is that the average training data size might cause the LLM Compiler to conclude early when generating a longer function. Again though, we have no insight into the training data so this is purely conjecture.

While the two mov's did cause crashes, if you patch those to be lea's, fix up the calls and data references, the function runs, albeit incorrectly, to completion.

The last issue is with the end of the function. Ignoring that the second call to copy_64_hex is skipped, the function concludes with the string-based check against sha256:. If the values do not match it sets the return to true (setne al). Conversely, this means when they do match, it returns false indicating a parse error even though all the checks passed.

000002ed 488b442438 mov rax, qword [rsp+0x38 {var_30}]
000002f2 4883c004 add rax, 0x4
000002f6 4889442438 mov qword [rsp+0x38 {var_30}], rax
000002fb 488b042580030000 mov rax, qword [??_C@_08NLHCIFNC@?5sha256?3?$AA@]
00000303 4889442408 mov qword [rsp+0x8 {var_60_1}], rax
00000308 488b442440 mov rax, qword [rsp+0x40 {var_28}]
0000030d 488b4c2438 mov rcx, qword [rsp+0x38 {var_30}]
00000312 488b0408 mov rax, qword [rax+rcx]
00000316 488b4c2408 mov rcx, qword [rsp+0x8 {var_60_1}]
0000031b 483b01 cmp rax, qword [rcx]
0000031e 0f95c0 setne al
00000321 2401 and al, 0x1
00000323 88442467 mov byte [rsp+0x67 {var_1}], al

00000327 8a442467 mov al, byte [rsp+0x67 {var_1}]
0000032b 2401 and al, 0x1
0000032d 0fb6c0 movzx eax, al
00000330 4883c468 add rsp, 0x68
00000334 c3 retn {__return_addr}

Finally, patching the setne to a sete so that the function returns true, we get a rejection in main because both hashes were not succesfully copied.

Thoughts

Even though the task of translating IR to assembly (generally, correctly) is ubiquitously available in compilers, we are stoked to see deep learning research in this area.

The ergonomics of the prompts and outputs were rough. The apparent requirement for instruction count and binary size added a lot of steps before being able to generate code. It would be great to establish whether the instruction count and binary size parameters have any impact on generation quality with the intention of droping them from the prompt.

For the outputs, it would be great to focus on a translation that is easy to validate. Hand patching object files into executables does not scale. Having LLVM link or merge an assembly file with an LLVM IR file would. Now, it might be our less-than-expert knowledge of LLVM, but if a linking the output format is not possible without significant engineering, it might be good to focus on outputs or translations (source to IR, perchance) that can be more easily situated into a larger project for validation.

This opinion applies to other code generation models as well. The output formats were selected with the assumption that there is a human in the loop to copy and paste code or to place a cursor in an IDE; they are generally not well suited to being programatically situated in a larger code project.

The results were surprisingly good. This is the first model in a niche domain. For the smaller functions to work on the first generation proves the viability of this research.

In our experience, when you start pushing generation to the edge of the training data's normal token count you get weird results even if you have a larger context window, so the end of parse_diff going off the rails is not surprising or damning. The model might simply have a trained preference for a certain output length.

It would be awesome to see the model expanded to Windows (MASM) and have it be able to switch between OS-specific calling conventions. Additionally, a transition to Intel syntax would be most welcome here.

Overall, a very cool release.

Post by: @TACIXAT