Home > Blog e-mail: laurie@tratt.net   twitter: laurencetratt   twitter: laurencetratt
email updates:
  |   RSS feed

What's the Most Portable Way to Include Binary Blobs in an Executable?

July 25 2022

Blog archive

 
Last 10 blog posts
November Links
More Evidence for Problems in VM Warmup
What is a Research Summer School?
October Links
pizauth: another alpha release
UML: My Part in its Downfall
September Links
pizauth, an OAuth2 token requester daemon, in alpha
A Week of Bug Reporting
August Links
 
I recently needed to include an arbitrary blob of data in an executable, in a manner that's easily ported across platforms. I soon discovered that there are various solutions to including blobs, but finding out what the trade-offs are has been a case of trial and error [1]. In this post I'm going to try and document the portability (or lack thereof...) of the solutions I've tried, give a rough idea of performance, and then explain why I'll probably use a combination of several solutions in the future.

Outlining the problem

Let's assume that I want to embed an arbitrary null-terminated string into an ELF executable and that I'm OK with that string having the fixed symbol name string_blob. My C program may then look as simple as:

#include <stdio.h>
extern char string_blob[];
int main() {
    printf("%s\n", string_blob);
    return 0;
}
Let's compile my C program ex1.c into an object file (i.e. a '.o' file):
$ cc -c -o ex1.o ex1.c
Here's my actual blob of data:
$ echo -n "blobby blobby blobby\0" > string_blob.txt
What do I do now? Well, I need to produce a second object file that contains my blob of data. On both my OpenBSD and Linux amd64 machines I can use objcopy to convert a blob into an object file:
$ objcopy -I binary -O elf64-x86-64 -B i386:x86-64 \
    string_blob.txt string_blob.o
Then I can link the two files together and run them:
$ cc -o ex1 ex1.o string_blob.o
ld: error: undefined symbol: string_blob
>>> referenced by ex1.c
>>>               ex1.o:(main)
cc: error: linker command failed with exit code 1 (use -v to see invocation)
Perhaps unsurprisingly this has failed, as objcopy hasn't created a symbol called string_blob. Let's see what symbols string_blob.o actually defines:
$ readelf -Ws string_blob.o

Symbol table '.symtab' contains 5 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND 
     1: 0000000000000000     0 SECTION LOCAL  DEFAULT    1 
     2: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT    1 _binary_string_blob_txt_start
     3: 0000000000000015     0 NOTYPE  GLOBAL DEFAULT    1 _binary_string_blob_txt_end
     4: 0000000000000015     0 NOTYPE  GLOBAL DEFAULT  ABS _binary_string_blob_txt_size
It turns out that objcopy creates three symbol names, with the one I care about being _binary_escaped_file_name_start [2]. Let's rewrite my C program to use the correct symbol name:
#include <stdio.h>
extern char _binary_string_blob_txt_start[];
int main() {
    printf("%s\n", _binary_string_blob_txt_start);
    return 0;
}
I'll call that version ex2.c and try again:
$ cc -c -o ex2.o ex2.c
$ cc -o ex2 ex2.o string_blob.o
$ ./ex2
blobby blobby blobby
Success!

Trying to solve objcopy's ugliness

There are at least two pieces of ugliness in my solution above. Let's look again at the objcopy command-line:
$ objcopy -I binary -O elf64-x86-64 -B i386:x86-64 \
    string_blob.txt string_blob.o
Where did the values elf64-x86-64 [3] and i386:x86-64 come from? Well, for me, they came from the tooth fairy, also known in the land of programming as StackOverflow. As happy as I am to lean heavily on search engines to peer into StackOverflow, that's not going to help me work out the right values for platforms I don't know about: what happens if someone tries to run my example on, say, an Arm box?

Is there a portable way to automatically determine the right values? On Linux, with GNU's ld I can easily get the right value for -O with:

$ ld --print-output-format
elf64-x86-64
but getting the -B value is a bit fiddlier [4]:
$ ld --verbose | grep OUTPUT_ARCH \
    | sed -E "s/OUTPUT_ARCH.(.*)./\\1/g"
i386:x86-64
Unfortunately on OpenBSD, which uses LLVM's lld linker:
$ ld --print-output-format
ld: error: unknown argument '--print-output-format'
$ ld --verbose
ld: error: no input files
GNU ld and lld aren't the only linkers I tend to encounter. gold (another GNU linker, different than the "classic" BFD-based ld) and mold (a new performance-focussed linker, broadly in the spirit of lld) are a mixed bag [5]:
$ gold --print-output-format
elf64-x86-64
$ gold --verbose
gold: fatal error: no input files
$ mold --print-output-format
mold: fatal: unknown command line option: --print-output-format
$ mold --verbose
mold: fatal: option -m: argument missing
In short, there doesn't seem to be a portable way of discovering the right values to pass to objcopy. But it's OK, because newer versions of GNU objcopy will create object files in the way I want simply with:
$ objcopy --version | head -n 1
GNU objcopy (GNU Binutils for Debian) 2.35.2
$ objcopy -I binary -O default string_blob.txt string_blob.o
$ cc -o ex2 ex2.o string_blob.o
$ ./ex2
blobby blobby blobby
Unfortunately OpenBSD's much older version of objcopy creates object files which don't seem usable:
$ objcopy --version | head -n 1
GNU objcopy 2.17
$ objcopy -I binary -O default string_blob.txt \
    string_blob.o
$ cc -o ex2 ex2.o string_blob.o
ld: error: string_blob.o is incompatible with /usr/lib/crt0.o
cc: error: linker command failed with exit code 1 (use -v to see invocation)
The key difference seems to be that on the object file produced by newer GNU objcopy the object file has a sensible Machine value:
$ objcopy -I binary -O default string_blob.txt \
    string_blob.o
$ readelf -h string_blob.o|grep Machine
  Machine:                           Advanced Micro Devices X86-64
whereas OpenBSD's older GNU objcopy produces an object file with no Machine at all:
$ objcopy -I binary -O default string_blob.txt \
    string_blob.o
$ readelf -h string_blob.o|grep Machine
  Machine:                           None
On OpenBSD I have to specify -B to fix this:
$ objcopy -I binary -O default -B i386:x86-64 string_blob.txt string_blob.o
$ readelf -h string_blob.o|grep Machine
  Machine:                           Advanced Micro Devices X86-64
which is unfortunate as it's not obvious to me, at least, how to interrogate the compiler toolchain to find out what the right value to pass to -B might be.

But there is an alternative! LLVM has a completely different, but mostly compatible, objcopy called llvm-objcopy and most boxes I have access to have a copy. It certainly works fine if I give it complete values for -O and -B:

$ llvm-objcopy -I binary -O elf64-x86-64 -B i386:x86-64 \
    string_blob.txt string_blob.o
$ cc -o ex2 ex2.o string_blob.o
$ ./ex2
blobby blobby blobby
It's a promising start, but -O doesn't support default as a value:
$ llvm-objcopy -I binary -O default -B i386:x86-64 string_blob.txt string_blob.o
llvm-objcopy: error: invalid output format: 'default'
However, and unlike GNU objcopy, I can leave -B out:
$ llvm-objcopy -I binary -O elf64-x86-64 \
    string_blob.txt string_blob.o
$ cc -o ex2 ex2.o string_blob.o
$ ./ex2
blobby blobby blobby
So, in conclusion it seems that the situation with the various versions of objcopy is:
  1. I have to assume some operating systems have quite an old version of GNU objcopy.
  2. For old versions of GNU objcopy I have to specify -B.
  3. For llvm-objcopy I have to specify -O.
  4. There is no portable way of automatically determining the right values for -O or -B.

Put another way: objcopy doesn't seem to satisfy my initial constraints when it comes to portability.

Using ld

On Debian, which uses the GNU linker, I can use the linker to perform the same task as objcopy without specifying any tricky arguments:
$ ld --version | head -n 1
GNU ld (GNU Binutils for Debian) 2.35.2
$ ld -r -o string_blob.o -b binary string_blob.txt
$ cc -o ex2 ex2.o string_blob.o
$ ./ex2
blobby blobby blobby
However lld isn't as forgiving:
$ ld --version
LLD 13.0.0 (compatible with GNU linkers)
$ ld -r -o string_blob.o -b binary string_blob.txt
ld: error: target emulation unknown: -m or at least one .o file required
I have to pass a similar value to objcopy's -O parameter (but note that, for reasons unknown to me, hyphens have now become underscores) for lld to work:
$ ld --version
$ ld -r -m elf_x86_64 -o string_blob.o \
    -b binary string_blob.txt
$ cc -o ex2 ex2.o string_blob.o
$ ./ex2
blobby blobby blobby
OpenBSD does include GNU's ld (though called ld.bfd because it uses GNU's BFD library) which is of a similar vintage to its version of objcopy but, surprisingly, it's less pernickety than its version of objcopy:
$ ld.bfd --version | head -n 1
GNU ld version 2.17
$ ld.bfd -r -o string_blob.o -b binary \
    string_blob.txt
$ cc -o ex2 ex2.o string_blob.o
$ ./ex2
blobby blobby blobby
gold works as well as GNU ld:
$ gold -r -o string_blob.o -b binary \
    string_blob.txt
$ cc -o ex2 ex2.o string_blob.o
$ ./ex2
blobby blobby blobby
but mold points me back to objcopy which, as we know from above, isn't viable:
$ mold -r -o out.o -b binary LICENSE
mold: fatal: mold does not support `-b binary`. If you want to convert a binary
file into an object file, use `objcopy -I binary -O default 
` instead.
At least for Linux and OpenBSD the situation with linkers is thus [6]:
  1. GNU ld and gold work fine.
  2. lld only works with similar limitations to llvm-objcopy.
  3. mold doesn't work at all.

At least at the moment, it seems that I can reasonably expect to find a copy of GNU ld on many systems (which is good) but it might not be the linker the user wants me to use (which is bad). I'm also reasonably sure that some systems (e.g. OS X?) only have lld. Furthermore, because it's so much faster than any other linker I've tried, it seems possible that some systems will make mold their default linker in the future. In summary, I don't really think I can rely on using a linker for my task.

Assembler tricks

Many assemblers support the GNU directive .incbin directive which allows us to embed an arbitrary binary blob. Given the following assembly file (which I'll call string_blob.S):
    .global string_blob
string_blob:
    .incbin "string_blob.txt"
and a C file ex3.c using the (under my control!) symbol name string_blob:
#include <stdio.h>
extern char string_blob[];
int main() {
    printf("%s\n", string_blob);
    return 0;
}
everything works nicely on Linux and OpenBSD:
$ cc -c -o blob.o string_blob.S
$ cc -c -o ex3.o ex3.c
$ cc -o ex3 ex3.o string_blob.o
$ ./ex3
blobby blobby blobby
It looks like I have a winner! However, .incbin isn't supported by all assemblers; some call it incbin (without the leading '.'); some don't seem to have it at all. The incbin C library does an excellent job of hiding away most of these portability horrors (at least until one comes to MSVC, at which point there's more work involved). Unfortunately it doesn't seem to be available as a package on (at least) Debian or OpenBSD and, since there is no widely agreed upon package manager for C, that means slurping its source code into your repository, which you may or may not be keen on doing.

Preprocessing

The "obvious" way of including binary blobs is to convert them into C source code and compile that into an object file. One possibility is to use xxd, which generates exactly the sort of C source code I want:
$ xxd -i string_blob.txt
unsigned char string_blob_txt[] = {
  0x62, 0x6c, 0x6f, 0x62, 0x62, 0x79, 0x20, 0x62, 0x6c, 0x6f, 0x62, 0x62,
  0x79, 0x20, 0x62, 0x6c, 0x6f, 0x62, 0x62, 0x79, 0x00
};
unsigned int string_blob_txt_len = 21;
However, perhaps surprisingly, xxd is part of Vim (but not Neovim?), which is a rather heavyweight (and odd) dependency to require just to include a binary blob. I've also seen references to various other programs which supposedly do the same job, but they don't seem widely available as OS-level packages.

Fortunately, we can make use of the venerable and widely available hexdump tool. Although little used these days, its -e parameter allows us to format its output in a manner of our choosing. It's not difficult to get it to produce output that looks very similar to the core produced by xxd:

$ hexdump -v -e '"0x" 1/1 "%02X" ", "' string_blob.txt
0x62, 0x6C, 0x6F, 0x62, 0x62, 0x79, 0x20, 0x62, 0x6C, 0x6F, 0x62, 0x62, 0x79, 0x20, 0x62, 0x6C, 0x6F, 0x62, 0x62, 0x79, 0x00
It's then trivial to use echo to make this into a valid C source file:
$ echo "unsigned char string_blob[] = {" \
    > string_blob.c
$ hexdump -v -e '"0x" 1/1 "%02X" ", "' \
    string_blob.txt >> string_blob.c
$ echo "\n};" >> string_blob.c
which produces this:
unsigned char string_blob[] = {
0x62, 0x6C, 0x6F, 0x62, 0x62, 0x79, 0x20, 0x62, 0x6C, 0x6F, 0x62, 0x62, 0x79, 0x20, 0x62, 0x6C, 0x6F, 0x62, 0x62, 0x79, 0x00, 
};
which I can then compile:
$ cc -c -o string_blob.o string_blob.c
$ cc -o ex3 ex3.o string_blob.o
$ ./ex3
blobby blobby blobby

However, this route is not fast, and for large binary blobs, especially if they frequently change, it would be a definite bottleneck. As a quick test, I took an 82MiB input file on my desktop machine: hexdump took about 15 seconds to produce the C output; and clang took about 90 seconds, and used just under 10GiB of RAM at its peak, to produce an object file. That's 3 orders of magnitude longer than the 0.2 seconds it took objcopy and the 0.8 seconds it took using incbin in assembler!

A variant of this approach is to use hexdump to produce output suitable for an assembler:

$ echo ".global string_blob\nstring_blob:" \
    > string_blob.S
$ hexdump -v -e '".byte 0x" 1/1 "%02X" "\n"' \
    string_blob.txt >> string_blob.S
$ as -o string_blob.o string_blob.S
$ cc -c -o ex3 ex3.o string_blob.o
$ ./ex3
blobby blobby blobby
The good news is that while hexdump still takes about 15 seconds to convert my huge file, GNU as takes only 17 seconds and uses a peak of just under 100MiB RAM. That's a lot better than when using clang! However, I'm not sure whether there are any modern assemblers that support .byte that don't support .incbin. In other words, if I've become desperate enough to use hexdump, I suspect it's because I've found myself in a situation where the assembler is expecting a syntax I don't know about.

Summary

There are almost certainly other ways of achieving what I want [7] and in the future it looks like we might finally have compilers which can include blobs directive with a #embed directive. However, realistically it will take many years before I can rely on every compiler I come across supporting this. In the interim, I feel like the options I've outlined above give us a reasonable spread of options. What would I actually use in practice? Well, if I'm only dealing with small binary blobs, I'd probably use the hexdump route because it works easily on every platform I have access to [8].

If performance was an issue, I would be forced to interrogate the system to see if one of the faster routes worked, gradually falling back on slower routes otherwise. For example, in a configure script I would, in order:

  1. test whether objcopy -I binary -O default text.txt out.o (where text.txt is a small file whose contents are irrelevant) produces an object file which can be linked to produce an executable.
  2. test whether the assembler works with .incbin or incbin.
  3. otherwise use the hexdump-into-C route.
At least on Linux and OpenBSD (and, I suspect, on most other modern Unices including OS X) one of the first two routes (which are roughly equivalently fast) would succeed. But if they didn't, I'd be fairly confident that hexdump would either be available or easily installed by the user. As a pleasant bonus the hexdump route will work equally well on non-ELF platforms (though I couldn't be entirely sure that all compilers would cope with huge binary blobs).

Update (2022-07-25): David Chisnall points out that (at least) clang can process blobs-in-C-source-code faster if they're embedded as a string (but watch out for the null byte at the end!).

Acknowledgements: thanks to Edd Barrett, Stephen Kell, and Davin McCall for comments.

If you’d like updates on new blog posts: follow me on Twitter; or subscribe to the RSS feed; or subscribe to email updates:

Footnotes

[1] After I'd put most of the post together, I discovered that C23 will include an #embed directive. In the long term that will probably end up being the easiest way of achieving what I want — but it will take quite a while before I can rely on compilers on random boxes supporting it.
[2] As far as I know, GNU objcopy doesn't specify what the file name escaping rules are though llvm-objcopy says that "non-alphanumeric characters [are] converted to _".

One can also use objcopy to rename symbols using objcopy --redefine-sym "_binary_string_blob_txt_start=string_blob" string_blob.o. Although objcopy has the -w switch to allow wildcards to be specified, none of the 3 versions of objcopy I'm using in this post supports that syntax with --redefine-sym, so you have to work out the "full" name yourself.
[3] In the GNU toolchain this is the BFDName. You can see a list of those supported by GNU objcopy with --info, though without any indication of what the "native" BFDName is. llvm-objcopy does not support --info.
[4] Amusingly if I take the same approach to get the value of -O, I find that ld --verbose likes elf64-x86-64 so much that it specifies it thrice:

OUTPUT_FORMAT("elf64-x86-64", "elf64-x86-64",
              "elf64-x86-64")

[5] Stephen Kell points out that, since linker scripts are optional in gold and mold, there is no concept of a default script.
[6] After I wrote this I stumbled across this mind-boggling example of how hard it is to deal with GNU ld, as well as OS X's and mingw's approach (though, if you want to try it out, I think the command-line given for ld is missing -b binary immediately after the -r).
[7] For example, I have only lightly looked into linker scripts because they don't seem to promise greater portability than other routes. Based on a pointer from some brave souls, the best I managed was:
TARGET(binary)
OUTPUT_FORMAT("elf32-i386")
OUTPUT(string_blob.o)
INPUT (string_blob.txt)
with GNU ld, which works, but isn't an improvement over what I could manage via the command line.

In other situations I have used Rust's include_bytes macro, but as rapidly as Rust is growing, I still wouldn't expect to find rustc available on a random box.
[8] I don't have a Windows box to test on, but I presume its newish Unix subsystem includes hexdump, or it's easily available as a package. I also assume (but don't know) that the objcopy routes aren't available on Windows.

OS X does, I believe, include hexdump but OS X is not an ELF platform (it uses the Mach-O format). I assume that the standard developer packages include llvm-objcopy (but I would not expect to find GNU binutils installed on most boxes).

After I'd put most of the post together, I discovered that C23 will include an #embed directive. In the long term that will probably end up being the easiest way of achieving what I want — but it will take quite a while before I can rely on compilers on random boxes supporting it.
As far as I know, GNU objcopy doesn't specify what the file name escaping rules are though llvm-objcopy says that "non-alphanumeric characters [are] converted to _".

One can also use objcopy to rename symbols using objcopy --redefine-sym "_binary_string_blob_txt_start=string_blob" string_blob.o. Although objcopy has the -w switch to allow wildcards to be specified, none of the 3 versions of objcopy I'm using in this post supports that syntax with --redefine-sym, so you have to work out the "full" name yourself.

In the GNU toolchain this is the BFDName. You can see a list of those supported by GNU objcopy with --info, though without any indication of what the "native" BFDName is. llvm-objcopy does not support --info.
Amusingly if I take the same approach to get the value of -O, I find that ld --verbose likes elf64-x86-64 so much that it specifies it thrice:
OUTPUT_FORMAT("elf64-x86-64", "elf64-x86-64",
              "elf64-x86-64")
Stephen Kell points out that, since linker scripts are optional in gold and mold, there is no concept of a default script.
After I wrote this I stumbled across this mind-boggling example of how hard it is to deal with GNU ld, as well as OS X's and mingw's approach (though, if you want to try it out, I think the command-line given for ld is missing -b binary immediately after the -r).
For example, I have only lightly looked into linker scripts because they don't seem to promise greater portability than other routes. Based on a pointer from some brave souls, the best I managed was:
TARGET(binary)
OUTPUT_FORMAT("elf32-i386")
OUTPUT(string_blob.o)
INPUT (string_blob.txt)
with GNU ld, which works, but isn't an improvement over what I could manage via the command line.

In other situations I have used Rust's include_bytes macro, but as rapidly as Rust is growing, I still wouldn't expect to find rustc available on a random box.

I don't have a Windows box to test on, but I presume its newish Unix subsystem includes hexdump, or it's easily available as a package. I also assume (but don't know) that the objcopy routes aren't available on Windows.

OS X does, I believe, include hexdump but OS X is not an ELF platform (it uses the Mach-O format). I assume that the standard developer packages include llvm-objcopy (but I would not expect to find GNU binutils installed on most boxes).

Comments

Comment:
Name:
Homepage: (optional)
Email: (used only to verify your comment: it is not displayed)
Can't load comments
Home > Blog e-mail: laurie@tratt.net   twitter: laurencetratt twitter: laurencetratt