Outlining the problem
Let's assume that I
want to embed an arbitrary null-terminated string into an ELF executable and
that I'm OK with that string having the fixed symbol name
string_blob
. My C program may then look as simple as:
#include <stdio.h> extern char string_blob[]; int main() { printf("%s\n", string_blob); return 0; }Let's compile my C program
ex1.c
into an object file (i.e. a
'.o
' file):
$ cc -c -o ex1.o ex1.cHere's my actual blob of data:
$ echo -n "blobby blobby blobby\0" > string_blob.txtWhat do I do now? Well, I need to produce a second object file that contains my blob of data. On both my OpenBSD and Linux amd64 machines I can use
objcopy
to convert a blob into an object file:
$ objcopy -I binary -O elf64-x86-64 -B i386:x86-64 \ string_blob.txt string_blob.oThen I can link the two files together and run them:
$ cc -o ex1 ex1.o string_blob.o ld: error: undefined symbol: string_blob >>> referenced by ex1.c >>> ex1.o:(main) cc: error: linker command failed with exit code 1 (use -v to see invocation)Perhaps unsurprisingly this has failed, as
objcopy
hasn't
created a symbol called string_blob
. Let's see what symbols
string_blob.o
actually defines:
$ readelf -Ws string_blob.o Symbol table '.symtab' contains 5 entries: Num: Value Size Type Bind Vis Ndx Name 0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND 1: 0000000000000000 0 SECTION LOCAL DEFAULT 1 2: 0000000000000000 0 NOTYPE GLOBAL DEFAULT 1 _binary_string_blob_txt_start 3: 0000000000000015 0 NOTYPE GLOBAL DEFAULT 1 _binary_string_blob_txt_end 4: 0000000000000015 0 NOTYPE GLOBAL DEFAULT ABS _binary_string_blob_txt_sizeIt turns out that
objcopy
creates three symbol names, with the one
I care about being _binary_escaped_file_name_start
[2].
Let's rewrite my C program to use the correct symbol name:
#include <stdio.h> extern char _binary_string_blob_txt_start[]; int main() { printf("%s\n", _binary_string_blob_txt_start); return 0; }I'll call that version
ex2.c
and try again:
$ cc -c -o ex2.o ex2.c $ cc -o ex2 ex2.o string_blob.o $ ./ex2 blobby blobby blobbySuccess!
Trying to solve objcopy's ugliness
There are at least two pieces of ugliness in my solution above. Let's look again at theobjcopy
command-line:
$ objcopy -I binary -O elf64-x86-64 -B i386:x86-64 \ string_blob.txt string_blob.oWhere did the values
elf64-x86-64
[3] and i386:x86-64
come from? Well, for me, they came from the tooth fairy, also known in the land
of programming as StackOverflow. As happy as I am to lean heavily on search engines
to peer into StackOverflow, that's not going to help me work out the right values
for platforms I don't know about: what happens if someone tries to run my
example on, say, an Arm box?
Is there a portable way to automatically determine the right values? On
Linux, with GNU's ld
I can easily get the right value for
-O
with:
$ ld --print-output-format elf64-x86-64but getting the
-B
value is a bit fiddlier [4]:
$ ld --verbose | grep OUTPUT_ARCH \ | sed -E "s/OUTPUT_ARCH.(.*)./\\1/g" i386:x86-64Unfortunately on OpenBSD, which uses LLVM's
lld
linker:
$ ld --print-output-format ld: error: unknown argument '--print-output-format' $ ld --verbose ld: error: no input filesGNU
ld
and lld
aren't the only linkers I tend to
encounter. gold (another GNU linker, different than the "classic" BFD-based ld
)
and mold
(a new
performance-focussed linker, broadly in the spirit of lld
) are
a mixed bag [5]:
$ gold --print-output-format elf64-x86-64 $ gold --verbose gold: fatal error: no input files $ mold --print-output-format mold: fatal: unknown command line option: --print-output-format $ mold --verbose mold: fatal: option -m: argument missingIn short, there doesn't seem to be a portable way of discovering the right values to pass to
objcopy
. But it's OK, because newer versions
of GNU objcopy
will create object files in the way I want simply with:
$ objcopy --version | head -n 1 GNU objcopy (GNU Binutils for Debian) 2.35.2 $ objcopy -I binary -O default string_blob.txt string_blob.o $ cc -o ex2 ex2.o string_blob.o $ ./ex2 blobby blobby blobbyUnfortunately OpenBSD's much older version of
objcopy
creates
object files which don't seem usable:
$ objcopy --version | head -n 1 GNU objcopy 2.17 $ objcopy -I binary -O default string_blob.txt \ string_blob.o $ cc -o ex2 ex2.o string_blob.o ld: error: string_blob.o is incompatible with /usr/lib/crt0.o cc: error: linker command failed with exit code 1 (use -v to see invocation)The key difference seems to be that on the object file produced by newer GNU
objcopy
the object file has a sensible Machine
value:
$ objcopy -I binary -O default string_blob.txt \ string_blob.o $ readelf -h string_blob.o|grep Machine Machine: Advanced Micro Devices X86-64whereas OpenBSD's older GNU
objcopy
produces
an object file with no Machine
at all:
$ objcopy -I binary -O default string_blob.txt \ string_blob.o $ readelf -h string_blob.o|grep Machine Machine: NoneOn OpenBSD I have to specify
-B
to fix this:
$ objcopy -I binary -O default -B i386:x86-64 string_blob.txt string_blob.o $ readelf -h string_blob.o|grep Machine Machine: Advanced Micro Devices X86-64which is unfortunate as it's not obvious to me, at least, how to interrogate the compiler toolchain to find out what the right value to pass to
-B
might be.
But there is an alternative! LLVM has a completely different, but mostly
compatible, objcopy
called llvm-objcopy
and most
boxes I have access to have a copy. It certainly works fine if I give it
complete values for -O
and -B
:
$ llvm-objcopy -I binary -O elf64-x86-64 -B i386:x86-64 \ string_blob.txt string_blob.o $ cc -o ex2 ex2.o string_blob.o $ ./ex2 blobby blobby blobbyIt's a promising start, but
-O
doesn't support
default
as a value:
$ llvm-objcopy -I binary -O default -B i386:x86-64 string_blob.txt string_blob.o llvm-objcopy: error: invalid output format: 'default'However, and unlike GNU
objcopy
, I can leave -B
out:
$ llvm-objcopy -I binary -O elf64-x86-64 \ string_blob.txt string_blob.o $ cc -o ex2 ex2.o string_blob.o $ ./ex2 blobby blobby blobbySo, in conclusion it seems that the situation with the various versions of
objcopy
is:
- I have to assume some operating systems have quite an old version of GNU
objcopy
. - For old versions of GNU
objcopy
I have to specify-B
. - For
llvm-objcopy
I have to specify-O
. - There is no portable way of automatically determining the right values for
-O
or-B
.
Put another way: objcopy
doesn't seem to satisfy my initial
constraints when it comes to portability.
Using ld
On Debian, which uses the GNU linker, I can use the linker to perform the same
task as objcopy
without specifying any tricky arguments:
$ ld --version | head -n 1 GNU ld (GNU Binutils for Debian) 2.35.2 $ ld -r -o string_blob.o -b binary string_blob.txt $ cc -o ex2 ex2.o string_blob.o $ ./ex2 blobby blobby blobbyHowever
lld
isn't as forgiving:
$ ld --version LLD 13.0.0 (compatible with GNU linkers) $ ld -r -o string_blob.o -b binary string_blob.txt ld: error: target emulation unknown: -m or at least one .o file requiredI have to pass a similar value to
objcopy
's -O
parameter (but note that, for reasons unknown to me, hyphens have now
become underscores) for lld
to work:
$ ld --version $ ld -r -m elf_x86_64 -o string_blob.o \ -b binary string_blob.txt $ cc -o ex2 ex2.o string_blob.o $ ./ex2 blobby blobby blobbyOpenBSD does include GNU's
ld
(though called ld.bfd
because it uses GNU's BFD
library) which is of a similar vintage to its
version of objcopy
but, surprisingly, it's less pernickety than
its version of objcopy
:
$ ld.bfd --version | head -n 1 GNU ld version 2.17 $ ld.bfd -r -o string_blob.o -b binary \ string_blob.txt $ cc -o ex2 ex2.o string_blob.o $ ./ex2 blobby blobby blobby
gold
works as well as GNU ld
:
$ gold -r -o string_blob.o -b binary \ string_blob.txt $ cc -o ex2 ex2.o string_blob.o $ ./ex2 blobby blobby blobbybut
mold
points me back to objcopy
which, as we know
from above, isn't viable:
$ mold -r -o out.o -b binary LICENSE mold: fatal: mold does not support `-b binary`. If you want to convert a binary file into an object file, use `objcopy -I binary -O defaultAt least for Linux and OpenBSD the situation with linkers is thus [6]:` instead.
- GNU
ld
andgold
work fine. lld
only works with similar limitations tollvm-objcopy
.mold
doesn't work at all.
At least at the moment, it seems that I can reasonably expect to find a copy
of GNU ld
on many systems (which is good) but it might not be the
linker the user wants me to use (which is bad). I'm also reasonably sure that
some systems (e.g. OS X?) only have lld
. Furthermore, because it's
so much faster than any other linker I've tried, it seems possible that
some systems will make mold
their default linker in the future. In
summary, I don't really think I can rely on using a linker for my task.
Assembler tricks
Many assemblers support the GNU directive.incbin
directive which
allows us to embed an arbitrary binary blob. Given the following assembly file (which
I'll call string_blob.S
):
.global string_blob string_blob: .incbin "string_blob.txt"and a C file
ex3.c
using the (under my control!) symbol
name string_blob
:
#include <stdio.h> extern char string_blob[]; int main() { printf("%s\n", string_blob); return 0; }everything works nicely on Linux and OpenBSD:
$ cc -c -o blob.o string_blob.S $ cc -c -o ex3.o ex3.c $ cc -o ex3 ex3.o string_blob.o $ ./ex3 blobby blobby blobbyIt looks like I have a winner! However,
.incbin
isn't supported by
all assemblers; some call it incbin
(without the leading '.');
some don't seem to have it at all. The incbin
C
library does an excellent job of hiding away most of these portability horrors
(at least until one comes to MSVC, at which point there's more work involved).
Unfortunately it doesn't seem to be available as a package on (at least) Debian
or OpenBSD and, since there is no widely agreed upon package manager for C,
that means slurping its source code into your repository, which you may or may
not be keen on doing.
Preprocessing
The "obvious" way of including binary blobs is to convert them into C source code and compile that into an object file. One possibility is to usexxd
,
which generates exactly the sort of C source code I want:
$ xxd -i string_blob.txt unsigned char string_blob_txt[] = { 0x62, 0x6c, 0x6f, 0x62, 0x62, 0x79, 0x20, 0x62, 0x6c, 0x6f, 0x62, 0x62, 0x79, 0x20, 0x62, 0x6c, 0x6f, 0x62, 0x62, 0x79, 0x00 }; unsigned int string_blob_txt_len = 21;However, perhaps surprisingly,
xxd
is part of Vim (but not
Neovim?), which is a rather heavyweight (and odd) dependency to require just to
include a binary blob. I've also seen references to various other programs
which supposedly do the same job, but they don't seem widely available as
OS-level packages.
Fortunately, we can make use of the venerable and widely available
hexdump
tool. Although little used these days, its -e
parameter allows us to format its output in a manner of our choosing. It's not
difficult to get it to produce output that looks very similar to the core
produced by xxd
:
$ hexdump -v -e '"0x" 1/1 "%02X" ", "' string_blob.txt 0x62, 0x6C, 0x6F, 0x62, 0x62, 0x79, 0x20, 0x62, 0x6C, 0x6F, 0x62, 0x62, 0x79, 0x20, 0x62, 0x6C, 0x6F, 0x62, 0x62, 0x79, 0x00It's then trivial to use
echo
to make this into a valid C source file:
$ echo "unsigned char string_blob[] = {" \ > string_blob.c $ hexdump -v -e '"0x" 1/1 "%02X" ", "' \ string_blob.txt >> string_blob.c $ echo "\n};" >> string_blob.cwhich produces this:
unsigned char string_blob[] = { 0x62, 0x6C, 0x6F, 0x62, 0x62, 0x79, 0x20, 0x62, 0x6C, 0x6F, 0x62, 0x62, 0x79, 0x20, 0x62, 0x6C, 0x6F, 0x62, 0x62, 0x79, 0x00, };which I can then compile:
$ cc -c -o string_blob.o string_blob.c $ cc -o ex3 ex3.o string_blob.o $ ./ex3 blobby blobby blobby
However, this route is not fast, and for large binary blobs, especially if
they frequently change, it would be a definite bottleneck. As a quick test, I
took an 82MiB input file on my desktop machine: hexdump
took about
15 seconds to produce the C output; and clang
took about 90
seconds, and used just under 10GiB of RAM at its peak, to produce an object
file. That's 3 orders of magnitude longer than the 0.2 seconds it took
objcopy
and the 0.8 seconds it took using incbin
in
assembler!
A variant of this approach is to use hexdump
to produce output
suitable for an assembler:
$ echo ".global string_blob\nstring_blob:" \ > string_blob.S $ hexdump -v -e '".byte 0x" 1/1 "%02X" "\n"' \ string_blob.txt >> string_blob.S $ as -o string_blob.o string_blob.S $ cc -c -o ex3 ex3.o string_blob.o $ ./ex3 blobby blobby blobbyThe good news is that while
hexdump
still takes about 15 seconds
to convert my huge file, GNU as
takes only 17 seconds and uses a
peak of just under 100MiB RAM. That's a lot better than when using
clang
! However, I'm not sure whether there are any modern
assemblers that support .byte
that don't support
.incbin
. In other words, if I've become desperate enough to use
hexdump
, I suspect it's because I've found myself in a situation
where the assembler is expecting a syntax I don't know about.
Summary
There are almost certainly other ways of achieving what I want [7] and in the future it looks like we might finally have compilers which can include blobs directive with a#embed
directive. However, realistically it will take many years before I can
rely on every compiler I come across supporting this. In the interim,
I feel like the options I've outlined above give us a reasonable spread
of options. What would I actually use in practice? Well, if I'm only dealing with small
binary blobs, I'd probably use the hexdump
route because it works
easily on every platform I have access to [8].
If performance was an issue, I would be forced to interrogate the system to
see if one of the faster routes worked, gradually falling back on slower routes
otherwise. For example, in a configure
script I would, in order:
- test whether
objcopy -I binary -O default text.txt out.o
(wheretext.txt
is a small file whose contents are irrelevant) produces an object file which can be linked to produce an executable. - test whether the assembler works with
.incbin
orincbin
. - otherwise use the
hexdump
-into-C route.
hexdump
would either be available or easily installed by the user.
As a pleasant bonus the hexdump
route will work equally well on
non-ELF platforms (though I couldn't be entirely sure that all compilers
would cope with huge binary blobs).
Update (2022-07-25): David Chisnall points out that (at least) clang can process blobs-in-C-source-code faster if they're embedded as a string (but watch out for the null byte at the end!).
Acknowledgements: thanks to Edd Barrett, Stephen Kell, and Davin McCall for comments.
Footnotes
[1]After I'd put most of the post together, I discovered that C23 will include an#embed
directive. In the long term that will probably end up being the easiest way
of achieving what I want — but it will take quite a while before I can rely
on compilers on random boxes supporting it.[2]As far as I know, GNU
objcopy
doesn't specify what the file
name escaping rules are though llvm-objcopy
says that "non-alphanumeric characters [are] converted to _".
One can also use objcopy
to rename symbols using objcopy
--redefine-sym "_binary_string_blob_txt_start=string_blob"
string_blob.o
. Although objcopy
has the
-w
switch to allow wildcards to be specified, none of the 3
versions of objcopy
I'm using in this post supports that syntax
with --redefine-sym
, so you have to work out the "full" name
yourself.
[3]In the GNU toolchain this is the BFDName. You can see a list of those
supported by GNU objcopy
with --info
, though without
any indication of what the "native" BFDName is. llvm-objcopy
does
not support --info
.
[4]Amusingly if I take the same approach to get the value of -O
,
I find that ld --verbose
likes elf64-x86-64
so much
that it specifies it thrice:
OUTPUT_FORMAT("elf64-x86-64", "elf64-x86-64",
"elf64-x86-64")
[5]Stephen Kell points out that, since linker scripts are optional in gold and
mold, there is no concept of a default script.
[6]After I wrote this I stumbled across this
mind-boggling example of how hard it is to deal with GNU ld
, as
well as OS X's and mingw's approach (though, if you want to try it out, I
think the command-line given for ld
is missing -b
binary
immediately after the -r
).
[7]For example, I have only lightly looked into linker scripts because they don't
seem to promise greater portability than other routes. Based on a pointer from
some brave souls, the best I managed was:
TARGET(binary) OUTPUT_FORMAT("elf32-i386") OUTPUT(string_blob.o) INPUT (string_blob.txt)with GNU
ld
, which works, but isn't an improvement over what I
could manage via the command line.
In other situations I have used Rust's OS X does, I believe, include include_bytes
macro, but as rapidly as Rust is growing, I still wouldn't expect to find
rustc
available on a random box.
[8]I don't have a Windows box to test on, but I presume its newish Unix subsystem
includes hexdump
, or it's easily available as a package. I also
assume (but don't know) that the objcopy
routes aren't available
on Windows.
hexdump
but OS X is not an ELF
platform (it uses the Mach-O
format). I assume that the standard developer packages include
llvm-objcopy
(but I would not expect to find GNU binutils
installed on most boxes).
#embed
directive. In the long term that will probably end up being the easiest way
of achieving what I want — but it will take quite a while before I can rely
on compilers on random boxes supporting it.objcopy
doesn't specify what the file
name escaping rules are though llvm-objcopy
says that "non-alphanumeric characters [are] converted to _".
One can also use objcopy
to rename symbols using objcopy
--redefine-sym "_binary_string_blob_txt_start=string_blob"
string_blob.o
. Although objcopy
has the
-w
switch to allow wildcards to be specified, none of the 3
versions of objcopy
I'm using in this post supports that syntax
with --redefine-sym
, so you have to work out the "full" name
yourself.
objcopy
with --info
, though without
any indication of what the "native" BFDName is. llvm-objcopy
does
not support --info
.-O
,
I find that ld --verbose
likes elf64-x86-64
so much
that it specifies it thrice:
OUTPUT_FORMAT("elf64-x86-64", "elf64-x86-64", "elf64-x86-64")
ld
, as
well as OS X's and mingw's approach (though, if you want to try it out, I
think the command-line given for ld
is missing -b
binary
immediately after the -r
).TARGET(binary) OUTPUT_FORMAT("elf32-i386") OUTPUT(string_blob.o) INPUT (string_blob.txt)with GNU
ld
, which works, but isn't an improvement over what I
could manage via the command line.
In other situations I have used Rust's include_bytes
macro, but as rapidly as Rust is growing, I still wouldn't expect to find
rustc
available on a random box.
hexdump
, or it's easily available as a package. I also
assume (but don't know) that the objcopy
routes aren't available
on Windows.
OS X does, I believe, include hexdump
but OS X is not an ELF
platform (it uses the Mach-O
format). I assume that the standard developer packages include
llvm-objcopy
(but I would not expect to find GNU binutils
installed on most boxes).