SHELF Loading is a new type of ELF binary reflective loading that my colleague @Anonymous_ and I first documented on April 21st 2021. This new ELF reflective loading methodology enables the capability to generate compiler-based artifacts with properties that resemble those of shellcode. These compiler-based artifacts are ultimately a Hybrid ELF file between conventional static and PIE binaries. Had the pleasure to publish this research at Tmp0ut, a Linux VX zine. This was Volume #1, and its first release and their official site can be found at tmpout.sh.

Make sure to check other of the great articles published on this first volume if ELF mangling/golfing/fuckery is for you :)

Introduction

Over the last several years there have been several enhancements in Linux offensive tooling in terms of sophistication and complexity. Linux malware has become increasingly more popular, given the higher number of public reports documenting Linux threats. These include government-backed Linux implants such as APT28’s VPNFilter, Drovorub or Winnti wide range of Linux Malware.

However, this increase in popularity does not seem to have had much of an impact in the totality of the sophistication of the current Linux threat landscape just yet. It’s a fairly young ecosystem, where cybercriminals have not been able to identify reliable targets for monetization apart from Cryptocurrency Mining, DDoS, and more recently, Ransomware operations.

In today’s Linux threat landscape, even the smallest refinement or introduction of complexity often results in AV evasion, and therefore Linux malware authors do not tend to invest unnecessary resources to sophisticate their implants. There are various reasons why this phenomenon occurs, and it is subject to ambiguity. The Linux ecosystem, in contrast to other popular spheres such as Windows and MacOS, is more dynamic and diverse, stemming from the many flavors of ELF files for different architectures, the fact that ELF binaries can be valid in many different forms, and that the visibility of Linux threats is often quite poor.

Due to these issues, AV vendors face a completely different set of challenges detecting these threats. Often times this disproportionate detection failure of simple/unsophisticated threats leaves an implicit impression that Linux malware is by nature not complex. This statement couldn’t be further from the truth, and those familiar with the ELF file format know that there is quite a lot of room for innovation with ELF files that other file formats are not able to provide due to their lack of flexibility, even if we have not seen it abused as much over the years.

In this article we are going to discuss a technique that achieves an uncommon functionality of file formats, which generically converts full executables to shellcode in a way that demonstrates, yet again, another example that ELF binaries can be manipulated to achieve offensive innovation that is hard or impossible to replicate in other file formats.

A Primer On ELF Reflective Loading

In order to understand the technique, we must first give a contextual background on previously known ELF techniques upon which this one is based, with a comparison of the benefits and tradeoffs.

Most ELF packers, or any application implementing any form of ELF binary loading, are primarily based on what’s known as User-Land-Exec.

User-Land-Exec is a method first documented by @thegrugq, in which an ELF binary can be loaded without using any of the execve family of system calls, and hence its name.

For the sake of simplicity, the steps to implement an ordinary User-Land-Exec with support of ET_EXEC and ET_DYN ELF binaries is illustrated in the following diagram, showcasing an implementation of the UPX packer for ELF binaries:

As we can observe, this technique has the following requirements (by @thegrugq):

  1. Clean out the address space
  2. If the binary is dynamically linked, load the dynamic linker.
  3. Load the binary.
  4. Initialize the stack.
  5. Determine the entry point (i.e. the dynamic linker or the main executable).
  6. Transfer execution to the entry point.

On a more technical level, we come up with the following requirements:

  1. Setup the stack of the embedded executable with its correspondent Auxiliary Vector.
  2. Parse PHDR’s and identify if there is a PT_INTERP segment, denoting that the file is a dynamically linked executable.
  3. LOAD interpreter if PT_INTERP is present.
  4. LOAD target embedded executable.
  5. Pivot to mapped e_entry of target executable or interpreter accordingly, depending if the target executable is a dynamically linked binary.

For a more in-depth explanation, we suggest reading @thegrugq’s comprehensive paper on the matter [9].

One of the capabilities of conventional User-Land-Exec are the evasion of an execve footprint as previously mentioned, in contrast with other techniques such as memfd_create/execveat, which are also widely used to load end execute a target ELF file. Since the loader maps and loads the target executable, the embedded executable has the flexibility of having a non-conventional structure. This has the side benefit of being useful for evasion and anti-forensics purposes.

On the other hand, since there are a lot of critical artifacts involved in the loading process, it can be easy to recognize by reverse-engineers, as well as being somewhat fragile due to the fact that the technique is heavily dependent on these components. For this reason, writing User-Land-Exec based loaders have been somewhat tedious. As more features get added to the ELF file format, this technique has been inclined to mature over time and thereby increasing its complexity.

The new technique that we will be covering in this paper relies on implementing a generic User-Land-Exec loader with a reduced set of constraints supporting a hybrid PIE and statically linked ELF binaries that to our knowledge have yet to be reported.

We believe this technique represents a drastic improvement of previous versions of User-Land-Exec loaders, since based on the lack of technical implementation constraints and the nature of this new hybrid static/PIE ELF flavor, the extent of capabilities it can provide is wider and more evasive than with previous User-Land-Exec variants.

Internals Of Static PIE Executable Generation

Background

In July of 2017 H. J. Lu patched a bug entry in GCC bugzilla named ‘Support creating static PIE’. This patch mentioned the implementation of a statically based PIE in his branch at glibc hjl/pie/static, in which Lu documented that by supplying –static and –pie flags to the linker along with PIE versions of crt*.o as input, static PIE ELF executables could be generated. It is important to note, that at the time of this patch, generation of fully statically linked PIE binaries was not possible.[1]

In August, Lu submitted a second patch[2] to the GCC driver, for adding the –static flag to support static PIE files that he was able to demonstrate in his previous patch. The patch was accepted in trunk[3], and this feature was released in GCC v8.

Moreover, in December of 2017 a commit was made in glibc[4] adding the option –enable-static-pie. This patch made it possible to embed the needed parts of ld.so to produce standalone static PIE executables.

The major change in glibc to allow static PIE was the addition of the function _dl_relocate_static_pie which gets called by __libc_start_main. This function is used to locate the run-time load address, read the dynamic segment, and perform dynamic relocations before initialization, then transfer control flow of execution to the subject application.

In order to know which flags and compilation/linking stages were needed in order to generate static PIE executables, we passed the flag –static-pie –v to GCC. However, we soon realized by doing this that the linker generated a plethora of flags and calls to internal wrappers. As an example, the linking phase is handled by the tool /usr/lib/gcc/x86_64-linux-gnu/9/collect2 and GCC itself is wrapped by /usr/lib/gcc/x86_64-linux-gnu/9/cc1. Nevertheless, we managed to remove the irrelevant flags and we ended up with the following steps:

These steps are in fact the same provided by Lu, supplying the linker with input files compiled with –fpie, and –static, -pie, -z text, --no-dynamic-linker. In particular, the most relevant artifacts for static PIE creation are rcrt1.o, libc.a, and our own supplied input file, test.o. The rcrt1.o object contains the _start code which has the code required to correctly load the application before executing its entry point by calling the correspondent libc startup code contained in __libc_start_main:

As previously mentioned, __libc_start_main will call the new added function _dl_relocate_static_pie (defined at elf/dl-reloc-static-pie.c file of glibc source). The primary steps performed by this function are commented in the source:

With the help of these features, GCC is capable of generating static executables which can be loaded at any arbitrary address.

We can observe that _dl_relocate_static_pie will handle the needed dynamic relocations. One noticeable difference of rcrt1.o from conventional crt1.o is that all contained code is position independent. Inspecting what the generated binaries look like we see the following:

At first glance they seem to be common dynamically linked PIE executables, based on the ET_DYN executable type retrieved from the ELF header. However, upon closer examination of the segments, we will observe the nonexistent PT_INTERP segment usually denoting the path to the interpreter in dynamically linked executables and the existence of a PT_TLS segment, usually contained only in statically linked executables.

If we check what the dynamic linker identifies the subject executable as, we will see it identifies the file type correctly:

In order to load this file, all we would need to do is map all the PT_LOAD segments to memory, set up the process stack with the correspondent Auxiliary Vector entries, and then pivot to the mapped executable’s entry point. We do not need to be concerned about mapping the RTLD since we don’t have any external dependencies or link time address restrictions.

As we can observe, we have four loadable segments commonly seen in SCOP ELF binaries. However, for the sake of easier deployment, it will be crucial if we could merge all those segments into one as is usually done with ELF disk injection into a foreign executable. We can do just this by using the –N linker flag to merge data and text within a single segment.

Non-compatibility of GCC’s -N and static-pie flags

If we pass –static-pie and –N flags together to GCC we see that it generates the following executable:

The first thing we noticed about the type of generated ELF when using –static-pie alone was that it had a type of ET_DYN, and now together with –N it results in an ET_EXEC.

In addition, if we take a closer look at the segment’s virtual addresses, we see that the generated binary is not a Position Independent Executable. This is due to the fact that the virtual addresses appear to be absolute addresses and not relative ones. To understand why our program is not being linked as expected, we inspected the linker script that was being used.

As we are using the ld linker from binutils, we took a look on how ld selected the linker script; this is done in the ld/ldmain.c code at line 345:

The ldfile_open_default_command_file is in fact an indirect call to an architecture independent function generated at compile time that contains a set of internal linker scripts to be selected depending upon the flags passed to ld. Because we are using the x86_64 architecture, the generated source will be ld/elf_x86_64.c, and the function which is called to select the script is gldelf_x86_64_get_script, which is simply a set of if-else-if statements to select the internal linker script. The –N option sets the config.text_read_only variable to false, which forces the selection function to use an internal script which does not produce PIC as can be seen below:

This way of selecting the default script makes the –static-pie and –N flags non-compatible, because the forced test of selecting the script based on –N is parsed before –static-pie.

Circumvention via custom Linker Script

The incompatibility between –N, -static, and –pie flags led us to a dead end, and we were forced to think of different ways to overcome this barrier. What we attempted was to provide a custom script to drive the linker. As we essentially needed to merge the behavior of two separate linker scripts, our approach was to choose one of the scripts and adapt it to generate the desired outcome with features of the remaining script.

We chose the default script of –static-pie over the one used with –N because in our case it was easier to modify as opposed to changing the –N default script to support PIE generation.

To accomplish this goal, we would need to change the definition of the segments, which are controlled by the PHDRS [5] field in the linker script. If the command is not used the linker will provide program headers generated by default – However, if we neglect this in the linker script, the linker will not create any additional program headers and will strictly follow the guidelines defined in the subject linker script.

Taking into account the details discussed above, we added a PHDRS command to the default linker script, starting with all the original segments which are created by default when using –static-pie:

After this we need to know how each section maps to each segment – and for this we can use readelf as shown below:

With knowledge of the mappings, we just needed to change the section output definition in the linker script which adds the appropriate segment name at the end of each function definition, as shown in the following example:

Here, the .tdata and .tbss sections are being assigned to the segments that get mapped in the same order that we saw in the output of the readelf –l command. Eventually, we ended up having a working script precisely changing all mapped sections which were mapped in data to the text segment:

If we compile our subject test file with this linker script, we see the following generated executable:

We now have a static-pie with just one loadable segment. The same approach can be repeated to remove other irrelevant segments, keeping only critical segments necessary for the execution of the binary. As an example, the following is a static-pie executable instance with minimal program headers needed to run:

The following is the final output of our desired ELF structure – having only one PT_LOAD segment generated by a linker script with the PHDRS command configured as in the screenshot below:

SHELF Loading

This generated ELF flavor gives us some interesting capabilities that other ELF types are not able to provide. For the sake of simplicity, we have labelled this type of ELF binary as SHELF, and we will be referencing it throughout the rest of this paper. The following is an updated diagram of the loading stages needed for SHELF loading:

As we can see in the diagram above, the process of loading SHELF files is highly reduced in complexity compared to conventional ELF loading schemes.

To illustrate the reduced set of constraints to load these types of files, a snippet of a minimalistic SHELF User-Land-Exec approach is as follows:

By using this approach, a subject SHELF file would look as follows in memory and on disk:

As we can observe, the ELF header and Program Headers are missing from the process image. This is a feature that this flavor of ELF enables us to implement and is discussed in the following section.

Anti-Forensic Capabilities

This new approach to User-Land-Exec has also two optional stages useful for anti-forensic purposes. Since the dl_relocate_static_pie function will obtain all of the required fields for relocation from the Auxiliary Vector, this leaves us room to play with how the subject SHELF file structure may look in memory and on disk.

Removing the ELF header will directly impact reconstruction capabilities, because most Linux-based scanners will scan process memory for existing ELF images by first identifying ELF headers. The ELF header will be parsed and will contain further information on where to locate the Program Header Table and consequently the rest of the mapped artifacts of the file.

Removal of the ELF header is trivial since this artifact is not really needed by the loader – all required information in the subject file will be retrieved from the Auxiliary Vector as previously mentioned.

An additional artifact that can be hidden is the Program Header Table. This is a slightly different case when compared with the ELF Header. The Auxiliary Vector needs to locate the Program Header Table in order for the RTLD to successfully load the file by applying the needed runtime relocations. Regardless, there are many approaches to obfuscating the PHT. The simplest approach is to remove the original Program Header Table location, and relocate it somewhere in the file that is only known by the Auxiliary Vector.

We can precompute the location of each of the Auxiliary Vector entries and define each entry as a macro in an include file, tailoring our loader to every subject SHELF file at compile-time. The following is an example of how these macros can be generated:

As we can observe, we have parsed the subject SHELF file for its e_entry and e_phnum fields, creating corresponding macros to hold those values. We also have to choose a random base image to load the file. Finally, we locate the PHT and convert it to an array, then remove it from its original location. Applying these modifications allows us to completely remove the ELF header and change the default location of the subject SHELF file PHT both on disk and in memory(!)

Without successful retrieval of the Program Header Table, reconstruction capabilities may be strictly limited and further heuristics will have to be applied for successful process image reconstruction.

An additional approach to make the reconstruction of the Program Header Table much harder is by instrumenting the way glibc implements the resolution of the Auxiliary Vector fields.

Obscuring SHELF features by PT_TLS patching

Even after modifying the default location of the Program Header Table by choosing a new arbitrary location when crafting the Auxiliary Vector, the Program Header Table would still reside in memory and could be found with some effort. To obscure ourselves even further we can cover how the startup code reads the Auxiliary Vector fields.

The code that does this is in elf/dl_support.c in the function _dl_aux_init. In abstract, the code iterates over all the auxv_t entries, and each of these entries initialize internal variables from glibc:

The only reason the Auxiliary Vector is required is to initialize internal _dl_* variables. Knowing this, we can bypass the creation of the Auxiliary Vector entirely and do the same job that _dl_aux_init would do before passing control of execution to the subject SHELF file.

The only entries which are critical are AT_PHDR, AT_PHNUM, and AT_RANDOM. Therefore, we only need to patch the respective _dl_* variables that depend on these fields. As an example of how to retrieve these values, we can use the following one-liner to generate an include file with precomputed macros holding the offset to every dl_* variable:

With the offset to these variables located, we only need to patch them in the same way the original startup code would do so using the Auxiliary Vector. As a way to illustrate this technique, the following code will initialize the addresses of the Program Headers to new_address, and the number of program headers to the correct number:

At this point we have a working program without supplying the Auxiliary Vector. Because the subject binary is statically linked, and the code that will load the SHELF file is our loader, we can neglect every other segment in the Auxiliary Vector’s AT_PHDR and AT_PHNUM or dl_phdr and dl_phnum respectively. There is an exception, which is the PT_TLS segment which is the interface in which Thread Local Storage is implemented in the ELF file format.

The following code which resides in csu/libc-tls.c on function __libc_setup_tls show the type of information that gets retrieved from the PT_TLS segment:

In the code snippet above, we can see that TLS initialization relies on the presence of the PT_TLS segment. We have several approaches that can obfuscate this artifact, such as patching the __libc_setup_tls function to just return and then initialize the TLS with our own code. Here, we’ll choose to implement a quick patch to glibc instead as a PoC.

To avoid the need of the PT_TLS Program Header we have added a global variable to hold the values from PT_TLS and set the values inside __libc_setup_tls, reading from our global variable instead of the subject SHELF file Program Header Table. With this small change we finally strip all the program headers:

Using the following script to generate _phdr.h:

We can apply our patches in the following way after including _phdr.h:

Applying the methodology shown above, we gain a high level of evasiveness by loading and executing our SHELF file without an ELF header, Program Header Table, and Auxiliary Vector – just as shellcode gets loaded. The following diagram illustrates how straightforward the loading process of SHELF files is:

Conclusion

We have covered the internals of Reflective Loading of ELF files, explaining previous implementations of User-Land-Exec, along with its benefits and drawbacks. We then explained the latest patches in the GCC code base that implemented support for static-pie binaries, discussing our desired outcome, and the approaches we followed to achieve the generation of static-pie ELF files with one single PT_LOAD segment. Finally, we discussed the anti-forensic features that SHELF loading can provide, which we think to be a considerable enhancement when compared with previous versions of ELF Reflective Loading.

We think this could be the next generation of ELF Reflective Loading, and it may benefit readers to understand the extent of offensive capabilities that the ELF file format can provide. If you would like access to the source code, contact @sblip or @ulexec.

References