Linux has a tool (perf stat) that can record the instruction counts of a program (among many other stats). Let’s try it on one of the simplest programs possible – a program that returns an errorcode:

int main(int argc, char* argv[])
{
    return 0;
}

This has the corresponding assembly code 1:

main:
.LFB0:
	.cfi_startproc
	endbr64
	movl	$0, %eax
	ret

Running perf stat on even this simple program results in the execution of 646,399 instructions, even though there are only three effective instructions 2.

 Performance counter stats for './a.out' (100 runs):

              0.26 msec task-clock                #    0.480 CPUs utilized            ( +-  3.53% )
                 0      context-switches          #    0.000 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
                44      page-faults               #  150.895 K/sec                    ( +-  0.22% )
           908,925      cycles                    #    3.117 GHz                      ( +-  2.89% )
           646,399      instructions              #    0.66  insn per cycle           ( +-  0.18% )
           128,260      branches                  #  439.859 M/sec                    ( +-  0.18% )
             3,388      branch-misses             #    2.58% of all branches          ( +-  1.60% )

         0.0005323 +- 0.0000176 seconds time elapsed  ( +-  3.31% )

Clearly, it’s doing something other than the 3 instructions in the assembly. What’s going on here? Are there boilerplate instructions included in this program that aren’t visible to the disassembler? Does this include the instructions required to load the program? Why do the number of instructions vary from run to run.

I believe the linker is including the C runtime. Besides that, this program dynamically links in other libraries, including the libc dynamically linked library, although I don’t understand how it’s being used 3.

$ ldd ./a.out
	linux-vdso.so.1 (0x00007fff86bd3000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fb7c95b0000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fb7c97c2000)

Let’s try counting only the user instructions, instead of both user and kernel as was done in the previous run.

$ perf stat -r 100 -e instructions:u -o results.txt ./a.out
 Performance counter stats for './a.out' (100 runs):

           119,569      instructions:u                                                ( +-  0.00% )

         0.0005738 +- 0.0000207 seconds time elapsed  ( +-  3.62% )

While the time varies by 3.6% over the 100 runs, the number of instructions doesn’t. That answers one of the questions: the source of the variation in instruction count is from counting kernel instructions. The kernel code must take different code paths from run to run. It’s still executing 119,569 instructions, way more than expected.

Rewriting in x86 Assembly

Drawing inspiration from elfbin, where the author created a tiny “hello world” ELF binary, I rewrote the program in NASM to remove the need for a C runtime:

BITS 64

section .text
global _start
_start:
    mov eax,1
    mov ebx,0
    int 80h

and assembled using

$ nasm -f elf64 simple.asm
$ ld simple.o -o ./a.out
$ strip -s ./a.out

I recorded only the user instructions instead of the default which is to record both user and kernel

$ perf stat -r 100 -e instructions:u -o results.txt ./a.out

This time it only recorded 4 instructions. It’s more than the expected 3, but maybe the int instruction counts as 2 instructions. I like the fact that the instructions do not vary from run to run unlike the execution time.

 Performance counter stats for './a.out' (100 runs):

                 4      instructions:u                                              

        0.00020050 +- 0.00000418 seconds time elapsed  ( +-  2.08% )

Conclusion

perf stat gave the expected instruction count when only user instructions were counted. It matched the expected instruction count almost exactly when a very simple binary was created.

For some unknown reason the kernel instruction counts vary from run to run. That must mean there are branches. What’s interesting is that the code itself only has one system call, exit. Is the variation in kernel count due to the loader, exit, or both? What exactly is the kernel doing? This is a question for another day.

Appendix: Binary Before and After Strip

The ELF binary was run through strip -s to remove all symbols to keep the binary simple to understand. Below is the result of readelf -a --wide before and after running this command.

In case you aren’t clear on the difference between segments and sections (I wasn’t), the segment contains information needed during execution, while the sections contain information needed for linking. The diagram below illustrates the difference . The primary difference is that the loader must know where to load the segments in memory and therefore needs the program header table, while the linker does not need this information. As an interesting side note, relocatable ELF do not have program headers either (binaryexploitation).

Diagram of Segments and Sections credit: nairobi-embedded.org

strip removes the symbol table (.symtab) and its string table.strtab but leaves the program itself (.text) and the string table for the ELF header.shstrtab. The symbols aren’t needed because nothing is linking against this binary. stackoverflow has an explanation of the different string tables.

Before Strip

ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           Advanced Micro Devices X86-64
  Version:                           0x1
  Entry point address:               0x401000
  Start of program headers:          64 (bytes into file)
  Start of section headers:          4352 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         2
  Size of section headers:           64 (bytes)
  Number of section headers:         5
  Section header string table index: 4

Section Headers:
  [Nr] Name              Type            Address          Off    Size   ES Flg Lk Inf Al
  [ 0]                   NULL            0000000000000000 000000 000000 00      0   0  0
  [ 1] .text             PROGBITS        0000000000401000 001000 00000c 00  AX  0   0 16
  [ 2] .symtab           SYMTAB          0000000000000000 001010 0000a8 18      3   3  8
  [ 3] .strtab           STRTAB          0000000000000000 0010b8 000024 00      0   0  1
  [ 4] .shstrtab         STRTAB          0000000000000000 0010dc 000021 00      0   0  1
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
  L (link order), O (extra OS processing required), G (group), T (TLS),
  C (compressed), x (unknown), o (OS specific), E (exclude),
  l (large), p (processor specific)

There are no section groups in this file.

Program Headers:
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
  LOAD           0x000000 0x0000000000400000 0x0000000000400000 0x0000b0 0x0000b0 R   0x1000
  LOAD           0x001000 0x0000000000401000 0x0000000000401000 0x00000c 0x00000c R E 0x1000

 Section to Segment mapping:
  Segment Sections...
   00     
   01     .text 

There is no dynamic section in this file.

There are no relocations in this file.

The decoding of unwind sections for machine type Advanced Micro Devices X86-64 is not currently supported.

Symbol table '.symtab' contains 7 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND 
     1: 0000000000401000     0 SECTION LOCAL  DEFAULT    1 
     2: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS simple.asm
     3: 0000000000401000     0 NOTYPE  GLOBAL DEFAULT    1 _start
     4: 0000000000402000     0 NOTYPE  GLOBAL DEFAULT    1 __bss_start
     5: 0000000000402000     0 NOTYPE  GLOBAL DEFAULT    1 _edata
     6: 0000000000402000     0 NOTYPE  GLOBAL DEFAULT    1 _end

No version information found in this file.

After Strip

ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           Advanced Micro Devices X86-64
  Version:                           0x1
  Entry point address:               0x401000
  Start of program headers:          64 (bytes into file)
  Start of section headers:          4128 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         2
  Size of section headers:           64 (bytes)
  Number of section headers:         3
  Section header string table index: 2

Section Headers:
  [Nr] Name              Type            Address          Off    Size   ES Flg Lk Inf Al
  [ 0]                   NULL            0000000000000000 000000 000000 00      0   0  0
  [ 1] .text             PROGBITS        0000000000401000 001000 00000c 00  AX  0   0 16
  [ 2] .shstrtab         STRTAB          0000000000000000 00100c 000011 00      0   0  1
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
  L (link order), O (extra OS processing required), G (group), T (TLS),
  C (compressed), x (unknown), o (OS specific), E (exclude),
  l (large), p (processor specific)

There are no section groups in this file.

Program Headers:
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
  LOAD           0x000000 0x0000000000400000 0x0000000000400000 0x0000b0 0x0000b0 R   0x1000
  LOAD           0x001000 0x0000000000401000 0x0000000000401000 0x00000c 0x00000c R E 0x1000

 Section to Segment mapping:
  Segment Sections...
   00     
   01     .text 

There is no dynamic section in this file.

There are no relocations in this file.

The decoding of unwind sections for machine type Advanced Micro Devices X86-64 is not currently supported.

No version information found in this file.

Other Notes

perf list shows the available events possible to record on your system. The result depends on the counters present in your CPU.

To start gdb and stop at the first instruction, invoke the tool ($ gdb ./a.out) and enter > start as the first command.

  1. generated with gcc -S 

  2. see this on endbr64 if you’re curious what it does 

  3. binaries are dynamically linked to libc by default. Here is how to statically link libc