Use DWARF to debug WASM in the browser (part 2)

Read line number data from the DWARF symbol

In the previous articles we have extracted the debug sections of the WASM binary and started interpreting the DWARF data in them. We extracted the compilation unit and the associated Debug Information Entry (DIE).

In this article, we will see how we can interpret the data present in the .debug_line section. It is important to keep in mind that using as little space as possible is one of the main design goal of DWARF. That is why the way the association between the source location and the compiled code is encoded the way it is in DWARF. In order, for your debugger, to know where in the code the current program counter register points to, a correspondence between the instruction address and the source location must be encoded in the .debug_line. The data is not encoded using what we might call a format. It is actually encoded as a program. We will create a virtual machine composed of various register and read binary code which represent instructions, each of those instruction acting on the state of the virtual machine register. Occasionally, this VM will generate an entry in a matrix that will match instruction code of the original compiled program to the corresponding source location.

The line program virtual machine

The virtual machine is composed of the following "registers":

  • address: The program counter value for a particular instruction in the original program.
  • op_index: This is related to VLIW architecture, not relevant in the WASM context.
  • file: An index in a filename array.
  • line: The line in the source file.
  • column: The column in the source file.
  • is_stmt: A boolean indicating if the instruction can be used as a breakpoint or not.
  • basic_block: Indicating if the instruction is the first of a basic block. A basic block being more or less a function.
  • end_sequence: A sequence is a contiguous set of instruction. This boolean indicate the instruction is the last of the sequence.
  • prologue_end: A boolean indicating that the current instruction is at the beginning of a function. A breakpoint here will stop right before the actual code of the function runs.
  • epilogue_begin: A boolean indicating the contrary. An instruction executed after the function has finished and right before it returns.
  • isa: Not important in the WASM context.
  • discriminator: An index indicating the if of the block the instruction belongs to.

They are described in details in DWARF Debugging Information Format version 4 chapter 6.2.2. The whole purpose of the program will be to change the virtual machine state and generate a correspondence matrix using those registers. Every once in a while, the registers will be dumped in a stack and this will identify another link between the instructions and the source program.

The line program instructions

This virtual machine will read its instruction from the .debug_line section. The instruction set is pretty simple. I will not list them all here as you can find them in Chapter 6.2.5 of the aforementioned document. The instructions are of 3 different types:

  • 1. The standard opcode: They will act on the register in various ways. depending on the version of the DWARF standard, their number is pretty limited. 12 for version 4. They can have operands.
  • 2. The extended opcode: Not much different from the standard opcodes, there's four of them in version 4.
  • 3. The special opcode: Their only purpose is to advance the VM address register by the value depending on their opcode and a pretty involved formula described in chapter 6.2.5.1. Another cost saving measure.

The special opcodes

They are only represented by their opcode as an unsigned byte. They have no operands. Each time a special opcode is read, the VM will create an entry in the correspondence matrix. The opcode will also have some effect on the VM state see 6.2.5.1..

The standard opcodes

As we have seen, there are twelve of them. Their opcode is represented by an unsigned byte. The opcode can be followed by zero, one or multiple operands encoded in LEB128. One example of instruction is DW_LNS_advance_line which opcode is 0x0C and takes one LEB128 operand which must be added to the line register of the VM.

The extended opcode

The extended opcode have their first byte set to 0 and their opcode, starting on the second byte is encoded in LEB128. There are 4 described in the version 4 of the standard but due to their encoding they can be many future extension without breaking backward compatibility.

The .debug_line section

Like the custom sections we have covered in the previous article, the .debug_line section first starts with 0x00, the length of the section, the length of the section name and finally the name. The next byte will correspond to index 0x00 of the line program header. This is important because each compilation unit has a field called DW_AT_stmt_list which is one of the tags we read when we decoded the CU.

0x0000000b: DW_TAG_compile_unit
            DW_AT_producer    ("zig 0.10.0")
            DW_AT_language    (DW_LANG_C99)
            DW_AT_name        ("main")
            DW_AT_stmt_list   (0x00000000)  <----- here
            DW_AT_comp_dir    (".")
            DW_AT_GNU_pubnames        (true)
            DW_AT_low_pc      (0x00000000)
            DW_AT_ranges      (0x00000000
               [0x00000003, 0x000000d5)
               [0x000000d6, 0x000000de))
  
This field will reference an offset in the .debug_line section indicating the start of the line program corresponding to the CU. The offset starts right after the section name.

In our example here, the statement list starts at 0x00. The first thing we will read is the line program header. The header determine some parameters and the initial state of the VM. The header is composed as follow:

  • unit_length (4 bytes): The size of the program to come.
  • version (2 bytes): The version which depends on the DWARF version. Can probably be ignored.
  • header_length (4 bytes): The size of this header up to the first instruction.
  • minimum_instruction_length (1 byte): Parameter which will be used by the instructions to come.
  • maximum_operations_per_instruction (1 bytes): Ditto.
  • default_is_stmt (1 byte): The value to which initialize the VM's is_stmt register.
  • line_base (1 byte signed!)
  • line_range (1 byte): Both line_base and line_range will be use to calculate a state change in the VM depending on the opcode value.
  • opcode_base (1 byte): The number assigned to the first special opcode.
  • standard_opcode_lengths: This is an array of LEB128, one for each standard opcode which gives the number of operands for each opcode. The first LEB128 is for opcode 1, the second opcode 2, etc. until opcode_base - 1
  • include_directories: A sequence of zero-terminated strings which ends with two zeros. Will be used to reconstruct full file path.
  • file_names: A sequence of records whose indices in the sequence are used in the VM register "file". It is to be noted that the sequence starts with index 1.
Once the line header is read, the instruction starts. The instructions will be read opcode by opcode. Either the opcode found is zero and then the extended opcode will be read in the following LEB128, either the value is below the opcode_base value and is considered a standard opcode whose description can be found in 6.2.5.2, either it is a special opcode.

Reading this code until the end of the unit (whose length is given in the header) will ensure the VM state is changed appropriately and the correspondence is generated. You will end up with a matrix that will look like this:

Address            Line   Column File   ISA Discriminator Flags
------------------ ------ ------ ------ --- ------------- -------------
0x0000000000000003      3      0      5   0             0  is_stmt
0x0000000000000025      4      3      5   0             0  is_stmt prologue_end
0x0000000000000098      0      3      5   0             0
0x0000000000000099      4      3      5   0             0
0x00000000000000b6      0      3      5   0             0
0x00000000000000b7      4      3      5   0             0
0x00000000000000d5      4      3      5   0             0  end_sequence
0x00000000000000d6    767      0      1   0             0  is_stmt
0x00000000000000d7    788     17      1   0             0  is_stmt prologue_end
0x00000000000000dc      0     17      1   0             0
0x00000000000000de      0     17      1   0             0  end_sequence
  

You can see here that in file 5 (which happens to be main.zig) line 4 and column 3 corresponds to the instruction at address 0x0025. Pretty straightforward.

Conclusion

This article did not go into all the gory details of the line number program virtual machine instruction set but the DWARF standard does a pretty good job at that. In the next article, we will finally see how we can use this information to generate the inline sourcemap to be added to the WASM buffer and enable the browser's debugger.