Developer Documentation - Symbolic Memory
When programs under investigation execute allocation instructions, including alloca
which allocates stack memory and malloc
and others which allocate heap memory, symbolic memory locations are created and henceforth propagated with control flow and operated on by load
, store
and other memory operations. Symbolic memory is effectively blanked when instructions may have any effect, such as a call to an external unannotated function, or a store through an unknown pointer.
Data structures
The symbolic store as a whole is represented by OrdinaryLocalStore
, which consists of a list of stack frames and a global heap (which also contains locations representing global variables, and other special locations). The store as a whole is a copy-on-write object, so sharing it during loops or other control flow divergences is cheap if they don’t actually operate on memory. Stack frames are created and destroyed as calls are entered and exited; they may also be extended due to variable-length arrays on the stack, user alloca
calls and similar. The heap is appended to whenever malloc
or a user-annotated allocator is encountered.
Each memory location is represented by an ImprovedValSet
. This can be an ImprovedValSetSingle
(capable of representing a set of values such as { i32 0, i32 5 }
) or ImprovedValSetMulti
(a list of ImprovedValSetSingle
instances, so it can represent e.g. struct types). To make sparse alterations to large allocations cheap, ImprovedValSetMulti
is itself copy-on-write, and can have a base object that it shadows – for example, if a simple struct like struct X { int a; int b; int c; int d; }
was initially zeroed, resulting in representation [ i32 0, i32 0, i32 0, i32 0 ]
, then only the third member was overwritten with a 1, it can be represented as [ 8-12 i32 1 ] -> [ i32 0, i32 0, i32 0, i32 0 ]
, where 8-12
is the top object’s offset and ->
indicates the base object pointer.
The ImprovedValSet
objects themselves consist of ImprovedVal
instances; these finally are a ShadowValue
plus an optional offset field, used to indicate pointer offsets. ShadowValue
can represent many different types of data as enumerated by ShadowValType
:
SHADOWVAL_CI8
,_CI16
et al: raw constant integers. These are not stored asConstantInt
because that class is both considerably larger than a plain integer, and cannot be deleted, so using it to represent intermediate values would result in prohibitive memory usage. This did require some copy-and-pasting of the LLVM constant folder, however.SHADOWVAL_OTHER
: any otherConstant
, such as a floating-point constant. In theory these have the same problem asConstantInt
, but in LLPE’s practical use so far the problem has not come up.SHADOWVAL_PTRIDX
: a (frame index, object index) pair identifying a stack frame and object. If the stack frame is -1, object index refers to the heap.SHADOWVAL_FDIDX
andSHADOWVAL_FXIDX64
: 32-bit and 64-bit symbolic file descriptors, respectively. They refer to the block-localFDStore
, which is maintained similarly to the symbolic store.
ShadowValue
can also refer to some other types of value which cannot be stored in memory: SHADOWVAL_ARG
, _INST
and _GV
are used to note instruction def-use chains, and are not relevant to symbolic memory.
Thus to sum up the memory objects:
ShadowValue
can describe “constant integer 1” or “the second allocation in my parent function”ImprovedVal
adds an offset field, irrelevant to constants but able to describe “offset 16 in the second allocation in my parent function”ImprovedValSetSingle
is a set of those, so it can describe{ global X offset 0, parent-allocation-2 offset 16 }
ImprovedValSetMulti
is a list of Singles, so it can describe a struct with many pointer fields, for example.
Vague allocations
In general, LLPE looks to allocate a symbolic memory location per allocation that would occur at runtime. However a complication arises when an allocation instruction is found while analysing the general case of a loop body. Depending on context, this couid refer to the allocation that was made in the first iteraton (as in a loop that allocates a particular object the first time it is needed and re-uses it thereafter), or to the allocation made this iteration (as in say an array-to-linked-list function that repeatedly allocates and populates nodes), or to an unknowable combination of the two. LLPE handles this conservatively; such allocation sites (internally called vague allocations) can only be used wtihin the loop body in question, after which they are overwritten with a no-information overdefined value.
Implicit casts
LLPE’s symbolic memory supports all manner of reinterpret and other casts, explicit or implied via aliasing. Struct members and similar are described by byte offsets, not field numbers, and operations such as casting pointers to and from uint64
are entirely invisible, as are reinterpreting for example a byte array as a single large byte array or floating-point number. Some casts will be impossible to evaluate however, such as taking the top 32 bits of a pointer and combining it with a constant (we can’t know ahead of time what the constant result would be, and it is meaningless as a pointer). Certain comparisons are also impossible, such as whether a pointer is numerically greater than an independently-allocated object. A special function integrator_same_object
is provided to allow user-supplied models to replace the common use case x >= base && x < base + offset
used to determine whether x
belongs to a particular object base
.
Useful files to read when investigating / improving the symbolic memory implementation:
include/llvm/Analysis/SharedTree.h
: implementation of the copy-on-write store and frame-lists that compose the symbolic store.main/LoadForward.cpp
: functions that read and write the symbolic store, called on executing load, store, allocation and other instructions.main/MainLoop.cpp
: logic that passes the symbolic store between basic blocks, into and out of called functions and loops, forks / merges it when necessary, and so on.