Saturday, March 3, 2018

Understanding Overflow with Integer Arithmetic

Introduction

In programming we use a lot of arithmetic.  Much of it is simple addition, a fair amount of multiplication, some subtraction, etc...  Use of arithmetic can also be more exposed or more hidden depending on the languages we use, say, C vs. assembly.

Arithmetic falls into two categories: that for which overflow simply cannot occur, and that for which it can.  Generally speaking it is the programmer's responsibility to know which is which!

Understanding when Integer Arithmetic is Safe from Overflow

Much of the integer arithmetic calculation that we do is in access to our (in-memory) data structures.  Such calculations require additions, subtractions, multiplications (and shift) and are known as — especially in the C and assembly language — "pointer arithmetic".

The rule of thumb is that if we accesses memory using pointer arithmetic, we do not need to worry about overflow in our pointer arithmetic. 

However, to be sure, there are a number of caveats, as not all pointer arithmetic is overflow-free:
  • One catch is that we must be sure the memory we're trying to reach exists.

    How can we be sure that the memory indeed exist?

    1. Global variables, including arrays and other data structures always exist or else your program would not be running.  If there was insufficient memory for the operating system to load the program (code and global data) it would not be running.

    2. Always check dynamic memory allocation to see if it succeeds or fails!
      • if the allocation fails, then obviously, the memory does not exist, and our programs must not attempt to use it!
      • if the allocation succeeds, we can avoid overflow checking (subject to the following)

    3. Do not access memory prior to, or past an object, where an object is either:
      • a simple variable or struct or data structure whose address was taken to obtain a pointer or obtained by dynamic memory allocation, or,
      • an array,
      (It ok for a pointer to be computed as a terminal value (a value used for comparison only), as one byte prior to or one byte past an object as long as we don't access memory through that pointer.)

    4. For arrays, understand the provenance of indexing variables or indexing expressions. 

      • The values of indexing variables & expressions can come from algorithms.  When indexing in a simple loop that knows the actual allocated size of an array, we can be sure that overflow will not occur. When using complex indexing expressions that involve algorithmic computation, where we trust the algorithm.  For example, a heap data structure is an array whose elements have children at computable positions in the array.  A heap sort or priority queue algorithm (both use this kind of heap data structure) will operate on the elements and their children using complex indexing computations, that (given a valid algorithm) always access existing elements in the array.

      • However, pointer arithmetic involving indexing values from external and/or unverified provenance, can result in overflow.

    5. For objects, understanding the type of the object.

      This is particularly important when we have objects that share common initial fields and otherwise variable lengths — which can be done with classes & subclasses, tagged unions, or untagged/undiscriminated unions.  The latter — untagged and undiscriminated unions, if allowed by the programming language — can be problematic, since they are not necessarily automatically detected by the runtime of the programming language (and hence the programmer's concern).

  • Another catch is that we must use integers that are (at least) the same size as pointers (i.e. addresses) for the given programming environment. 
    • For 16-bit environments (16-bit pointers & 16-bit integers), the integers almost certainly need to be unsigned as well. 
    • For 32-bit environments, we can generally use 32-bit signed integers, as long as the operating system will not allow us to allocate larger than half the address space (i.e. 2gb), or allocated on the 2gb boundary.  This is generally the case.  If not however, then integers need to be unsigned.
    • For 64-bit environments, we can assume that the operting system will not allow us to allocate objects larger than 2^63, and, will not locate objects on the 2^63 boundary, so we can use signed 64-bit integers.

When should we have doubts about Overflow?

  • When doing non-pointer arithmetic.
  • Whether pointer arithmetic or not, if input is used from external sources is used e.g. values comes from the user or a file or the network.

What should we do when we have doubts?

  • Establish and agree upon bounds and limitations in API contracts, and, validate input values against agreed upon limitations.
  • Use overflow checking arithmetic, and, document conditions upon which overflow will cause failures, such as implementation choices for the size of fixed-sized numbers.
  • Use alternative data types, such as larger integers, floating point types, decimal types, and variable length types like BigInteger.  (Overflow with these other data types is beyond the scope of this post.)

Final Notes

There are many issues with using pointers safely, such as accessing, by pointer, an object that has been deleted, which will almost certainly cause problems for the program.  However, in accessing memory for an object has been deleted (and/or reclaimed for another purpose) overflow in pointer arithmetic is not the immediate concern (because the memory itself still exists), rather the concern is the mismatch between the expected object and the real object at that memory location.

This post first appear on Erik Eidt's "Discussions on Programming Models" at http://erikeidt.blogspot.com/2018/03/understanding-overflow-with-integer.html