Getting your sums right

This article is a response to several enquiries received by Technical Support and should be of interest to most programmers. Although only scratching at the surface it shows the importance of understanding what even a standards conforming language actually promises.

Floating point accuracy

A Fortran programmer who wrote to us (although it could have been any language) had the code fragment

 REAL A,B,X
 X = A * B
 IF (X.EQ.A*B) THEN

and was surprised when the IF statement did not always evaluate true. This is not a bug in the compiler, as might be imagined, but a result of how the compiler works. Floating-point values from intermediate calculations are stored in an internal format that has greater accuracy than a 4-byte real. This is how the result of the calculation A*B is stored. The result is then assigned to X, and so is stored in the normal 4-byte real format, and thus may lose some accuracy. In the comparison, this 4-byte real X is expanded into the internal format and is compared with the internally stored result of A*B. It is quite possible that these are not exactly the same value, and the equality will fail.

The most obvious lesson to be learnt is: do not compare floating point values for equality. Rather than comparing two values to see if they are the same, check to see if they are similar in value.

Narrow integers

C programmers in particular will be aware of the hidden dangers involved in integer arithmetic, especially where the type int is a 16-bit value.

The expression i=(j*7)/4, where i,j have type int (or short), and j=10000 will give i the value 1116, rather than 17500 as might be expected. A C compiler that has type int as a 32-bit value will give the correct answer.

Pascal and Fortran users do not have this problem, as all arithmetic is done as if it were at full integer width, with all Prospero Pascal and Fortran compilers having a 32-bit integer type as standard. Prospero compilers are getting clever. In Pascal,

 VAR a,b:INTEGER2;
 a := (a + b) * 2;

will in fact all be performed at INTEGER2 width: the crucial element being that the result can be guaranteed to be the same as if the arithmetic is done at INTEGER width. Even better, with

 VAR a,b:-;
 a : = (a + b) DIV 2;

the subrange type of a and b guarantees that a and b can be held at INTEGER2 width, and also that the whole expression will at no stage overflow the INTEGER2 width. On processors such as 80x86, INTEGER2 calculations are much faster than INTEGER ones. Subrange types are more than just very correct programming: they may allow the compiler to produce better code.

What are `IMOD(-5,3)`, `-5 % 3` and `-5 MOD 3`?

These three questions are identical, presented in Fortran, C and Pascal language forms. The answers are -2 or 1 for C, -2 for Fortran and 1 for Pascal. So which is right? The ANSI C standard accepts either -2 or 1. The Fortran standard defines IMOD(i,j) as i-INT(i/j)*j. The Pascal standard defines i MOD j as i-j*k, where k is any integer such that 0<=i-j*k<j, thus uniquely defining k and i MOD j. A typical use of MOD is to bring a value into range. For instance having calculated the number of days since some fixed point in time, reducing this number modulo 7 would give the day of the week. The number of days may be negative, but even so we would want a result in the range 0 to 6. This is only guaranteed to work in Pascal, with both C and Fortran requiring extra work to allow for negative results.

Product Support Top of page

Getting your sums right

Floating point accuracy

Narrow integers

What are IMOD(-5,3), -5 % 3 and -5 MOD 3?

What are `IMOD(-5,3)`, `-5 % 3` and `-5 MOD 3`?