skip to main content
article
Free access

What every computer scientist should know about floating-point arithmetic

Published: 01 March 1991 Publication History

Abstract

Floating-point arithmetic is considered as esoteric subject by many people. This is rather surprising, because floating-point is ubiquitous in computer systems: Almost every language has a floating-point datatype; computers from PCs to supercomputers have floating-point accelerators; most compilers will be called upon to compile floating-point algorithms from time to time; and virtually every operating system must respond to floating-point exceptions such as overflow. This paper presents a tutorial on the aspects of floating-point that have a direct impact on designers of computer systems. It begins with background on floating-point representation and rounding error, continues with a discussion of the IEEE floating point standard, and concludes with examples of how computer system builders can better support floating point.

References

[1]
AHO, A. V., SETHI, R., AND ULLMAN, J. D. 1986. Compilers: Principles, Techniques and Tools. Addison-Wesley, Reading, Mass.
[2]
ANSI 1978. American National Standard Programming Language FORTRAN, ANSI Standard X3.9-1978. American National Standards Institute, New York.
[3]
BARNETT, D. 1987. A portable floating-point environment. Unpublished manuscript.
[4]
BROWN, W. S. 1981. A simple but realistic model of floating-point computation. ACM Trans. Math. Softw. 7, 4, 445-480.
[5]
CARDELLI, L., DONAHUE, J., GLASSMAN, L., JORDAN, M., KASLOW, B., AND NELSON, G. 1989. Modula-3 Report (revised). Digital Systems Research Center Report *~52, Palo Alto, Calif.
[6]
CODY, W. J. et al. 1984. A proposed radix- and word-length-independent standard for floatingpoint arithmetic. IEEE Micro 4, 4, 86-100.
[7]
CODY, W. J. 1988. Floating-point standards--Theory and practice. In Reliability in Computing: The Role of lnterval Methods on Scientific Computing, Ramon E. Moore, Ed. Academic Press, Boston, Mass., pp. 99-107.
[8]
COONEN, J. 1984. Contributions to a proposed standard for binary floating-point arithmetic. PhD dissertation, Univ. of California, Berkeley.
[9]
DEKKER, T. J. 1971. A floating-point technique for extending the available precision. Numer. Math. 18, 3, 224-242.
[10]
DEMMEL, J. 1984. Underflow and the reliability of numerical software. SIAM J. Sci. Stat. Cornput. 5, 4, 887-919.
[11]
FARNUM, C. 1988. Compiler support for floatingpoint computation. Softw. Pract. Experi. 18, 7, 701-709.
[12]
FORSYTHE, G. E., AND MOLER, C. B. 1967. Computer Solutmn of Linear Algebraic Systems. Prentice-Hall, Englewood Cliffs, N.J.
[13]
GOLDBERG, I. B. 1967. 27 Bits are not enough for 8-digit accuracy. Commum. ACM 10, 2, 105-106.
[14]
GOLDBERC, D. 1990. Computer arithmetic. In Computer Architecture: A Quantitative Approach, David Patterson and John L. Hennessy, Eds. Morgan Kaufmann, Los Altos, Calif., Appendix A.
[15]
GOLUB, G. H., AND VAN LOAN, C. F. 1989. Matrix Computations. The Johns Hopkins University Press, Baltimore, MD.
[16]
HEWLETT PACKARD 1982. HP-15C Advanced Functions Handbook.
[17]
IEEE 1987. IEEE Standard 754-1985 for Binary Floating-Point Arithmetic, IEEE. Reprinted in SIGPLAN 22, 2, 9-25.
[18]
KASAN, W. 1972. A Survey of Error Analysis. In Information Processing 71, (Ljubljana, Yugoslavia), North Holland, Amsterdam, vol. 2, pp. 1214-1239.
[19]
KAHAN, W. 1986. Calculating Area and Angle of a Needle-like Triangle. Unpublished manuscript.
[20]
KAHAN, W. 1987. Branch cuts for complex elementary functions. In The State of the Art in Numerical Analyszs, M. J. D. Powell and A Iserles, Eds., Oxford University Press, N.Y., Chap. 7.
[21]
KAnAN, W. 1988. Unpublished lectures given at Sun Microsystems, Mountain View, Calif.
[22]
KAHAN, W., AND COONEN, J. T. 1982. The near orthogonality of syntax, semantics, and diagnostics in numerical programming environments. In The Relationship between Numerical Computation and Programmi,g Languages, J. K. Reid~ Ed. North-Holland~ Amsterdam~ pp 103 115.
[23]
KAI~AN, W., AND LEBLANC, E. 1985. Anomalies in the IBM acrith package. In Proceedings of the 7th IEEE Symposium on Computer Arithmetic (Urbana, Ill.), pp. 322-331.
[24]
KERXIGHAN, B. W., AND RITCHm, D. M. 1978. The C Programming Language. Prentice~Hall, Englewood Cliffs, N.J.
[25]
KIRCHNER, R., AND KULISCH, U 1987 Arithmetic for vector processors. In Proceedings of the 8th IEEE Symposium on Computer Arithmetic (Como, Italy), pp. 256-269.
[26]
KNUT~, D. E. 1981. The Art of Computer Programming Addison-Wesley, Reading, Mass., vol. II, 2nd ed.
[27]
KULISH, U. W., ArCD MmANKER W. L. 1986. The Arithmetic of the Digital Computer: A new approach. SIAM Rev 28, 1, 1-36.
[28]
MATULA, D. W., AND KORNERUP, P. 1985. Finite Precision Rational Arithmetic: Slash Number Systems. IEEE Trans. Comput. C-34, 1, 3-18.
[29]
REISER, J. F., A2CD KNUTH, D E. 1975. Evading the drift in floating-point addition. Inf. Process. Lett 3, 3, 84-87
[30]
STERBETZ, P. H. 1974. Floating-Point Computation. Prentice-Hall, Englewood Cliffs, N.J.
[31]
SWARTZLANDER, E. E, AND ALEXOPOULOS, G. 1975. The sign/logarithm number system. IEEE Trans. Comput. C-24, 12, 1238-1242
[32]
WALTHER, J. S. 1971. A unified algorithm for elementary functions. Proceedings of the AFIP Spr~ng Joint Computer Conference, pp. 379- 385.

Cited By

View all
  • (2024)Cholla-MHD: An Exascale-capable Magnetohydrodynamic Extension to the Cholla Astrophysical Simulation CodeThe Astrophysical Journal10.3847/1538-4357/ad464a970:1(44)Online publication date: 15-Jul-2024
  • (2024)Floating-Point Embedding: Enhancing the Mathematical Comprehension of Large Language ModelsSymmetry10.3390/sym1604047816:4(478)Online publication date: 15-Apr-2024
  • (2024)Developing GA-FuL: A Generic Wide-Purpose Library for Computing with Geometric AlgebraMathematics10.3390/math1214227212:14(2272)Online publication date: 20-Jul-2024
  • Show More Cited By

Recommendations

Reviews

Louis W. Ehrlich

The title of this paper is appropriate. In fact, the words “user of computers for computation” could replace the words “computer scientist.” In Section 1, “Rounding Errors,” the details of rounding are described. Some of the topics discussed are floating-point formats, relative error and ulps (units in the last place), guard digits, and cancellation. Goldberg discusses the relations between ulps, relative error, and machine epsilon, with examples illustrating “wobble” in ulps. (Recent correspondence over NA-NET was appropriately concerned with new measures of precision in floating-point arithmetic, which fits in here.) He points out the significance of guard digits. Catastrophic cancellation (between computed numbers) and benign cancellation (between exact numbers) are described with some examples showing how some of this may be avoided by rewriting formulas. Section 2, “IEEE Standard,” describes the two different standards, 754 (for binary) and 854 (for binary or decimal). Subtopics include formats and operations (including base and precision), special quantities (such as NaNs and infinity), and exceptions, flags, and trap handlers. Not only are the standards described, but the reasons they were chosen are discussed and examples are given to illustrate the need or desire for them. This section should be kept in mind when one is considering purchasing a new computer. Section 3, “Systems Aspects,” contains discussion on such topics as instruction sets, languages and compilers (ambiguity and optimizers), and exception handling. Section 4, “Details,” presents proofs of some of the statements made earlier (such as “a single guard digit is enough to guarantee that addition and subtraction will always be accurate”). Finally, in Section 5, “Summary,” the author again states that rigorous reasoning can be applied to floating-point algorithms, consistent with underlying hardware and efficient algorithms. The paper contains an appendix in which several theorems are proven, including a highly accurate summation formula due to W. Kahan. Everyone who uses a computer to compute should read or at least peruse this paper. Many of the examples given are things we do without thinking of possible results. We should at least be aware of what can happen.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM Computing Surveys
ACM Computing Surveys  Volume 23, Issue 1
March 1991
123 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/103162
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 March 1991
Published in CSUR Volume 23, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. NaN
  2. denormalized number
  3. exception
  4. floating-point
  5. floating-point standard
  6. gradual underflow
  7. guard digit
  8. overflow
  9. relative error
  10. rounding error
  11. rounding mode
  12. ulp
  13. underflow

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5,947
  • Downloads (Last 6 weeks)829
Reflects downloads up to 22 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Cholla-MHD: An Exascale-capable Magnetohydrodynamic Extension to the Cholla Astrophysical Simulation CodeThe Astrophysical Journal10.3847/1538-4357/ad464a970:1(44)Online publication date: 15-Jul-2024
  • (2024)Floating-Point Embedding: Enhancing the Mathematical Comprehension of Large Language ModelsSymmetry10.3390/sym1604047816:4(478)Online publication date: 15-Apr-2024
  • (2024)Developing GA-FuL: A Generic Wide-Purpose Library for Computing with Geometric AlgebraMathematics10.3390/math1214227212:14(2272)Online publication date: 20-Jul-2024
  • (2024)Enabling Bitwise Reproducibility for the Unstructured Computational MotifApplied Sciences10.3390/app1402063914:2(639)Online publication date: 11-Jan-2024
  • (2024)Rationalized diffraction calculations for high accuracy and high speed with few bitsJournal of the Optical Society of America A10.1364/JOSAA.51088441:2(303)Online publication date: 23-Jan-2024
  • (2024)Drivers of infection with Toxoplasma gondii genotype type II in Eurasian red squirrels (Sciurus vulgaris)Parasites & Vectors10.1186/s13071-023-06068-617:1Online publication date: 23-Jan-2024
  • (2024)Numerical Fuzz: A Type System for Rounding Error AnalysisProceedings of the ACM on Programming Languages10.1145/36564568:PLDI(1954-1978)Online publication date: 20-Jun-2024
  • (2024)Arfa: An Agile Regime-Based Floating-Point Optimization Approach for Rounding ErrorsProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680378(1516-1528)Online publication date: 11-Sep-2024
  • (2024)Rounding Error Using Low Precision Approximate Random VariablesSIAM Journal on Scientific Computing10.1137/23M155281446:4(B502-B526)Online publication date: 16-Jul-2024
  • (2024)Iterative Multimetric Model Extraction for Digital Predistortion of RF Power Amplifiers Using Enhanced Quadratic SPSAIEEE Transactions on Microwave Theory and Techniques10.1109/TMTT.2023.330518172:3(1503-1514)Online publication date: Mar-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media