article

Free access

What every computer scientist should know about floating-point arithmetic

Author:

David GoldbergAuthors Info & Claims

ACM Computing Surveys (CSUR), Volume 23, Issue 1

Pages 5 - 48

https://doi.org/10.1145/103162.103163

Published: 01 March 1991 Publication History

PDF eReader

Abstract

Floating-point arithmetic is considered as esoteric subject by many people. This is rather surprising, because floating-point is ubiquitous in computer systems: Almost every language has a floating-point datatype; computers from PCs to supercomputers have floating-point accelerators; most compilers will be called upon to compile floating-point algorithms from time to time; and virtually every operating system must respond to floating-point exceptions such as overflow. This paper presents a tutorial on the aspects of floating-point that have a direct impact on designers of computer systems. It begins with background on floating-point representation and rounding error, continues with a discussion of the IEEE floating point standard, and concludes with examples of how computer system builders can better support floating point.

References

[1]

AHO, A. V., SETHI, R., AND ULLMAN, J. D. 1986. Compilers: Principles, Techniques and Tools. Addison-Wesley, Reading, Mass.

Crossref

Google Scholar

[2]

ANSI 1978. American National Standard Programming Language FORTRAN, ANSI Standard X3.9-1978. American National Standards Institute, New York.

Google Scholar

[3]

BARNETT, D. 1987. A portable floating-point environment. Unpublished manuscript.

Google Scholar

[4]

BROWN, W. S. 1981. A simple but realistic model of floating-point computation. ACM Trans. Math. Softw. 7, 4, 445-480.

Crossref

Google Scholar

[5]

CARDELLI, L., DONAHUE, J., GLASSMAN, L., JORDAN, M., KASLOW, B., AND NELSON, G. 1989. Modula-3 Report (revised). Digital Systems Research Center Report *~52, Palo Alto, Calif.

Google Scholar

[6]

CODY, W. J. et al. 1984. A proposed radix- and word-length-independent standard for floatingpoint arithmetic. IEEE Micro 4, 4, 86-100.

Google Scholar

[7]

CODY, W. J. 1988. Floating-point standards--Theory and practice. In Reliability in Computing: The Role of lnterval Methods on Scientific Computing, Ramon E. Moore, Ed. Academic Press, Boston, Mass., pp. 99-107.

Crossref

Google Scholar

[8]

COONEN, J. 1984. Contributions to a proposed standard for binary floating-point arithmetic. PhD dissertation, Univ. of California, Berkeley.

Crossref

Google Scholar

[9]

DEKKER, T. J. 1971. A floating-point technique for extending the available precision. Numer. Math. 18, 3, 224-242.

Google Scholar

[10]

DEMMEL, J. 1984. Underflow and the reliability of numerical software. SIAM J. Sci. Stat. Cornput. 5, 4, 887-919.

Google Scholar

[11]

FARNUM, C. 1988. Compiler support for floatingpoint computation. Softw. Pract. Experi. 18, 7, 701-709.

Crossref

Google Scholar

[12]

FORSYTHE, G. E., AND MOLER, C. B. 1967. Computer Solutmn of Linear Algebraic Systems. Prentice-Hall, Englewood Cliffs, N.J.

Google Scholar

[13]

GOLDBERG, I. B. 1967. 27 Bits are not enough for 8-digit accuracy. Commum. ACM 10, 2, 105-106.

Crossref

Google Scholar

[14]

GOLDBERC, D. 1990. Computer arithmetic. In Computer Architecture: A Quantitative Approach, David Patterson and John L. Hennessy, Eds. Morgan Kaufmann, Los Altos, Calif., Appendix A.

Google Scholar

[15]

GOLUB, G. H., AND VAN LOAN, C. F. 1989. Matrix Computations. The Johns Hopkins University Press, Baltimore, MD.

Google Scholar

[16]

HEWLETT PACKARD 1982. HP-15C Advanced Functions Handbook.

Google Scholar

[17]

IEEE 1987. IEEE Standard 754-1985 for Binary Floating-Point Arithmetic, IEEE. Reprinted in SIGPLAN 22, 2, 9-25.

Google Scholar

[18]

KASAN, W. 1972. A Survey of Error Analysis. In Information Processing 71, (Ljubljana, Yugoslavia), North Holland, Amsterdam, vol. 2, pp. 1214-1239.

Google Scholar

[19]

KAHAN, W. 1986. Calculating Area and Angle of a Needle-like Triangle. Unpublished manuscript.

Google Scholar

[20]

KAHAN, W. 1987. Branch cuts for complex elementary functions. In The State of the Art in Numerical Analyszs, M. J. D. Powell and A Iserles, Eds., Oxford University Press, N.Y., Chap. 7.

Google Scholar

[21]

KAnAN, W. 1988. Unpublished lectures given at Sun Microsystems, Mountain View, Calif.

Google Scholar

[22]

KAHAN, W., AND COONEN, J. T. 1982. The near orthogonality of syntax, semantics, and diagnostics in numerical programming environments. In The Relationship between Numerical Computation and Programmi,g Languages, J. K. Reid~ Ed. North-Holland~ Amsterdam~ pp 103 115.

Google Scholar

[23]

KAI~AN, W., AND LEBLANC, E. 1985. Anomalies in the IBM acrith package. In Proceedings of the 7th IEEE Symposium on Computer Arithmetic (Urbana, Ill.), pp. 322-331.

Google Scholar

[24]

KERXIGHAN, B. W., AND RITCHm, D. M. 1978. The C Programming Language. Prentice~Hall, Englewood Cliffs, N.J.

Crossref

Google Scholar

[25]

KIRCHNER, R., AND KULISCH, U 1987 Arithmetic for vector processors. In Proceedings of the 8th IEEE Symposium on Computer Arithmetic (Como, Italy), pp. 256-269.

Google Scholar

[26]

KNUT~, D. E. 1981. The Art of Computer Programming Addison-Wesley, Reading, Mass., vol. II, 2nd ed.

Crossref

Google Scholar

[27]

KULISH, U. W., ArCD MmANKER W. L. 1986. The Arithmetic of the Digital Computer: A new approach. SIAM Rev 28, 1, 1-36.

Crossref

Google Scholar

[28]

MATULA, D. W., AND KORNERUP, P. 1985. Finite Precision Rational Arithmetic: Slash Number Systems. IEEE Trans. Comput. C-34, 1, 3-18.

Google Scholar

[29]

REISER, J. F., A2CD KNUTH, D E. 1975. Evading the drift in floating-point addition. Inf. Process. Lett 3, 3, 84-87

Google Scholar

[30]

STERBETZ, P. H. 1974. Floating-Point Computation. Prentice-Hall, Englewood Cliffs, N.J.

Google Scholar

[31]

SWARTZLANDER, E. E, AND ALEXOPOULOS, G. 1975. The sign/logarithm number system. IEEE Trans. Comput. C-24, 12, 1238-1242

Google Scholar

[32]

WALTHER, J. S. 1971. A unified algorithm for elementary functions. Proceedings of the AFIP Spr~ng Joint Computer Conference, pp. 379- 385.

Google Scholar

Cited By

View all

Caddy RSchneider E(2024)Cholla-MHD: An Exascale-capable Magnetohydrodynamic Extension to the Cholla Astrophysical Simulation CodeThe Astrophysical Journal10.3847/1538-4357/ad464a970:1(44)Online publication date: 15-Jul-2024
https://doi.org/10.3847/1538-4357/ad464a
Jin XMao CYue DLeng T(2024)Floating-Point Embedding: Enhancing the Mathematical Comprehension of Large Language ModelsSymmetry10.3390/sym1604047816:4(478)Online publication date: 15-Apr-2024
https://doi.org/10.3390/sym16040478
Eid AMontoya F(2024)Developing GA-FuL: A Generic Wide-Purpose Library for Computing with Geometric AlgebraMathematics10.3390/math1214227212:14(2272)Online publication date: 20-Jul-2024
https://doi.org/10.3390/math12142272
Show More Cited By

Recommendations

A Combined Decimal and Binary Floating-Point Multiplier
ASAP '09: Proceedings of the 2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors

In this paper, we describe the first hardware design of a combined binary and decimal floating-point multiplier, based on specifications in the IEEE 754-2008 Floating-point Standard. The multiplier design operates on either (1) 64-bit binary encoded ...
A Decimal Floating-Point Divider Using Newton---Raphson Iteration

Increasing chip densities and transistor counts provide more room for designers to add functionality for important application domains into future microprocessors. As a result of rapid growth in financial, commercial, and Internet-based applications, ...
Options for Denormal Representation in Logarithmic Arithmetic
Abstract
Economical hardware often uses a FiXed-point Number System (FXNS), whose constant absolute precision is acceptable for many signal-processing algorithms. The almost-constant relative precision of the more expensive Floating-Point (FP) number ...

Reviews

Reviewer: Louis W. Ehrlich

The title of this paper is appropriate. In fact, the words “user of computers for computation” could replace the words “computer scientist.” In Section 1, “Rounding Errors,” the details of rounding are described. Some of the topics discussed are floating-point formats, relative error and ulps (units in the last place), guard digits, and cancellation. Goldberg discusses the relations between ulps, relative error, and machine epsilon, with examples illustrating “wobble” in ulps. (Recent correspondence over NA-NET was appropriately concerned with new measures of precision in floating-point arithmetic, which fits in here.) He points out the significance of guard digits. Catastrophic cancellation (between computed numbers) and benign cancellation (between exact numbers) are described with some examples showing how some of this may be avoided by rewriting formulas. Section 2, “IEEE Standard,” describes the two different standards, 754 (for binary) and 854 (for binary or decimal). Subtopics include formats and operations (including base and precision), special quantities (such as NaNs and infinity), and exceptions, flags, and trap handlers. Not only are the standards described, but the reasons they were chosen are discussed and examples are given to illustrate the need or desire for them. This section should be kept in mind when one is considering purchasing a new computer. Section 3, “Systems Aspects,” contains discussion on such topics as instruction sets, languages and compilers (ambiguity and optimizers), and exception handling. Section 4, “Details,” presents proofs of some of the statements made earlier (such as “a single guard digit is enough to guarantee that addition and subtraction will always be accurate”). Finally, in Section 5, “Summary,” the author again states that rigorous reasoning can be applied to floating-point algorithms, consistent with underlying hardware and efficient algorithms. The paper contains an appendix in which several theorems are proven, including a highly accurate summation formula due to W. Kahan. Everyone who uses a computer to compute should read or at least peruse this paper. Many of the examples given are things we do without thinking of possible results. We should at least be aware of what can happen.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

ACM Computing Surveys Volume 23, Issue 1

March 1991

123 pages

ISSN:0360-0300

EISSN:1557-7341

DOI:10.1145/103162

Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 March 1991

Published in CSUR Volume 23, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1,123
Total Citations
View Citations
29,113
Total Downloads

Downloads (Last 12 months)5,947
Downloads (Last 6 weeks)829

Reflects downloads up to 22 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Caddy RSchneider E(2024)Cholla-MHD: An Exascale-capable Magnetohydrodynamic Extension to the Cholla Astrophysical Simulation CodeThe Astrophysical Journal10.3847/1538-4357/ad464a970:1(44)Online publication date: 15-Jul-2024
https://doi.org/10.3847/1538-4357/ad464a
Jin XMao CYue DLeng T(2024)Floating-Point Embedding: Enhancing the Mathematical Comprehension of Large Language ModelsSymmetry10.3390/sym1604047816:4(478)Online publication date: 15-Apr-2024
https://doi.org/10.3390/sym16040478
Eid AMontoya F(2024)Developing GA-FuL: A Generic Wide-Purpose Library for Computing with Geometric AlgebraMathematics10.3390/math1214227212:14(2272)Online publication date: 20-Jul-2024
https://doi.org/10.3390/math12142272
Siklósi BMudalige GReguly I(2024)Enabling Bitwise Reproducibility for the Unstructured Computational MotifApplied Sciences10.3390/app1402063914:2(639)Online publication date: 11-Jan-2024
https://doi.org/10.3390/app14020639
Fujimori SIto TShimobaba T(2024)Rationalized diffraction calculations for high accuracy and high speed with few bitsJournal of the Optical Society of America A10.1364/JOSAA.51088441:2(303)Online publication date: 23-Jan-2024
https://doi.org/10.1364/JOSAA.510884
Wijburg SMontizaan MKik MJoeres MCardron GLuttermann CMaas MMaksimov POpsteegh MSchares G(2024)Drivers of infection with Toxoplasma gondii genotype type II in Eurasian red squirrels (Sciurus vulgaris)Parasites & Vectors10.1186/s13071-023-06068-617:1Online publication date: 23-Jan-2024
https://doi.org/10.1186/s13071-023-06068-6
Kellison AHsu J(2024)Numerical Fuzz: A Type System for Rounding Error AnalysisProceedings of the ACM on Programming Languages10.1145/36564568:PLDI(1954-1978)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656456
Xu JCui MLi FZhang ZYang HZhou BZhao JChristakis MPradel M(2024)Arfa: An Agile Regime-Based Floating-Point Optimization Approach for Rounding ErrorsProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680378(1516-1528)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3680378
Giles MSheridan-Methven O(2024)Rounding Error Using Low Precision Approximate Random VariablesSIAM Journal on Scientific Computing10.1137/23M155281446:4(B502-B526)Online publication date: 16-Jul-2024
https://doi.org/10.1137/23M1552814
Yin HZhu A(2024)Iterative Multimetric Model Extraction for Digital Predistortion of RF Power Amplifiers Using Enhanced Quadratic SPSAIEEE Transactions on Microwave Theory and Techniques10.1109/TMTT.2023.330518172:3(1503-1514)Online publication date: Mar-2024
https://doi.org/10.1109/TMTT.2023.3305181
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

A Combined Decimal and Binary Floating-Point Multiplier

A Decimal Floating-Point Divider Using Newton---Raphson Iteration

Options for Denormal Representation in Logarithmic Arithmetic

Reviews

Access critical reviews of Computing literature here

Comments

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF

eReader

Login options

Full Access

Abstract

References

Cited By

Index Terms

Recommendations

A Combined Decimal and Binary Floating-Point Multiplier

A Decimal Floating-Point Divider Using Newton---Raphson Iteration

Options for Denormal Representation in Logarithmic Arithmetic

Reviews

Access critical reviews of Computing literature here

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations