When it comes to the size of numeric values used in computations (as opposed to sizes of data that you might store or transmit), there are really only two binary units that you absolutely must know: bit and byte. A bit is an abbreviation for "binary digit" and is a value that hold one of two states, which we typically refer to as 0 and 1. It's important to recognize that 0 and 1 are just conventions that we use for referring to these values; if you open up the memory chip in your computer, you won't see a bunch of 0s and 1s printed everywhere... instead, RAM (or at least the kind of RAM called dynamic RAM or DRAM) represents these 0s and 1s as the presence or absence of charge which, by the way, is a continuous value even though 0 and 1 are discrete (0 means that the charge is below some threshold, and 1 means that the charge is above some threshold). How bits are physically stored is dependent on the particular hardware / physical medium (e.g. whereas DRAM stores bits as electric charge, spinning hard drives store bits as magnetic polarity).
Bytes are groups of 8 bits. We care about bytes because, quite arbitrarily,
that is the minimum unit that can be referenced in memory. That is,
when retrieving data from memory, memory address x+1 is the byte that
is after the byte at memory address x. If the pioneers of computers had
selected a different number of bits to shuffle to/from memory at a time
or to use as the size of the smallest integer on which basic numerical operations
such as addition and subtraction are performed, then this could have been
a different value (although 8 is particularly nice as it is a power of two).
Fortunately, this value has been standardized, and any sane computing system
that hasn't been designed specifically for the purpose of driving software engineers
insane or to suicide use 8 bits as this size (though it should be noted that
the C++ programming language has a
CHAR_BIT constant for the
number of bits in a byte, but if you ever find yourself having to use it rather
than simply typing 8 other than out of an abundance of caution, you should consider switching jobs. ;)).
In addition to bytes, there are both smaller units (such as "nybbles" or "nibbles") and larger units that consist of multiples of bytes. However, bits and bytes are the most important to know. This table lists some more:
|Name||Number of Bits||Numeric Range (unsigned)||Comment / Significance|
|bit||1||[0, 1]||Basic unit that can be manipulated|
|nybble||4||[0, 16]||Corresponds to a single hexidecimal digit|
|byte||8||[0, 256]||Smallest memory-addressable unit|
|word*||16||[0, 65535]||Can represent any letter in the "basic multilingual plane" of Unicode or any letter in UCS-2 (but not every Unicode code point such as those outside the BMP)|
|dword*||32||[0, 4294967295]||Basic integer size on 32-bit machines. Can represent any Unicode code point. Can represent any IPv4 address (but not IPv6 addresses!)|
|qword*||64||[0, 264-1]||Basic integer size on 64-bit machines|
We've discussed what binary is but have not yet discussed how to represent numeric values using binary. Before doing that, however, it is useful to review how we represent numeric values in the decimal number system, since both decimal and binary are special cases of "base X", and understanding how base 10 may reveal general principles of base X.
Consider how we write the decimal number one hundred twenty three. In base 10 notation, we write this out as 123. However, this is in fact, shorthand for the polynomial expression 1 · 102 + 2 · 101 + 3 · 100. This holds for other numbers, as well; for example, 10 is 1 · 101 + 0 · 100; 82356 is 8 · 104 + 2 · 103 + 3 · 102 + 5 · 101 + 6 · 100.
In base 10, each digit represents a coefficient of a power of 10 (1, 10, 100, 1000, etc.). Because a coefficient of 10 would signify another power of 10, the digits only go up to 9, stopping before 10. The coefficients are written in descending order such that the most significant digit is written first (when reading left-to-right). A decimal point, if it exists, corresponds to a transition to negative powers of ten (10-1 = 1/10 = 0.1, 10-2 = 1/100 = 0.01, 10-3 = 1/1000 = 0.001, etc.).
The same rules that apply to base 10 generalize to any base. In base X, the digits range in value from 0 to X-1 (for base X ≥ 10, we generally start using letters to represent digits whose value is 10 or higher, and when the alphabet is also exhausted, then there is no standard convention and we must start looking for symbols from elsewhere to supplement the need for digits). In base X, the notation for a number is likewise a shorthand for a polynomial expression in which the digits are coefficients of powers of X. And, as with base 10, the left-most digit is also the most significant digit.
To recap, in base X, the number "abc" means aX2 + bX1 + cX0. To prevent confusion when multiple number systems are used, the base is typically denoted using a subscript. For example, 12310 indicates that the digits 123 should be interpreted in base 10, whereas 1238 denotes that the digits should be interpreted in base 8 (a.k.a. octal). Note that, while this is the proper mathematical notation, most computer languages use a simple prefix such as "0" to denote octal, "0x" to denote hexadecimal, etc. that are more convenient than typing superscripts and subscripts.
Because 10 in base X means 1 · X1 + 0 · X 0, the number 10 in base X is always equal to X. That is, in base ten, the symbol 10 denotes ten. In base 2, however, the symbol 10 denotes two. And now you can get the joke "there are 10 types of people in the world: those who know binary and those who don't." Congratulations! ;)
Everyhing in computers are ultimately represented in binary, including non-numeric content such as text, images, executable files, etc. Understanding how numbers are represented, however, must come first.
Before discussing how numbers are represented, however, we should clarify that we are discussing specifically how the CPU represents numbers. Because this is a physical piece of hardware, its capacity and size is finite. As a result, the numbers that it operates on are of a fixed size, typically powers of two that are at least a byte in size (8-bit, 16-bit, 32-bit, 64-bit, etc.). It is possible to operate on and represent numbers of a larger size or with arbitrary size/precision (up to whatever the system memory tolerates), but sizes other than the fixed sizes provided by the CPU cannot be operated on directly by the hardware and require slower software emulation of the basic numeric operations.
Whole numbers are the subset of the integers that are non-negative (0, 1, 2, 3, ...). For a fixed size N-bit number, whole numbers can be represented simply using the rules of base 2 notation; each bit in the number represents a coefficient of a power of 2. For example, the 4-bit number 0100 is 0 · 23 + 1 · 22 + 0 · 21 + 0 · 20 = 410; by a similar token, the 4-bit number 0011 is 0 · 23 + 0 · 22 + 1 · 21 + 1 · 20 = 310. In other words, the rules operate exactly the same as in normal base X notation, except that -- because the integers are of a fixed size -- unused leading digits are set to 0.
Addition works the same as in decimal; digits are aligned and then added, from least significant digit to most significant digit. As with decimal, it is possible to get a carry result. For example:
|0000 0000 0000 0010||= 2|
|+||0000 0000 0000 0011||= 3|
|=||0000 0000 0000 0101||= 5|
There is one area of this that is somewhat complicated and confusing which applies only to numeric values that are larger than a byte, and this is the concept of byte order or "endian-ness". In "big-endian" systems, the most significant byte goes first; whereas, in "little-endian" systems, the least significant byte goes first. When we write examples, we will assume a big-endian system simply because this is consistent with regular mathematical notation, but you should be aware that the computer may be storing the bytes that comprise the number in reverse order in memory.
|Decimal||Binary (16-bit number)||Comment / Hint / Explanation|
|1||0000 0000 0000 0001|
|2||0000 0000 0000 0010|
|3||0000 0000 0000 0011||3 = 2 + 1|
|10||0000 0000 0000 1010||10 = 8 + 2|
|100||0000 0000 0110 0100||100 = 64 + 32 + 4|
As shown above, converting from decimal to binary involves decomposing the number into a sum of powers of two.
So far, we've made a glaring omission, and that is how to handle negative numbers. In regular, everyday mathematical notation, we use a negative sign (-) to indicate negative values. While we could mimick this in computers by making one of the bits indicate the sign and nothing else, there are a couple of disadvantages to that approach. For one thing, this results in two separate representations of zero, namely positive zero (+0) and negative zero (-0). Since comparison with zero is a pretty common operation, having two ways to represent that value is not great. In addition, think about how you peform addition involving negative values or how you perform subtraction as compared to addition. Pretty complicated, right? There are several cases that you need to consider that result in an equivalent operation involving non-negative values and performing either addition or subtraction, with addition and subtraction being completely different.
Instead of using a sign bit that indicates only the sign with no numeric weight, most computers use a representation known as "two's complement". Two's complement brings with it several benefits: a single representation of zero, simplifying addition/subtraction operations, increasing the dynamic range of integers (by using that no longer duplicated zero to represent a different value). In two's complement, the most significant bit still represents a power of 2, but the most significant term is multiplied by -1. For example, in a 4-bit integer, the most significant bit would correspond to -23 instead of 23. Because a power of 2 is one more than the sum of all preceding powers (i.e., 2n - 1 = 20 + ... + 2n-1), when all bits are set, it represents the value -1. The most negative value in two's complement is represented by just the most significant bit being set (since it is the only negative bit with no positive values being added to it).
Here are some interesting properties of two's complement: addition is identical to that for whole numbers and works regardless of whether any of the operands are negative; a number can be negated by flipping all the bits (changing each 1 to a 0 and vice-versa) and then adding 1; the most negative number is bigger by one in magnitude than the most positive.
Here are some examples of adding positive and negative values:
And another example:
And one involving both operands being negative:
The examples above that use a very small number of bits may have clued you into the possibility of overflow, which can occur when a carry happens on the most significant bit. This can result in surprising outcomes such as obtaining a posiive result from adding two negative numbers or vice-versa, for example. It should be noted that this is not specific to the integer case but also applies to whole numbers, as well, and any instance in which mathemetical operations are being performed on limited precision numbers and is something that you should be aware of.
Real numbers include rational numbers, which are fractions of integers, as well as irrational numbers that go on forever like Pi. Because storage is finite, computers can only express rational numbers, rounding irrational numbers. We will use "real" and "rational" interchangeably from now on, because the real values that computers can represent are rational.
There are a number of different ways that computers can represent real numbers. The two most common, however, are floating point and fixed point. In fixed point, the number of bits before and after the decimal point are prespecified ("fixed"), which makes for an easy-to-understand format but is problematic when dealing with very large or very small values. In floating point, the bits do not represent specific powers of two (the radix point "floats"); instead, some bits indicate the exponent to be used for the number as a whole. In other words, floating point uses scientific notation (like 1.05 · 10-100, except using binary to represent the leading digit, fraction digits, and exponent) whereas fixed point has the digits correspond to specific positive and negative powers of two. Floating point is by far the most common in scientific and other computing. However, fixed point using decimal instead of binary, is very common for calculations involving currency (but this is typically done using software emulation, not as a CPU-supported primitive).
When it comes to floating point numbers, there are several different ways that such numbers could be represented (e.g. choosing different sets or numbers of bits to represent the fraction digits, the exponent digits, etc.). However, the vast majority of computers use a verison of floating point numbers that is standardized in IEEE 754. This representation, unlike two's complement, uses a distinct sign bit and thus has both positive and negative forms of 0. Additionally, IEEE 757 includes a number of invalid values such as NaN ("not a number") as well as special values to encode ±infinity.
The excellent article What Every Computer Scientist Should Know About Floating-Point Arithmetic provides a far better summary and explanation of floating point than I could ever hope to give, so I strongly recommend reading it (and won't attempt to replicate or replace it). Instead, I will only add (or rather, reemphasize) that, due to rounding errors, one should never do direct equality comparisons involving floating point numbers (e.g. one should never test that a floating point exactly equals zero, rather one should test that its magnitude is smaller than some threshold amount).