You will have heard, that audio signals are represented by some kind of wave. If you have ever seen this wave diagrams with a line going up and down — that’s basically what’s inside those files. Take a look at this file picture from http://en.wikipedia.org/wiki/Sampling_rate
You see your audio wave (the gray line). The current value of that wave is repeatedly measured and given as a number. That’s the numbers in those bytes. There are two different things that can be adjusted with this: The number of measurements you take per second (that’s the sampling rate, given in Hz — that’s how many per second you grab). The other adjustment is how exact you measure. In the 2-byte case, you take two bytes for one measurement (that’s values from -32768 to 32767 normally). So with those numbers given there, you can recreate the original wave (up to a limited quality, of course, but that’s always so when storing stuff digitally). And recreating the original wave is what your speaker is trying to do on playback.
There are some more things you need to know. First, since it’s two bytes, you need to know the byte order (big endian, little endian) to recreate the numbers correctly. Second, you need to know how many channels you have, and how they are stored. Typically you would have mono (one channel) or stereo (two), but more is possible. If you have more than one channel, you need to know, how they are stored. Often you would have them interleaved, that means you get one value for each channel for every point in time, and after that all values for the next point in time.
To illustrate: If you have data of 8 bytes for two channels and 16-bit number:
b would make up the first 16bit number that’s the first value for channel 1,
d would be the first number for channel 2.
f are the second value of channel 1,
h the second value for channel 2. You wouldn’t hear much there because that would not come close to a second of data…
If you take together all that information you have, you can calculate the bit rate you have, that’s how many bits of information is generated by the recorder per second. In our example, you generate 2 bytes per channel on every sample. With two channels, that would be 4 bytes. You need about 44000 samples per second to represent the sounds a human beeing can normally hear. So you’ll end up with 176000 bytes per second, which is 1408000 bits per second.
And of course, it is not 2-bit values, but two 2 byte values there, or you would have a really bad quality.