It's a protocol negotiation handshake / bootstrap. There are dozens of different standards for modem based communication, including many different encoding and modulation schemes and multiple data rates. The "dialup sound" is a conversation between modems. The first part begins a low bit-rate connection where the modems exchange an initial summary of their capabilities, even this is a bootstrap procedure stepping up through different encodings/modulations but staying at 300 bps initially. This steps up to a more thorough exchange of capabilities in meticulous detail and a broad-spectrum set of signals (this is the bing/bong tone you often hear in the middle) in order to measure the frequency response and signal to noise ratio through the phone line. This is followed by a short negotiation of protocols and data rates that the modems can support and that seem like they would work over the line then an initial test connection followed by some fine tuning as they negotiate the ultimate connection protocol, power level, and data rates in each direction (using synthetic data). This often involves stepping up or down the data rate and power levels on either end to find the maximum throughput level that can be supported by the given setup, then the modem speaker turns off after the connection is established and the data rate and other details have been finally decided.
The noise itself was just for debugging purposes. If you had a handset listening in on the line you could hear it as well, but that would also disrupt the connection. Most modems made it possible to disable the onboard speaker during dialing and handshaking if the user didn't want to hear it.
Some additional resources: