How to get an accurate duration of any audio file quickly? - c

There are many audio files in the wild encoded with VBR which don't have an accurate duration tag, and command line FFMpeg would output the following:
Estimating duration from bitrate, this may be inaccurate
(and indeed, it is)
Using libav, we can see that this is tested with the has_duration() function from libavformat/utils.c
I have an audio player that needs to get an accurate duration information using libav. I've tried completely decoding the file and counting the total number of samples decoded which is the usual recommendation (and is accurate), but this can take 10+ seconds on a 60 minute audio file, which is not acceptable.
I notice that there are some closed source audio decoders and music playing apps which can get an accurate duration for these files immediately. There must be some cheap way of determining the accurate duration? Perhaps a snippet or high-level description would help me out.

It is possible to "demux" the track very quickly and examine the duration of each packet to determine the total duration of the track, decoding is not necessary.
Simply adding durations in a loop until EOF is reached:
int read = av_read_frame(pFormatCtx, packet);
durationSeconds = packet->duration * (double) timebase;

Related

How do I write audio data at a certain sample rate?

I am making a synthesizer by piping data into aplay (I know it's not ideal) and the sound is lagging behind the keypresses which alter the sound. I believe this is because aplay is going at a constant 8000 Hz, but the c program is going at an unstable rate. How do I get the for loop to go at 8000 Hz in C?
To generate audio samples at 8000 Hz (or any fixed rate) you don't want your loop to "run at" that rate. That would involve huge amounts of overhead (99.99% or more) spinning doing nothing until time to generate the next sample, and (especially if you sleep rather than spinning) would be unreliable in that your process might not wake-up/get-scheduled in time for some of the samples.
Instead, you just want to be producing samples at an overall rate matching what the consumer (aplay/the audio device) expects. You can compute the overall current sample number you should be generating up to as something like:
current_time + buffer_depth - start_time
then, after generating up to that sample, sleep for some period proportional to the buffer depth, but sufficiently less that you won't be in trouble if your process doesn't get scheduled again right away. The buffer depth you can use depends on what kind of latency you need. If you're making sounds for live/realtime events, you probably want a buffer depth of 1/50 sec (20 ms) or less. If not, you can happily use huge buffers like 5-10 seconds.
If you are piping data to aplay, you will not experience any problems with the sample rate (8 kHz, for example) because the kernel will block your program when you write() when the buffer is full. This will effectively limit your audio generation to 8 kHz with no work on your part.
However, this is far from ideal. Your application will only be throttled once the kernel buffer for the pipe is full, and the default size for pipe buffers on Linux is 64 kB. For stereo 16-bit data at 8 kHz, this is two full seconds of audio data, so you would expect your audio to lag at least two seconds from the user input. This is unacceptable for synthesizer applications.
The only real solution is to use the ALSA library directly (or some alternative sound API). Using this API, you can send buffered audio data to your audio output device without accumulating excessive queued data in kernel buffers.
See A Guide Through The Linux Sound API Jungle for some tips.

Playing 15 audio tracks at once with <50ms latency?

To summarise, my question is: is it possible to decode and play 15 lossily-compressed audio tracks on-the-fly at the same time with under 50ms latency and with no stuttering?
Background
I'm writing a sound library in plain C for a game I'm creating. I'm hoping to have up to 15 audio tracks playing at once with less than 50ms latency.
As of now, the library is able to play raw PCM files (48000Hz packed 16-bit samples), and can easily play 15 sounds at once at 45ms latency without stuttering and with minimal CPU usage. This is on my relatively old Intel Q9300 + SSD machine.
Since raw audio files are huge though, I augmented my library to support playing back OPUS files using opusfile (https://mf4.xiph.org/jenkins/view/opus/job/opusfile-unix/ws/doc/html/index.html). I was hoping that I'd still be able to play 15 sounds at once without the audio files taking up 200MB+. How wrong I was - I was only able to play 3 or 4 OPUS tracks at once before I could hear stuttering and other buffer underrun symptoms. CPU usage was also massively increased compared to raw PCM playback.
I also tried including VORBIS support using vorbisfile (http://www.xiph.org/vorbis/doc/vorbisfile/). I thought maybe decoding VORBIS on-the-fly wouldn't be as CPU intensive. VORBIS is a little better than OPUS - I can play 5 or 6 sounds at once before stuttering becomes audible (I guess VORBIS is indeed easier to decode) - but this is still nowhere near as good as playing back raw PCM files.
Before I delve into the low-level libvorbis/libopus APIs and investigate other audio compression formats, is it actually feasible to decode and play 15 lossily-compressed audio tracks on-the-fly at the same time with under 50ms latency and with no stuttering on a medium-to-low end desktop computer?
If it helps, my sound library currently calls a function approximately every 15ms which basically does the following (error-handling and post-processing omitted for clarity):
void onBufferUpdateNeeded(int numSounds, struct Sound *sounds,
uint16_t *bufferToUpdate, int numSamplesNeeded, uint16_t *tmpBuffer) {
int i, j;
memset(bufferToUpdate, 0, numSamplesNeeded * sizeof(uint16_t));
for (i = 0; i < numSounds; ++i) {
/* Seek to the specified sample number in the already-opened
file handle. The implementation of this depends on the file
type (vorbis, opus, raw PCM). */
seekToSample(sounds[i].fileHandle, sounds[i].currentSample);
/* Read numSamplesNeeded samples from the file handle into
tmpBuffer. */
readSamples(tmpBuffer, sounds[i].fileHandle, numSamplesNeeded);
/* Add the samples into the buffer. */
for (j = 0; j < numSamplesNeeded; ++j) {
bufferToUpdate[j] += tmpBuffer[j];
}
}
}
Thanks in advance for any help!
It sounds like you already know the answer to your own question: NO. Normally, the only advice I would have to questions like these (especially performance-related queries) is to try it and find out if it's possible. But you have already collected that data.
It's true that perceptual/lossy audio codecs tend to be computationally intensive to decode. It sounds like you want to avoid the storage overhead of raw PCM. In that case, if you can safely assume you'll have enough memory reserved for your application, you can decode the audio streams in advance, or employ some caching mechanism to deal with memory constraints. Perhaps this can be offloaded to a different thread (since the Q9300 CPU mentioned in your question is dual core).
Otherwise, you will need to seek out a compressor that has lower computational requirements. You might be interested in FLAC, sponsored by the same organization as Vorbis and Opus. It's lossless, so it won't compress quite as well as the lossy algorithms, but it should be much, much faster to decode.
And if that's still not suitable, browse around on this big list of ~150 audio codecs until you find one that meets your standards. Since you control the client software, you have a lot of choices (vs, e.g., streaming to a web browser).

Creating MIDI Files - Explanation of time division in header chunk

Edit: Posted on Audio/Video Production site https://video.stackexchange.com/questions/4148/creating-midi-files-explanation-of-time-division-in-header-chunk
I've been reading about MIDI file structure as I'm interested in writing an application that would read/write files in this format, but I'm a little confused about time divison in the header chunk.
My understanding is that this part is essentially 16 bits, where if the sign bit is 1 the remaining bits specify an SMPTE timecode, and if it's 0 then the bits specify the number of ticks/pulses per quarter note (PPQ).
My questions, specifically, are:
What does a higher/lower PPQ do to a MIDI file? Does this change the quality of the sound? My understanding is that it does not affect tempo
How does the SMPTE timecode affect the MIDI file in playback?
Essentially, I'm trying to understand what these actually mean to the end result.
I'm not registered over on that forum, so I'll paste it here:
I can answer part 1.
PPQ absolutely affects the tempo of the MIDI file. It doesn't change the quality of the sound, it changes the rate at which events are processed.
Tempo is defined in terms of microseconds per quarter note. If you change the number of ticks (pulses) in a quarter note (PPQ), you effectively change the rate at which the file is played back. A standard value for PPQ is 480. If the only change you make to a file is to double the PPQ, you essentially halve the playback rate (tempo).
I know this is an old question, but it wasn't answered completely, or entirely accurately.
All MIDI files use delta times. There are no absolute timings in a MIDI file, SMPTE or not.
In original MIDI format files, the header timing information specifies the PPQN, or Pulses Per Quarter Note. The SetTempo meta-event specifies the number of microseconds per quarter note (the tempo). The MIDI event delta information specifies the number of pulses between this event and the last event.
In SMPTE-style MIDI files, the header timing information specifies two values - the frames per second, and frame subdivisions. Frames per second is literally FPS (some values need to be adjusted, like 29 being really 29.97). Frame subdivisions can be thought of as the number of pulses per frame. The MIDI event delta information specifies the number of frame subdivisions (or pulses) since the last event.
One important difference is, SMPTE files do not use the SetTempo meta-event. All time scales are fixed by the header timing field.
#LeffelMania got it right, but I just wanted to add that SMPTE is simply a different way of keeping the time in your arrangement. If you use SMPTE, then you get an absolute time for each event, but otherwise the events are relative to the previous ones.
In my experience, most MIDI files use the conventional way of relative event timing (ie, not SMPTE), as this is easier to work with.

Tell libavcodec/ffmpeg to drop frame

I'm building an app in which I create a video.
Problem is, sometime (well... most of the time) the frame acquisition process isn't quick enough.
What I'm currently doing is to skip the current frame acquisition if I'm late, however FFMPEG/libavcodec considers every frame I pass to it as the next frame in line, so If I drop 1 out of 2 frames, a 20seconds video will only last 10. More problems come in as soon as I add sound, since sound processing is way faster...
What I'd like would be to tell FFMPEG : "last frame should last twice longer that originally intended", or anything that could allow me to process in real time.
I tried to stack the frames at a point, but this ends up killing all my memory (I also tried to 'stack' my frames in the hard drive, which was way to slow, as I expected)
I guess I'll have to work with the pts manually, but all my attempts have failed, and reading some other apps code which use ffmpeg, such as VLC, wasn't of a great help... so any advice would be much appreciated!
Thanks a lot in advance!
your output will probably be considered variable framerate (vfr), but you can simply generate a timestamp using wallclock time when a frame arrives and apply it to your AVFrame before encoding it. then the frame will be displayed at the correct time on playback.
for an example of how to do this (at least the specifying your own timestamp part), see doc/examples/muxing.c in the ffmpeg distribution (line 491 in my current git pull):
frame->pts += av_rescale_q(1, video_st->codec->time_base, video_st->time_base);
here the author is incrementing the frame timestamp by 1 in the video codec's timebase rescaled to the video stream's timebase, but in your case you can simply rescale the number of seconds since you started capturing frames from an arbitrary timebase to your output video stream's timebase (as in the above example). for example, if your arbitrary timebase is 1/1000, and you receive a frame 0.25 seconds since you started capturing, then do this:
AVRational my_timebase = {1, 1000};
frame->pts = av_rescale_q(250, my_timebase, avstream->time_base);
then encode the frame as usual.
Many (most?) video formats don't permit leaving out frames. Instead, try reusing old video frames when you can't get a fresh one in time.
Just an idea.. when it's lagging with the processing have you tried to pass to it the same frame again (and drop the current one)? Maybe it can process the duplicated frame quickly.
There's this ffmpeg command line switch -threads ... for multicore processing, so you should be able to do something similar with the API (though I have no idea how). This might solve your problem.

playing only part of a sound using FMOD

I'm trying to play only part of a sound using FMOD, say frames 50000-100000 of a 200000 frame file.
I have found a couple of ways to seek forward (i.e. to start playback at frame 50000) but I have not found a way to make sure the sound stops playing at 100000. Is there any way FMOD can natively do this without having to add lbsndfile or the like into the picture?
I should also mention that I am using the streaming option. I have to assume that these sounds are arbitrarily large and cannot be comfortably/quickly loaded into memory.
You can use Channel::setDelay for sample accurate starting and stopping of sounds. Use FMOD_DELAYTYPE_DSPCLOCK_START to set the start time of the sound and FMOD_DELAYTYPE_DSPCLOCK_END to set the end time.
Check out the docs for Channel::setDelay, FMOD_DELAYTYPE, System::getDSPClock.
You should be able to use the streaming callback to stop the stream when you get to the desired point.
Option 1: When you create the stream, set lenbytes to an even divisor of the number of frames you wish to play. In your example, set 'lenbytes' to 5000, then keep a counter in the callback. When you get to 10, stop the stream.
Option 2: use FSOUND_Stream_AddSyncPoint with pcmoffset set to your desired stopping point. Register a callback with FSOUND_Stream_SetSyncCallback. Stop the stream in the callback.
To start playback at sample 50,000 and end at 100,000 you could do the following assuming the sound file sample rate and the system sample rate are the same. As DSP clock works in system output samples you may need to do some maths to adjust your end sample in terms of output rate. See Sound::getDefaults for sound sample rate and System::getSoftwareFormat for system rate.
unsigned int sysHi, sysLo;
// ... create sound, play sound paused ...
// Seek the data to the desired start offset
channel->setPosition(50000, FMOD_TIMEUNIT_PCM);
// For accurate sample playback get the current system "tick"
system->getDSPClock(&sysHi, &sysLo);
// Set start offset to a couple of "mixes" in the future, 2048 samples is far enough in the future to avoid issues with mixer timings
FMOD_64BIT_ADD(sysHi, sysLo, 0, 2048);
channel->setDelay(FMOD_DELAYTYPE_DSPCLOCK_START, sysHi, sysLo);
// Set end offset for 50,000 samples from our start time, which means the end sample will be 100,000
FMOD_64BIT_ADD(sysHi, sysLo, 0, 50000);
channel->setDelay(FMOD_DELAYTYPE_DSPCLOCK_END, sysHi, sysLo);
// ... unpause sound ...

Resources