I have a network video stream that I am decoding with the ffmpeg C library.
I'd like to reduce the maximal frame rate to some maximum, say 15 fps.
I used the filter fps=fps=15, but even on a 25 fps video stream this caused frame duplication. I presume this was due to network delays.
Is there some way to reduce the maximal frame-rate but avoid frame duplication and just get delays instead?
If not, is there a way to identify if a decoded frame is one of the duplicates?
Related
I am making a synthesizer by piping data into aplay (I know it's not ideal) and the sound is lagging behind the keypresses which alter the sound. I believe this is because aplay is going at a constant 8000 Hz, but the c program is going at an unstable rate. How do I get the for loop to go at 8000 Hz in C?
To generate audio samples at 8000 Hz (or any fixed rate) you don't want your loop to "run at" that rate. That would involve huge amounts of overhead (99.99% or more) spinning doing nothing until time to generate the next sample, and (especially if you sleep rather than spinning) would be unreliable in that your process might not wake-up/get-scheduled in time for some of the samples.
Instead, you just want to be producing samples at an overall rate matching what the consumer (aplay/the audio device) expects. You can compute the overall current sample number you should be generating up to as something like:
current_time + buffer_depth - start_time
then, after generating up to that sample, sleep for some period proportional to the buffer depth, but sufficiently less that you won't be in trouble if your process doesn't get scheduled again right away. The buffer depth you can use depends on what kind of latency you need. If you're making sounds for live/realtime events, you probably want a buffer depth of 1/50 sec (20 ms) or less. If not, you can happily use huge buffers like 5-10 seconds.
If you are piping data to aplay, you will not experience any problems with the sample rate (8 kHz, for example) because the kernel will block your program when you write() when the buffer is full. This will effectively limit your audio generation to 8 kHz with no work on your part.
However, this is far from ideal. Your application will only be throttled once the kernel buffer for the pipe is full, and the default size for pipe buffers on Linux is 64 kB. For stereo 16-bit data at 8 kHz, this is two full seconds of audio data, so you would expect your audio to lag at least two seconds from the user input. This is unacceptable for synthesizer applications.
The only real solution is to use the ALSA library directly (or some alternative sound API). Using this API, you can send buffered audio data to your audio output device without accumulating excessive queued data in kernel buffers.
See A Guide Through The Linux Sound API Jungle for some tips.
I am writing a small module in C to handle jitter and drift for a full-duplex audio system. It acts as a very primitive voice chat module, which connects to an external modem that uses a separate clock, independent from my master system clock (ie: it is not slaved off of the system master clock).
The source is based off of an existing example available online here: http://svn.xiph.org/trunk/speex/libspeex/jitter.c
I have 4 audio streams:
Network uplink (my voice, after processing, going to the far side speaker)
Network downlink (far side's voice, before processing, coming to me)
Speaker output (the far side's voice, after processing, to the local speakers)
Mic input (my voice, before processing, coming from the local microphone)
I have two separate threads of execution. One handles the local devices and buffer (ie: playing processed audio to the speakers, and capturing data from the microphone and passing it off to the DSP processing library to remove background noise, echo, etc). The other thread handles pulling the network downlink signal and passing it off to the processing library, and taking the processed data from the library and pushing it via the uplink connection.
The two threads use mutexes and a set of shared circular/ring buffers. I am looking for a way to implement a sure-fire (safe and reliable) jitter and drift correction mechanism. By jitter, I am referring to a clock having variable duty cycle, but the same frequency as an ideal clock.
The other potential issue I would need to correct is drift, which would assume both clocks use an ideal 50% duty cycle, but their base frequency is off by ±5%, for example.
Finally, these two issues can occur simultaneously. What would be the ideal approach to this? My current approach is to use a type of jitter buffer. They are just data buffers which implement a moving average to count their average "fill" level. If a thread tries to read from the buffer, and not-enough data is available and there is a buffer underflow, I just generate data for it on-the-fly by either providing a spare zeroed-out packet, or by duplicating a packet (ie: packet loss concealment). If data is coming in too quickly, I discard an entire packet of data, and keep going. This handles the jitter portion.
The second half of the problem is drift correction. This is where the average fill level metric comes in useful. For all buffers, I can calculate the relative growth/reduction levels in various buffers, and add or subtract a small number of samples every so often so that all buffer levels hover around a common average "fill" level.
Does this approach make sense, and are there any better or "industry standard" approaches to handling this problem?
Thank you.
References
Word Clock – What’s the difference between jitter and frequency drift?, Accessed 2014-09-13, <http://www.apogeedigital.com/knowledgebase/fundamentals-of-digital-audio/word-clock-whats-the-difference-between-jitter-and-frequency-stability/>
Jitter.c, Accessed 2014-09-13, <http://svn.xiph.org/trunk/speex/libspeex/jitter.c>
I faced a similar, although admittedly simpler, problem. I won't be able to fully answer your question but i hope sharing my solutions to some practical problems i ran into will benefit you anyway.
Last year i was working on a system which should simultaneously record from and render to multiple audio devices, each potentially ticking off a different clock. The most obvious example being a duplex stream on 2 devices, but it also handled multiple inputs/outputs only. All in all being a bit simpler than your situation (single threaded and no network i/o). In the end i don't believe dealing with more than 2 devices is harder than 2, any system with multiple clocks is going to have to deal with the same problems.
Some stuff i've learned:
Pick one stream and designate it's clock as "the truth" (i.e., sync all other streams to a common master clock). If you don't do this you won't have a well-defined notion of "current sample position", and without it there's nothing to sync to. This also has the benefit that at least one stream in the system will always be clean (no dropping/padding samples).
Your approach of using an additional buffer to handle jitter is correct. Without it you'd be constantly dropping/padding even on streams with the same nominal sample rate.
Consider whether or not you'd want to introduce such a jitter buffer for the "master" stream also. Doing so means introducing artificial latency in the master stream, not doing so means the rest of your streams will lag behind.
I'm not sure whether it's a good idea to drop entire packets. Why not try to use up as much of the samples as possible? Especially with large packet sizes this is far less noticeable.
To elaborate on the above, I got badly bitten by the following case: assume s1 (master) producing 48000 frames every second and s2 producing 96000 every 2 seconds. Round 1: read 48000 from s1, 0 from s2. Round 2: read 48000 from s1, 96000 from s2 -> overflow. Discard entire packet. Round 3: read 48000 from s1, 0 from s2. Etc. Obviously this is a contrived example but i ran into cases where on average I dropped 50% of secondary stream's data using this scheme. Introduction of the jitter buffer does help but didn't completely fix this problem. Note that this is not strictly related to clock jitter/skew, it's just that some drivers like to update their padding values periodically and they will not accurately report to you what is really in the hardware buffer.
Another variation on this problem happens when you really do got clock jitter but the API of your choice doesn't let you control packet size (e.g., allows you to request less frames than are actually available). Assume s1 (master) recording #1000 Hz and s2 alternating each second #1000 and 1001hz. Round 1, read 1000 frames from both. Round 2, read 1000 frames from s1, and 1001 from s2 -> overflow. Etc, on average you'll dump around 50% of frames on s2. Note that this is not so much a problem if your API lets you say "give me 1000 samples even though i know you've got more". By doing so though, you'll eventually overflow the hardware input buffer.
To have the most control over when to drop/pad, I found it easiest to allways keep input buffers empty and output buffers full. This way all dropping/padding takes place in the jitter buffer and you'll at least know and control what's happening.
If possible try to separate your program logic: the hard part is finding out where to pad/drop samples. Once you've got that in place it's easy to try different variations of pad/drop, sample-and-hold, interpolation etc.
All in all I'd say your solution looks very reasonable, although I'm not sure about the "drop entire packet thing" and I'd definitely pick one stream as the master to sync against. For completeness here's the solution I eventually came up with:
1 Assume a jitter buffer of size J on each stream.
2: Wait for a packet of size M to become available on the master stream (M is typically derived from the stream latency). We're going to deliver M frames of input/output to the app. I didn't implement an additional buffer on the master stream.
3: For all input streams: let H be the number of recorded frames in the hardware buffer, B be the number of recorded frames currently in the jitter buffer, and A being the number of frames available to the application: A equals H + B.
3a: If A < M, we have input underflow. Offer A recorded frames + (M - A) padding frames to the app. Since the device is likely slow, fill 1/2 of the jitter buffer with silence.
3b: If A == M, offer A frames to the app. The jitter buffer is now empty.
3c: If A > M but (A - M) <= J, offer M recorded frames to the app. A - M frames stay in the jitter buffer.
3d: If A > M and (A - M) > J, we have input overflow. Offer M recorded frames to the app, of the remaining frames put J/2 back in the jitter buffer, we use up M + J/2 frames and we drop A - (M + J/2) frames as overflow. Don't try to keep the jitter buffer full because the device is likely fast and we don't want to overflow again on the next round.
4: Sort of the inverse of 3: for outputs, fast devices will underflow, slow devices will overflow.
A, H and B are the same thing but this time they don't represent available frames but available padding (e.g., how much frames can i offer to the app to write to).
Try to keep hardware buffers full at all costs.
This scheme worked out quite well for me, although there's a few things to consider:
It involves a lot of bookkeeping. Make sure that for input buffers, data always flows from hardware->jitter buffer->application and for outputs always from app->jitter buffer->hardware. It's very easy to make the mistake of thinking you can "skip" frames in the jitter buffer if there's enough samples available from the hardware directly to the app. This will essentially mess up the chronological order of frames in an audio stream.
This scheme introduces variable latency on secondary streams because i try to postpone the moment of padding/dropping as long as possible. This may or may not be a problem. I found that in practice postponing these operations gives audibly better results, probably because many "minor" glitches of only a few samples are more annoying than the occasional larger hiccup.
Also, PortAudio (an open source audio project) has implemented a similar scheme, see http://www.portaudio.com/docs/proposals/001-UnderflowOverflowHandling.html. It may be worthwile to browse through the mailinglist and see what problems/solutions came up there.
Note that everything i've said so far is only about interaction with the audio hardware, i've no idea whether this will work equally well with the network streams but I don't see any obvious reason why not. Just pick 1 audio stream as the master and sync the other one to it and do the same for the network streams. This way you'll end up with two more-or-less independent systems connected only by the ringbuffer, each with an internally consistent clock, each running on it's own thread. If you're aiming for low audio latency, you'll also want to drop the mutexes and opt for a lock-free fifo of some sorts.
I am curious to see if this is possible. I'll throw in my two bits though.
I am a novice programmer, but studied audio engineering/interactive audio.
My first assumption is that this is not possible. At least not on a sample-to-sample basis. Especially not for complex audio data and waveforms such as human speech. The program could have no expectation of what the waveform "should" look like.
This is why there are high-end audio interfaces with temperature controlled internal clocks.
On the other hand, maybe there is a library that can detect the symptoms of jitter, somehow...
In which case I would be very curious to hear about it.
As far as drift correction, maybe I don't understand something on the programming front, but shouldn't you be pulling audio at a specific sample rate? I believe sample rate/drift is handled at the hardware level.
I really hope this helps. You might have to steer me closer to home.
I have a raw video of 1000 frames.I am doing Inverse Perspective Mapping of these frames and storing these frames on hard disk. But this process takes around 10 minutes to convert it.
Is there any other way in which speed can be improved? I am using CVWarpPerspective and cvgetperspectivetransform functions, I have to do it in real time with a maximum delay of 500ms.
You could use OpenGL for hardware acceleration; but your biggest bottleneck is likely to be writing the images back to disk.
Assuming the images aren't small; half a second to load, warp and save 1000 raw frames is very demanding. What is the reason for this specification ?
I'm building an app in which I create a video.
Problem is, sometime (well... most of the time) the frame acquisition process isn't quick enough.
What I'm currently doing is to skip the current frame acquisition if I'm late, however FFMPEG/libavcodec considers every frame I pass to it as the next frame in line, so If I drop 1 out of 2 frames, a 20seconds video will only last 10. More problems come in as soon as I add sound, since sound processing is way faster...
What I'd like would be to tell FFMPEG : "last frame should last twice longer that originally intended", or anything that could allow me to process in real time.
I tried to stack the frames at a point, but this ends up killing all my memory (I also tried to 'stack' my frames in the hard drive, which was way to slow, as I expected)
I guess I'll have to work with the pts manually, but all my attempts have failed, and reading some other apps code which use ffmpeg, such as VLC, wasn't of a great help... so any advice would be much appreciated!
Thanks a lot in advance!
your output will probably be considered variable framerate (vfr), but you can simply generate a timestamp using wallclock time when a frame arrives and apply it to your AVFrame before encoding it. then the frame will be displayed at the correct time on playback.
for an example of how to do this (at least the specifying your own timestamp part), see doc/examples/muxing.c in the ffmpeg distribution (line 491 in my current git pull):
frame->pts += av_rescale_q(1, video_st->codec->time_base, video_st->time_base);
here the author is incrementing the frame timestamp by 1 in the video codec's timebase rescaled to the video stream's timebase, but in your case you can simply rescale the number of seconds since you started capturing frames from an arbitrary timebase to your output video stream's timebase (as in the above example). for example, if your arbitrary timebase is 1/1000, and you receive a frame 0.25 seconds since you started capturing, then do this:
AVRational my_timebase = {1, 1000};
frame->pts = av_rescale_q(250, my_timebase, avstream->time_base);
then encode the frame as usual.
Many (most?) video formats don't permit leaving out frames. Instead, try reusing old video frames when you can't get a fresh one in time.
Just an idea.. when it's lagging with the processing have you tried to pass to it the same frame again (and drop the current one)? Maybe it can process the duplicated frame quickly.
There's this ffmpeg command line switch -threads ... for multicore processing, so you should be able to do something similar with the API (though I have no idea how). This might solve your problem.
Can anyone say how sampling rate and framesize are related ?
I decoded a spx file to wav, with sampling rate of 10 kHz and at 16 bit. The frame size applied during the decoding process was 640.
The decoded file is playable in vlc. But I want to play that file in Flex.
Flex supports rate of 44.1 kHz, 22.5 kHz and 11.2 kHz only. I want to increase the sampling rate during decoding process. I know how to do that in the code but I guess the framesize also should be increased. I don't know the dependency between these two. Can anyone help?
Frame size and sampling rate are generally orthogonal concepts. They don't need to affect each other unless a particular format demands it.
For PCM .wav, the frame size will always be bits/channels * channels. In your case, 16 bits for mono, or 32 bits for stereo.
Also, there is no need to change the decoding frame size only because you later apply resampling.
You mix two independent tasks: spex decoding and resampling. The mentioned frame size should be considered only as a buffer that contains PCM samples. These PCM samples you should pass to a resampler (for example SSRC: http://shibatch.sourceforge.net/).
Frame Size depends on the codec used to compress the original data. It will contain an integral number of samples (320 in this case).
If I'm correct in thinking raw audio has a frame size equal to the sample size. However some codecs perform compression over a range of samples. Usually the larger the frame size, the more memory needed to compress the data but the potentially better compression you can achieve.
You can't increase the sampling rate during decoding however you could resample the decoded audio. Presumably you're actually re-encoding the data to send it to Flex? You'll need to have a look at the codec you're using to rencode. Which codec are you using?
irrespective of number of channels used, frame rate and sampling rate are same.
because that is the purpose of TDM.
New channels are introduced in the gap left between two consecutive samples.
As the number of channels increase time allotted to each channel decrease there by time taken by each bit.
but tame gap between consecutive samples of any channel will remain constant and it will equal to the total frame time.
i.e. Time gap between samples = Frame time, hence Frame rate is equal to sample rate.