0xC00D4A44 MF_E_SINK_NO_SAMPLES_PROCESSED with MPEG 4 sink - video-processing

I am running out of ideas on why I am getting this HRESULT.
I have a pipeline in Media Foundation. A file is loaded through the source resolver. I am using the media session.
Here is my general pipeline:
Source Reader -> Decoder -> Color Converter (to RGB24) -> Custom MFT -> Color Converter (To YUY2) -> H264 Encoder -> Mpeg 4 Sink
In my custom MFT I do some editing to the frames. One of the tasks of the MFT is to filter samples and drop the undesired ones.
This pipeline is used to trim video and output an MP4 file.
For example if the user wants to trim 3 seconds from the 10 second marker, my MFT will read the uncompressed sample time and discard it by asking for more samples. If a sample is in range, it will be passed to the next color converter. My MFT handles frames in RGB24, hence the reason for the initial color converter. The second color converter transforms the color space for the H264 encoder. I am using the High Profile Level 4.1 encoder.
The pipeline gets setup properly. All of the frames get passed to the sink and I have a wrapper for the MPEG4 sink. I see that the BeginFinalize and EndFinalize gets called.
However on some of my trim operations, the EndFinalize with spit out the MF_E_SINK_NO_SAMPLES_PROCESSED. I think it is random. It usually happens when a range not close to the beginning is selected.
It might be due to sample times. I am rebasing the sample times and duration.
For example, if the adjusted frame duration is 50ms (selected by user), I will grab the first acceptable sample (let's say 1500ms) and rebase it to 0. The next one will be 1550ms in my MFT and then set to 50ms and so on. So frame times are set in 50ms increments.
Is this approach correct? Could it be that the sink is not receiving enough samples to write the headers and finalize the file?
As mentioned, it work in some cases and it fails in most. I am running my code on Windows 10.

I tried to implement the same task using IMFMediaSession/IMFtopology, but had the same problems you faced. I think that IMFMediaSession either modifies the timestamps outside your MFT, or expects them not to be modified by your MFT.
So in order to make this work, I took the IMFSourceReader->IMFSinkWriter approach.
This way I could modify the timestmaps of the samples read from the reader and pass to the writer only those that fall into the given range.
Furthermore, you can take a look at the old MFCopy example. It does exactly the file trimming as you described it. You can download it from here: https://sourceforge.net/projects/mfnode/

Related

How to locate position and get intermediate results in the source code on VTM?

I want to achieve deep learning-based video compression. But it's difficult to get the intermediate results. So I want to ask if there are some convenient methods.
Could you specify your question 'intermediate results'?
If you mean reconstructed frame of VTM, you can get a buffer from picture class.
EncGOP encodes every frame in GOP and executes in-loop filters, therefore you can get intermediate frame from EncGOP while encoding.
At decoder side, you can get same buffer at DecLib.
I hope this answer helps you.

How can I get current microphone input level with C WinAPI?

Using Windows API, I want to implement something like following:
i.e. Getting current microphone input level.
I am not allowed to use external audio libraries, but I can use Windows libraries. So I tried using waveIn functions, but I do not know how to process audio input data in real time.
This is the method I am currently using:
Record for 100 milliseconds
Select highest value from the recorded data buffer
Repeat forever
But I think this is way too hacky, and not a recommended way. How can I do this properly?
Having built a tuning wizard for a very dated, but well known, A/V conferencing applicaiton, what you describe is nearly identical to what I did.
A few considerations:
Enqueue 5 to 10 of those 100ms buffers into the audio device via waveInAddBuffer. IIRC, when the waveIn queue goes empty, weird things happen. Then as the waveInProc callbacks occurs, search for the sample with the highest absolute value in the completed buffer as you describe. Then plot that onto your visualization. Requeue the completed buffers.
It might seem obvious to map the sample value as follows onto your visualization linearly.
For example, to plot a 16-bit sample
// convert sample magnitude from 0..32768 to 0..N
length = (sample * N) / 32768;
DrawLine(length);
But then when you speak into the microphone, that visualization won't seem as "active" or "vibrant".
But a better approach would be to give more strength to those lower energy samples. Easy way to do this is to replot along the μ-law curve (or use a table lookup).
length = (sample * N) / 32768;
length = log(1+length)/log(N);
length = max(length,N)
DrawLine(length);
You can tweak the above approach to whatever looks good.
Instead of computing the values yourself, you can rely on values from Windows. This is actually the values displayed in your screenshot from the Windows Settings.
See the following sample for the IAudioMeterInformation interface:
https://learn.microsoft.com/en-us/windows/win32/coreaudio/peak-meters.
It is made for the playback but you can use it for capture also.
Some remarks, if you open the IAudioMeterInformation for a microphone but no application opened a stream from this microphone, then the level will be 0.
It means that while you want to display your microphone peak meter, you will need to open a microphone stream, like you already did.
Also read the documentation about IAudioMeterInformation it may not be what you need as it is the peak value. It depends on what you want to do with it.

FFmpeg filter for selecting video/audio streams

I am trying to create a node (a collection of nodes is fine too), that takes in many streams and an index, and outputs one stream specified by the index. Basically, I want to create a mux node, something like:
Node : Stream ... Number -> Stream
FFmpeg's filter graph API seems to have two filters for doing that: streamselect (for video) and astreamselect (for audio). And for the most part, they seem to do what I want:
[in0][in1][in2]streamselect=inputs=3:map=1[out]
This stream will take in three video streams, and output the second one in1.
You can use a similar filter for audio streams:
[in0][in1]astreamselect=inputs=2:map=0[out]
Which will take in two streams and output the first one in0.
The question is, can I create a filter that takes in a list of both audio and video streams and outputs the stream based only on the stream index? So something like:
[v0][v1][a0][a1][a2]avstreamselect=inputs=5:map=3[out]
Which maps a1 to out?
If it helps I am using the libavfilter C API rather than the command line interface.
While it may not be possible with one filter1, it is certainly possible to do this by combining multiple filters, one for either audio or video (depending on which one you are selecting), and a bunch of nullsink or anullsink filters for the rest of them.
For example, the would be filter:
[v0][v1][a0][a1]avstreamselect=inputs=4:map=2[out]
which takes in two video streams and two audio streams, and returns the third stream (the first audio stream), can be written as:
[a0][a1]astreamselect=inputs=2:map=0[out];
[v0]nullsink;[v1]nullsink
Here, we run select for the first stream, and all of the remaining ones are mapped to sinks. This idea could potentially be generalized to only use nullsink, anullsink, copy, and acopy, for example, we could have also written it with 4 nodes:
[a0]acopy[out];
[a1]anullsink;
[v0]nullsink;
[v1]nullsink
1I still don't know if it is or not. Feel free to remove this if it actually is possible.

Images and Filters in OpenCL

Lets say I have an image called Test.jpg.
I just figured out how to bring an image into the project by the following line:
FILE *infile = fopen("Stonehenge.jpg", "rb");
Now that I have the file, do I need to convert this file into a bmp image in order to apply a filter to it?
I have never worked with images before, let alone OpenCl so there is a lot that is going over my head.
I need further clarification on this part for my own understanding
Does this bmp image also need to be stored in an array in order to have a filter applied to it? I have seen a sliding window technique be used a couple of times in other examples. Is the bmp image pretty much split up into RGB values (0-255)? If someone can provide a link on this item that should help me understand this a lot better.
I know this may seem like a basic question to most but I do not have a mentor on this subject in my workplace.
Now that I have the file, do I need to convert this file into a bmp image in order to apply a filter to it?
Not exactly. bmp is a very specific image serialization format and actually a quite complicated one (implementing a BMP file parser that deals with all the corner cases correctly is actually rather difficult).
However what you have there so far is not even file content data. What you have there is a C stdio FILE handle and that's it. So far you did not even check if the file could be opened. That's not really useful.
JPEG is a lossy compressed image format. What you need to be able to "work" with it is a pixel value array. Either an array of component tuples, or a number of arrays, one for each component (depending on your application either format may perform better).
Now implementing image format decoders becomes tedious. It's not exactly difficult but also not something you can write down on a single evening. Of course the devil is in the details and writing an implementation that is high quality, covers all corner cases and is fast is a major effort. That's why for every image (and video and audio) format out there you usually can find only a small number of encoder and decoder implementations. The de-facto standard codec library for JPEG are libjpeg and libjpeg-turbo. If your aim is to read just JPEG files, then these libraries would be the go-to implementation. However you also may want to support PNG files, and then maybe EXR and so on and then things become tedious again. So there are meta-libraries which wrap all those format specific libraries and offer them through a universal API.
In the OpenGL wiki there's a dedicated page on the current state of image loader libraries: https://www.opengl.org/wiki/Image_Libraries
Does this bmp image also need to be stored in an array in order to have a filter applied to it?
That actually depends on the kind of filter you want to apply. A simple threshold filter for example does not take a pixel's surroundings into account. If you were to perform scanline signal processing (e.g. when processing old analogue television signals) you may require only a single row of pixels at a time.
The universal solution of course to keep the whole image in memory, but then some pictures are so HUGE that no average computer's RAM can hold them. There are image processing libraries like VIPS that implement processing graphs that can operate on small subregions of an image at a time and can be executed independently.
Is the bmp image pretty much split up into RGB values (0-255)? If someone can provide a link on this item that should help me understand this a lot better.
In case you mean "pixel array" instead of BMP (remember, BMP is a specific data structure), then no. Pixel component values may be of any scalar type and value range. And there are in fact colour spaces in which there are value regions which are mathematically necessary but do not denote actually sensible colours.
When it comes down to pixel data, an image is just a n-dimensional array of scalar component tuples where each component's value lies in a given range of values. It doesn't get more specific for that. Only when you introduce colour spaces (RGB, CMYK, YUV, CIE-Lab, CIE-XYZ, etc.) you give those values specific colour-meaning. And the choice of data type is more or less arbitrary. You can either use 8 bits per component RGB (0..255), 10 bits (0..1024) or floating point (0.0 .. 1.0); the choice is yours.

Which video encoding algorithm should I use for a video with just one static image and sound?

I'm doing video processing tasks and one of the problems I need to solve is choosing the appropriate encoding algorithm for a video that has just one static image throughout the entire video.
Currently I tried several algorithms, such as DivX and XviD, but they produce 3MB video for a 1 minute long video. The audio is 64kbit/s mp3, so the audio takes just 480KB. So the video is 2.5MB!
As the image in the video is not changing, it could be compressed really efficiently as there is no motion. The image size itself (it's a jpg) is just 50KB.
So ideally I'd expect this video to be about 550KB - 600KB and not 3MB.
Any ideas about how I could optimize the video so it's not that huge?
I hope this is the right stackexchange forum to ask this question.
Set the frames-per-second to be very low. Lower than 1fps if you can. Your goal would be to get as close to two keyframes (one at the start, and one at the end) as possible.
Whether you can do this depends on the scheme/codec you are using, and also the encoder.
Many codecs will have keyframe-related options. For example, here are some open-source encoders:
lavc (libavcodec):
keyint=<0-300> - maximum interval between keyframes in frames (default: 250 or one keyframe every ten seconds in a 25fps movie.
This is the recommended default for MPEG-4). Most codecs require regular keyframes in order to limit the accumulation of mismatch error. Keyframes are also needed for seeking, as seeking is only possible to a keyframe - but keyframes need more space than other frames, so larger numbers here mean slightly smaller files but less precise seeking. 0 is equivalent to 1, which makes every frame a keyframe. Values >300 are not recommended as the quality might be bad depending upon decoder, encoder and luck. It is common for MPEG-1/2 to use values <=30.
xvidenc:
max_key_interval= - maximum interval between keyframes (default: 10*fps)
Interestingly, this solution may reduce the ability to seek in the file, so you will want to test that.
I think this problem is related to the implementation of video encoder, not the video encoding standard itself.
Actually, most video encoder implementations are not designed for videos of static image, thus it will not produce perfect bitstream as we imagined when a video of static image is inputted. Most video encoder implementations are designed for processing "natural" video.
If you really need a better encoding result for video of static image, you may do a hack on an open source video encoder, from 2nd frame on, mark all MBs' as "skip"...

Resources