Why RTP/RTSP meddle with my H.264 NALs? - c

I looked in The RFC and noting could explain why the following happens(Though the decoder can still produce the original movie).
I transmitted the H.264/AVC nals using VSS h.264 encoder, the byte stream looked something like this E5 46 0E 4F FF A0 23...
when I read the movie data one the receiver side after the RTP Broadcaster/RTSP receiver, I get extra unknown data but always in the same places, 8 bytes are added before Start Code prefix (0x00000001),
and 2 bytes are added after Start Code prefix it looks something like this.
XX XX XX XX XX XX XX XX 00 00 00 01 XX XX, then I look in the Wireshark and i could see that the RTP adds the bytes to the data payload.
Why does it happens why? and why the decoder seems to cope well with those extra bytes?!

Thats some messed up stream... And you can mess it up even more, and it will still work, because decoder parses it for 0x000001 start code, skipping the bytes that are added at the beginning. Those two new bytes at the end must be H264 fragmentation bytes... or something H264 related since they work.
So basically, this is due to defective packetizer/RTSP source filter. My guess is that if you ASCII encode those 8 bytes you will get the vendor name of the RTSP source filter... xD

As I mentioned in another post Changing NALU h.264/avc, for RTP encupsulation, H.264 is transmitted over RTP as defined in RFC 3984. This in particular defines how exactly large NAL units are broken into smaller parts that fit smaller message sizes, suchas UDP datagram size. That is, fragmentation.
Receiver depacketizes the data and restores NALUs, and it uses this extra information to do the job.
So what you essentially need is to compare raw data you have against RFC 3984 format. Also, Wireshark already does partially this for you by dissecting traffic into readable items.

Related

FFmpeg: what does av_parser_parse2 do?

When sending h264 data for frame decoding, it seems like a common method is to first call av_parser_parse2 from the libav library on the raw data.
I looked for documentation but I couldn't find anything other than some example codes. Does it group up packets of data so that the resulting data starts out with NAL headers so it can be perceived a frame?
The following is a link to a sample code that uses av_parser_parse2:
https://github.com/DJI-Mobile-SDK-Tutorials/Android-VideoStreamDecodingSample/blob/master/android-videostreamdecodingsample/jni/dji_video_jni.c
I would appreciate if anyone could explain those library details to me or link me resources for better understanding.
Thank you.
It is like you guessed, av_parser_parse2() for H.264 consumes input data, looks for NAL start codes 0x000001 and checks the NAL unit type looking for frame starts and outputs the input data, but with a different framing.
That is it consumes the input data, ignores its framing by putting all consecutive data into a big buffer and then restores the framing from the H.264 byte stream alone, which is possible because of the start codes and the NAL unit types. It does not increase or decrease the amount of data given to it. If you get 30k out, you have put 30k in. But maybe you did it in little pieces of around 1500 bytes, the payload of the network packets you received.
Btw, when the function declaration is not documented well, it is a good idea to look at the implementation.
Just to recover the framing is not involved enough to call it parsing. But the H.264 parser in ffmpeg also gathers some more information from the H.264 stream, eg. whether it is interlaced, so it really deserves its name.
It however does not decode the image data of the H.264 stream.
DJI's video transmission does not guarantee the data in each packet belongs to a single video frame. Mostly a packet contains only part of the data needed for a single frame. It also does not guarantee that a packet contains data from one frame and not two consecutive frames.
Android's MediaCodec need to be queued with buffers, each holding the full data for a single frame.
This is where av_parser_parse2() comes in. It gathers packets until it can find enough data for a full frame. This frame is then sent to MediaCodec for decoding.

Why is data sent across the network converted to network byte order?

I'm not sure how to use hton(). The theory is that any data sent over the network should be in network byte (i.e. big-endian) format. Suppose client A supports big-endian and B supports little-endian. I'm sending data from A to B and the data is read as multibyte. Then in the network we need to convert data to network byte order using htonl() and htons(). Since client A is already big-endian, htonl() and htons() return the same output. But B is little-endian, so those functions reverse the order. Given that, how can we say that adhering to a common format (i.e. big-endian) is a solution to the problem when big- and little-endian machines need to communicate?
I'll try it the other way, showing the whole flow:
Sending 0x44332211 over the wire always happens as 44 33 22 11. The sender's htonl() ensures that, either by reverting the order of the bytes (on LE machines) or by just leaving them the way they are (on BE machines). The receiver turns the 44 33 22 11 into 0x44332211 with ntohl() - again, either by reverting them or leaving them.
The mentionned functions {hton,ntoh}{l,s}() help programming in a portable way: no matter if the program tuns on a LE or BE machine, they always work the way they should. Thus, even on BE machines the functions should be called, even if they are noops.
Example:
A (BE) wants to send 0x44332211 to B (LE).
A has the number 0x44332211 in memory as 44 33 22 11.
A calls htonl() as the program has been written to be portable.
The number is still represented as 44 33 22 11 and sent over the wire.
B receives 44 33 22 11 and puts it through ntohl().
B gets the value represented by 11 22 33 44 from ntohl() and puts it into the respective variable - which then results to 0x44332211 as wanted.
Again, the need for always calling these function saves you from thinking about which kind of machine you are programming for - just program for all kinds of machines and call each of these function when they are needed.
The same example can be expressed without knowing if A or B are BE or LE:
A has the number 0x44332211 in memory.
A calls htonl() so that the number is sent as 44 33 22 11 over the wire.
Whether this is done by reverting or by leaving it is determined by the endianness of host B.
B receives 44 33 22 11 and puts it through ntohl(). This one reverses it or not, depending on the endianness of host B.
B gets the value 0x44332211 as wanted.
I think you're thinking that client B seeing the bytes in "reversed order" means that they're wrong. The bytes will be in reverse order compared to client A, but that's because client A interprets integers backwards from client B; both will still interpret it as the same number in the end. For example, one machine would represent the number 4 as 00 00 00 04. The other would represent it as 04 00 00 00, but both would still see it as a 4 -- if you add 1 to it you're going to get 00 00 00 05 and 05 00 00 00, respectively. The hton/ntoh functions exist because there's no way to look at a number and know if it's big- or little-endian, so the receiver can't be sure which way to interpret the bytes

Identifying Frame boundaries in RTP stream

I have few doubts regarding frame boundaries in RTP packets.
First, If the marker bit is set, does it say that a new frame has begun(this is what I understand from RFC 3551)?
Second, According to what I read a frame starts with I-frame followed by P, B frames. Now, which field indicates this? And is the I frame has the marker bit set?
Third, If I need to find the start and end of a frame, would the check for marker bit suffice?
Thanks!
The RTP entry on the Wireshark Wiki provides a lot of information, including (edit) sample captures. You could exlore it, and it might answer some of your questions. If you're going to write code to work with RTP, Wireshark is useful for monitoring/debugging.
Edit For your first question about Marker bit, this FAQ might help. Also, finding the frames (I, P, B) depends on the payload. There's another question here that has an answer showing how I, P, B are found for MPEG. The h263-over-rtp.pcap has examples with I and P frames for H.263.
This in an old question but I think it is a good one.
As you mention I,P and B frames, in 2012 you are likely referring to H.264 over RTP.
According to [rfc6184]1 , the marker bit is set on the last packet of a frame , so indeed the marker bit can be used as an indicator of the end of 1 frame and the next packet in sequence will be the start of the next frame.
According to this rfc, all packets of a frame also have the same RTPTIME so changes in RTPTIME is another indicator of the ending of the previous frame and start of a new frame.
Things get more tricky when you lose packets. For example, let's say you lose packets 5 and 6 and that these were the last packet of frame 1 and the first packet of frame 2. You know to discard frame 1 because you never got a packet with a marker bit for that frame, but how can you know if frame 2 is whole or not. Maybe the 2 lost packets were both part of frame 1 or maybe the second packet was part of frame 2?
rfc6184 defines the start bit that is present in the first packet of a fragmented NAL unit. If the NAL unit is not fragmented then by definition, we got the whole NAL unit if we got the packet. This means that we can know if we got a full NAL unit. Unfortunately, this does not guarantee we have the full frame since a frame could contain multiple NAL units (e.g. multiple slices) and we may have lost the first one. I don't have a solution for this problem but maybe somone will provide one sometime in the next 10 years.

What is the difference between byte stream and packetized stream in gstreamer rtsp h264 depayloader

In the gstreamer rtp h264 depayloader, there is a check to see if the incoming stream is a byte stream or packetized stream.
Can anybody tell me what is the difference between these two formats?
Also, for the bytestream, the codec_data does not get written to the caps. Any idea why this would be?
H.264 (NAL) Byte Stream
Is used mainly to be sent directly to the decoder on a single PC, and not to be transmitted over a network. It has simple format rules:
Each frame starts with the same 3 byte start code 0x000001.
Byte stream must start with Sequence Parameter Sets frame, followed by Picture Parameter Sets frame, then other frames (I, P, B) can follow.
All frames in it are whole frames – if IDR frame is 10 MB in size, it will be 10 MB in size from its 0x000001 start code, to the next frame's 0x000001 start code.
H.264 Packetized Stream
It is used only to be transmitted over TCP on a limited MTU network. Each network has MTU (Maximum Transmission Unit) that can be sent at a time through TCP. Usually it is around 1500 bytes. So, if you want to send 10 MB IDR frame over TCP, you will have to break it apart so the parts fit the MTU. H.264 Stream that is adopted in this way is called Packetized Stream.
In order to decode this stream, you must reconstruct whole frames on the receiving side, and you usually then want to make H264 NAL Byte Stream out of it, so you can send it to a decoder...
Rules of packetization can be found here: http://www.rfc-editor.org/rfc/rfc3984.txt

Type of socket address from recvfrom() with AF_PACKET / PF_PACKET

On two PC, I am opening an AF_PACKET / PF_PACKET socket, raw protocol.
sock = socket(AF_PACKET, SOCK_RAW, htons(PROTO_BULK))
(edit: PROTO_BULK is a dummy type, created by myself for this test. I do not expect it to be inferring with this problem, but I may be wrong.)
The first PC send a packet to the other the standard send() way, which is received on the other side with:
recvfrom(sock, buffer, 1600, 0, (struct sockaddr*) &from, &fromlen);
My problem is now: what is the type of the data in "from"? Here is an exemple of what I am receiving:
00 00 00 00 00 00 00 00 f8 cd a1 00 58 1d a1 00 94 60
From the length and the content, it does not look like a "struct sockaddr" nor a "sockaddr_ll". Any idea?
I am actually looking for the interface the packet was received on. I could use the destination MAC from the packet and identify the interface from it, but it would not work with a broadcast packet.
I am running a recent Linux kernel (and I do not want to investigate in the kernel sources!).
Edit: My code was wrong. I did not initialized "fromlen" with "from" buffer size, so I supposed some spurious value was there and "recvfrom()" could not do its job correctly. Thanks to Ben Voigt for pointing me this bug with his comment below! Of course, I did not include this part of the faulty code in my question, as it was obvious there could be no mistake with such simple code...
With the correct paramter, I get a "struct sockaddr_ll" correctly filled, including the interface number I was looking for.
I am actually looking for the interface the packet was received on.
You should be using recvmsg, which gives access to metadata such as the packet's destination address (useful for multihomed systems) and I think also the interface.
The sender address you're looking at now isn't going to tell you the interface. It's good for sending a reply packet with sendto though.
I took a glance at the kernel sources. I do not claim to fully understand them, but...
Is PROTO_BULK 37984 (0x9460)?
If so, this is probably a "struct sockaddr_pkt". (It has the right size, anyway.) Although I am not sure what the other bytes mean. If I am reading the code correctly, they should be the "name" of the interface on which the packet was received. So I am probably reading the code incorrectly.

Resources