C - How to determine the amount of bytes for JSON messages - c

I am working on a Linux-based project consisting of a "core" application, written in C, and a web server, probably written in Python. The core and web server must be able to communicate with each other over TCP/IP. My focus is on the core application, in C.
Because of the different programming languages used for the core and web server, I am looking for a message protocol which is easy to use in both languages. Currently I think JSON is a good candidate. My question, however, is not so much about the message protocol, but about how I would determine the amount of bytes to read from (and maybe send to) the socket, specifically when using a message protocol like JSON, or XML.
As I understand it, whether you use JSON, XML, or some other message protocol, you cannot include the size of the message in the message itself, because in order to parse the message, you would need the entire message and therefore need to know the size of it in advance. Note that by "message" I mean the data formatted according to the used message protocol.
I've been thinking and reading about the solution to this, and have come to the following two possibilities:
Determine the largest possible size of a message, say 500 bytes, and based on that determine the buffer size, say 512 bytes, and add padding to each message so that 512 bytes are sent;
Prepend each message with its size in "plain text". If the size is stored in an Int (4 bytes), then the receiver first reads 4 bytes from the socket and using those 4 bytes, determines how many bytes to read next for the actual message;
Because all of the offered solutions I've read weren't specifically for the use of some message protocol, like JSON, I think it's possible that maybe I am missing out on something.
So, which of the two possibilities I offered is the best, or, am I not aware of some other solution to this problem?
Kind regards.

This is a classic problem encountered with streams, including those of TCP, often called the "message boundary problem." You can search around for more detailed answers than what I can give here.
To determine boundaries, you have some options:
Fixed length with padding like you said. Unless you have very small messages, not adviseable.
Prepend with size like you said. If you want to get fancy and support large messages without wasting too many bytes, you can use a variable length quantity, where you use a bit to determine whether to read more bytes for the size. #alnitak mentioned a drawback in the comments I neglected, which is that you can't start sending until you know the size.
Bound with some byte you don't use anywhere else (JSON and XML are text-only, so '\0' works with ASCII or any UTF). Simple but slower on the receiving end because you have to scan every byte this way.
Edit: JSON, XML, and many other formats can also be parsed on-the-fly to determine boundaries (e.g. each { must be closed with } in JSON), but I don't see any advantage to doing this.
If this isn't just a learning experience, you can instead use an existing protocol to do this all for you. HTTP (inefficient) or gRPC (more efficient), for example.
Edits: I originally said something totally wrong about having to include a checksum to handle packet loss in spite of TCP... TCP won't advance until those packets are properly received, so that's not an issue. IDK what I was thinking.

Related

C convert char* to network byte order before transfer

I'm working on a project where I must send data to server and this client will run on different os's. I knwo the problem with endians on different machines, so I'm converting 'everything' (almost) to network byte order, using htonl and vice-versa on the other end.
Also, I know that for a single byte, I don't need to convert anything. But, what should I do for char*? ex:
send(sock,(char*)text,sizeof(text));
What's the best approach to solve this? Should I create an 'intermediate function' do intercept this 'send', then really send char-by-char of this char array? If so, do I need to convert every char to network byte order? I think no, since every char is only one byte.
Thinking of this, if I create this 'intermediate functions', I don't have to convert nothing more to network byte order, since this function will send char by char, thus don't need conversion of endians.
I any advice on this.
I am presuming from your question that the application layer protocol (more specifically everything above level 4) is under your design control. For single byte-wide (octet-wide in networking parlance) data there is no issue with endian ordering and you need do nothing special to accommodate that. Now if the character data is prepended with a length specifier that is, say 2 octets, then the ordering of the bytes must be treated consistently.
Going with network byte ordering (big-endian) will certainly fill the bill, but so would consistently using little-endian. Consistency of byte ordering on each end of a connection is the crucial issue.
If the protocol is not under your design control, then the protocol specification should offer guidance on the issue of byte ordering for multi-byte integers and you should follow that.

How can I portably send a C struct through a network socket?

Suppose I have a C struct defined as follows :
typedef struct servData {
char max_word[MAX_WORD];
char min_word[MAX_WORD];
int word_count ;
} servSendData ;
where 'MAX_WORD' could be any value.
Now if I have an instance of this structure :
servSendData myData ;
And if I populate this instance and then send it over the network, will there be any portability issues here considering that I want my server as well as the client to be running on either a 64-bit system or a 32-bit system.
I am going to send and receive data as follows :
//server side
strcpy(myData.max_word, "some large word") ;
strcpy(myData.min_word, "small") ;
myData.word_count=100 ;
send(sockFd, (char*)&myData, sizeof(myData);
//client side
recv(sockFd, (char*)&myData, sizeof(myData);
printf("large word is %s\n", myData.max_word) ;
printf("small word is %s\n", myData.min_word) ;
printf("total words is %d\n", myData.word_count) ;
Yes, there definitely will be portability issues.
Alignment of structure members can be different even among different compilers on the same platform, let alone different platforms. And that's all assuming that sizeof(int) is the same across all of them (though granted, it usually is --- but do you really want to rely on "usually" and hope for the best?).
This holds even if MAX_WORD is the same on both computers (I'll assume they are from here on out; if they're not, then you're in trouble here).
What you need to do is send (and receive) each field separately. There is also a problem with sizeof(int) and endianness, so I've added a call to htonl() to convert from system to network byte order (the inverse function is ntohl()). They both return uint32_t which has a fixed, known, size.
send(sockFd, myData.max_word, sizeof(myData.max_word)); // or just MAX_WORD
send(sockFd, myData.min_word, sizeof(myData.min_word));
uint32_t count = htonl(myData.word_count); // convert to network byte order
send(sockFd, &count, sizeof(count));
// error handling!
if((ret = recv(sockFd, myData.max_word, sizeof(myData.max_word))) != sizeof(myData.max_word))
{
// handle error or read more data
}
... // and so on
// remember to convert back from network byte order on recv!
// also keep in mind the third field is now `uint32_t`, and not `int` in the stream
As other relies have stated there are real problems in copying a C structure between different machines with different compilers/word size/and endian structure. One common way to resolve this issue is to transform your data into a machine independent format, transfer it across the network and then transform it back into a structure on the receiver. This is such a common requirement that multiple technologies already exist to do this - the two that spring to my mind initially are gsoap and rpcgen although there are probably many other options.
I've mostly used gsoap and after you get past the initial learning curve you can develop robust solutions that scale well (with multiple threads) and which handles both the networking and data translations for you.
If you don't want to go down this route then the safest approach is to write routines that convert your data to/from a standard string format (if you have issues with Unicode you'll need to take that into account as well) and then send that across the network.
You have to take care about the endians.
May you should use hton() or ntoh() functions, to convert between little and big endian.
You can use structure packing. With most C compilers, you can enforce a specific structure alignment. It is sometimes used for what you need it - to transfer a struct over a network.
Note that this still leaves endianness issues, so this is not a universal solution.
If you are not writing embedded software, sending data between applications without serializing properly is rarely a good idea.
The same goes for using raw sockets, which is not very convenient, and feels a bit like "reinventing the wheel".
Many libraries can help you with both! Of course, you don't have to use them, but reading their documentation, and understanding how they work will help you make better choices. Things you have not yet planned can come out of the box (like, what happens when you want to update your system, and the message format changes?)
For serialization, have a read on those general purpose formats:
Human readable: JSON, XML, YAML, others...
Binary: Protobuf, TPL, Avro, BSON, MessagePack, and many others
For socket abstraction, look up
Boost ASIO
ZeroMQ
nanomsg
Many others

How can I do printf style debugging over a slow CAN bus - with constant strings on the remote tool, not the embedded system

Currently, on my embedded system (coded in C), I have a lot of debug-assistive print statements which are executed when a remote tool is hooked up to the system that can display the messages to a PC. These help to understand the general system status, but as the messages are going over a slow CAN bus, I believe they may be clogging the tubes and causing other problems with trying to get any useful data logged.
The basic gist of it is this:
It's like a printf, but ultimately in a special message format that gets sent from the embedded system to the tool over the CAN bus. For this purpose, I can replace this generic print message with special debugging messages that send it a unique ID followed by only the variable parameters (i.e. the argc/argv). I am wondering if that is the right way to go about it, or if there is a magic bullet I've missed, or something else I haven't thought of.
So, I found this question which starts out well for my purposes:
printf() debugging library using string table "decoder ring"
But I do not have the constraint of making it as easy as a printf. I can maintain a table of strings on the remote tool's side (since it's a Windows executable and therefore not as code size limited). I am the sole person responsible for this code and would prefer to try and lighten up the code size as well as the CAN bus traffic while debugging.
My current thoughts are thus:
printf("[%d] User command: answer call\n", (int)job);
This becomes
debug(dbgUSER_COMMAND_ANSWER_CALL, job);
dbgUSER_COMMAND_ANSWER_CALL being part of an enum of possible debug messages
and the remote side has something like
switch(messagetype)
{
case dbgUSER_COMMAND_ANSWER_CALL:
/* retrieve an integer from the data portion of the message and put it into a variable */
printf("[%d] User command: answer call\n", (int)avariable);
}
That's relatively straightforward and it would be fantastic if all my messages came in that same format. Where it gets tricky, though, is where some of my statements have to print strings which are not constant (the name of the device, for example).
printf("[%d] %02X:%02X:%02X:%02X:%02X:%02X (%s)\n", /* a bunch of parameters here */);
So, should I make it so that the contents of the debug message are 1) the message type, 2) length of the first parameter, 3) the parameter, 4) length of the next parameter, 5) the parameter, so on and so forth
Or have I overlooked something more obvious or easy?
Thanks
I'm assuming you use CAN because that's the connection you already have to your device. You haven't really provided enough information about your diagnostic needs, but I can give an example of what we do where I work. We use a custom build tool to comb through our sources building up a string table. Our code uses something like:
log( LOGH(T0722T), //"Position"
LOG_DOT_HEX_VALUE(i),
LOG_TEXT(T0178T), //"Id"
LOG_DOT_VALUE(uniqueId % 10000),
0 );
This would record some data which could be decoded into:
<timestamp> H Position.00B Id.0235
We allow use of T0000T to have the tool lookup (or generate) a unique number for us. The tool builds up an enum using the TxxxxT numbers for the compiler, and a file containing the ordered list of strings. Every build generates a string table which matches the enum numbering. This system also ties into a database system used for generating internationalized strings, but that's not really relevant to your question.
Each element is a short (16 bits). We allow 12 bits for values and use the high 4 bits for type and flag info. Is the encoded data a string id; Is it a signal (High/Low) or just an event; Is it a value; Is it decimal, hexidecimal, base64url; concatenated (with the preceding item) or separate.
We record data in a large ring buffer for querying if needed. This allows the system to run without interference but the data can be extracted if a problem is noted. It is certainly possible to constantly send the data, and if that's desired I'd suggest limiting any one message to a single CAN payload (that's 8 bytes assuming you only use a single ID). The message I provided above would fit in one CAN message if properly encoded (assuming the receiving side created the timestamps).
We do have the ability to include arbitrary data (ASCII or Hexadecimal), but I never use it. It's usually a waste of precious space in the logs. Logging is always a balance in embedded environments.

send glib hashtable with MPI

i recently came across a problem with my parallel program. Each process has several glib hashtables that need to be exchanged with other processes, these hashtables may be quite large. What is the best approach to achieve that?
create derived datatype
use mpi pack and unpack
send key & value as arrays (problem, since amount of elements is not known at compile time)
I haven't used 1 & 2 before and don't even know if thats possible, that's why i am asking you guys..
Pack/unpack creates a copy of your data: if your maps are large, you'll want to avoid that. This also rules out your 3rd option.
You can indeed define a custom datatype, but it'll be a little tricky. See the end of this answer for an example (replacing "graph" with "map" and "node" with "pair" as you read). I suggest you read up on these topics to get a firm understanding of what you need to do.
That the number of elements is not known at compile time shouldn't be a real issue. You can just send a message containing the payload size before sending the map contents. This will let the receiving process allocate just enough memory for the receive buffer.
You may also want to consider simply printing the contents of your maps to files, and then having the processes read each others' ouput. This is much more straightforward, but also less elegant and much slower than message passing.

Sample read/write handling of packets in C

I'm a bit new to C, but I've done my homework (some tutorials, books, etc.) and I need to program a simple server to handle requests from clients and interact with a db. I've gone through Beej's Guide to Network programming, but I'm a bit unsure how to piece together and handle different parts of the data getting sent back and forth.
For instance, say the client is sending some information that the server will put in multiple fields. How do I piece together that data to be sent and then break it back up on the server side?
Thanks,
Eric
If I understand correctly, you're asking, "how does the server understand the information the client sends it"?
If that's what you're asking, the answer is simple: it's mutually agreed upon ahead of time that the data structures each uses will be compatible. I.e. you decide upon what your communication protocol will be ahead of time.
So, for example, if I have a client-server application where the client connects and can ask for things such as "time", "date" and can say "settime " and "setdate ", I need to write my server in such a way that it will understand those commands.
Obviously, in the above case it's trivial, since it'd just be a text-based protocol. But let's say you're writing an application that will return a struct of information, i.e.
struct Person {
char* name;
int age;
int heightInInches;
// ... other fields ...
};
You might write the entire struct out from the server/client. In this case there are a few things to be aware of:
You need to hton/ntoh properly
You need to make sure that your client and server both can understand the struct in question.
You may or may not have to align on a 4B boundary (because if you don't, different C compilers may do different things, which may burn you between the client and the server, or it may not).
In general, though, when writing a client/server app, the most important thing to get right is the communication protocol.
I'm not sure if this quite answers your question, though. Is this what you were after, or were you asking more about how, exactly, do you use the send/recv functions?
First, you define how the packet will look - what information will be in it. Make sure the definition is in an architecture-neutral format. That means that you specify it in a sequence that does not depend on whether the machine is big-endian or little-endian, for example, nor on whether you are compiling with 32-bit long or 64-bit long values. If the content is of variable length, make sure the definition contains the information needed to tell how long each part is - in particular, each variable length part should be preceded by a suitable count of its length.
When you need to package the data for transmission, you will take the raw (machine-specific) values and write them into a buffer (think 'character array') at the appropriate positions, in the appropriate format.
This buffer will be sent across the wire to the receiver, which will read it into another buffer, and then reverse the process to obtain the information from the buffer into local variables.
There are functions such as ntohs() to convert from a network ('n') to host ('h') format for a 'short' (meaning 16-bit) integer, and htonl() to convert from a host 'long' (32-bit integer) to network format - etc.
One good book for networking is Stevens' "UNIX Network Programming, Vol 1, 3rd Edn". You can find out more about it at its web site, including example code.
As already mentioned above what you need is a previously agreed means of communication. One thing that helps me is to use xmls to communicate.
e.g. You need time to send time to client then include it in a tag called time.
Then parse it on the client side and read the tag value.
The biggest advantage is that once you have a parser in place on client side then even if you have to send some new information them just have to agree on a tag name that will be parsed on the client side.
It helps me , I hope it helps you too.

Resources