I'm a linux programmer and recently involved in porting an epoll based client with two file descriptor written in c to windows.
As you know,in linux using epoll or select (I know windows supports select, but its not efficient at all) you can block on file descriptors until a file descriptor
is ready and you can know when it is ready to write and when read.
I have taken a look at windows IOCP and it sounds ok for overlapped io in microsoft world.
But in all samples it is used for a multi client server that each client's socket is independent from other sockets.
using completions ports, it can be done creating a completionKey structure for each client and put a variable in struct and make it read when invoking
WSArecv and wirt when WSAsend and the other variable indicating socket value and retrieving them from GetQueuedCompletionStatus to know what to do, if write is done for socket, do read, and vise versa.
But in my case, the file descriptors (fd) are really overlapped. reading from one fd, makes read and write to other fd and that makes it hard to know what
operation realy happend for each fd in GetQueuedCompletionStatus result because there is one completionKey associated for each fd. to be clear consider this please:
There is two handles called fd1 and fd2 and completionKey1 is holding handle and status for f1 and completionKey2 for fd2 and completionKey variable is for
retrieving completion from GetQueuedCompletionStatus.
GetQueuedCompletionStatus(port_handle, &completionKey.bufflen, (PULONG_PTR)&completionKey,(LPOVERLAPPED *)&ovl,INFINITE);
switch (completionKey.status)
{
case READ:
if(completionKey->handle == fd1)
{
fd1_read_is_done(completionKey.buffer,completionKey.bufflen);
completionKey->status = WRITE;
do_fd1_write(completionKey);
completionKey2->status = WRITE;
completionKey2->buffer = "somedata";
do_fd2_write(completionKey2);
}
else if(completionKey->handle == fd2)
{
fd2_read_is_done(completionKey.buffer,completionKey.bufflen);
completionKey->status = WRITE;
do_fd2_write(completionKey);
completionKey1->status = WRITE;
completionKey1->buffer = "somedata";
do_fd1_write(completionKey1);
}
break;
case WRITE_EVENT:
if(completionKey->handle == fd1)
{
fd1_write_is_done(completionKey.bufflen);
completionKey->status = READ;
do_fd1_read(completionKey);
completionKey2->status = READ;
do_fd2_read(completionKey2);
}
else if(completionKey->handle == fd2)
{
fd2_write_is_done(completionKey.bufflen);
completionKey->status = READ;
do_fd2_read(completionKey);
completionKey1->status = READ;
do_fd1_read(completionKey1);
}
break;
}
in the above code, it comes a situation that some of altering completionKeys will override the pending reads or writes and the resulted
completionKey->status would be wrong (it will report read instead of write for instance) and worst is the buffer will override. if I use locking for
completionKeys, it will lead to dead lock situations.
After looking to WSAsend or WSArecv, noticed there is a overlap parameter can be set for every send or receive.
but it leads to two major problems. according to WSAOVERLAPPED structure:
typedef struct _WSAOVERLAPPED {
ULONG_PTR Internal;
ULONG_PTR InternalHigh;
union {
struct {
DWORD Offset;
DWORD OffsetHigh;
};
PVOID Pointer;
};
HANDLE hEvent;
} WSAOVERLAPPED, *LPWSAOVERLAPPED;
First, there is no place for putting status and appropriate buffer in it and most of them are reserved.
Second if could make a work for first problem, I need to check if there is no available overlapped left and all of them are used in pending operations,
allocate a new one for every read and write and because of the client is going to be so busy, it might happens a lot and besides, managing those overlapped pools is a headache.
so am I missing something or microsoft has screwed this?
And because of I don't need multithreading, is there another way to solve my problem?
thanks in advance
Edit
As I guessed, the first problem that I mentioned in using overlapped struct has answer and I need just create another struct with all buffers and status and etc and put OVERLAPPED as first filed.
now you solve me the others ;)
You're really asking two different questions here. I can't answer the first one, as I've never used IO completion ports, but from everything I've read they're best avoided by everyone but experts. (I will point out an obvious solution to the problem I think you're describing: rather than actually writing the data to the other socket while another write is still pending, put the data in a queue and write it later. You still have to deal with two simultaneous operations on a given socket - one read and one write - but that shouldn't be a problem.)
However, it's easy to use OVERLAPPED (or WSAOVERLAPPED) structures to track the status of overlapped requests. All you do is embed the OVERLAPPED structure as the first element in a larger structure:
typedef struct _MyOverlapped
{
WSAOVERLAPPED overlapped;
... your data goes here ...
} MyOverlapped, lpMyOverlapped;
then cast the LPWSAOVERLAPPED sent to the completion routine to lpMyOverlapped to access your context data.
Alternatively, if you are using a completion routine, the hEvent member of WSAOVERLAPPED is guaranteed to be unused, so you can set this to a pointer to a structure of your choice.
I don't see why you think that managing the pool of overlapped structures is going to be a problem. There's exactly one overlapped structure per active buffer, so every time you allocate a buffer, allocate a corresponding overlapped structure.
Related
If I am using select() to monitor three file descriptor sets:
if (select(fdmax+1, &read_fds, &write_fds, &except_fds, NULL) == -1) {
perror("select()");
exit(1);
} else {
...
}
Can a particular file descriptor be ready for reading AND writing AND exception handling simultaneously?
Beej's popular networking page shows a select() example in which he tests the members of the read fd_set using a for loop. Since the loop increments by one each iteration, it will necessarily test some integers that don't happen to be existing file descriptors:
for(i = 0; i <= fdmax; i++) {
if (FD_ISSET(i, &read_fds)) { // we got one!!
{
...
}
}
I believe he's doing this for the sake of keeping the example code simple. Might/should one only test existing file descriptors?
Expanding a little bit with examples and #user207421 comment:
1 Can a particular file descriptor be ready for reading AND writing AND exception handling simultaneously?
Good example will be a socket, which will (almost) always be ready for writing, and will be ready for reading when data is available. It is not common to have exceptions - they are used for exceptional situations. For example, availability of out-of-band message on TCP connections, but most applications do not use those features.
Note that 'normal' errors will be indicated in readfds (for example, socket shutdown).
See also: *nix select and exceptfds/errorfds semantics,
Beej's popular networking page shows a select() example in which he tests the members of the read fd_set using a for loop. Since the loop increments by one each iteration, it will necessarily test some integers that don't happen to be existing file descriptors:
I believe that in this case, it is done simplify the code examples, and is a reasonable implementation for most light weight implementations. It works well if the number of non-listen connections is very small.
Worth mentioning that the 'fd_set' is implemented on Linux with a set of bits, but on Windows (winsock) as an array of fd values. A full scan on all FDs will be O(n) on Linux, and O(n*n) on Windows. This can make a big performance hit on large N for Windows app.
In large scale applications, where a server will listen to hundreds (or more) open connections, each require different actions, potentially with multiple states, the common practice will be to have the list of of active connections, and use a callback to invoke the function. This is usually implemented with an 'eventloop'. Examples include X11, rpc servers, etc.
See Also: https://en.wikipedia.org/wiki/Event_loop
Your question: why you would use select() when you only have one socket.
when select is used and you do not want it to block other processing.
Then make use of the timeout parameter.
That way, even with only one file descriptor open, The program will not block forever due to that one file descriptor not receiving any data as it would block if using read() or similar function.
I.E. this is a very good method when, for instance, listening to a serial port, which only has data when some external event occurs.
I've gotten ideas for multiple projects recently that all involve reading IP addresses from a file. Since they are all supposed to be able to handle a large amount of hosts, I've attempted to implement multi-threading or creating a pool of sockets and select()-ing from them in order to achieve some form of concurrency for better performance. On multiple occasions, reading from the file seems to be the bottleneck in enhancing performance. The way I understand it, reading from a file with fgets or similar is a synchronous, blocking operation. So even if I successfully implemented a client that connects to multiple hosts asynchronously, the operation would still be synchronous because I can only read one address at a time from a file.
/* partially pseudo code */
/* getaddrinfo() stuff here */
while(fgets(ip, sizeof(ip), file) {
FD_ZERO(&readfds);
/* create n sockets here in a for loop */
for (i = 0; i < socket_num; i++) {
if (newfd > fd[i]) newfd = fd[i];
FD_SET(fd[i], &readfds);
}
/* here's where I think I should connect n sockets to n addresses from file
* but I'm only getting one IP at a time from file, so I'm not sure how to connect to
* n addresses at once with fgets
*/
for (j = 0; j < socket_num; j++) {
if ((connect(socket, ai->ai_addr, ai->ai_addrlen)) == -1)
// error
else {
freeaddrinfo(ai);
FD_SET(socket, &master);
fdmax = socket;
if (select(socket+1, &master, NULL, NULL, &tv) == -1);
// error
if ((recvd = read(socket, banner, RECVD)) <= 0)
// error
if (FD_ISSET(socket, &master))
// print success
}
/* clear sets and close sockets and stuff */
}
I've pointed out my issues with comments, but just to clarify: I'm not sure how to perform asynchronous I/O operations on multiple target servers read from a file, since reading entries from file seems to be strictly synchronous. I've run into similar isssues with multithreading, with a marginally better degree of success.
void *function_passed_to_pthread_create(void *opts)
{
while(fgets(ip_addr, sizeof(ip_addr), opts->file) {
/* speak to ip_addr and get response */
}
}
main()
{
/* necessary stuff */
for (i = 0; i < thread_num; i++) {
pthread_create(&tasks, NULL, above_function, opts)
}
for (j = 0; j < thread_num; j++)
/* join threads */
return 0;
}
This seems to work, but since multiple threads are all processing the same file the results aren't always accurate. I imagine it's because multiple threads may process the same address from file at the same time.
I've considered loading all the entries from a file into an array/into memory, but if the file was particularly large I imagine that could cause memory issues. On top of that, I'm not sure it that even makes sense to do anyway.
As a final note; if the file I'm reading from happens to be a particularly large file with a huge amount of IPs then I do not believe either solution scales well. Anything is possible with C though, so I imagine there is some way to achieve what I'm hoping to.
To sum this post up; I'd like to find a way to improve a client-side applications performance using asynchronous I/O or multi-threading when reading entries from a file.
Several people have hinted at a good solution to this in their comments, but it's probably worth spelling it out in more detail. The full solution has quite a lot of details and is pretty complicated code, so I'm going to use pseudocode to explain what I'd recommend.
What you have here is really a variation on a classic producer/consumer problem: You have a single thing producing data, and many things trying to consume that data. In your case, it must be a "single thing" producing that data, because the lengths of each line of the source file are unknown: You can't just jump forward 'n' bytes and somehow be at the next IP. There can only be one actor at a time moving the read pointer toward the next unknown position of the \n, so you by definition have a single producer.
There are three general ways to attack this:
Solution A involves having each thread pulling a little more out of a shared file buffer, and kicking off an asynchronous (nonblocking) read every time the last read completes. There are a whole host of headaches getting this solution right, as it's very sensitive to timing differences between the filesystem and the work being performed: If the file reads are slow, the workers will all stall waiting for the file. If the workers are slow, the file reader will either stall or fill up memory waiting for them to consume the data. This solution is likely the absolute fastest, but it's also incredibly difficult synchronization code to get right with about a zillion caveats. Unless you're an expert in threading (or extremely clever abuse of epoll_wait()), you probably don't want to go this route.
Solution B has a "master" thread, responsible for reading the file, and populating some kind of thread-safe queue with the data it reads, with one IP address (one string) per queue entry. Each of the worker threads just consumes queue entries as fast as it can, querying the remote server and then requesting another queue entry. This requires a little care to get right, but is generally a lot safer than Solution A, especially if you use somebody else's queue implementation.
Solution C is pretty hacktastic, but you shouldn't dismiss it out-of-hand, depending on what you're doing. This solution just involves using something like the Un*x sed command (see Get a range of lines from a file given the start and end line numbers) to slice your source file into a bunch of "chunky" source files in advance — say, twenty of them. Then you just run twenty copies of a really simple single-thread program in parallel using &, each on a different "slice" of file. Mushed together with a little shell script to automate it, this can be a "good enough" solution for a lot of needs.
Let's take a closer look at Solution B — a master thread with a thread-safe queue. I'm going to cheat and assume you can construct a working queue implementation (if not, there are StackOverflow articles on implementing a thread-safe queue using pthreads: pthread synchronized blocking queue).
In pseudocode, this solution is then something like this:
main()
{
/* Create a queue. */
queue = create_queue();
/* Kick off the master thread to read the file, and give it the queue. */
master_thread = pthread_create(master, queue);
/* Kick off a bunch of workers with access to the queue. */
for (i = 0; i < 20; i++) {
worker_thread[i] = pthread_create(worker, queue);
}
/* Wait for everybody to finish. */
pthread_join(master_thread);
for (i = 0; i < 20; i++) {
pthread_join(worker_thread[i]);
}
}
void master(queue q)
{
FILE *fp = fopen("ips.txt", "r");
char buffer[BIGGER_THAN_ANY_IP];
/* Inhale the file as fast as we can, and push each line we
read onto the queue. */
while (fgets(fp, buffer) != NULL) {
char *next_ip = strdup(buffer);
enqueue(q, next_ip);
}
/* Add some final messages in the queue to let the workers
know that we're out of data. There are *much* better ways
of notifying them that we're "done", but in this case,
pushing a bunch of NULLs equal to the number of threads is
simple and probably good enough. */
for (i = 0; i < 20; i++) {
enqueue(q, NULL);
}
}
void worker(queue q)
{
char *ip;
/* Inhale messages off the queue as fast as we can until
we get a "NULL", which means that it's time to stop.
The call to dequeue() *must* block if there's nothing
in the queue; the call should only return NULL if the
queue actually had NULL pushed into it. */
while ((ip = dequeue(q)) != NULL) {
/* Insert code to actually do the work here. */
connect_and_send_and_receive_to(ip);
}
}
There are plenty of caveats and details in a real implementation (like: how do we implement the queue, ring buffers or a linked list? what if the text isn't all IPs? what if the char buffer isn't big enough? how many threads is enough? how do we deal with file or network errors? will malloc performance become a bottleneck? what if the queue gets too big? can we do better to overlap the network I/O?).
But, caveats and details aside, the pseudocode I presented above is a good enough starting point that you likely can expand it into a working solution.
read IP's from a file, have worker threads, keep giving IP's to worker threads. let all socket communication happen in worker threads. Also if the IPv4 addresses are stored hex format instead of ascii, probably can read multiples of them in a single shot and it would be faster.
If you just want to read asynchronously you can use getch() from ncurses with delay set to 0. It is part of posix so you don't need any additional dependencies. Also you have unlocked_stdio.
On the other hand, I have to wonder why is fgets() a bottleneck. As long as you have data in file it should not block. And even if data is huge (like 1MB or 100k ip addresses) reading it into list at startup should take less than 1 second.
And why are you openining sockets_num connections to every ip in the list? You are having sockets_num multiplied by number of ip addresses at the same time. Since every socket is file on linux you will hit system issues when you try to open more than several thousand files (see ulimit -Sn). Can you confirm that issue is not in connect() in that case?
If FILE_SKIP_COMPLETION_PORT_ON_SUCCESS is set on a file handle that is bound to an I/O completion port, then an OVERLAPPED structure needs to be deallocated when its I/O completes synchronously.
Otherwise, it needs to stay alive until a worker processes the notification from an I/O completion port.
This all sounds good until you realize that this only works if you manage the file handle yourself.
But if someone else gives you the file handle, how are you supposed to know when you should free the OVERLAPPED structure? Is there any way to discover this after the fact?
Otherwise, does this basically imply you cannot correctly perform overlapped I/O on any file handle that you cannot guarantee the completion notification state of...?
I'm not sure that your scenario makes sense.
Your clarified scenario - successfully performing I/O on an arbitrary file handle, without even knowing whether it is asynchronous or not - is challenging, I think very unusual, and almost certainly not how the API was designed to be used, but perhaps (as you suggest) not entirely implausible.
(Although I don't think you can avoid requiring some cooperation between the caller and your code, because in the IOCP case, the caller has to be able to tell whose I/O a dequeued packet belongs to. You could do this by having the caller allocate the OVERLAPPED structures, as RbMm suggests, but it might be simpler to ask them for a completion key to use.)
I'm not certain offhand how Windows behaves if you provide a redundant event handle, e.g., when the I/O is actually synchronous or using IOCP. But I would guess that it isn't going to be a problem in practice, so provided you're not too worried about future-proofing, you're probably OK.
At any rate, it isn't all that difficult to deal with the particular issue your question asks about. Basically, you just need to prevent the structure from being released twice.
Before making each call, assign a unique completion key and add it to a linked list or other suitable global structure. (The structure must be capable of an atomic find-and-remove operation, or protected by a critical section or similar.)
If the call succeeds immediately, i.e., does not report that the I/O is pending, treat it exactly as if a queued packet were received from the IOCP queue. Typically, you would either use a common function that is called by both your IOCP thread and your I/O thread, or a call to PostQueuedCompletionStatus to manually insert a packet to the IOCP queue.
When a packet is received (or when the call succeeds immediately) first perform a find-and-remove for the completion key against the global structure. If the find fails, you know that you have already been notified of the success of the I/O, and don't need to do anything.
If the find-and-remove succeeds, process the I/O as appropriate and release the OVERLAPPED structure.
There are undoubtedly ways to optimize the same basic approach.
Addendum: if the caller is processing the IOCP packets, and providing you with a completion key to use, you won't be able to use a unique completion key on each request. In this scenario, you can use the pointer to the OVERLAPPED structure instead.
The reason (in the general case) for not using the pointer is that you might receive a packet containing a completion key from one I/O request along with an OVERLAPPED structure from a different one, because the OVERLAPPED structure might be both released and reassigned before a duplicate notification is processed. That doesn't matter in this case, because all of your requests will use the same completion key anyway.
Addendum^2: if you don't know anything about the handle, you'll also need to provide an event object for each OVERLAPPED structure, and wait on them in case notification of the I/O completion arrives that way. It's getting too late in the day for me to try to figure out the exact consequences of that, but it may mean that under some circumstances you get three notifications for the same I/O operation. You might be able to avoid that, but if not, this approach will still work.
Is there any way to discover this after the fact?
yes, exist - need use ZwQueryInformationFile with FileIoCompletionNotificationInformation
FILE_IO_COMPLETION_NOTIFICATION_INFORMATION is defined in wdm.h
so code which we need for query:
FILE_IO_COMPLETION_NOTIFICATION_INFORMATION ficni;
ZwQueryInformationFile(hFile, &iosb, &ficni, sizeof(ficni), FileIoCompletionNotificationInformation);
demo code for set and query
HANDLE hFile;
IO_STATUS_BLOCK iosb;
STATIC_OBJECT_ATTRIBUTES(oa, "\\systemroot\\notepad.exe");
if (0 <= ZwOpenFile(&hFile, FILE_GENERIC_READ, &oa, &iosb, FILE_SHARE_VALID_FLAGS, 0))
{
FILE_IO_COMPLETION_NOTIFICATION_INFORMATION ficni = { FILE_SKIP_COMPLETION_PORT_ON_SUCCESS };
if (0 <= ZwSetInformationFile(hFile, &iosb, &ficni, sizeof(ficni), FileIoCompletionNotificationInformation))
{
ficni.Flags = 0x12345678;
if (
0 > ZwQueryInformationFile(hFile, &iosb, &ficni, sizeof(ficni), FileIoCompletionNotificationInformation)
||
!(ficni.Flags & FILE_SKIP_COMPLETION_PORT_ON_SUCCESS)
)
{
__debugbreak();
}
}
ZwClose(hFile);
}
also let copy paste from wdm.h (for not say that this is "undocumented" )
//
// Don't queue an entry to an associated completion port if returning success
// synchronously.
//
#define FILE_SKIP_COMPLETION_PORT_ON_SUCCESS 0x1
//
// Don't set the file handle event on IO completion.
//
#define FILE_SKIP_SET_EVENT_ON_HANDLE 0x2
//
// Don't set user supplied event on successful fast-path IO completion.
//
#define FILE_SKIP_SET_USER_EVENT_ON_FAST_IO 0x4
typedef struct _FILE_IO_COMPLETION_NOTIFICATION_INFORMATION {
ULONG Flags;
} FILE_IO_COMPLETION_NOTIFICATION_INFORMATION, *PFILE_IO_COMPLETION_NOTIFICATION_INFORMATION;
I have question - for what reason this is declared in wdm.h ?
First of all, I've never worked with C before (mostly Java which is the reason you'll find me write some naive C code). I am writing a simple command interpreter in C. I have something like this:
//Initialization code
if (select(fdmax+1, &read_fds, NULL, NULL, NULL) == -1) {
perror("Select dead");
exit(EXIT_FAILURE);
}
....
....
//Loop through connections to see who has the data ready
//If the data is ready
if ((nbytes = recv(i, buf, sizeof(buf), 0)) > 0) {
//Do something with the message in the buffer
}
Now if I'm looking at something like a long paragraph of commands, it is obvious that a 256 byte buffer will not be able to get the entire command. For the time being, I'm using a 2056 byte buffer to get the entire command. But if I want to use the 256 byte buffer, how would I go about doing this? Do I keep track of which client gave me what data and append it to some buffer? I mean, use something like two dimensional arrays and such?
Yes, the usual approach is to have a buffer of "data I've received but not processed" for each client, large enough to hold the biggest protocol message.
You read into that buffer (always keeping track of how much data is currently in the buffer), and after each read, check to see if you have a complete message (or message(s), since you might get two at once!). If you do, you process the message, remove it from the buffer and shift any remaining data up to the start of the buffer.
Something roughly along the lines of:
for (i = 0; i < nclients; i++)
{
if (!FD_ISSET(client[i].fd, &read_fds))
continue;
nbytes = recv(client[i].fd, client[i].buf + client[i].bytes, sizeof(client[i].buf) - client[i].bytes, 0);
if (nbytes > 0)
{
client[i].bytes += nbytes;
while (check_for_message(client[i]))
{
size_t message_len;
message_len = process_message(client[i]);
client[i].bytes -= message_len;
memmove(client[i].buf, client[i].buf + message_len, client[i].bytes);
}
}
else
/* Handle client close or error */
}
By the way, you should check for errno == EINTR if select() returns -1, and just loop around again - that's not a fatal error.
I would keep a structure around for each client. Each structure contains a pointer to a buffer where the command is read in. Maybe you free the buffers when they're not used, or maybe you keep them around. The structure could also contain the client's fd in it as well. Then you just need one array (or list) of clients which you loop over.
The other reason you'd want to do this, besides the fact that 256 bytes might not be enough, is that recv doesn't always fill the buffer. Some of the data might still in transit over the network.
If you keep around buffers for each client, however, you can run into the "slowloris" attack, where a single client keeps sending little bits of data and takes up all your memory.
It can be a serious pain when you get tons of data like that over a network. There is a constant trade between allocating a huge array or multiple reads with data moves. You should consider getting a ready made linked list of buffers, then traverse the linked list as you read the buffers in each node of the linked list. That way it scales gracefully and you can quickly delete what you've processed. I think that's the best approach and it's also how boost asio implements buffered reads.
If you're dealing with multiple clients a common approach to to fork/exec for each connection. Your server would listen for incoming connections, and when one is made it would fork and and exec a child version of itself that would then handle the "command interpreter" portion of the problem.
This way you're letting the OS manage the client processes--that is, you don't have to have a data structure in your program to manage them. You will still need to clean up child processes in your server as they terminate.
As for managing the buffer...How much data do you expect before you post a response? You may need to be prepared to dynamically adjust the size of your buffer.
I am writing an API which includes IPC functions which send data to another process which may be local or on another host. I'd really like the send function to be as simple as:
int mySendFunc(myDataThing_t* thing, int sd);
without the caller having to know -- in the immediate context of the mySendFunc() call -- whether sd leads to a local or remote process. It seems to me that if I could so something like:
switch (socketFamily(sd)) {
case AF_UNIX:
case AF_LOCAL:
// Send without byteswapping
break;
default:
// Use htons() and htonl() on multi-byte values
break;
}
It has been suggested that I might implement socketFamily() as:
unsigned short socketFamily(int sd)
{
struct sockaddr sa;
size_t len;
getsockname(sd, &sa, &len);
return sa.sa_family;
}
But I'm a little concerned about the efficiency of getsockname() and wonder if I can afford to do it every time I send.
See getsockname(2). You then inspect the struct sockaddr for the family.
EDIT: As a side note, its sometimes useful to query info as well, in this case info libc sockets
EDIT:
You really can't know without looking it up every time. It can't be simply cached, as the socket number can be reused by closing and reopening it. I just looked into the glibc code and it seems getsockname is simply a syscall, which could be nasty performance-wise.
But my suggestion is to use some sort of object-oriented concepts. Make the user pass a pointer to a struct you had previously returned to him, i.e. have him register/open sockets with your API. Then you can cache whatever you want about that socket.
Why not always send in network byte order?
If you control the client and server code I have a different suggestion, which I've used successfully in the past.
Have the first four bytes of your message be a known integer value. The receiver can then inspect the first four bytes to see if it matches the known value. If it matches, then no byte swapping is needed.
This saves you from having to do byte swapping when both machines have the same endianness.