How to get "valid data length" of a file? - c

There is a function to set the "valid data length" value: SetFileValidData, but I didn't find a way to get the "valid data length" value.
I want to know about given file if the EOF is different from the VDL, because writing after the VDL in case of VDL<EOF will cause a performance penalty as described here.

I found this page, claims that:
there is no mechanism to query the value of the VDL
So the answer is "you can't".
If you care about performance you can set the VDL to the EOF, but then note that you may allow access old garbage on your disk - the part between those two pointers, that supposed to be zeros if you would access that file without setting the VDL to point the EOF.

Looked into this. No way to get this information via any API, even the e.g. NtQueryInformationFile API (FileEndOfFileInformation only worked with NtSetInformationFile). So finally I read this by manually reading NTFS records. If anyone has a better way, please tell! This also obviously only works with full system access (and NTFS) and might be out of sync with the in-memory information Windows uses.
#pragma pack(push)
#pragma pack(1)
struct NTFSFileRecord
{
char magic[4];
unsigned short sequence_offset;
unsigned short sequence_size;
uint64 lsn;
unsigned short squence_number;
unsigned short hardlink_count;
unsigned short attribute_offset;
unsigned short flags;
unsigned int real_size;
unsigned int allocated_size;
uint64 base_record;
unsigned short next_id;
//char padding[470];
};
struct MFTAttribute
{
unsigned int type;
unsigned int length;
unsigned char nonresident;
unsigned char name_lenght;
unsigned short name_offset;
unsigned short flags;
unsigned short attribute_id;
unsigned int attribute_length;
unsigned short attribute_offset;
unsigned char indexed_flag;
unsigned char padding1;
//char padding2[488];
};
struct MFTAttributeNonResident
{
unsigned int type;
unsigned int lenght;
unsigned char nonresident;
unsigned char name_length;
unsigned short name_offset;
unsigned short flags;
unsigned short attribute_id;
uint64 starting_vnc;
uint64 last_vnc;
unsigned short run_offset;
unsigned short compression_size;
unsigned int padding;
uint64 allocated_size;
uint64 real_size;
uint64 initial_size;
};
#pragma pack(pop)
HANDLE GetVolumeData(const std::wstring& volfn, NTFS_VOLUME_DATA_BUFFER& vol_data)
{
HANDLE vol = CreateFileW(volfn.c_str(), GENERIC_WRITE | GENERIC_READ,
FILE_SHARE_READ|FILE_SHARE_WRITE|FILE_SHARE_DELETE, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
if (vol == INVALID_HANDLE_VALUE)
return vol;
DWORD ret_bytes;
BOOL b = DeviceIoControl(vol, FSCTL_GET_NTFS_VOLUME_DATA,
NULL, 0, &vol_data, sizeof(vol_data), &ret_bytes, NULL);
if (!b)
{
CloseHandle(vol);
return INVALID_HANDLE_VALUE;
}
return vol;
}
int64 GetFileValidData(HANDLE file, HANDLE vol, const NTFS_VOLUME_DATA_BUFFER& vol_data)
{
BY_HANDLE_FILE_INFORMATION hfi;
BOOL b = GetFileInformationByHandle(file, &hfi);
if (!b)
return -1;
NTFS_FILE_RECORD_INPUT_BUFFER record_in;
record_in.FileReferenceNumber.HighPart = hfi.nFileIndexHigh;
record_in.FileReferenceNumber.LowPart = hfi.nFileIndexLow;
std::vector<BYTE> buf;
buf.resize(sizeof(NTFS_FILE_RECORD_OUTPUT_BUFFER) + vol_data.BytesPerFileRecordSegment - 1);
NTFS_FILE_RECORD_OUTPUT_BUFFER* record_out = reinterpret_cast<NTFS_FILE_RECORD_OUTPUT_BUFFER*>(buf.data());
DWORD bout;
b = DeviceIoControl(vol, FSCTL_GET_NTFS_FILE_RECORD, &record_in,
sizeof(record_in), record_out, 4096, &bout, NULL);
if (!b)
return -1;
NTFSFileRecord* record = reinterpret_cast<NTFSFileRecord*>(record_out->FileRecordBuffer);
unsigned int currpos = record->attribute_offset;
MFTAttribute* attr = nullptr;
while ( (attr==nullptr ||
attr->type != 0xFFFFFFFF )
&& record_out->FileRecordBuffer + currpos +sizeof(MFTAttribute)<buf.data() + bout)
{
attr = reinterpret_cast<MFTAttribute*>(record_out->FileRecordBuffer + currpos);
if (attr->type == 0x80
&& record_out->FileRecordBuffer + currpos + attr->attribute_offset+sizeof(MFTAttributeNonResident)
< buf.data()+ bout)
{
if (attr->nonresident == 0)
return -1;
MFTAttributeNonResident* dataattr = reinterpret_cast<MFTAttributeNonResident*>(record_out->FileRecordBuffer
+ currpos + attr->attribute_offset);
return dataattr->initial_size;
}
currpos += attr->length;
}
return -1;
}
[...]
NTFS_VOLUME_DATA_BUFFER vol_data;
HANDLE vol = GetVolumeData(L"\\??\\D:", vol_data);
if (vol != INVALID_HANDLE_VALUE)
{
int64 vdl = GetFileValidData(alloc_test->getOsHandle(), vol, vol_data);
if(vdl>=0) { [...] }
[...]
}
[...]

The SetValidData (according to MSDN) can be used to create for example a large file without having to write to the file. For a database this will allocate a (contiguous) storage area.
As a result, it seems the file size on disk will have changed without any data having been written to the file.
By implication, any GetValidData (which does not exist) just returns the size of the file, so you can use GetFileSize which returns the "valid" file size.

I think you are confused as to what "valid data length" actually means. Check this answer.
Basically, while SetEndOfFile lets you increase the length of a file quickly, and allocates the disk space, if you skip to the (new) end-of-file to write there, all the additionally allocated disk space would need to be overwritten with zeroes, which is kind of slow.
SetFileValidData lets you skip that zeroing-out. You're telling the system, "I am OK with whatever is in those disk blocks, get on with it". (This is why you need the SE_MANAGE_VOLUME_NAME priviledge, as it could reveal priviledged data to unpriviledged users if you don't overwrite the data. Users with this priviledge can access the raw drive data anyway.)
In either case, you have set the new effective size of the file. (Which you can read back.) What, exactly, should a seperate "read file valid data" report back? SetFileValidData told the system that whatever is in those disk blocks is "valid"...
Different approach of explanation:
The documentation mentions that the "valid data length" is being tracked; the purpose for this is for the system to know which range (from end-of-valid-data to end-of-file) it still needs to zero out, in the context of SetEndOfFile, when necessary (e.g. you closing the file). You don't need to read back this value, because the only way it could be different from the actual file size is because you, yourself, did change it via the aforementioned functions...

Related

What's the "to little endian" equivalent of htonl? [duplicate]

I need to convert a short value from the host byte order to little endian. If the target was big endian, I could use the htons() function, but alas - it's not.
I guess I could do:
swap(htons(val))
But this could potentially cause the bytes to be swapped twice, rendering the result correct but giving me a performance penalty which is not alright in my case.
Here is an article about endianness and how to determine it from IBM:
Writing endian-independent code in C: Don't let endianness "byte" you
It includes an example of how to determine endianness at run time ( which you would only need to do once )
const int i = 1;
#define is_bigendian() ( (*(char*)&i) == 0 )
int main(void) {
int val;
char *ptr;
ptr = (char*) &val;
val = 0x12345678;
if (is_bigendian()) {
printf(“%X.%X.%X.%X\n", u.c[0], u.c[1], u.c[2], u.c[3]);
} else {
printf(“%X.%X.%X.%X\n", u.c[3], u.c[2], u.c[1], u.c[0]);
}
exit(0);
}
The page also has a section on methods for reversing byte order:
short reverseShort (short s) {
unsigned char c1, c2;
if (is_bigendian()) {
return s;
} else {
c1 = s & 255;
c2 = (s >> 8) & 255;
return (c1 << 8) + c2;
}
}
;
short reverseShort (char *c) {
short s;
char *p = (char *)&s;
if (is_bigendian()) {
p[0] = c[0];
p[1] = c[1];
} else {
p[0] = c[1];
p[1] = c[0];
}
return s;
}
Then you should know your endianness and call htons() conditionally. Actually, not even htons, but just swap bytes conditionally. Compile-time, of course.
Something like the following:
unsigned short swaps( unsigned short val)
{
return ((val & 0xff) << 8) | ((val & 0xff00) >> 8);
}
/* host to little endian */
#define PLATFORM_IS_BIG_ENDIAN 1
#if PLATFORM_IS_LITTLE_ENDIAN
unsigned short htoles( unsigned short val)
{
/* no-op on a little endian platform */
return val;
}
#elif PLATFORM_IS_BIG_ENDIAN
unsigned short htoles( unsigned short val)
{
/* need to swap bytes on a big endian platform */
return swaps( val);
}
#else
unsigned short htoles( unsigned short val)
{
/* the platform hasn't been properly configured for the */
/* preprocessor to know if it's little or big endian */
/* use potentially less-performant, but always works option */
return swaps( htons(val));
}
#endif
If you have a system that's properly configured (such that the preprocessor knows whether the target id little or big endian) you get an 'optimized' version of htoles(). Otherwise you get the potentially non-optimized version that depends on htons(). In any case, you get something that works.
Nothing too tricky and more or less portable.
Of course, you can further improve the optimization possibilities by implementing this with inline or as macros as you see fit.
You might want to look at something like the "Portable Open Source Harness (POSH)" for an actual implementation that defines the endianness for various compilers. Note, getting to the library requires going though a pseudo-authentication page (though you don't need to register to give any personal details): http://hookatooka.com/poshlib/
This trick should would: at startup, use ntohs with a dummy value and then compare the resulting value to the original value. If both values are the same, then the machine uses big endian, otherwise it is little endian.
Then, use a ToLittleEndian method that either does nothing or invokes ntohs, depending on the result of the initial test.
(Edited with the information provided in comments)
My rule-of-thumb performance guess is that depends whether you are little-endian-ising a big block of data in one go, or just one value:
If just one value, then the function call overhead is probably going to swamp the overhead of unnecessary byte-swaps, and that's even if the compiler doesn't optimise away the unnecessary byte swaps. Then you're maybe going to write the value as the port number of a socket connection, and try to open or bind a socket, which takes an age compared with any sort of bit-manipulation. So just don't worry about it.
If a large block, then you might worry the compiler won't handle it. So do something like this:
if (!is_little_endian()) {
for (int i = 0; i < size; ++i) {
vals[i] = swap_short(vals[i]);
}
}
Or look into SIMD instructions on your architecture which can do it considerably faster.
Write is_little_endian() using whatever trick you like. I think the one Robert S. Barnes provides is sound, but since you usually know for a given target whether it's going to be big- or little-endian, maybe you should have a platform-specific header file, that defines it to be a macro evaluating either to 1 or 0.
As always, if you really care about performance, then look at the generated assembly to see whether pointless code has been removed or not, and time the various alternatives against each other to see what actually goes fastest.
Unfortunately, there's not really a cross-platform way to determine a system's byte order at compile-time with standard C. I suggest adding a #define to your config.h (or whatever else you or your build system uses for build configuration).
A unit test to check for the correct definition of LITTLE_ENDIAN or BIG_ENDIAN could look like this:
#include <assert.h>
#include <limits.h>
#include <stdint.h>
void check_bits_per_byte(void)
{ assert(CHAR_BIT == 8); }
void check_sizeof_uint32(void)
{ assert(sizeof (uint32_t) == 4); }
void check_byte_order(void)
{
static const union { unsigned char bytes[4]; uint32_t value; } byte_order =
{ { 1, 2, 3, 4 } };
static const uint32_t little_endian = 0x04030201ul;
static const uint32_t big_endian = 0x01020304ul;
#ifdef LITTLE_ENDIAN
assert(byte_order.value == little_endian);
#endif
#ifdef BIG_ENDIAN
assert(byte_order.value == big_endian);
#endif
#if !defined LITTLE_ENDIAN && !defined BIG_ENDIAN
assert(!"byte order unknown or unsupported");
#endif
}
int main(void)
{
check_bits_per_byte();
check_sizeof_uint32();
check_byte_order();
}
On many Linux systems, there is a <endian.h> or <sys/endian.h> with conversion functions. man page for ENDIAN(3)

AT91 ARM EMAC polling issue

I am using atmel's lwip example. Interfacing with PHY is ok. It can link and even auto negotiate. Netif is going up. But when i start polling netif nothing happens. Ive narrowed down problem to EMAC_Poll
unsigned char EMAC_Poll(unsigned char *pFrame, unsigned int frameSize, unsigned int *pRcvSize)
{
unsigned short bufferLength;
unsigned int tmpFrameSize=0;
unsigned char *pTmpFrame=0;
unsigned int tmpIdx = rxTd.idx;
volatile EmacRxTDescriptor *pRxTd = rxTd.td + rxTd.idx;
ASSERT(pFrame, "F: EMAC_Poll\n\r");
char isFrame = 0;
// Set the default return value
*pRcvSize = 0;
// Process received RxTd
while ((pRxTd->addr & EMAC_RX_OWNERSHIP_BIT) == EMAC_RX_OWNERSHIP_BIT) {
// Never got there.
...
}
return EMAC_RX_NO_DATA;
}
typedef struct {
volatile EmacRxTDescriptor td[RX_BUFFERS];
EMAC_RxCallback rxCb; /// Callback function to be invoked once a frame has been received
unsigned short idx;
} RxTd;
/// Describes the type and attribute of Receive Transfer descriptor.
typedef struct _EmacRxTDescriptor {
unsigned int addr;
unsigned int status;
} __attribute__((packed, aligned(8))) EmacRxTDescriptor, *PEmacRxTDescriptor;
There is while loop, but condition is never goes true.
I have very vague presentation what is RxTd and what exacly this condition means. However i can not see how thise RxTd Would change to pass condition. All references of RxTd leads to same emac.c module. Most of them in that polling function and rest in EMAC_ResetRx function.
static void EMAC_ResetRx(void)
{
unsigned int Index;
unsigned int Address;
// Disable RX
AT91C_BASE_EMAC->EMAC_NCR &= ~AT91C_EMAC_RE;
// Setup the RX descriptors.
rxTd.idx = 0;
for(Index = 0; Index < RX_BUFFERS; Index++) {
Address = (unsigned int)(&(pRxBuffer[Index * EMAC_RX_UNITSIZE]));
// Remove EMAC_RX_OWNERSHIP_BIT and EMAC_RX_WRAP_BIT
rxTd.td[Index].addr = Address & EMAC_ADDRESS_MASK;
rxTd.td[Index].status = 0;
}
rxTd.td[RX_BUFFERS - 1].addr |= EMAC_RX_WRAP_BIT;
// Receive Buffer Queue Pointer Register
AT91C_BASE_EMAC->EMAC_RBQP = (unsigned int) (rxTd.td);
}
I do not realy understand last line, but it looks like that rxTd is auto filled with AT91 itself. If it is so, there may be packing/aligment problem, but Atmel added __attribute__ ((packed, aligned(8))) on RxTd structure definition. Any way, can someone describe mechanism of data input or tell me where proble might be?
By the way i am using gcc, if that matters.
UPD:
Ive checked RSR and notice that it is start with 0, then goes to 2 after second. 2- means new data was captured.
UPD:
So i've read about function of emac in datasheet for my chip. I was right. That RBQP register must point to array of descriptors. Each descriptor consists of address and status field. The datasheet states that "bit zero of address field is written to one to show the buffer has been used". Then ARM uses another rx descriptor from that array. I guess by "has been used" they mean that that buffer is filled with frame data and ready to be processed. This must mean that data just not going to that buffer. But it must be there because REC goes high. Additionaly i've checked that RE in NCR is up and MI is enabled. I have no idea what is wrong.
I've spend whole week to solve it. The funny thing is that if i've dump memory and looked at all those addresses - The data was there whole time! So the key was to disable I and D caching and MMU itself. Hope it will help someone.

What does sparc_do_fork() exactly do?

I tried to find out the functionality of this function but I couldn't.. It is defined in Linux/arch/sparc/kernel/process_32.c Thanks
asmlinkage int sparc_do_fork(unsigned long clone_flags,
unsigned long stack_start,
struct pt_regs *regs,
unsigned long stack_size)
{
unsigned long parent_tid_ptr, child_tid_ptr;
unsigned long orig_i1 = regs->u_regs[UREG_I1];
long ret;
parent_tid_ptr = regs->u_regs[UREG_I2];
child_tid_ptr = regs->u_regs[UREG_I4];
ret = do_fork(clone_flags, stack_start, stack_size,
(int __user *) parent_tid_ptr,
(int __user *) child_tid_ptr);
/* If we get an error and potentially restart the system
* call, we're screwed because copy_thread() clobbered
* the parent's %o1. So detect that case and restore it
* here.
*/
if ((unsigned long)ret >= -ERESTART_RESTARTBLOCK)
regs->u_regs[UREG_I1] = orig_i1;
return ret;
}
It appears to be right there in the source code.
It's a wrapper around the regular Linux do_fork() call, one which saves and restores data (specifically, regs->u_regs[UREG_I1], which equates to the SPARC output register 1) that would otherwise be corrupted under certain circumstances:
/* If we get an error and potentially restart the system
* call, we're screwed because copy_thread() clobbered
* the parent's %o1. So detect that case and restore it
* here.
*/
It does this with:
unsigned long orig_i1 = regs->u_regs[UREG_I1]; // Save it.
ret = do_fork(...);
if ((unsigned long)ret >= -ERESTART_RESTARTBLOCK) // It may be corrupt
regs->u_regs[UREG_I1] = orig_i1; // so restore it.

Distributed shared memory [duplicate]

This question already has answers here:
Distributed shared memory library for C++? [closed]
(3 answers)
Closed 8 years ago.
My current system runs on Linux, with the different tasks using shared memory to access the common data (which is defined as a C struct). The size of the shared data is about 100K.
Now, I want to run the main user interface on Windows, while keeping all the other tasks in Linux, and I’m looking for the best replacement for the shared memory. The refresh rate of the UI is about 10 times per second.
My doubt is if it’s better to do it by myself or to use a third party solution. If I do it on my own, my idea is to open a socket and then use some sort of data serialization between the client (Windows) and the server (Linux).
In the case of third party solutions, I’m a bit overwhelmed by the number of options. After some search, it seems to me that the right solution could be MPI, but I would like to consider other options before start working with it: XDR, OpenMP, JSON, DBus, RDMA, memcached, boost libraries … Has anyone got any experience with any of them? Which are the pros and cons of using such solutions for accessing the shared memory on Linux from Windows?
Maybe MPI or the other third party solutions are too elaborated for such a simple use as mine and I should use a do-it-yourself approach? Any advice if I take that solution? Am I missing something? Am I looking in the wrong direction? I wouldn’t like to reinvent the wheel.
For the file sharing, I’m considering Samba.
My development is done with C and C++, and, of course, the solution needs to be compiled in Linux and Windows.
AFAIK (but I don't know Windows) you cannot share memory (à la shm_overview(7)) on both Linux and Windows at once (you could share data, using some network protocol, this is what memcached does).
However, you can make a Linux process answering to network messages from a Windows machine (but that is not shared memory!).
Did you consider using a Web interface? You could add a web interface -with ajax techniques- on Linux (to your Linux software), e.g. using FastCGI or an HTTP server library like libonion or wt. Then your Windows users could use their browser to interact with your program (which would still run on some Linux compute server).
PS. Read the wikipage on distributed shared memory. It means something different than what you are asking! Read also about message passing!
I'd recommend using a plain TCP socket between the user interface and the service, with simple tagged message frames passed between the two in preferably architecture-independent manner (i.e. with specific byte order and integer and floating-point types).
On the server end, I'd use a helper process or thread, that maintains two local copies of the shared state struct per client. First one reflects the state the client knows about, and the second one is used to snapshot the shared state at regular intervals. When the two differ, the helper sends the differing fields to the client. Conversely, the client can send modification requests to the helper.
I would not send the 100k shared data structure as a single chunk; it'd be very inefficient. Also, assuming the state contains multiple fields, having the client send the entire new 100k state would overwrite fields not meant to by the client.
I'd use one message for each field in the state, or atomically manipulated set of fields. The message structure itself should be very simple. At minimum, each message should start with a fixed-size length and type fields, so that it is trivial to receive the messages. It is always a good idea to leave the possibility of later extending the capabilities/fields, without breaking backwards compatibility, too.
You don't need a lot of code to implement the above. You do need some accessor/manipulator code for each different type of field in the state (char, short, int, long, double, float, and any other type or struct you might use), and that (and the helper that compares the active state with the simulated user state) will probably be the bulk of the code.
Although I don't normally recommend reinventing the wheel, in this case I do recommend writing your own message passing code, directly on top of TCP, simply because all the libraries I know of that could be used for this, are rather unwieldy or would impose some design choices I'd rather leave to the application.
If you want, I can provide some example code that would hopefully illustrate this better. I don't have Windows, but Qt does have a QTcpSocket class you can use that works the same in both Linux and Windows, so there shouldn't be (m)any differences. In fact, if you design the message structure well, you could write the UI in some other language like Python, without any differences to the server itself.
Promised examples follow; apologies for maximum-length post.
Paired counters are useful when there is a single writer for a specific field, possibly many readers, and you wish to add minimal overhead to the write path. Reading never blocks, and a reader will see whether they got a valid copy or if an update was in progress and they should not rely on the value they saw. (This also allows reliable write semantics for a field in an async-signal-safe context, where pthread locking is not guaranteed to work.)
This implementation works best on GCC-4.7 and later, but should work with earlier C compilers, too. Legacy __sync built-ins also work with Intel, Pathscale, and PGI C compilers. counter.h:
#ifndef COUNTER_H
#define COUNTER_H
/*
* Atomic generation counter support.
*
* A generation counter allows a single writer thread to
* locklessly modify a value non-atomically, while letting
* any number of reader threads detect the change. Reader
* threads can also check if the value they observed is
* "atomic", or taken during a modification in progress.
*
* There is no protection against multiple concurrent writers.
*
* If the writer gets canceled or dies during an update,
* the counter will get garbled. Reinitialize before relying
* on such a counter.
*
* There is no guarantee that a reader will observe an
* "atomic" value, if the writer thread modifies the value
* more often than the reader thread can read it.
*
* Two typical use cases:
*
* A) A single writer requires minimum overhead/latencies,
* whereas readers can poll and retry if necessary
*
* B) Non-atomic value or structure modified in
* an interrupt context (async-signal-safe)
*
* Initialization:
*
* Counter counter = COUNTER_INITIALIZER;
*
* or
*
* Counter counter;
* counter_init(&counter);
*
* Write sequence:
*
* counter_acquire(&counter);
* (modify or write value)
* counter_release(&counter);
*
* Read sequence:
*
* unsigned int check;
* check = counter_before(&counter);
* (obtain a copy of the value)
* if (check == counter_after(&counter))
* (a copy of the value is atomic)
*
* Read sequence with spin-waiting,
* will spin forever if counter garbled:
*
* unsigned int check;
* do {
* check = counter_before(&counter);
* (obtain a copy of the value)
* } while (check != counter_after(&counter));
*
* All these are async-signal-safe, and will never block
* (except for the spinning loop just above, obviously).
*/
typedef struct {
volatile unsigned int value[2];
} Counter;
#define COUNTER_INITIALIZER {{0U,0U}}
#if (__GNUC__ > 4) || (__GNUC__ == 4 && __GNUC_MINOR__ >= 7)
/*
* GCC 4.7 and later provide __atomic built-ins.
* These are very efficient on x86 and x86-64.
*/
static inline void counter_init(Counter *const counter)
{
/* This is silly; assignments should suffice. */
do {
__atomic_store_n(&(counter->value[1]), 0U, __ATOMIC_SEQ_CST);
__atomic_store_n(&(counter->value[0]), 0U, __ATOMIC_SEQ_CST);
} while (__atomic_load_n(&(counter->value[0]), __ATOMIC_SEQ_CST) |
__atomic_load_n(&(counter->value[1]), __ATOMIC_SEQ_CST));
}
static inline unsigned int counter_before(Counter *const counter)
{
return __atomic_load_n(&(counter->value[1]), __ATOMIC_SEQ_CST);
}
static inline unsigned int counter_after(Counter *const counter)
{
return __atomic_load_n(&(counter->value[0]), __ATOMIC_SEQ_CST);
}
static inline unsigned int counter_acquire(Counter *const counter)
{
return __atomic_fetch_add(&(counter->value[0]), 1U, __ATOMIC_SEQ_CST);
}
static inline unsigned int counter_release(Counter *const counter)
{
return __atomic_fetch_add(&(counter->value[1]), 1U, __ATOMIC_SEQ_CST);
}
#else
/*
* Rely on legacy __sync built-ins.
*
* Because there is no __sync_fetch(),
* counter_before() and counter_after() are safe,
* but not optimal (especially on x86 and x86-64).
*/
static inline void counter_init(Counter *const counter)
{
/* This is silly; assignments should suffice. */
do {
counter->value[0] = 0U;
counter->value[1] = 0U;
__sync_synchronize();
} while (__sync_fetch_and_add(&(counter->value[0]), 0U) |
__sync_fetch_and_add(&(counter->value[1]), 0U));
}
static inline unsigned int counter_before(Counter *const counter)
{
return __sync_fetch_and_add(&(counter->value[1]), 0U);
}
static inline unsigned int counter_after(Counter *const counter)
{
return __sync_fetch_and_add(&(counter->value[0]), 0U);
}
static inline unsigned int counter_acquire(Counter *const counter)
{
return __sync_fetch_and_add(&(counter->value[0]), 1U);
}
static inline unsigned int counter_release(Counter *const counter)
{
return __sync_fetch_and_add(&(counter->value[1]), 1U);
}
#endif
#endif /* COUNTER_H */
Each shared state field type needs their own structure defined. For our purposes, I'll just use a Value structure to describe one of them. It is up to you to define all the needed Field and Value types to match your needs. For example, a 3D vector of double precision floating-point components would be
typedef struct {
double x;
double y;
double z;
} Value;
typedef struct {
Counter counter;
Value value;
} Field;
where the modifying thread uses e.g.
void set_field(Field *const field, Value *const value)
{
counter_acquire(&field->counter);
field->value = value;
counter_release(&field->counter);
}
and each reader maintains their own local shapshot using e.g.
typedef enum {
UPDATED = 0,
UNCHANGED = 1,
BUSY = 2
} Updated;
Updated check_field(Field *const local, const Field *const shared)
{
Field cache;
cache.counter[0] = counter_before(&shared->counter);
/* Local counter check allows forcing an update by
* simply changing one half of the local counter.
* If you don't need that, omit the local-only part. */
if (local->counter[0] == local->counter[1] &&
cache.counter[0] == local->counter[0])
return UNCHANGED;
cache.value = shared->value;
cache.counter[1] = counter_after(&shared->counter);
if (cache.counter[0] != cache.counter[1])
return BUSY;
*local = cache;
return UPDATED;
}
Note that the cache above constitutes one local copy of the value, so each reader only needs to maintain one copy of the fields it is interested in. Also, the above accessor function only modifies local with a successful snapshot.
When using pthread locking, pthread_rwlock_t is a good option, as long as the programmer understands the priority issues with multiple readers and writers. (see man pthread_rwlock_rdlock and man pthread_rwlock_wrlock for details.)
Personally, I would just stick to mutexes with as short holding times as possible, to reduce overall complexity. The Field structure for a mutex-protected change-counted value I'd use would probably be something like
typedef struct {
pthread_mutex_t mutex;
Value value;
volatile unsigned int change;
} Field;
void set_field(Field *const field, const Value *const value)
{
pthread_mutex_lock(&field->mutex);
field->value = *value;
#if (__GNUC__ > 4) || (__GNUC__ == 4 && __GNUC_MINOR__ >= 7)
__atomic_fetch_add(&field->change, 1U, __ATOMIC_SEQ_CST);
#else
__sync_fetch_and_add(&field->change, 1U);
#endif
pthread_mutex_unlock(&field->mutex);
}
void get_field(Field *const local, Field *const field)
{
pthread_mutex_lock(&field->mutex);
local->value = field->value;
#if (__GNUC__ > 4) || (__GNUC__ == 4 && __GNUC_MINOR__ >= 7)
local->value = __atomic_fetch_n(&field->change, __ATOMIC_SEQ_CST);
#else
local->value = __sync_fetch_and_add(&field->change, 0U);
#endif
pthread_mutex_unlock(&field->mutex);
}
Updated check_field(Field *const local, Field *const shared)
{
#if (__GNUC__ > 4) || (__GNUC__ == 4 && __GNUC_MINOR__ >= 7)
if (local->change == __atomic_fetch_n(&field->change, __ATOMIC_SEQ_CST))
return UNCHANGED;
#else
if (local->change == __sync_fetch_and_add(&field->change, 0U))
return UNCHANGED;
#endif
pthread_mutex_lock(&shared->mutex);
local->value = shared->value;
#if (__GNUC__ > 4) || (__GNUC__ == 4 && __GNUC_MINOR__ >= 7)
local->change = __atomic_fetch_n(&field->change, __ATOMIC_SEQ_CST);
#else
local->change = __sync_fetch_and_add(&field->change, 0U);
#endif
pthread_mutex_unlock(&shared->mutex);
}
Note that the mutex field in the local copy is unused, and can be omitted from the local copy (defining a suitable Cache etc. structure type without it).
The semantics with respect to the change field are simple: It is always accessed atomically, and can therefore be read without holding the mutex. It is only modified while holding the mutex (but because readers do not hold the mutex, it must be modified atomically even then).
Note that if your server is restricted to x86 or x86-64 architectures, you can use ordinary accesses to change above (e.g. field->change++), as unsigned int access is atomic on these architectures; no need to use built-in functions. (Although increment is non-atomic, readers will only ever see the old or the new unsigned int, so that's not a problem either.)
On the server, you'll need some kind of "agent" process or thread, acting on behalf of the remote clients, sending field changes to each client, and optionally receiving field contents for update from each client.
Efficient implementation of such an agent is a big question. A properly thought-out recommendation would require more detailed information on the system. The simplest possible implementation, I guess, would be a low-priority thread simply scanning the shared state continuously, reporting any field changes not known to each client as it sees them. If change rate is high -- say, usually more than a dozen changes per second over the entire state -- then that may be quite acceptable. If change rate varies, adding a global change counter (modified whenever a single field is modified, i.e. in set_field()) and sleeping (or at least yielding) when no global change is observed, is better.
If there are multiple remote clients, I would use a single agent, which collects changes into a queue, with a cached copy of the state as seen at the head of the queue (with previous changes being those that lead up to it). Initially connected clients get the entire state, then each queued change following that. Note that the queue entries do not need to represent fields as such, but the messages sent to each client to update that field, and that if a latter update replaces an already queued value, the already queued value can be removed, reducing the amount of data that needs to be sent to clients that are catching up. This is an interesting topic on its own, and would warrant a question of its own, in my opinion.
Like I mentioned above, I would personally use a plain TCP socket between the user interface and the service.
There are four main issues with respect to the TCP messages:
Byte order and data formats (unless both server and remote clients are guaranteed to use the same hardware architecture)
My preferred solution is to use native byte order (but well-defined standard types) when sending, with recipient doing any byte order conversions. This requires that each end sends an initial message with predetermined "prototype values" for each integer and floating-point type used.
I assume integers use two's complement format with no padding bits, and double and float are double- and single-precision IEEE-754 types, respectively. Most current hardware architectures have these natively (although byte ordering does vary). Weird architectures can and do emulate these in software, typically by supplying suitable command-line switches or using some libraries.
Message framing (as TCP is a stream of bytes, not a stream of packets, from the application perspective)
Most robust option is to prefix each message with a fixed size type field and a fixed size length field. (Usually, the length field indicates how many additional bytes are part of this message, and will include any possible padding bytes the sender might have added.)
The recipient will receive TCP input until it has enough cached for the fixed type and length parts to examine the complete packet length. Then it receives more data, until it has a full TCP packet. A minimal receive buffer implementation is something like
Extensibility
Backwards compatibility is essential for making updates and upgrades easier. This means you'll probably need to include some kind of version number in the initial message from server to client, and have both server and client ignore messages they do not know. (This obviously requires a length field in each message, as recipients cannot infer the length of messages they do not recognize.)
In some cases you might need the ability to mark a message important, in the sense that if the recipient does not recognize it, further communications should be aborted. This is easiest to achieve by using some kind of identifier or bit in the type field. For example, PNG file format is very similar to the message format I've described here, with a four-byte (four ASCII character) type field, and a four-byte (32-bit) length field for each message ("chunk"). If the first type character is uppercase ASCII, it means it is a critical one.
Passing initial state
Passing the initial state as a single chunk fixes the entire shared state structure. Instead, I recommend passing each field or record separately, just as if they were updates. Optionally, when all the initial fields have been sent, you can send a message that informs the client they have the entire state; this should let the client first receive the complete state, then construct and render the user interface, without having to dynamically adjust to varying number of fields.
Whether each client requires the full state, or just a smaller subset of the state, is another question. When a remote client connects to a server, it probably should include some kind of identifier the server can use to make that determination.
Here is messages.h, a header file with inline implementation of integer and floating-point type handling and byte order detection:
#ifndef MESSAGES_H
#define MESSAGES_H
#include <stdint.h>
#include <string.h>
#include <errno.h>
#if defined(__GNUC__)
static const struct __attribute__((packed)) {
#else
#pragma pack(push,1)
#endif
const uint64_t u64; /* Offset +0 */
const int64_t i64; /* +8 */
const double dbl; /* +16 */
const uint32_t u32; /* +24 */
const int32_t i32; /* +28 */
const float flt; /* +32 */
const uint16_t u16; /* +36 */
const int16_t i16; /* +38 */
} prototypes = {
18364758544493064720ULL, /* u64 */
-1311768465156298103LL, /* i64 */
0.71948481353325643983254167324048466980457305908203125,
3735928559UL, /* u32 */
-195951326L, /* i32 */
1.06622731685638427734375f, /* flt */
51966U, /* u16 */
-7658 /* i16 */
};
#if !defined(__GNUC__)
#pragma pack(pop)
#endif
/* Add prototype section to a buffer.
*/
size_t add_prototypes(unsigned char *const data, const size_t size)
{
if (size < sizeof prototypes) {
errno = ENOSPC;
return 0;
} else {
memcpy(data, &prototypes, sizeof prototypes);
return sizeof prototypes;
}
}
/*
* Byte order manipulation functions.
*/
static void reorder64(void *const dst, const void *const src, const int order)
{
if (order) {
uint64_t value;
memcpy(&value, src, 8);
if (order & 1)
value = ((value >> 8U) & 0x00FF00FF00FF00FFULL)
| ((value & 0x00FF00FF00FF00FFULL) << 8U);
if (order & 2)
value = ((value >> 16U) & 0x0000FFFF0000FFFFULL)
| ((value & 0x0000FFFF0000FFFFULL) << 16U);
if (order & 4)
value = ((value >> 32U) & 0x00000000FFFFFFFFULL)
| ((value & 0x00000000FFFFFFFFULL) << 32U);
memcpy(dst, &value, 8);
} else
if (dst != src)
memmove(dst, src, 8);
}
static void reorder32(void *const dst, const void *const src, const int order)
{
if (order) {
uint32_t value;
memcpy(&value, src, 4);
if (order & 1)
value = ((value >> 8U) & 0x00FF00FFUL)
| ((value & 0x00FF00FFUL) << 8U);
if (order & 2)
value = ((value >> 16U) & 0x0000FFFFUL)
| ((value & 0x0000FFFFUL) << 16U);
memcpy(dst, &value, 4);
} else
if (dst != src)
memmove(dst, src, 4);
}
static void reorder16(void *const dst, const void *const src, const int order)
{
if (order & 1) {
const unsigned char c[2] = { ((const unsigned char *)src)[0],
((const unsigned char *)src)[1] };
((unsigned char *)dst)[0] = c[1];
((unsigned char *)dst)[1] = c[0];
} else
if (dst != src)
memmove(dst, src, 2);
}
/* Detect byte order conversions needed.
*
* If the prototypes cannot be supported, returns -1.
*
* If prototype record uses native types, returns 0.
*
* Otherwise, bits 0..2 are the integer order conversion,
* and bits 3..5 are the floating-point order conversion.
* If 'order' is the return value, use
* reorderXX(local, remote, order)
* for integers, and
* reorderXX(local, remote, order/8)
* for floating-point types.
*
* For adjusting records sent to server, just do the same,
* but with order obtained by calling this function with
* parameters swapped.
*/
int detect_order(const void *const native, const void *const other)
{
const unsigned char *const source = other;
const unsigned char *const target = native;
unsigned char temp[8];
int iorder = 0;
int forder = 0;
/* Verify the size of the types.
* C89/C99/C11 says sizeof (char) == 1, but check that too. */
if (sizeof (double) != 8 ||
sizeof (int64_t) != 8 ||
sizeof (uint64_t) != 8 ||
sizeof (float) != 4 ||
sizeof (int32_t) != 4 ||
sizeof (uint32_t) != 4 ||
sizeof (int16_t) != 2 ||
sizeof (uint16_t) != 2 ||
sizeof (unsigned char) != 1 ||
sizeof prototypes != 40)
return -1;
/* Find byte order for the largest floating-point type. */
while (1) {
reorder64(temp, source + 16, forder);
if (!memcmp(temp, target + 16, 8))
break;
if (++forder >= 8)
return -1;
}
/* Verify forder works for all other floating-point types. */
reorder32(temp, source + 32, forder);
if (memcmp(temp, target + 32, 4))
return -1;
/* Find byte order for the largest integer type. */
while (1) {
reorder64(temp, source + 0, iorder);
if (!memcmp(temp, target + 0, 8))
break;
if (++iorder >= 8)
return -1;
}
/* Verify iorder works for all other integer types. */
reorder64(temp, source + 8, iorder);
if (memcmp(temp, target + 8, 8))
return -1;
reorder32(temp, source + 24, iorder);
if (memcmp(temp, target + 24, 4))
return -1;
reorder32(temp, source + 28, iorder);
if (memcmp(temp, target + 28, 4))
return -1;
reorder16(temp, source + 36, iorder);
if (memcmp(temp, target + 36, 2))
return -1;
reorder16(temp, source + 38, iorder);
if (memcmp(temp, target + 38, 2))
return -1;
/* Everything works. */
return iorder + 8 * forder;
}
/* Verify current architecture is supportable.
* This could be a compile-time check.
*
* (The buffer contains prototypes for network byte order,
* and actually returns the order needed to convert from
* native to network byte order.)
*
* Returns -1 if architecture is not supported,
* a nonnegative (0 or positive) value if successful.
*/
static int verify_architecture(void)
{
static const unsigned char network_endian[40] = {
254U, 220U, 186U, 152U, 118U, 84U, 50U, 16U, /* u64 */
237U, 203U, 169U, 135U, 238U, 204U, 170U, 137U, /* i64 */
63U, 231U, 6U, 5U, 4U, 3U, 2U, 1U, /* dbl */
222U, 173U, 190U, 239U, /* u32 */
244U, 82U, 5U, 34U, /* i32 */
63U, 136U, 122U, 35U, /* flt */
202U, 254U, /* u16 */
226U, 22U, /* i16 */
};
return detect_order(&prototypes, network_endian);
}
#endif /* MESSAGES_H */
Here is an example function that a sender can use to pack a 3-component vector identified by a 32-bit unsigned integer, into a 36-byte message. This uses a message frame starting with four bytes ("Vec3"), followed by the 32-bit frame length, followed by the 32-bit identifier, followed by three doubles:
size_t add_vector(unsigned char *const data, const size_t size,
const uint32_t id,
const double x, const double y, const double z)
{
const uint32_t length = 4 + 4 + 4 + 8 + 8 + 8;
if (size < (size_t)bytes) {
errno = ENOSPC;
return 0;
}
/* Message identifier, four bytes */
buffer[0] = 'V';
buffer[1] = 'e';
buffer[2] = 'c';
buffer[3] = '3';
/* Length field, uint32_t */
memcpy(buffer + 4, &length, 4);
/* Identifier, uint32_t */
memcpy(buffer + 8, &id, 4);
/* Vector components */
memcpy(buffer + 12, &x, 8);
memcpy(buffer + 20, &y, 8);
memcpy(buffer + 28, &z, 8);
return length;
}
Personally, I prefer shorter, 16-bit identifiers and 16-bit length; that also limits maximum message length to 65536 bytes, making 65536 bytes a good read/write buffer size.
The recipient can handle the received TCP data stream for example thus:
static unsigned char *input_data; /* Data buffer */
static size_t input_size; /* Data buffer size */
static unsigned char *input_head; /* Next byte in buffer */
static unsigned char *input_tail; /* After last buffered byte */
static int input_order; /* From detect_order() */
/* Function to handle "Vec3" messages: */
static void handle_vec3(unsigned char *const msg)
{
uint32_t id;
double x, y, z;
reorder32(&id, msg+8, input_order);
reorder64(&x, msg+12, input_order/8);
reorder64(&y, msg+20, input_order/8);
reorder64(&z, msg+28, input_order/8);
/* Do something with vector id, x, y, z. */
}
/* Function that tries to consume thus far buffered
* input data -- typically run once after each
* successful TCP receive. */
static void consume(void)
{
while (input_head + 8 < input_tail) {
uint32_t length;
/* Get current packet length. */
reorder32(&length, input_head + 4, input_order);
if (input_head + length < input_tail)
break; /* Not all read, yet. */
/* We have a full packet! */
/* Handle "Vec3" packet: */
if (input_head[0] == 'V' &&
input_head[1] == 'e' &&
input_head[2] == 'c' &&
input_head[3] == '3')
handle_vec3(input_head);
/* Advance to next packet. */
input_head += length;
}
if (input_head < input_tail) {
/* Move partial message to start of buffer. */
if (input_head > input_data) {
const size_t have = input_head - input_data;
memmove(input_data, input_head, have);
input_head = input_data;
input_tail = input_data + have;
}
} else {
/* Buffer is empty. */
input_head = input_data;
input_tail = input_data;
}
}
Questions?

Is it necessary to disable IRQ during copying memory for sg_copy_buffer?

Here is the function of sg_copy_buffer for Linux kernel 2.6.32. Is it necessary to disable IRQ during copying memory?
static size_t sg_copy_buffer(struct scatterlist *sgl, unsigned int nents,
void *buf, size_t buflen, int to_buffer)
{
unsigned int offset = 0;
struct sg_mapping_iter miter;
unsigned long flags;
unsigned int sg_flags = SG_MITER_ATOMIC;
if (to_buffer)
sg_flags |= SG_MITER_FROM_SG;
else
sg_flags |= SG_MITER_TO_SG;
sg_miter_start(&miter, sgl, nents, sg_flags);
local_irq_save(flags);
while (sg_miter_next(&miter) && offset < buflen) {
unsigned int len;
len = min(miter.length, buflen - offset);
if (to_buffer)
memcpy(buf + offset, miter.addr, len);
else
memcpy(miter.addr, buf + offset, len);
offset += len;
}
sg_miter_stop(&miter);
local_irq_restore(flags);
return offset;
}
The sg_miter_start() function that is called in this function calls kmap_atomic(), which can only be used inside atomic (non interruptible) code paths. kmap_atomic() in turn is being used since it is MUCH cheaper then a regular kmap, since it does not need to do a global TLB flush.
The original implementation of sg_copy_buffer() left disabling interrupts to the caller, but after some callers forgot, causing bugs (e.g. https://bugzilla.kernel.org/show_bug.cgi?id=11529) the decision was made to disable interrupt in the function itself (see: http://www.spinics.net/lists/linux-scsi/msg29428.html for the discussion).

Resources