How to write binary into file in Crystal - file

I have an array of UInt32, what is the most efficient way to write it into a binary file in Crystal lang?
By now I am using IO#write_byte(byte : UInt8) method, but I believe there should be a way to write bigger chunks, than per 1 byte.

You can directly write a Slice(UInt8) to any IO, which should be faster than iterating each item and writing each bytes one by one.
The trick is to access the Array(UInt32)'s internal buffer as a Pointer(UInt8) then make it a Slice(UInt8), which can be achieved with some unsafe code:
array = [1_u32, 2_u32, 3_u32, 4_u32]
File.open("out.bin", "w") do |f|
ptr = (array.to_unsafe as UInt8*)
f.write ptr.to_slice(array.size * sizeof(UInt32))
end
Be sure to never keep a reference to ptr, see Array#to_unsafe for details.

Related

Most idiomatic way to read a range of bytes from a file

I have a file, say myfile. Using Rust, I would like to open myfile, and read bytes N to M into a Vec, say myvec. What is the most idiomatic way to do so? Naively, I thought of using bytes(), then skip, take and collect, but that sounds so inefficient.
The most idiomatic (to my knowledge) and relatively efficient way:
let start = 10;
let count = 10;
let mut f = File::open("/etc/passwd")?;
f.seek(SeekFrom::Start(start))?;
let mut buf = vec![0; count];
f.read_exact(&mut buf)?;
You indicated in the comments that you were concerned about the overhead of zeroing the memory before reading into it. Indeed there is a nonzero cost to this, but it's usually negligible compared to the I/O operations needed to read from a file, and the advantage is that your code remains 100% sound. But for educational purposes only, I tried to come up with an approach that avoids the zeroing.
Unfortunately, even with unsafe code, we cannot safely pass an uninitialized buffer to read_exact because of this paragraph in the documentation (emphasis mine):
No guarantees are provided about the contents of buf when this function is called, implementations cannot rely on any property of the contents of buf being true. It is recommended that implementations only write data to buf instead of reading its contents.
So it's technically legal for File::read_exact to read from the provided buffer, which means we cannot legally pass uninitialized data here (using MaybeUninit).
The existing answer works, but it reads the entire block that you're after into a Vec in memory. If the block you're reading out is huge or you have no use for it in memory, you ideally need an io::Read which you can copy straight into another file or pass into another api.
If your source implements Read + Seek then you can seek to the start position and then use Read::take to only read for a specific number of bytes.
use std::{fs::File, io::{self, Read, Seek, SeekFrom}};
let start = 20;
let length = 100;
let mut input = File::open("input.bin")?;
// Seek to the start position
input.seek(SeekFrom::Start(start))?;
// Create a reader with a fixed length
let mut chunk = input.take(length);
let mut output = File::create("output.bin")?;
// Copy the chunk into the output file
io::copy(&mut chink, &mut output)?;

Saving a string array to CSV

I have a 25194081x2 matrix of strings called s1. Below an outlook of how the data looks like.
I am trying to save this matrix to csv. I tried the code below but for some reason it saves the first column of the vector twice (side by side) instead of the two columns.
What am I doing wrong?
fileID= fopen('data.csv', 'w') ;
fprintf(fileID, '%s,%s\n', [s1(:,1) s1(:,2)]);
fclose(fileID)
Dont merge the columns to a string array like you do now, but provide them as separate arguments, and loop over the rows of s1:
fileID= fopen('data.csv', 'w') ;
for k = 1:size(s1,1)
fprintf(fileID, '%s,%s\n', s1(k,1), s1(k,2));
end
fclose(fileID)
Or, if you're using >R2019a, you can use writematrix:
writematrix(s1, 'data.csv');
My version of MATLAB (R2016a) does not have the string type available yet, but your problem is one I was having regularly with cell arrays of character vectors. The trick I was using to avoid using a loop for fprintf should be applicable to you.
Let's start with sample data as close to yours:
s1 = {'2F5E8693E','al1 1aj_25';
'3F5E8693E','al1 1aj_50';
'3F5E8693E','al1 1aj_50';}
Then this code usually executed much faster for me than having to loop on the matrix for writing to file:
% step 1: transpose, to get the matrix in the MATLAB default column major order
s = s1.' ;
% step 2 : Write all in a long character array
so = sprintf('%s, %s\n', s{:} ) ;
% step 3 : write to file in one go (no need loop)
fid = fopen('data.csv', 'w') ;
fprintf(fid,'%c',so) ;
fclose(fid) ;
The only step slightly different for you might be step 2. I don't know if this syntax will work on a matrix of string as good on a cell array of characters, but I'm sure there is a way to get the same result: a single long vector of characters. Once you get that, fprintf will be uber fast to write it to file.
note: If the amount of data is too large and with a limited memory you might not be able to generate the long char vector. In my experience, it was still faster to use this method in chuncks (which would fit in memory) rather than looping over each line of the matrix.

Using Linux AIO, able to do IOs but writing garbage as well into the file

This might seem silly, but, I am using libaio ( not posix aio), I am able to write something into the file, but I am also writing extra stuff into the file.
I read about the alignment requirement and the data type of the buffer field of iocb.
Here is the code sample ( only relevant sections of use, for representation )
aio_context_t someContext;
struct iocb somecb;
struct io_event someevents[1];
struct iocb *somecbs[1];
somefd = open("/tmp/someFile", O_RDWR | O_CREAT);
char someBuffer[4096];
... // error checks
someContext = 0; // this is necessary
io_setup(32, &someContext ); // no error checks pasted here
strcpy(someBuffer, "hello stack overflow");
memset(&somecb, 0, sizeof(somecb));
somecb.aio_fildes = somefd ;
somecb.aio_lio_opcode = IOCB_CMD_PWRITE;
somecb.aio_buf = (uint64_t)someBuffer;
somecb.aio_offset = 0;
somecb.aio_nbytes = 100; // // //
// I am avoiding the memeaign and sysconf get page part in sample paste
somecbs[0] = &somecb; // address of the solid struct, avoiding heap
// avoiding error checks for this sample listing
io_submit(someContext, 1, somecbs);
// not checking for events count or errors
io_getevents(someContext, 1, 1, someevents, NULL);
The Output:
This code does create the file, and does write the intended string
hello stack overflow into the file /tmp/someFile.
The problem:
The file /tmp/someFile also contains after the intended string, in series,
#^#^#^#^#^#^#^#^#^ and some sections from the file itself ( code section), can say garbage.
I am certain to an extent that this is some pointer gone wrong in the data field, but cannot crack this.
How to use aio ( not posix) to write exactly and only 'hello world' into a file?
I am aware that aio calls might be not supported on all file systems as of now. The one I am running against does support.
Edit - If you want the starter pack for this attempt , you can get from here.
http://www.fsl.cs.sunysb.edu/~vass/linux-aio.txt
Edit 2 : Carelessness, I was setting up more number of bytes to write to within the file, and the code was honoring it. Put simply, to write 'hw' exactly one needed no more than 2 bytes in the bytes field of iocb.
There's a few things going on here. First up, the alignment requirement that you mentioned is either 512 bytes or 4096 bytes, depending on your underlying device. Try 512 bytes to start. It applies to:
The offset that you're writing in the file must be a multiple of 512 bytes. It can be 0, 512, 1024, etc. You can write at offset 0 like you're doing here, but you can't write at offset 100.
The length of data that you're writing to the file must be a multiple of 512 bytes. Again, you can write 512 bytes, 1024 bytes, or 2048 bytes, and so on - any multiple of 512. You can't write 100 bytes like you're trying to do here.
The address of the memory that contains the data you're writing must be a multiple of 512. (I typically use 4096, to be safe.) Here, you'll need to be able to do someBuffer % 512 and get 0. (With the code the way it is, it most likely won't be.)
In my experience, failing to meet any of the above requirements doesn't actually give you an error back! Instead, it'll complete the I/O request using normal, regular old blocking I/O.
Unaligned I/O: If you really, really need to write a smaller amount of data or write at an unaligned offset, then things get tricky even above and beyond the io_submit interface. You'll need to do an aligned read to cover the range of data that you need to write, then modify the data in memory and write the aligned region back to disk.
For example, say you wanted to modify offset 768 through 1023 on the disk. You'd need to read 512 bytes at offset 512 into a buffer. Then, memcpy() the 256 bytes you wanted to write 256 bytes into that buffer. Finally, you issue a write of the 512 byte buffer at offset 512.
Uninitialized Data: As others have pointed out, you haven't fully initialized the buffer that you're writing. Use memset() to initialize it to zero to avoid writing junk.
Allocating an Aligned Pointer: To meet the pointer requirements for the data buffer, you'll need to use posix_memalign(). For example, to allocate 4096 bytes with a 512 byte alignment restriction: posix_memalign(&ptr, 512, 4096);
Lastly, consider whether you need to do this at all. Even in the best of cases, io_submit still "blocks", albeit at the 10 to 100 microsecond level. Normal blocking I/O with pread and pwrite offers a great many benefits to your application. And, if it becomes onerous, you can relegate it to another thread. If you've got a latency-sensitive app, you'll need to do io_submit in another thread anyway!

Flipping a file with a constant amount of use of file positioning functions

Write a function that flips all the bits in a file such that the last bit will be the first, and the first bit will the last and so on.
Note: only a constant amount of use of file positioning functions is allowed (such as fseek), i.e, the use of file positioning functions isn't dependent on the size of the file.
We have to flip each byte and each byte in the file. The function that flips a byte is pretty trivial, but I can't think of a way of flipping the bytes in the file without use of fseek. I can place the whole file a an array but that doesn't seem right, if the file is huge then it won't work.
Any hints or ideas please?
Suppose you have two buffers of size B bytes, and your files is of M*B bytes length. Then, if you can use an auxiliary file, you could use the following algorithm, given in pseudo-code, to solve your problem with a number of fseek equal to 0. Let's call F1 the original file, and F2 the auxiliary file, at each iteration the role of the two files is swapped:
for (i = 0; i<M; i++) {
read a block of F1 in buffer1 and flip its bits
for (j = 0; j < (M-i-1); j++) {
read a block from F1 in buffer2
write buffer2 to F2
}
write buffer1 to F2
for (j = (M-i-1); j < M-1; j++) {
read a block from F1 in buffer2
write buffer2 to F2
}
exchange the role of F1 and F2
}
The final result is in F1.
The algorithm in practice reverses the bits of each block, and writes it in the "correct" position in the auxiliary file. The number of passes is equal to the number of blocks, so one should have two buffers as large as the memory can afford.
Of course the algorithm is inefficient, since it rewrites the entire file M times, but respects the given assignment.
Update: this answer does not use a constant number of repositioning in the output file.
This is a solution inspired by #Renzo's answer. In fact, it is a simplified version of that answer.
Using the same notations, let's say the input file is F1 and its size is B * M where B is the size of two buffers (b1 and b2) we have in memory and F2 is the output file.
The idea is simple: read a block of B bytes into buffer b1, flip its bits into b2 (can also flip them in place, if you like it more this way), write data from buffer b2 into F2 as many times it needs until it gets written in its final position in F2. Rewind F2 and repeat until all the blocks from F1 are processed. On each iteration, the content of buffer b2 is written to F2 one time less than on the previous iteration.
The (pseudo-)code
// Open the files
fopen(F1);
fopen(F2);
// Process the input in blocks of size B
for (i = 0; i < M; i ++) {
// Read the next block of B bytes from F1 into b1
fread(b1, B, F1);
// Flip the bits from b1 into b2
flip_the_bits(b1, b2);
// Rewind F2, write b2 in it several times
rewind(F2);
// Write b2 in F2 once for each block not written yet
// This overwrites the blocks written on the previous iteration
// except for the last one
for (j = i; j < M; j ++) {
fwrite(b2, B, 1, F2);
}
}
// Close the files
fclose(F1);
fclose(F2);
Remarks about the code
If the file size is not a multiple of B then M is rounded up to the nearest integer and the iteration when i == 0 and j == M-1 must take care to write only the needed bytes into the output file (only the tail of the b2 buffer).
There is no positioning in the input file (it is open, read head-to-tail and closed) but there are a lot of rewinds (M) on the output file and also a lot of data is rewritten. There are M*(M+1)/2 writes for M useful blocks of data. Rewinding is a repositioning.
In the extreme case when B == 1 and M == filesize(F1) the code has maximum efficiency on the used memory (two bytes for the two buffers or a single byte if the flip is done in place) but the worst performance on time (positioning and writes). More memory (as much as possible) makes it run as fast as possible.
A little history
The problem is probably 40 or 50 years old. The storage devices at that time were magnetic tapes and repositioning on magnetic tapes takes time. This is why the problem asks to use as few repositioning as possible. And this is why the C function that positions the pointer at the beginning of the file is named rewind().
As #Joachim notes in a comment on the question, we need to use fseek() once to go to the end of the file to find its size and another one to go back to the beginning of the file and start processing it.
Maybe this was the way to go back then when the magnetic tapes were the cutting edge storage technology, I don't know. The modern operating systems provide functions to find the size of a file without needing to fseek() to its end and back.
Unless there's some obscure nonstandard function to reverse the direction in which the file pointer advances, I see no way around of reading the whole thing into a large buffer and go from there.
Now in case you're not limited to use stdio and are running on a system with large address space (something with more than, say 48 bits, like any modern 64 bit architecture) you could use a memory mapping instead of reading into a buffer. Actually a better name would be address space mapping (but memory map is what stuck): Instead of reading the file into a large chunk of allocated memory, you can as well tell the operating system: "Okay, here I've got this file, and I want its contents to appear in the process' address space". The OS will do it, and accessing such an address space location will actually read from the file in-situ (no intermediary buffer involved). A file memory map can also be used for writing.
See mmap for POSIX systems and CreateFileMapping+MapViewOfFile for Windows.
I don't know if that would be considered cheating though. Also I'm eager to see the official solution.

How to handle a huge string correctly?

This may be a newbie question, but i want to avoid buffer overflow. I read very much data from the registry which will be uploaded to an SQL database. I read the data in a loop, and the data was inserted after each loop. My problem is, that this way, if i read 20 keys, and the values under is ( the number of keys is different on every computer ), then i have to connect to the SQL database 20 times.
However i found out, that there is a way, to create a stored procedure, and pass the whole data it, and so, the SQL server will deal with data, and i have to connect only once to the SQL server.
Unfortunately i don't know how to handle such a big string to avoid any unexpected errors, like buffer owerflow. So my question is how should i declare this string?
Should i just make a string like char string[ 15000 ]; and concatenate the values? Or is there a simplier way for doing this?
Thanks!
STL strings should do a much better job than the approach you have described.
You'll also need to build some thresholds. For example, if your string grew more than a mega bytes, it will be worth considering making different SQL connections since your transaction will be too long.
You may read (key, value) pairs from a registry and store them into a preallocated buffer while there is sufficient space there.
Maintain "write" position within the buffer. You could use it to check whether there is enough space for new key,value pair in the buffer.
When there is no space left for new (key,value) pair - execute stored procedure and reset "write" position within the buffer.
At the end of the "read key, value pairs" loop - check buffer's 'write" position and execute stored procedure if it is greater than 0.
This way you will minimize number of times you execute stored procedure on a server.
const int MAX_BUFFER_SIZE = 15000;
char buffer[MAX_BUFFER_SIZE];
char buffer_pos = 0; // "write" position within the buffer.
...
// Retrieve key, value pairs and push them into the buffer.
while(get_next_key_value(key, value)) {
post(key, value);
}
// Execute stored procedure if buffer is not empty.
if(buffer_pos > 0) {
exec_stored_procedure(buffer);
}
...
bool post(const char* key, const char* value)
{
int len = strlen(key) + strlen(value) + <length of separators>;
// Execute stored procedure if there is no space for new key/value pair.
if(len + buffer_pos >= MAX_BUFFER_SIZE) {
exec_stored_procedure(buffer);
buffer_pos = 0; // Reset "write" position.
}
// Copy key, value pair to the buffer if there is sufficient space.
if(len + buffer_pos < MAX_BUFFER_SIZE) {
<copy key, value to the buffer, starting from "write" position>
buffer_pos += len; // Adjust "write" position.
return true;
}
else {
return false;
}
}
bool exec_stored_procedure(const char* buf)
{
<connect to SQL database and execute stored procedure.>
}
To do this properly in C you need to allocate the memory dynamically, using malloc or one of the operating system equivalents. The idea here is to figure out how much memory you actually need and then allocate the correct amount. The registry functions provide various ways you can determine how much memory you need for each read.
It gets a bit trickier if you're reading multiple values and concatenating them. One approach would be to read each value into a separately allocated memory block, then concatenate them to a new memory block once you've got them all.
However, it may not be necessary to go to this much trouble. If you can say "if the data is more than X bytes the program will fail" then you can just create a static buffer as you suggest. Just make sure that you provide the registry and/or string concatenation functions with the correct size for the remaining part of the buffer, and check for errors, so that if it does fail it fails properly rather than crashing.
One more note: char buf[15000]; is OK provided the declaration is in program scope, but if it appears in a function you should add the static specifier. Implicitly allocated memory in a function is by default taken from the stack, so a large allocation is likely to fail and crash your program. (Fifteen thousand bytes should be OK but it's not a good habit to get into.)
Also, it is preferable to define a macro for the size of your buffer, and use it consistently:
#define BUFFER_SIZE 15000
char buf[BUFFER_SIZE];
so that you can easily increase the size of the buffer later on by modifying a single line.

Resources