I have a file, say myfile. Using Rust, I would like to open myfile, and read bytes N to M into a Vec, say myvec. What is the most idiomatic way to do so? Naively, I thought of using bytes(), then skip, take and collect, but that sounds so inefficient.
The most idiomatic (to my knowledge) and relatively efficient way:
let start = 10;
let count = 10;
let mut f = File::open("/etc/passwd")?;
f.seek(SeekFrom::Start(start))?;
let mut buf = vec![0; count];
f.read_exact(&mut buf)?;
You indicated in the comments that you were concerned about the overhead of zeroing the memory before reading into it. Indeed there is a nonzero cost to this, but it's usually negligible compared to the I/O operations needed to read from a file, and the advantage is that your code remains 100% sound. But for educational purposes only, I tried to come up with an approach that avoids the zeroing.
Unfortunately, even with unsafe code, we cannot safely pass an uninitialized buffer to read_exact because of this paragraph in the documentation (emphasis mine):
No guarantees are provided about the contents of buf when this function is called, implementations cannot rely on any property of the contents of buf being true. It is recommended that implementations only write data to buf instead of reading its contents.
So it's technically legal for File::read_exact to read from the provided buffer, which means we cannot legally pass uninitialized data here (using MaybeUninit).
The existing answer works, but it reads the entire block that you're after into a Vec in memory. If the block you're reading out is huge or you have no use for it in memory, you ideally need an io::Read which you can copy straight into another file or pass into another api.
If your source implements Read + Seek then you can seek to the start position and then use Read::take to only read for a specific number of bytes.
use std::{fs::File, io::{self, Read, Seek, SeekFrom}};
let start = 20;
let length = 100;
let mut input = File::open("input.bin")?;
// Seek to the start position
input.seek(SeekFrom::Start(start))?;
// Create a reader with a fixed length
let mut chunk = input.take(length);
let mut output = File::create("output.bin")?;
// Copy the chunk into the output file
io::copy(&mut chink, &mut output)?;
Related
The bounty expires in 5 days. Answers to this question are eligible for a +100 reputation bounty.
johnbakers is looking for an answer from a reputable source:
Desiring a good understanding of why copy-on-write is not interfering with multithreaded updates to different array indexes and whether this is in fact safe to do from a specification standpoint, as it appears to be.
I see frequent mention that Swift arrays, due to copy-on-write, are not threadsafe, but have found this works, as it updates different and unique elements in an array from different threads simultaneously:
//pixels is [(UInt8, UInt8, UInt8)]
let q = DispatchQueue(label: "processImage", attributes: .concurrent)
q.sync {
DispatchQueue.concurrentPerform(iterations: n) { i in
... do work ...
pixels[i] = ... store result ...
}
}
(simplified version of this function)
If threads never write to the same indexes, does copy-on-write still interfere with this? I'm wondering if this is safe since the array itself is not changing length or memory usage. But it does seem that copy-on-write would prevent the array from staying consistent in such a scenario.
If this is not safe, and since doing parallel computations on images (pixel arrays) or other data stores is a common requirement in parallel computation, what is the best idiom for this? Is it better that each thread have its own array and then they are combined after all threads complete? It seems like additional overhead and the memory juggling from creating and destroying all these arrays doesn't feel right.
Updated answer:
Having thought about this some more, I suppose the main thing is that there's no copy-on-write happening here either way.
COW happens because arrays (and dictionaries, etc) in Swift behave as value types. With value types, if you pass a value to a function you're actually passing a copy of the value. But with array, you really don't want to do that because copying the entire array is a very expensive operation. So Swift will only perform the copy when the new copy is edited.
But in your example, you're not actually passing the array around in the first place, so there's no copy on write happening. The array of pixels exists in some scope, and you set up a DispatchQueue to update the pixel values in place. Copy-on-write doesn't come into play here because you're not copying in the first place.
I see frequent mention that Swift arrays, due to copy-on-write, are not threadsafe
To the best of my knowledge, this is more or less the opposite of the actual situation. Swift arrays are thread-safe because of copy-on-write. If you make an array and pass it to multiple different threads which then edit the array (the local copy of it), it's the thread performing the edits that will make a new copy for its editing; threads only reading data will keep reading from the original memory.
Consider the following contrived example:
import Foundation
/// Replace a random element in the array with a random int
func mutate(array: inout [Int]) {
let idx = Int.random(in: 0..<array.count)
let val = Int.random(in: 1000..<10_000)
array[idx] = val
}
class Foo {
var numbers: [Int]
init(_ numbers: [Int]) {
// No copying here; the local `numbers` property
// will reference the same underlying memory buffer
// as the input array of numbers. The reference count
// of the underlying buffer is increased by one.
self.numbers = numbers
}
func mutateNumbers() {
// Copy on write can happen when we call this function,
// because we are not allowed to edit the underlying
// memory buffer if more than one array references it.
// If we have unique access (refcount is 1), we can safely
// edit the buffer directly.
mutate(array: &self.numbers)
}
}
var numbers = [0, 1, 2, 3, 4, 5]
var foo_instances: [Foo] = []
for _ in 0..<4 {
let t = Thread() {
let f = Foo(numbers)
foo_instances.append(f)
for _ in 0..<5_000_000 {
f.mutateNumbers()
}
}
t.start()
}
for _ in 0..<5_000_000 {
// Copy on write can potentially happen here too,
// because we can get here before the threads have
// started mutating their arrays. If that happens,
// the *original* `numbers` array in the global will
// make a copy of the underlying buffer, point to the
// the new one and decrement the reference count of the
// previous buffer, potentially releasing it.
mutate(array: &numbers)
}
print("Global numbers:", numbers)
for foo in foo_instances {
print(foo.numbers)
}
Copy-on-write can happen when the threads mutate their numbers, and it can happen when the main thread mutates the original array, and but in neither case will it affect any of the data used by the other objects.
Arrays and copy-on-write are both thread-safe. The copying is done by the party responsible for the editing, not the other instances referencing the memory, so two threads will never step on each others toes here.
However, what you're doing isn't triggering copy-on-write in the first place, because the different threads are writing to the array in place. You're not passing the value of the array to the queue. Due to the how the closure works, it's more akin to using the inout keyword on a function. The reference count of the underlying buffer remains 1 but the reference count of the array goes up, because the threads executing the work are all pointing to the same array. This means that COW doesn't come into play at all.
As for this part:
If this is not safe, and since doing parallel computations on images (pixel arrays) or other data stores is a common requirement in parallel computation, what is the best idiom for this?
It depends. If you're simply doing a parallel map function, executing some function on each pixel that depends solely on the value of that pixel, then just doing a concurrentPerform for each pixel seems like it should be fine. But if you want to do something like apply a multi-pixel filter (like a convolution for example), then this approach does not work. You can either divide the pixels into 'buckets' and give each thread a bucket for itself, or you can have a read-only input pixel buffer and an output buffer.
Old answer below:
As far as I can tell, it does actually work fine. This code below runs fine, as best as I can tell. The dumbass recursive Fibonacci function means the latter values in the input array take a bit of time to run. It maxes out using all CPUs in my computer, but eventually only the slowest value to compute remains (the last one), and it drops down to just one thread being used.
As long as you're aware of all the risks of multi-threading (don't read the same data you're writing, etc), it does seem to work.
I suppose you could use withUnsafeMutableBufferPointer on the input array to make sure that there's no overhead from COW or reference counting.
import Foundation
func stupidFib(_ n: Int) -> Int {
guard n > 1 else {
return 1
}
return stupidFib(n-1) + stupidFib(n-2)
}
func parallelMap<T>(over array: inout [T], transform: (T) -> T) {
DispatchQueue.concurrentPerform(iterations: array.count) { idx in
array[idx] = transform(array[idx])
}
}
var data = (0..<50).map{$0} // ([0, 1, 2, 3, ... 49]
parallelMap(over: &data, transform: stupidFib) // uses all CPU cores (sort of)
print(data) // prints first 50 numbers in the fibonacci sequence
I have some byte data stred in a &[u8] I know for a fact the data contained in the slice is float data. I want to do the equivalent of auto float_ptr = (float*)char_ptr;
I tried:
let data_silce = &body[buffer_offset..(buffer_offset + buffer_length)] as &[f32];
But you are not allowed to execute this kind of casting in rust.
You will need to use unsafe Rust to interpret one type of data as if it is another.
You can do:
let bytes = &body[buffer_offset..(buffer_offset + buffer_length)];
let len = bytes.len();
let ptr = bytes.as_ptr() as *const f32;
let floats: &[f32] = unsafe { std::slice::from_raw_parts(ptr, len / 4)};
Note that this is Undefined Behaviour if any of these are true:
the size of the original slice is not a multiple of 4 bytes
the alignment of the slice is not a multiple of 4 bytes
All sequences of 4 bytes are valid f32s but, for types without that property, to avoid UB you also need to make sure that all values are valid.
Use the bytemuck library; specifically bytemuck::cast_slice(). Underneath, it is just the unsafe conversion that has already been described in other answers, but it uses compile-time (type) and run-time (length and alignment) checks to ensure that you don't have to worry about correctly checking the safety conditions.
let data_slice: &[f32] = bytemuck::cast_slice(
&body[buffer_offset..(buffer_offset + buffer_length)
];
Note that this will panic if the beginning and end of the slice are not aligned to 4 bytes. There is no way to avoid this requirement in Rust — if the data is not aligned, you must copy it to a new location that is aligned. (Since your goal is to produce f32s, the simplest way to do that while ensuring the alignment would be to iterate with f32::from_ne_bytes(), performing the f32 conversion and the copy.)
I noticed that this:
let a = [Float](repeating: 0, count: len)
takes very significantly more time than just
let p = UnsafeMutablePointer<Float>.allocate(capacity: len)
However, the unsafe pointer is not so convenient to use, and one may want to create a Array<Float> to pass onto other code.
let a = Array(UnsafeBufferPointer(start: p, count: len))
But doing this absolutely kills it, and it is faster to just create the Array with zeros filled in.
Any idea how to create an Array faster and at the same time, have an actual Array<Float> handy? In the context of my project, I can probably deal with the unsafe pointer internally and wrap it with Array only when needed outside the module.
Quick test on all the answers in this post:
let len = 10_000_000
benchmark(title: "array.create", num_trials: 10) {
let a = [Float](repeating: 0, count: len)
}
benchmark(title: "array.create faster", num_trials: 10) {
let p = UnsafeMutableBufferPointer<Float>.allocate(capacity: len)
}
benchmark(title: "Array.reserveCapacity ?", num_trials: 10) {
var a = [Float]()
a.reserveCapacity(len)
}
benchmark(title: "ContiguousArray ?", num_trials: 10) {
let a = ContiguousArray<Float>(repeating: 0, count: len)
}
benchmark(title: "ContiguousArray.reserveCapacity", num_trials: 10) {
var a = ContiguousArray<Float>()
a.reserveCapacity(len)
}
benchmark(title: "UnsafeMutableBufferPointer BaseMath", num_trials: 10) {
let p = UnsafeMutableBufferPointer<Float>(len) // Jeremy's BaseMath
print(p.count)
}
Results: (on 10 million floats)
array.create: 9.256 ms
array.create faster: 0.004 ms
Array.reserveCapacity ?: 0.264 ms
ContiguousArray ?: 10.154 ms
ContiguousArray.reserveCapacity: 3.251 ms
UnsafeMutableBufferPointer BaseMath: 0.049 ms
I am doing this adhocly running an app on iphone simulator in Release mode. I know i should probably do this in commandline/standalone, but since i plan to write this as part of an app, this may be alright.
For what I tried to do, UnsafeMutableBufferPointer seemed great, but you have to use BaseMath and all its conformances. If you are after a more general or other context. Be sure to read everything and decide which one works for you.
If you need performance, and know the size you need, you can use reserveCapacity(_:), this will preallocate the memory needed for the contents of the array. Per Apple documentation:
If you are adding a known number of elements to an array, use this method to avoid multiple reallocations. This method ensures that the array has unique, mutable, contiguous storage, with space allocated for at least the requested number of elements.
Calling the reserveCapacity(_:) method on an array with bridged storage triggers a copy to contiguous storage even if the existing storage has room to store minimumCapacity elements.
For performance reasons, the size of the newly allocated storage might be greater than the requested capacity. Use the array’s capacity property to determine the size of the new storage.
This is the closest thing to what I want. There's a library called BaseMath (started by Jeremy Howard), and there's a new class call AlignedStorage and UnsafeMutableBufferPointer. It is endowed with lot of math, and pretty-to-very fast too, so this reduce lot of managing of pointers while juggling math algorithm.
But this remains to be tested, this project is very new. I will leave this Q open to see if someone can suggest something better.
Note: this is the fastest in the context of what I am doing. If you really need a good struct value type Array (and variants), see other ans.
There are several questions already on Stack Overflow about allocating an array (say [i32]) on the heap. The general recommendation is boxing, e.g. Box<[i32]>. But while boxing works fine enough for smaller arrays, the problem is that the array being boxed has to first be allocated on the stack.
So if the array is too large (say 10 million elements), you will - even with boxing - get a stack overflow (one is unlikely to have a stack that large).
The suggestion then is using Vec<T> instead, that is Vec<i32> in our example. And while that does do the job, it does have a performance impact.
Consider the following program:
fn main() {
const LENGTH: usize = 10_000;
let mut a: [i32; LENGTH] = [0; LENGTH];
for j in 0..LENGTH {
for i in 0..LENGTH {
a[i] = j as i32;
}
}
}
time tells me that this program takes about 2.9 seconds to run. I use 10'000 in this example, so I can allocate that on the stack, but I really want one with 10 million.
Now consider the same program but with Vec<T> instead:
fn main() {
const LENGTH: usize = 10_000;
let mut a: Vec<i32> = vec![0; LENGTH];
for j in 0..LENGTH {
for i in 0..LENGTH {
a[i] = j as i32;
}
}
}
time tells me that this program takes about 5 seconds to run. Now time isn't super exact, but the difference of about 2 seconds for such a simple program is not an insignificant impact.
Storage is storage, the program with array is just as fast when the array is boxed. So it's not the heap slowing the Vec<T> version down, but the Vec<T> structure itself.
I also tried with a HashMap (specifically HashMap<usize, i32> to mimic an array structure), but that's far slower than the Vec<T> solution.
If my LENGTH had been 10 million, the first version wouldn't even have run.
If that's not possible, is there a structure that behaves like an array (and Vec<T>) on the heap, but can match the speed and performance of an array?
Summary: your benchmark is flawed; just use a Vec (as described here), possibly with into_boxed_slice, as it is incredibly unlikely to be slower than a heap allocated array.
Unfortunately, your benchmarks are flawed. First of all, you probably didn't compile with optimizations (--release for cargo, -O for rustc). Because if you would have, the Rust compiler would have removed all of your code. See the assembly here. Why? Because you never observe the vector/array, so there is no need to do all that work in the first place.
Also, your benchmark is not testing what you actually want to test. You are comparing an stack-allocated array with a heap-allocated vector. You should compare the Vec to a heap allocated array.
Don't feel bad though: writing benchmarks is incredible hard for many reasons. Just remember: if you don't know a lot about writing benchmarks, better don't trust your own benchmarks without asking others first.
I fixed your benchmark and included all three possibilities: Vec, array on stack and array on heap. You can find the full code here. The results are:
running 3 tests
test array_heap ... bench: 9,600,979 ns/iter (+/- 1,438,433)
test array_stack ... bench: 9,232,623 ns/iter (+/- 720,699)
test vec_heap ... bench: 9,259,912 ns/iter (+/- 691,095)
Surprise: the difference between the versions are way less than the variance of the measurement. So we can assume they all are pretty equally fast.
Note that this benchmark is still pretty bad. The two loops can just be replaced by one loop setting all array elements to LENGTH - 1. From taking a quick look at the assembly (and from the rather long time of 9ms), I think that LLVM is not smart enough to actually perform this optimization. But things like this are important and one should be aware of that.
Finally, let's discuss why both solutions should be equally fast and whether there are actually differences in speed.
The data section of a Vec<T> has exactly the same memory layout as a [T]: just many Ts contiguously in memory. Super simple. This also means both exhibit the same caching-behavior (specifically, being very cache-friendly).
The only difference is that a Vec might have more capacity than elements. So Vec itself stores (pointer, length, capacity). That is one word more than a simple (boxed) slice (which stores (pointer, length)). A boxed array doesn't need to store the length, as it's already in the type, so it is just a simple pointer. Whether or not we store one, two or three words is not really important when you will have millions of elements anyway.
Accessing one element is the same for all three: we do a bounds check first and then calculate the target pointer via base_pointer + size_of::<T>() * index. But it's important to note that the array storing its length in the type means that the bounds check can be removed more easily by the optimizer! This can be a real advantage.
However, bounds checks are already usually removed by the smart optimizer. In the benchmark code I posted above, there are no bounds checks in the assembly. So while a boxed array could be a bit faster by removed bounds checks, (a) this will be a minor performance difference and (b) it's very unlikely that you will have a lot of situations where the bound check is removed for the array but not for the Vec/slice.
If you really want a heap-allocated array, i.e. Box<[i32; LENGTH]> then you can use:
fn main() {
const LENGTH: usize = 10_000_000;
let mut a = {
let mut v: Vec<i32> = Vec::with_capacity(LENGTH);
// Explicitly set length which is safe since the allocation is
// sized correctly.
unsafe { v.set_len(LENGTH); };
// While not required for this particular example, in general
// we want to initialize elements to a known value.
let mut slice = v.into_boxed_slice();
for i in &mut slice[..] {
*i = 0;
}
let raw_slice = Box::into_raw(slice);
// Using `from_raw` is safe as long as the pointer is
// retrieved using `into_raw`.
unsafe {
Box::from_raw(raw_slice as *mut [i32; LENGTH])
}
};
// This is the micro benchmark from the question.
for j in 0..LENGTH {
for i in 0..LENGTH {
a[i] = j as i32;
}
}
}
It's not going to be faster than using a vector since Rust does bounds-checking even on arrays, but it has a smaller interface which might make sense in terms of software design.
I have an array of UInt32, what is the most efficient way to write it into a binary file in Crystal lang?
By now I am using IO#write_byte(byte : UInt8) method, but I believe there should be a way to write bigger chunks, than per 1 byte.
You can directly write a Slice(UInt8) to any IO, which should be faster than iterating each item and writing each bytes one by one.
The trick is to access the Array(UInt32)'s internal buffer as a Pointer(UInt8) then make it a Slice(UInt8), which can be achieved with some unsafe code:
array = [1_u32, 2_u32, 3_u32, 4_u32]
File.open("out.bin", "w") do |f|
ptr = (array.to_unsafe as UInt8*)
f.write ptr.to_slice(array.size * sizeof(UInt32))
end
Be sure to never keep a reference to ptr, see Array#to_unsafe for details.