What is the best way to read multiple files into one? - file

Here's a benchmark comparing two functions that read multiple files into a single one. The one uses read and the other uses read_to_end. My original motivation was to get the buffer's capacity be equal to the len at the end of the process. This did not happen with read_to_end which was quite unsatisfactory.
With read however, this works. The assert_eq!(buf.capacity(), buf.len()); of read_files_into_file2 (which uses read) does not panic.
use criterion::{criterion_group, criterion_main, Criterion};
use std::io::Read;
use std::io::Write;
use std::{
fs,
io::{self, Seek},
};
fn criterion_benchmark(c: &mut Criterion) {
let mut files = get_test_files().unwrap();
let mut file = fs::File::create("output").unwrap();
c.bench_function("1", |b| {
b.iter(|| {
read_files_into_file1(&mut files, &mut file).unwrap();
})
});
c.bench_function("2", |b| {
b.iter(|| {
read_files_into_file2(&mut files, &mut file).unwrap();
});
});
}
criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);
/// Goes back to the start so that the files can be read again from the start.
fn reset(files: &mut Vec<fs::File>, file: &mut fs::File) {
file.seek(io::SeekFrom::Start(0)).unwrap();
for file in files {
file.seek(io::SeekFrom::Start(0)).unwrap();
}
}
pub fn read_files_into_file1(files: &mut Vec<fs::File>, file: &mut fs::File) -> io::Result<()> {
reset(files, file);
let total_len = files
.iter()
.map(|file| file.metadata().unwrap().len())
.sum::<u64>() as usize;
let mut buf = Vec::<u8>::with_capacity(total_len);
for file in files {
file.read_to_end(&mut buf)?;
}
file.write_all(&buf)?;
// assert_eq!(buf.capacity(), buf.len());
Ok(())
}
fn read_files_into_file2(files: &mut Vec<fs::File>, file: &mut fs::File) -> io::Result<()> {
reset(files, file);
let total_len = files
.iter()
.map(|file| file.metadata().unwrap().len())
.sum::<u64>() as usize;
let mut vec: Vec<u8> = vec![0; total_len];
let mut buf = &mut vec[..];
for file in files {
match file.read(&mut buf) {
Ok(n) => {
buf = &mut buf[n..];
}
Err(err) if err.kind() == io::ErrorKind::Interrupted => {}
Err(err) => return Err(err),
}
}
file.write_all(&vec)?;
// assert_eq!(vec.capacity(), vec.len());
Ok(())
}
/// Creates 5 files with content "hello world" 500 times.
fn get_test_files() -> io::Result<Vec<fs::File>> {
let mut files = Vec::<fs::File>::new();
for index in 0..5 {
let mut file = fs::OpenOptions::new()
.read(true)
.write(true)
.truncate(true)
.create(true)
.open(&format!("test{}", index))?;
file.write_all("hello world".repeat(500).as_bytes())?;
files.push(file);
}
Ok(files)
}
If you uncomment the assert_eq!s then you will see that only read_files_into_file1 (which uses read_to_end) fails with this panic:
thread 'main' panicked at 'assertion failed: `(left == right)`
left: `55000`,
right: `27500`', benches/bench.rs:53:5
read_files_into_file1 allocates way more memory than needed while read_files_into_file2 allocates the optimal amount.
Despite that, the results say that they perform almost the same (read_files_into_file1 takes 11.439 us and read_files_into_file2 takes 11.098 us):
1 time: [11.417 us 11.439 us 11.463 us]
change: [+3.7987% +3.9997% +4.1984%] (p = 0.00 < 0.05)
Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
2 time: [11.085 us 11.098 us 11.112 us]
change: [+0.1255% +0.5081% +0.9545%] (p = 0.01 < 0.05)
Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
2 (2.00%) high mild
2 (2.00%) high severe
I expect read_files_into_file2 to be much faster but it was even shown to be slower when I increased the file size. Why is it that read_files_into_file2 does not meet my expectations and what is the best way to read multiple files into one, efficiently?

read_to_end generally isn't a good idea when dealing with large files since it will try to read the whole file into memory which can lead to swapping or out of memory errors.
On linux and assuming single-threaded execution using io::copy should be the fastest method since it contains optimizations for this case.
On other platforms using io::copy and wrapping the writer side in a BufWriter lets you control the buffer size used for copying which will help amortizing syscall costs.
If you can use multiple threads and know that the file lengths don't change then you can use platform-specific positional read/write methods such as read_at to read multiple files in parallel and write the data into the correct places in the destination file. Whether this actually provides a speedup depends on many factors. It's probably most beneficial when concatenating many small files from a network filesystem.
Beyond the standard library there also are crates that expose platform-specific copy routines which may be faster than a naive userspace copy approach.

Related

How to return arrays from Rust functions without them being copied?

I have a string of functions that generate arrays and return them up a call stack. Roughly the function signatures are:
fn solutions(...) -> [[u64; M]; N] { /* run iterator on lots of problem sets */ }
fn generate_solutions(...) -> impl Iterator<Item=[u64; M]> { /* call find_solution on different problem sets */ }
fn find_solution(...) -> [u64; M] { /* call validate_candidate with different candidates to find solution */ }
fn validate_candidate(...) -> Option<[u64; M]> {
let mut table = [0; M];
// do compute intensive work
if works { Some(table) } else { None }
}
My understanding was that Rust will not actually copy the arrays up the call stack but optimize the copy away.
But this isn't what I see. When I switch to Vec, I see 20x speed improvement with the only change being [u64;M] to Vec<u64>. So, it is totally copying the arrays over and over.
So why array and not Vec, everyone always asks. Embedded environment. no_std.
How to encourage Rust to optimize these array copies away?
Unfortunately, guaranteed lack of copies is currently an unsolved problem in Rust. To get the characteristics you want, you will need to explicitly pass in storage it should be written into (the “out parameter” pattern):
fn solutions(..., out: &mut [[u64; M]; N]) {...}
fn find_solution(..., out: &mut [u64; M]) {...}
fn validate_candidate(table: &mut [u64; M]) -> bool {
// write into table
works
}
Thus you will also have to find some alternative to Iterator for generate_solutions (since using Iterator implies that all the results can exist at once without overwriting each other).

Serious performance regression upon porting bubble sort from C to Rust [duplicate]

I was playing around with binary serialization and deserialization in Rust and noticed that binary deserialization is several orders of magnitude slower than with Java. To eliminate the possibility of overhead due to, for example, allocations and overheads, I'm simply reading a binary stream from each program. Each program reads from a binary file on disk which contains a 4-byte integer containing the number of input values, and a contiguous chunk of 8-byte big-endian IEEE 754-encoded floating point numbers. Here's the Java implementation:
import java.io.*;
public class ReadBinary {
public static void main(String[] args) throws Exception {
DataInputStream input = new DataInputStream(new BufferedInputStream(new FileInputStream(args[0])));
int inputLength = input.readInt();
System.out.println("input length: " + inputLength);
try {
for (int i = 0; i < inputLength; i++) {
double d = input.readDouble();
if (i == inputLength - 1) {
System.out.println(d);
}
}
} finally {
input.close()
}
}
}
Here's the Rust implementation:
use std::fs::File;
use std::io::{BufReader, Read};
use std::path::Path;
fn main() {
let args = std::env::args_os();
let fname = args.skip(1).next().unwrap();
let path = Path::new(&fname);
let mut file = BufReader::new(File::open(&path).unwrap());
let input_length: i32 = read_int(&mut file);
for i in 0..input_length {
let d = read_double_slow(&mut file);
if i == input_length - 1 {
println!("{}", d);
}
}
}
fn read_int<R: Read>(input: &mut R) -> i32 {
let mut bytes = [0; std::mem::size_of::<i32>()];
input.read_exact(&mut bytes).unwrap();
i32::from_be_bytes(bytes)
}
fn read_double_slow<R: Read>(input: &mut R) -> f64 {
let mut bytes = [0; std::mem::size_of::<f64>()];
input.read_exact(&mut bytes).unwrap();
f64::from_be_bytes(bytes)
}
I'm outputting the last value to make sure that all of the input is actually being read. On my machine, when the file contains (the same) 30 million randomly-generated doubles, the Java version runs in 0.8 seconds, while the Rust version runs in 40.8 seconds.
Suspicious of inefficiencies in Rust's byte interpretation itself, I retried it with a custom floating point deserialization implementation. The internals are almost exactly the same as what's being done in Rust's Reader, without the IoResult wrappers:
fn read_double<R : Reader>(input: &mut R, buffer: &mut [u8]) -> f64 {
use std::mem::transmute;
match input.read_at_least(8, buffer) {
Ok(n) => if n > 8 { fail!("n > 8") },
Err(e) => fail!(e)
};
let mut val = 0u64;
let mut i = 8;
while i > 0 {
i -= 1;
val += buffer[7-i] as u64 << i * 8;
}
unsafe {
transmute::<u64, f64>(val);
}
}
The only change I made to the earlier Rust code in order to make this work was create an 8-byte slice to be passed in and (re)used as a buffer in the read_double function. This yielded a significant performance gain, running in about 5.6 seconds on average. Unfortunately, this is still noticeably slower (and more verbose!) than the Java version, making it difficult to scale up to larger input sets. Is there something that can be done to make this run faster in Rust? More importantly, is it possible to make these changes in such a way that they can be merged into the default Reader implementation itself to make binary I/O less painful?
For reference, here's the code I'm using to generate the input file:
import java.io.*;
import java.util.Random;
public class MakeBinary {
public static void main(String[] args) throws Exception {
DataOutputStream output = new DataOutputStream(new BufferedOutputStream(System.out));
int outputLength = Integer.parseInt(args[0]);
output.writeInt(outputLength);
Random rand = new Random();
for (int i = 0; i < outputLength; i++) {
output.writeDouble(rand.nextDouble() * 10 + 1);
}
output.flush();
}
}
(Note that generating the random numbers and writing them to disk only takes 3.8 seconds on my test machine.)
When you build without optimisations, it will often be slower than it would be in Java. But build it with optimisations (rustc -O or cargo --release) and it should be very much faster. If the standard version of it still ends up slower, it’s something that should be examined carefully to figure out where the slowness is—perhaps something is being inlined that shouldn’t be, or not that should be, or perhaps some optimisation that was expected is not occurring.

What is the correct way to read a binary file in chunks of a fixed size and store all of those chunks into a Vec?

I'm having trouble with opening a file. Most examples read files into a String or read the entire file into a Vec. What I need is to read a file into chunks of a fixed size and store those chunks into an array (Vec) of chunks.
For example, I have a file called my_file of exactly 64 KB size and I want to read it in chunks of 16KB so I would end up with an Vec of size 4 where each element is another Vec with size 16Kb (0x4000 bytes).
After reading the docs and checking other Stack Overflow answers, I was able to come with something like this:
let mut file = std::fs::File::open("my_file")?;
// ...calculate num_of_chunks 4 in this case
let list_of_chunks = Vec::new();
for chunk in 0..num_of_chunks {
let mut data: [u8; 0x4000] = [0; 0x4000];
file.read(&mut data[..])?;
list_of_chunks.push(data.to_vec());
}
Although this seems to work fine, it looks a bit convoluted. I read:
For each iteration, create a new array on stack
Read the chunk into the array
Copy the contents of the array into a new Vec and then move the Vec into the list_of_chunks Vec.
I'm not sure if it's idiomatic or even possible, but I'd rather have something like this:
Create a Vec with num_of_chunk elements where each element is another Vec of size 16KB.
Read file chunk directly into the correct Vec
No copying and we make sure memory is allocated before reading the file.
Is that approach possible? or is there a better conventional/idiomatic/correct way to do this?
I'm wondering if Vec is the correct type for solving this. I mean, I won't need the array to grow after reading the file.
Read::read_to_end reads efficiently directly into a Vec. If you want it in chunks, combine it with Read::take to limit the amount of bytes that read_to_end will read.
Example:
let mut file = std::fs::File::open("your_file")?;
let mut list_of_chunks = Vec::new();
let chunk_size = 0x4000;
loop {
let mut chunk = Vec::with_capacity(chunk_size);
let n = file.by_ref().take(chunk_size as u64).read_to_end(&mut chunk)?;
if n == 0 { break; }
list_of_chunks.push(chunk);
if n < chunk_size { break; }
}
The last if is not necessary, but it prevents an extra read call: If less than the requested amount of bytes was read by read_to_end, we can expect the next read to read nothing, since we hit the end of the file.
I think the most idiomatic way would be to use an iterator. The code below (freely inspired by M-ou-se's answer):
Handles many use cases by using generic types
Will use a pre-allocated vector
Hides side effect
Avoid copying data twice
use std::io::{self, Read, Seek, SeekFrom};
struct Chunks<R> {
read: R,
size: usize,
hint: (usize, Option<usize>),
}
impl<R> Chunks<R> {
pub fn new(read: R, size: usize) -> Self {
Self {
read,
size,
hint: (0, None),
}
}
pub fn from_seek(mut read: R, size: usize) -> io::Result<Self>
where
R: Seek,
{
let old_pos = read.seek(SeekFrom::Current(0))?;
let len = read.seek(SeekFrom::End(0))?;
let rest = (len - old_pos) as usize; // len is always >= old_pos but they are u64
if rest != 0 {
read.seek(SeekFrom::Start(old_pos))?;
}
let min = rest / size + if rest % size != 0 { 1 } else { 0 };
Ok(Self {
read,
size,
hint: (min, None), // this could be wrong I'm unsure
})
}
// This could be useful if you want to try to recover from an error
pub fn into_inner(self) -> R {
self.read
}
}
impl<R> Iterator for Chunks<R>
where
R: Read,
{
type Item = io::Result<Vec<u8>>;
fn next(&mut self) -> Option<Self::Item> {
let mut chunk = Vec::with_capacity(self.size);
match self
.read
.by_ref()
.take(chunk.capacity() as u64)
.read_to_end(&mut chunk)
{
Ok(n) => {
if n != 0 {
Some(Ok(chunk))
} else {
None
}
}
Err(e) => Some(Err(e)),
}
}
fn size_hint(&self) -> (usize, Option<usize>) {
self.hint
}
}
trait ReadPlus: Read {
fn chunks(self, size: usize) -> Chunks<Self>
where
Self: Sized,
{
Chunks::new(self, size)
}
}
impl<T: ?Sized> ReadPlus for T where T: Read {}
fn main() -> io::Result<()> {
let file = std::fs::File::open("src/main.rs")?;
let iter = Chunks::from_seek(file, 0xFF)?; // replace with anything 0xFF was to test
println!("{:?}", iter.size_hint());
// This iterator could return Err forever be careful collect it into an Result
let chunks = iter.collect::<Result<Vec<_>, _>>()?;
println!("{:?}, {:?}", chunks.len(), chunks.capacity());
Ok(())
}

Why does a File need to be mutable to call Read::read_to_string?

Here's a line from the 2nd edition Rust tutorial:
let mut f = File::open(filename).expect("file not found");
I'm of the assumption that the file descriptor is a wrapper around a number that basically doesn't change and is read-only.
The compiler complains that the file cannot be borrowed mutably, and I'm assuming it's because the method read_to_string takes the instance as the self argument as mutable, but the question is "why"? What is ever going to change about the file descriptor? Is it keeping track of the cursor location or something?
error[E0596]: cannot borrow immutable local variable `fdesc` as mutable
--> main.rs:13:5
|
11 | let fdesc = File::open(fname).expect("file not found");
| ----- consider changing this to `mut fdesc`
12 | let mut fdata = String::new();
13 | fdesc.read_to_string(&mut fdata)
| ^^^^^ cannot borrow mutably
The whole source:
fn main() {
let args: Vec<String> = env::args().collect();
let query = &args[1];
let fname = &args[2];
println!("Searching for '{}' in file '{}'...", query, fname);
let fdesc = File::open(fname).expect("file not found"); //not mut
let mut fdata = String::new();
fdesc.read_to_string(&mut fdata)
.expect("something went wrong reading the file");
println!("Found: \n{}", fdata);
}
I'm assuming it's because the method read_to_string takes the instance as the self argument as mutable
Yes, that's correct:
fn read_to_string(&mut self, buf: &mut String) -> Result<usize>
The trait method Read::read_to_string takes the receiver as a mutable reference because in general, that's what is needed to implement "reading" from something. You are going to change a buffer or an offset or something.
Yes, an actual File may simply contain an underlying file descriptor (e.g. on Linux or macOS) or a handle (e.g. Windows). In these cases, the operating system deals with synchronizing the access across threads. That's not even guaranteed though — it depends on the platform. Something like Redox might actually have a mutable reference in its implementation of File.
If the Read trait didn't accept a &mut self, then types like BufReader would have to use things like internal mutability, reducing the usefulness of Rust's references.
See also:
Why is it possible to implement Read on an immutable reference to File?

What is the most efficient way to read a large file in chunks without loading the entire file in memory at once?

What is the most efficient general purpose way of reading "large" files (which may be text or binary), without going into unsafe territory? I was surprised how few relevant results there were when I did a web search for "rust read large file in chunks".
For example, one of my use cases is to calculate an MD5 checksum for a file using rust-crypto (the Md5 module allows you to add &[u8] chunks iteratively).
Here is what I have, which seems to perform slightly better than some other methods like read_to_end:
use std::{
fs::File,
io::{self, BufRead, BufReader},
};
fn main() -> io::Result<()> {
const CAP: usize = 1024 * 128;
let file = File::open("my.file")?;
let mut reader = BufReader::with_capacity(CAP, file);
loop {
let length = {
let buffer = reader.fill_buf()?;
// do stuff with buffer here
buffer.len()
};
if length == 0 {
break;
}
reader.consume(length);
}
Ok(())
}
I don't think you can write code more efficient than that. fill_buf on a BufReader over a File is basically just a straight call to read(2).
That said, BufReader isn't really a useful abstraction when you use it like that; it would probably be less awkward to just call file.read(&mut buf) directly.
I did it this way, I don't know if it is wrong but it worked perfectly for me, still don't know if it is the correct way tho..
use std::io;
use std::io::prelude::*;
use std::fs::File;
fn main() -> io::Result<()>
{
const FNAME: &str = "LargeFile.txt";
const CHUNK_SIZE: usize = 1024; // bytes read by every loop iteration.
let mut limit: usize = (1024 * 1024) * 15; // How much should be actually read from the file..
let mut f = File::open(FNAME)?;
let mut buffer = [0; CHUNK_SIZE]; // buffer to contain the bytes.
// read up to 15mb as the limit suggests..
loop {
if limit > 0 {
// Not finished reading, you can parse or process data.
let _n = f.read(&mut buffer[..])?;
for bytes_index in 0..buffer.len() {
print!("{}", buffer[bytes_index] as char);
}
limit -= CHUNK_SIZE;
} else {
// Finished reading..
break;
}
}
Ok(())
}

Resources