Extracting an archive with progress bar - mutable borrow error - file

I am trying to extract a .tar.bz file (or .tar.whatever actually) and also be able to have a xx% progress report. So far I have this:
pub fn extract_file_with_progress<P: AsRef<Path>>(&self, path: P) -> Result<()> {
let path = path.as_ref();
let size = fs::metadata(path)?;
let mut f = File::open(path)?;
let decoder = BzDecoder::new(&f);
let mut archive = Archive::new(decoder);
for entry in archive.entries()? {
entry?.unpack_in(".")?;
let pos = f.seek(SeekFrom::Current(0))?;
}
Ok(())
}
The idea is to use pos/size to get the percentage, but compiling the above function gets me the error cannot borrow f as mutable because it is also borrowed as immutable.
I understand what the error means, but I don't really use f as mutable; I only use the seek function to get the current position.
Is there a way to work-around this, either by forcing the compiler to ignore the mutable borrow or by getting the position in some immutable way?

Files are a bit special. The usual read() and seek() and write() methods (defined on the Read, Seek and Write traits) take self by mutable reference:
fn read(&mut self, buf: &mut [u8]) -> Result<usize>
fn seek(&mut self, pos: SeekFrom) -> Result<u64>
fn write(&mut self, buf: &[u8]) -> Result<usize>
However, all mentioned traits are also implemented for &File, i.e. for immutable references to a file:
impl<'a> Read for &'a File
impl<'a> Seek for &'a File
impl<'a> Write for &'a File
So you can modify a file even if you only have a read-only reference to the file. For these implementations, the Self type is &File, so accepting self by mutable reference in fact means accepting a &mut &File, a mutable reference to a reference to a file.
Your code passes &f to BzDecoder::new(), creating an immutable borrow. Later you call f.seek(SeekFrom::Current(0)), which passes f to seek by mutable reference. However, this is not allowed, since you already have an immutable borrow of the file. The solution is to use the Seek implementation on &File instead:
(&mut &f).seek(SeekFrom::Current(0))
or slightly simpler
(&f).seek(SeekFrom::Current(0))
This only creates a second immutable borrow, which is allowed by Rust's rules for references.
I created a playground example demonstrating that this works. If you replace (&f) with f you get the error you originally got.

Related

How to return arrays from Rust functions without them being copied?

I have a string of functions that generate arrays and return them up a call stack. Roughly the function signatures are:
fn solutions(...) -> [[u64; M]; N] { /* run iterator on lots of problem sets */ }
fn generate_solutions(...) -> impl Iterator<Item=[u64; M]> { /* call find_solution on different problem sets */ }
fn find_solution(...) -> [u64; M] { /* call validate_candidate with different candidates to find solution */ }
fn validate_candidate(...) -> Option<[u64; M]> {
let mut table = [0; M];
// do compute intensive work
if works { Some(table) } else { None }
}
My understanding was that Rust will not actually copy the arrays up the call stack but optimize the copy away.
But this isn't what I see. When I switch to Vec, I see 20x speed improvement with the only change being [u64;M] to Vec<u64>. So, it is totally copying the arrays over and over.
So why array and not Vec, everyone always asks. Embedded environment. no_std.
How to encourage Rust to optimize these array copies away?
Unfortunately, guaranteed lack of copies is currently an unsolved problem in Rust. To get the characteristics you want, you will need to explicitly pass in storage it should be written into (the “out parameter” pattern):
fn solutions(..., out: &mut [[u64; M]; N]) {...}
fn find_solution(..., out: &mut [u64; M]) {...}
fn validate_candidate(table: &mut [u64; M]) -> bool {
// write into table
works
}
Thus you will also have to find some alternative to Iterator for generate_solutions (since using Iterator implies that all the results can exist at once without overwriting each other).

What is the best way to read multiple files into one?

Here's a benchmark comparing two functions that read multiple files into a single one. The one uses read and the other uses read_to_end. My original motivation was to get the buffer's capacity be equal to the len at the end of the process. This did not happen with read_to_end which was quite unsatisfactory.
With read however, this works. The assert_eq!(buf.capacity(), buf.len()); of read_files_into_file2 (which uses read) does not panic.
use criterion::{criterion_group, criterion_main, Criterion};
use std::io::Read;
use std::io::Write;
use std::{
fs,
io::{self, Seek},
};
fn criterion_benchmark(c: &mut Criterion) {
let mut files = get_test_files().unwrap();
let mut file = fs::File::create("output").unwrap();
c.bench_function("1", |b| {
b.iter(|| {
read_files_into_file1(&mut files, &mut file).unwrap();
})
});
c.bench_function("2", |b| {
b.iter(|| {
read_files_into_file2(&mut files, &mut file).unwrap();
});
});
}
criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);
/// Goes back to the start so that the files can be read again from the start.
fn reset(files: &mut Vec<fs::File>, file: &mut fs::File) {
file.seek(io::SeekFrom::Start(0)).unwrap();
for file in files {
file.seek(io::SeekFrom::Start(0)).unwrap();
}
}
pub fn read_files_into_file1(files: &mut Vec<fs::File>, file: &mut fs::File) -> io::Result<()> {
reset(files, file);
let total_len = files
.iter()
.map(|file| file.metadata().unwrap().len())
.sum::<u64>() as usize;
let mut buf = Vec::<u8>::with_capacity(total_len);
for file in files {
file.read_to_end(&mut buf)?;
}
file.write_all(&buf)?;
// assert_eq!(buf.capacity(), buf.len());
Ok(())
}
fn read_files_into_file2(files: &mut Vec<fs::File>, file: &mut fs::File) -> io::Result<()> {
reset(files, file);
let total_len = files
.iter()
.map(|file| file.metadata().unwrap().len())
.sum::<u64>() as usize;
let mut vec: Vec<u8> = vec![0; total_len];
let mut buf = &mut vec[..];
for file in files {
match file.read(&mut buf) {
Ok(n) => {
buf = &mut buf[n..];
}
Err(err) if err.kind() == io::ErrorKind::Interrupted => {}
Err(err) => return Err(err),
}
}
file.write_all(&vec)?;
// assert_eq!(vec.capacity(), vec.len());
Ok(())
}
/// Creates 5 files with content "hello world" 500 times.
fn get_test_files() -> io::Result<Vec<fs::File>> {
let mut files = Vec::<fs::File>::new();
for index in 0..5 {
let mut file = fs::OpenOptions::new()
.read(true)
.write(true)
.truncate(true)
.create(true)
.open(&format!("test{}", index))?;
file.write_all("hello world".repeat(500).as_bytes())?;
files.push(file);
}
Ok(files)
}
If you uncomment the assert_eq!s then you will see that only read_files_into_file1 (which uses read_to_end) fails with this panic:
thread 'main' panicked at 'assertion failed: `(left == right)`
left: `55000`,
right: `27500`', benches/bench.rs:53:5
read_files_into_file1 allocates way more memory than needed while read_files_into_file2 allocates the optimal amount.
Despite that, the results say that they perform almost the same (read_files_into_file1 takes 11.439 us and read_files_into_file2 takes 11.098 us):
1 time: [11.417 us 11.439 us 11.463 us]
change: [+3.7987% +3.9997% +4.1984%] (p = 0.00 < 0.05)
Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
2 time: [11.085 us 11.098 us 11.112 us]
change: [+0.1255% +0.5081% +0.9545%] (p = 0.01 < 0.05)
Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
2 (2.00%) high mild
2 (2.00%) high severe
I expect read_files_into_file2 to be much faster but it was even shown to be slower when I increased the file size. Why is it that read_files_into_file2 does not meet my expectations and what is the best way to read multiple files into one, efficiently?
read_to_end generally isn't a good idea when dealing with large files since it will try to read the whole file into memory which can lead to swapping or out of memory errors.
On linux and assuming single-threaded execution using io::copy should be the fastest method since it contains optimizations for this case.
On other platforms using io::copy and wrapping the writer side in a BufWriter lets you control the buffer size used for copying which will help amortizing syscall costs.
If you can use multiple threads and know that the file lengths don't change then you can use platform-specific positional read/write methods such as read_at to read multiple files in parallel and write the data into the correct places in the destination file. Whether this actually provides a speedup depends on many factors. It's probably most beneficial when concatenating many small files from a network filesystem.
Beyond the standard library there also are crates that expose platform-specific copy routines which may be faster than a naive userspace copy approach.

Why can I not mutably borrow a variable in two different map functions?

I have an iterator in Rust that loops over a Vec<u8> and applies the same function at two different stages. I do this by chaining a couple of map functions together. Here is the relevant code (where example, example_function_1, and example_function_2 are stand-in variables and functions respectively):
NOTE: example.chunks() is a custom function! Not the default one on slices!
let example = vec![0, 1, 2, 3];
let mut hashers = Cycler::new([example_function_1, example_function_2].iter());
let ret: Vec<u8> = example
//...
.chunks(hashers.len())
.map(|buf| hashers.call(buf))
//...
.map(|chunk| hashers.call(chunk))
.collect();
Here is the code for Cycler:
pub struct Cycler<I> {
orig: I,
iter: I,
len: usize,
}
impl<I> Cycler<I>
where
I: Clone + Iterator,
I::Item: Fn(Vec<u8>) -> Vec<u8>,
{
pub fn new(iter: I) -> Self {
Self {
orig: iter.clone(),
len: iter.clone().count(),
iter,
}
}
pub fn len(&self) -> usize {
self.len
}
pub fn reset(&mut self) {
self.iter = self.orig.clone();
}
pub fn call(&mut self, buf: Bytes) -> Bytes {
// It is safe to unwrap because it should indefinietly continue without stopping
self.next().unwrap()(buf)
}
}
impl<I> Iterator for Cycler<I>
where
I: Clone + Iterator,
I::Item: Fn(Vec<u8>) -> Vec<u8>,
{
type Item = I::Item;
fn next(&mut self) -> Option<I::Item> {
match self.iter.next() {
next => next,
None => {
self.reset();
self.iter.next()
}
}
}
// No size_hint, try_fold, or fold methods
}
What confuses me is that the second time I reference hashers it says this:
error[E0499]: cannot borrow `hashers` as mutable more than once at a time
--> libpressurize/src/password/password.rs:28:14
|
21 | .map(|buf| hashers.call(buf))
| ----- ------- first borrow occurs due to use of `hashers` in closure
| |
| first mutable borrow occurs here
...
28 | .map(|chunk| hashers.call(chunk))
| --- ^^^^^^^ ------- second borrow occurs due to use of `hashers` in closure
| | |
| | second mutable borrow occurs here
| first borrow later used by call
Shouldn't this work because the mutable reference is not used at the same time?
Please let me know if more info/code is needed to answer this.
.map(|buf| hashers.call(buf))
You're probably thinking that in the above line, hashers is mutably borrowed to call it. That's true (since Cycler::call takes &mut self) but it's not what the compiler error is about. In this line, hashers is mutably borrowed to construct the closure |buf| hashers.call(buf), and that borrow lasts as long as the closure does.
Thus, when you write
.map(|buf| hashers.call(buf))
//...
.map(|chunk| hashers.call(chunk))
you are constructing two closures which live at the same time (assuming this is std::iter::Iterator::map) and mutably borrowing hashers for each of them, which is not allowed.
This error is actually protecting you against a side-effect hazard: it's not obvious (in a purely local analysis) what order the side effects of the two call()s will be performed in, because the map()s could do anything they like with the closures. Given the code you wrote, I assume you're doing this on purpose, but the compiler doesn't know that you know what you're doing.
(We can't even predict what the interleaving will be just because they're iterators. Inside of your //... there could be, say a .filter() step which leads to hashers.call(buf) being called several times between each call to hashers.call(chunk), or something else that produces a different number of outputs than inputs.)
If you know that you want the interleaving of side-effects that is “whenever either map() decides to call it”, then you can gain that freedom with a RefCell or other interior mutability, as dianhenglau's answer demonstrates.
Shouldn't this work because the mutable reference is not used at the same time?
No. The rules of references stated that "At any given time, you can have either one mutable reference or any number of immutable references", no matter if it is or isn't used at the same time. See this answer for the reason behind the rules.
As for workaround, since you're sure that the mutations do not occur simultaneously, you can use std::cell::RefCell as explained in this chapter. Modify the code into:
use std::cell::RefCell;
let example = vec![0, 1, 2, 3];
// Remove the "mut", wrap Cycler in RefCell.
let hashers = RefCell::new(Cycler::new([example_function_1, example_function_2].iter()));
let ret: Vec<u8> = example
//...
.chunks(hashers.borrow().len())
// Borrow hashers as immutable inside the closure, then borrow the Cycler as mutable.
.map(|buf| hashers.borrow_mut().call(buf))
//...
.map(|chunk| hashers.borrow_mut().call(chunk))
.collect();

Why does a File need to be mutable to call Read::read_to_string?

Here's a line from the 2nd edition Rust tutorial:
let mut f = File::open(filename).expect("file not found");
I'm of the assumption that the file descriptor is a wrapper around a number that basically doesn't change and is read-only.
The compiler complains that the file cannot be borrowed mutably, and I'm assuming it's because the method read_to_string takes the instance as the self argument as mutable, but the question is "why"? What is ever going to change about the file descriptor? Is it keeping track of the cursor location or something?
error[E0596]: cannot borrow immutable local variable `fdesc` as mutable
--> main.rs:13:5
|
11 | let fdesc = File::open(fname).expect("file not found");
| ----- consider changing this to `mut fdesc`
12 | let mut fdata = String::new();
13 | fdesc.read_to_string(&mut fdata)
| ^^^^^ cannot borrow mutably
The whole source:
fn main() {
let args: Vec<String> = env::args().collect();
let query = &args[1];
let fname = &args[2];
println!("Searching for '{}' in file '{}'...", query, fname);
let fdesc = File::open(fname).expect("file not found"); //not mut
let mut fdata = String::new();
fdesc.read_to_string(&mut fdata)
.expect("something went wrong reading the file");
println!("Found: \n{}", fdata);
}
I'm assuming it's because the method read_to_string takes the instance as the self argument as mutable
Yes, that's correct:
fn read_to_string(&mut self, buf: &mut String) -> Result<usize>
The trait method Read::read_to_string takes the receiver as a mutable reference because in general, that's what is needed to implement "reading" from something. You are going to change a buffer or an offset or something.
Yes, an actual File may simply contain an underlying file descriptor (e.g. on Linux or macOS) or a handle (e.g. Windows). In these cases, the operating system deals with synchronizing the access across threads. That's not even guaranteed though — it depends on the platform. Something like Redox might actually have a mutable reference in its implementation of File.
If the Read trait didn't accept a &mut self, then types like BufReader would have to use things like internal mutability, reducing the usefulness of Rust's references.
See also:
Why is it possible to implement Read on an immutable reference to File?

How to get the current cursor position in file?

Given this code:
let any_offset: u64 = 42;
let mut file = File::open("/home/user/file").unwrap();
file.seek(SeekFrom::Start(any_offset));
// println!("{:?}", file.cursor_position())
How can I obtain the current cursor position?
You should call Seek:seek with a relative offset of 0. This has no side effect and returns the information you are looking for.
Seek is implemented for a number of types, including:
impl Seek for File
impl<'_> Seek for &'_ File
impl<'_, S: Seek + ?Sized> Seek for &'_ mut S
impl<R: Seek> Seek for BufReader<R>
impl<S: Seek + ?Sized> Seek for Box<S>
impl<T> Seek for Cursor<T> where
impl<W: Write + Seek> Seek for BufWriter<W>
Using the Cursor class mentioned by Aaronepower might be more efficient though, since you could avoid having to make an extra system call.
According to the Seek trait API the new position is returned with the seek function. However you can also take the data of the File, and place it within a Vec, and then wrap the Vec in a Cursor which does contain a method which gets the current position.
Without Cursor
let any_offset: u64 = 42;
let mut file = File::open("/home/user/file").unwrap();
let new_position = file.seek(SeekFrom::Start(any_offset)).unwrap();
println!("{:?}", new_position);
With Cursor
use std::io::Cursor;
let any_offset: u64 = 42;
let mut file = File::open("/home/user/file").unwrap();
let contents = Vec::new();
file.read_to_end(&mut contents);
let mut cursor = Cursor::new(contents);
cursor.seek(SeekFrom::Start(any_offset));
println!("{:?}", cursor.position());
As of Rust 1.51.0 (2021) there is now the method stream_position() on the Seek trait.
use std::io::Seek;
let pos = file.stream_position().unwrap();
However, looking at the source code in the linked documentation this is purely a convenience wrapper that uses the same SeekFrom::Current(0) implementation behind the scenes.

Resources