How to get the current cursor position in file? - file

Given this code:
let any_offset: u64 = 42;
let mut file = File::open("/home/user/file").unwrap();
file.seek(SeekFrom::Start(any_offset));
// println!("{:?}", file.cursor_position())
How can I obtain the current cursor position?

You should call Seek:seek with a relative offset of 0. This has no side effect and returns the information you are looking for.
Seek is implemented for a number of types, including:
impl Seek for File
impl<'_> Seek for &'_ File
impl<'_, S: Seek + ?Sized> Seek for &'_ mut S
impl<R: Seek> Seek for BufReader<R>
impl<S: Seek + ?Sized> Seek for Box<S>
impl<T> Seek for Cursor<T> where
impl<W: Write + Seek> Seek for BufWriter<W>
Using the Cursor class mentioned by Aaronepower might be more efficient though, since you could avoid having to make an extra system call.

According to the Seek trait API the new position is returned with the seek function. However you can also take the data of the File, and place it within a Vec, and then wrap the Vec in a Cursor which does contain a method which gets the current position.
Without Cursor
let any_offset: u64 = 42;
let mut file = File::open("/home/user/file").unwrap();
let new_position = file.seek(SeekFrom::Start(any_offset)).unwrap();
println!("{:?}", new_position);
With Cursor
use std::io::Cursor;
let any_offset: u64 = 42;
let mut file = File::open("/home/user/file").unwrap();
let contents = Vec::new();
file.read_to_end(&mut contents);
let mut cursor = Cursor::new(contents);
cursor.seek(SeekFrom::Start(any_offset));
println!("{:?}", cursor.position());

As of Rust 1.51.0 (2021) there is now the method stream_position() on the Seek trait.
use std::io::Seek;
let pos = file.stream_position().unwrap();
However, looking at the source code in the linked documentation this is purely a convenience wrapper that uses the same SeekFrom::Current(0) implementation behind the scenes.

Related

What is the best way to read multiple files into one?

Here's a benchmark comparing two functions that read multiple files into a single one. The one uses read and the other uses read_to_end. My original motivation was to get the buffer's capacity be equal to the len at the end of the process. This did not happen with read_to_end which was quite unsatisfactory.
With read however, this works. The assert_eq!(buf.capacity(), buf.len()); of read_files_into_file2 (which uses read) does not panic.
use criterion::{criterion_group, criterion_main, Criterion};
use std::io::Read;
use std::io::Write;
use std::{
fs,
io::{self, Seek},
};
fn criterion_benchmark(c: &mut Criterion) {
let mut files = get_test_files().unwrap();
let mut file = fs::File::create("output").unwrap();
c.bench_function("1", |b| {
b.iter(|| {
read_files_into_file1(&mut files, &mut file).unwrap();
})
});
c.bench_function("2", |b| {
b.iter(|| {
read_files_into_file2(&mut files, &mut file).unwrap();
});
});
}
criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);
/// Goes back to the start so that the files can be read again from the start.
fn reset(files: &mut Vec<fs::File>, file: &mut fs::File) {
file.seek(io::SeekFrom::Start(0)).unwrap();
for file in files {
file.seek(io::SeekFrom::Start(0)).unwrap();
}
}
pub fn read_files_into_file1(files: &mut Vec<fs::File>, file: &mut fs::File) -> io::Result<()> {
reset(files, file);
let total_len = files
.iter()
.map(|file| file.metadata().unwrap().len())
.sum::<u64>() as usize;
let mut buf = Vec::<u8>::with_capacity(total_len);
for file in files {
file.read_to_end(&mut buf)?;
}
file.write_all(&buf)?;
// assert_eq!(buf.capacity(), buf.len());
Ok(())
}
fn read_files_into_file2(files: &mut Vec<fs::File>, file: &mut fs::File) -> io::Result<()> {
reset(files, file);
let total_len = files
.iter()
.map(|file| file.metadata().unwrap().len())
.sum::<u64>() as usize;
let mut vec: Vec<u8> = vec![0; total_len];
let mut buf = &mut vec[..];
for file in files {
match file.read(&mut buf) {
Ok(n) => {
buf = &mut buf[n..];
}
Err(err) if err.kind() == io::ErrorKind::Interrupted => {}
Err(err) => return Err(err),
}
}
file.write_all(&vec)?;
// assert_eq!(vec.capacity(), vec.len());
Ok(())
}
/// Creates 5 files with content "hello world" 500 times.
fn get_test_files() -> io::Result<Vec<fs::File>> {
let mut files = Vec::<fs::File>::new();
for index in 0..5 {
let mut file = fs::OpenOptions::new()
.read(true)
.write(true)
.truncate(true)
.create(true)
.open(&format!("test{}", index))?;
file.write_all("hello world".repeat(500).as_bytes())?;
files.push(file);
}
Ok(files)
}
If you uncomment the assert_eq!s then you will see that only read_files_into_file1 (which uses read_to_end) fails with this panic:
thread 'main' panicked at 'assertion failed: `(left == right)`
left: `55000`,
right: `27500`', benches/bench.rs:53:5
read_files_into_file1 allocates way more memory than needed while read_files_into_file2 allocates the optimal amount.
Despite that, the results say that they perform almost the same (read_files_into_file1 takes 11.439 us and read_files_into_file2 takes 11.098 us):
1 time: [11.417 us 11.439 us 11.463 us]
change: [+3.7987% +3.9997% +4.1984%] (p = 0.00 < 0.05)
Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
2 time: [11.085 us 11.098 us 11.112 us]
change: [+0.1255% +0.5081% +0.9545%] (p = 0.01 < 0.05)
Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
2 (2.00%) high mild
2 (2.00%) high severe
I expect read_files_into_file2 to be much faster but it was even shown to be slower when I increased the file size. Why is it that read_files_into_file2 does not meet my expectations and what is the best way to read multiple files into one, efficiently?
read_to_end generally isn't a good idea when dealing with large files since it will try to read the whole file into memory which can lead to swapping or out of memory errors.
On linux and assuming single-threaded execution using io::copy should be the fastest method since it contains optimizations for this case.
On other platforms using io::copy and wrapping the writer side in a BufWriter lets you control the buffer size used for copying which will help amortizing syscall costs.
If you can use multiple threads and know that the file lengths don't change then you can use platform-specific positional read/write methods such as read_at to read multiple files in parallel and write the data into the correct places in the destination file. Whether this actually provides a speedup depends on many factors. It's probably most beneficial when concatenating many small files from a network filesystem.
Beyond the standard library there also are crates that expose platform-specific copy routines which may be faster than a naive userspace copy approach.

Extracting an archive with progress bar - mutable borrow error

I am trying to extract a .tar.bz file (or .tar.whatever actually) and also be able to have a xx% progress report. So far I have this:
pub fn extract_file_with_progress<P: AsRef<Path>>(&self, path: P) -> Result<()> {
let path = path.as_ref();
let size = fs::metadata(path)?;
let mut f = File::open(path)?;
let decoder = BzDecoder::new(&f);
let mut archive = Archive::new(decoder);
for entry in archive.entries()? {
entry?.unpack_in(".")?;
let pos = f.seek(SeekFrom::Current(0))?;
}
Ok(())
}
The idea is to use pos/size to get the percentage, but compiling the above function gets me the error cannot borrow f as mutable because it is also borrowed as immutable.
I understand what the error means, but I don't really use f as mutable; I only use the seek function to get the current position.
Is there a way to work-around this, either by forcing the compiler to ignore the mutable borrow or by getting the position in some immutable way?
Files are a bit special. The usual read() and seek() and write() methods (defined on the Read, Seek and Write traits) take self by mutable reference:
fn read(&mut self, buf: &mut [u8]) -> Result<usize>
fn seek(&mut self, pos: SeekFrom) -> Result<u64>
fn write(&mut self, buf: &[u8]) -> Result<usize>
However, all mentioned traits are also implemented for &File, i.e. for immutable references to a file:
impl<'a> Read for &'a File
impl<'a> Seek for &'a File
impl<'a> Write for &'a File
So you can modify a file even if you only have a read-only reference to the file. For these implementations, the Self type is &File, so accepting self by mutable reference in fact means accepting a &mut &File, a mutable reference to a reference to a file.
Your code passes &f to BzDecoder::new(), creating an immutable borrow. Later you call f.seek(SeekFrom::Current(0)), which passes f to seek by mutable reference. However, this is not allowed, since you already have an immutable borrow of the file. The solution is to use the Seek implementation on &File instead:
(&mut &f).seek(SeekFrom::Current(0))
or slightly simpler
(&f).seek(SeekFrom::Current(0))
This only creates a second immutable borrow, which is allowed by Rust's rules for references.
I created a playground example demonstrating that this works. If you replace (&f) with f you get the error you originally got.

Why does a File need to be mutable to call Read::read_to_string?

Here's a line from the 2nd edition Rust tutorial:
let mut f = File::open(filename).expect("file not found");
I'm of the assumption that the file descriptor is a wrapper around a number that basically doesn't change and is read-only.
The compiler complains that the file cannot be borrowed mutably, and I'm assuming it's because the method read_to_string takes the instance as the self argument as mutable, but the question is "why"? What is ever going to change about the file descriptor? Is it keeping track of the cursor location or something?
error[E0596]: cannot borrow immutable local variable `fdesc` as mutable
--> main.rs:13:5
|
11 | let fdesc = File::open(fname).expect("file not found");
| ----- consider changing this to `mut fdesc`
12 | let mut fdata = String::new();
13 | fdesc.read_to_string(&mut fdata)
| ^^^^^ cannot borrow mutably
The whole source:
fn main() {
let args: Vec<String> = env::args().collect();
let query = &args[1];
let fname = &args[2];
println!("Searching for '{}' in file '{}'...", query, fname);
let fdesc = File::open(fname).expect("file not found"); //not mut
let mut fdata = String::new();
fdesc.read_to_string(&mut fdata)
.expect("something went wrong reading the file");
println!("Found: \n{}", fdata);
}
I'm assuming it's because the method read_to_string takes the instance as the self argument as mutable
Yes, that's correct:
fn read_to_string(&mut self, buf: &mut String) -> Result<usize>
The trait method Read::read_to_string takes the receiver as a mutable reference because in general, that's what is needed to implement "reading" from something. You are going to change a buffer or an offset or something.
Yes, an actual File may simply contain an underlying file descriptor (e.g. on Linux or macOS) or a handle (e.g. Windows). In these cases, the operating system deals with synchronizing the access across threads. That's not even guaranteed though — it depends on the platform. Something like Redox might actually have a mutable reference in its implementation of File.
If the Read trait didn't accept a &mut self, then types like BufReader would have to use things like internal mutability, reducing the usefulness of Rust's references.
See also:
Why is it possible to implement Read on an immutable reference to File?

Reference has shorter lifetime than its value from same scope?

It appears in my code that a value is living longer than a reference to it, even though both are created in the same scope. I'd like to know why, and how I can adjust the lifetime of my reference.
Example 1 is accepted by the compiler...
let mut rxs: Vec<Receiver<String>> = Vec::new();
let mut txs: Vec<SyncSender<String>> = Vec::new();
for _ in 0..N {
let (tx, rx) = sync_channel(0);
txs.push(tx);
rxs.push(rx);
}
But Example 2 isn't...
let sel = Select::new();
let mut handles: Vec<Handle<String>> = Vec::new();
let mut txs: Vec<SyncSender<String>> = Vec::new();
for _ in 0..N {
let (tx, rx) = sync_channel(0);
txs.push(tx);
handles.push(sel.handle(&rx));
}
The compiler tells me that the reference &rx is borrowed in the last line of the for loop, but is dropped at the end of the for loop and needs to live longer, presumably because the reference is placed in a structure with longer lifetime. Why would the reference have a different lifetime than the value, and if the value can be moved into a structure as in the first example, why not a reference like in the second?
Finally, I'd like to know why I don't encounter the same issue in Example 3, even though a reference is borrowed and passed into a structure that lasts longer than the scope of the borrow...
let (txs, rxs): (Vec<SyncSender<String>>, Vec<Receiver<String>>) =
(0..N).map(|_| sync_channel(0)).unzip();
let handles: Vec<Handle<String>> =
rxs.iter().map(|x| sel.handle(&x)).collect();
In the first example you are moving rx into the rxs vec. That's fine because you move the ownership of rx too, and it won't get dropped.
In the second example, you are passing a reference to sel.handle(), which is another way of saying it is being borrowed. rx is dropped at the end of each loop iteration, but handles outlives the entire loop. If the compiler didn't stop this from happening then handles would be full of dangling pointers.
But why would the reference have a different lifetime than the value
A reference always has a shorter lifetime than the value that it references. This has to be the case: the reference must exist and be allocated to memory before you can find its address. After a value is dropped, any reference to it is pointing at freed memory, which could be being used for something else.
and if the value can be moved into a structure as in the first example, why not a reference like in the second?
In the second example, the reference is being moved. But the original value isn't. The reference is now pointing at the free memory which was previously used by rx.
In the third example, you have created vectors which own all of the Senders and Receivers. As long as txs and rxs stay in scope, these values will not be dropped.
In example 2, rx does not have the same lifetime as handles. In fact, it's dropped at the end of the loop, like this:
let sel = Select::new();
let mut handles: Vec<Handle<String>> = Vec::new();
let mut txs: Vec<SyncSender<String>> = Vec::new();
for _ in 0..N {
let (tx, rx) = sync_channel(0);
txs.push(tx);
handles.push(sel.handle(&rx));
drop(tx);
drop(rx);
}
drop(txs);
drop(handles);
drop(sel);
Exmaple 3 is not equivalent to example 2. This is equivalent to example 2, and it fails:
let (txs, rxs): (Vec<SyncSender<String>>, Vec<Receiver<String>>) =
(0..N).map(|_| sync_channel(0)).unzip();
let handles: Vec<Handle<String>> =
rxs.into_iter().map(|x| sel.handle(&x)).collect(); // <-- notice: into_iter()
The iter() function returns an iterator of references. That's why this works:
let (txs, rxs): (Vec<SyncSender<String>>, Vec<Receiver<String>>) =
(0..N).map(|_| sync_channel(0)).unzip();
let handles: Vec<Handle<String>> =
rxs.iter().map(|x| sel.handle(x)).collect(); // <-- notice: no `&`

What is the most efficient way to read a large file in chunks without loading the entire file in memory at once?

What is the most efficient general purpose way of reading "large" files (which may be text or binary), without going into unsafe territory? I was surprised how few relevant results there were when I did a web search for "rust read large file in chunks".
For example, one of my use cases is to calculate an MD5 checksum for a file using rust-crypto (the Md5 module allows you to add &[u8] chunks iteratively).
Here is what I have, which seems to perform slightly better than some other methods like read_to_end:
use std::{
fs::File,
io::{self, BufRead, BufReader},
};
fn main() -> io::Result<()> {
const CAP: usize = 1024 * 128;
let file = File::open("my.file")?;
let mut reader = BufReader::with_capacity(CAP, file);
loop {
let length = {
let buffer = reader.fill_buf()?;
// do stuff with buffer here
buffer.len()
};
if length == 0 {
break;
}
reader.consume(length);
}
Ok(())
}
I don't think you can write code more efficient than that. fill_buf on a BufReader over a File is basically just a straight call to read(2).
That said, BufReader isn't really a useful abstraction when you use it like that; it would probably be less awkward to just call file.read(&mut buf) directly.
I did it this way, I don't know if it is wrong but it worked perfectly for me, still don't know if it is the correct way tho..
use std::io;
use std::io::prelude::*;
use std::fs::File;
fn main() -> io::Result<()>
{
const FNAME: &str = "LargeFile.txt";
const CHUNK_SIZE: usize = 1024; // bytes read by every loop iteration.
let mut limit: usize = (1024 * 1024) * 15; // How much should be actually read from the file..
let mut f = File::open(FNAME)?;
let mut buffer = [0; CHUNK_SIZE]; // buffer to contain the bytes.
// read up to 15mb as the limit suggests..
loop {
if limit > 0 {
// Not finished reading, you can parse or process data.
let _n = f.read(&mut buffer[..])?;
for bytes_index in 0..buffer.len() {
print!("{}", buffer[bytes_index] as char);
}
limit -= CHUNK_SIZE;
} else {
// Finished reading..
break;
}
}
Ok(())
}

Resources