Rust: How to read a file block by block - file

i am totally new to rust. I want to read a file block by block/Chunks (every block should contain 16 Bytes) and write it - for this test scenario - into another file, f2. So i i tried it first with this code here:
let mut buf = [0;16];
let mut count = 0;
for byte in f1.bytes() {
if count == 16 {
do_smth(&mut f2, &mut buf);
count = 0;
let data = byte?;
buf[count] = data;
} else {
let data = byte?;
buf[count] = data;
count +=1;
}
}
The test bytes in the file f1 were:
0123456789abcdef-hello world, hello world!
The result in file f2 was
0123456789abcdefhello world, hel
Is there a performant way to increment the file cursor each iteration.
I read about the seek function and experimented a little with it but didn't come to a solution. Maybe this could be solved with an increment of the file cursor each interation?

This is the working solution.
let mut buffer = [0; 16];
let mut count = 0;
while let Ok(n) = f1.read(&mut buffer[..]) {
if n != 16 {
let rest = &buffer[0..n];
do_smth(&mut f2, &rest);
break;
} else {
do_smth(&mut f2, &mut buffer);
count += n;
}
}

There are two thing you could consider in order to improve performance:
Wrapping File in a BufReader to reduce the number of system calls and essentially chunking your file access.
You might want to check out the memmap crate, to use a memory mapped file.

Related

Rust - open dynamic number of writers

let's say I have a dynamic number of input strings from a file (barcodes).
I want to split up a huge 111GB text file based upon matches to the input strings, and write those hits to files.
I don't know how many inputs to expect.
I have done all the file input and string matching, but am stuck at the output step.
Ideally, I would open a file for each input in the input vector barcodes, just containing strings. Are there any approaches to open a dynamic number of output files?
A suboptimal approach is searching for a barcode string as an input arg, but this means I have to read the huge file repeatedly.
The barcode input vector just contains strings, eg
"TAGAGTAT",
"TAGAGTAG",
Ideally, output should look like this if the previous two strings are input
file1 -> TAGAGTAT.txt
file2 -> TAGAGTAG.txt
Thanks for your help.
extern crate needletail;
use needletail::{parse_fastx_file, Sequence, FastxReader};
use std::str;
use std::fs::File;
use std::io::prelude::*;
use std::path::Path;
fn read_barcodes () -> Vec<String> {
// TODO - can replace this with file reading code (OR move to an arguments based model, parse and demultiplex only one oligomer at a time..... )
// The `vec!` macro can be used to initialize a vector or strings
let barcodes = vec![
"TCTCAAAG".to_string(),
"AACTCCGC".into(),
"TAAACGCG".into()
];
println!("Initial vector: {:?}", barcodes);
return barcodes
}
fn main() {
//let filename = "test5m.fastq";
let filename = "Undetermined_S0_R1.fastq";
println!("Fastq filename: {} ", filename);
//println!("Barcodes filename: {} ", barcodes_filename);
let barcodes_vector: Vec<String> = read_barcodes();
let mut counts_vector: [i32; 30] = [0; 30];
let mut n_bases = 0;
let mut n_valid_kmers = 0;
let mut reader = parse_fastx_file(&filename).expect("Not a valid path/file");
while let Some(record) = reader.next() {
let seqrec = record.expect("invalid record");
// get sequence
let sequenceBytes = seqrec.normalize(false);
let sequenceText = str::from_utf8(&sequenceBytes).unwrap();
//println!("Seq: {} ", &sequenceText);
// get first 8 chars (8chars x 2 bytes)
let sequenceOligo = &sequenceText[0..8];
//println!("barcode vector {}, seqOligo {} ", &barcodes_vector[0], sequenceOligo);
if sequenceOligo == barcodes_vector[0]{
//println!("Hit ! Barcode vector {}, seqOligo {} ", &barcodes_vector[0], sequenceOligo);
counts_vector[0] = counts_vector[0] + 1;
}
You probably want a HashMap<String, File>. You could build it from your barcode vector like this:
use std::collections::HashMap;
use std::fs::File;
use std::path::Path;
fn build_file_map(barcodes: &[String]) -> HashMap<String, File> {
let mut files = HashMap::new();
for barcode in barcodes {
let filename = Path::new(barcode).with_extension("txt");
let file = File::create(filename).expect("failed to create output file");
files.insert(barcode.clone(), file);
}
files
}
You would call it like this:
let barcodes = vec!["TCTCAAAG".to_string(), "AACTCCGC".into(), "TAAACGCG".into()];
let file_map = build_file_map(&barcodes);
And you would get a file to write to like this:
let barcode = barcodes[0];
let file = file_map.get(&barcode).expect("barcode not in file map");
// write to file
I just need an example of a) how to properly instantiate a vector of files named after the relevant string b) setup the output file objects properly c) write to those files.
Here's a commented example:
use std::io::Write;
use std::fs::File;
use std::io;
fn read_barcodes() -> Vec<String> {
// read barcodes here
todo!()
}
fn process_barcode(barcode: &str) -> String {
// process barcodes here
todo!()
}
fn main() -> io::Result<()> {
let barcodes = read_barcodes();
for barcode in barcodes {
// process barcode to get output
let output = process_barcode(&barcode);
// create file for barcode with {barcode}.txt name
let mut file = File::create(format!("{}.txt", barcode))?;
// write output to created file
file.write_all(output.as_bytes());
}
Ok(())
}

Serious performance regression upon porting bubble sort from C to Rust [duplicate]

I was playing around with binary serialization and deserialization in Rust and noticed that binary deserialization is several orders of magnitude slower than with Java. To eliminate the possibility of overhead due to, for example, allocations and overheads, I'm simply reading a binary stream from each program. Each program reads from a binary file on disk which contains a 4-byte integer containing the number of input values, and a contiguous chunk of 8-byte big-endian IEEE 754-encoded floating point numbers. Here's the Java implementation:
import java.io.*;
public class ReadBinary {
public static void main(String[] args) throws Exception {
DataInputStream input = new DataInputStream(new BufferedInputStream(new FileInputStream(args[0])));
int inputLength = input.readInt();
System.out.println("input length: " + inputLength);
try {
for (int i = 0; i < inputLength; i++) {
double d = input.readDouble();
if (i == inputLength - 1) {
System.out.println(d);
}
}
} finally {
input.close()
}
}
}
Here's the Rust implementation:
use std::fs::File;
use std::io::{BufReader, Read};
use std::path::Path;
fn main() {
let args = std::env::args_os();
let fname = args.skip(1).next().unwrap();
let path = Path::new(&fname);
let mut file = BufReader::new(File::open(&path).unwrap());
let input_length: i32 = read_int(&mut file);
for i in 0..input_length {
let d = read_double_slow(&mut file);
if i == input_length - 1 {
println!("{}", d);
}
}
}
fn read_int<R: Read>(input: &mut R) -> i32 {
let mut bytes = [0; std::mem::size_of::<i32>()];
input.read_exact(&mut bytes).unwrap();
i32::from_be_bytes(bytes)
}
fn read_double_slow<R: Read>(input: &mut R) -> f64 {
let mut bytes = [0; std::mem::size_of::<f64>()];
input.read_exact(&mut bytes).unwrap();
f64::from_be_bytes(bytes)
}
I'm outputting the last value to make sure that all of the input is actually being read. On my machine, when the file contains (the same) 30 million randomly-generated doubles, the Java version runs in 0.8 seconds, while the Rust version runs in 40.8 seconds.
Suspicious of inefficiencies in Rust's byte interpretation itself, I retried it with a custom floating point deserialization implementation. The internals are almost exactly the same as what's being done in Rust's Reader, without the IoResult wrappers:
fn read_double<R : Reader>(input: &mut R, buffer: &mut [u8]) -> f64 {
use std::mem::transmute;
match input.read_at_least(8, buffer) {
Ok(n) => if n > 8 { fail!("n > 8") },
Err(e) => fail!(e)
};
let mut val = 0u64;
let mut i = 8;
while i > 0 {
i -= 1;
val += buffer[7-i] as u64 << i * 8;
}
unsafe {
transmute::<u64, f64>(val);
}
}
The only change I made to the earlier Rust code in order to make this work was create an 8-byte slice to be passed in and (re)used as a buffer in the read_double function. This yielded a significant performance gain, running in about 5.6 seconds on average. Unfortunately, this is still noticeably slower (and more verbose!) than the Java version, making it difficult to scale up to larger input sets. Is there something that can be done to make this run faster in Rust? More importantly, is it possible to make these changes in such a way that they can be merged into the default Reader implementation itself to make binary I/O less painful?
For reference, here's the code I'm using to generate the input file:
import java.io.*;
import java.util.Random;
public class MakeBinary {
public static void main(String[] args) throws Exception {
DataOutputStream output = new DataOutputStream(new BufferedOutputStream(System.out));
int outputLength = Integer.parseInt(args[0]);
output.writeInt(outputLength);
Random rand = new Random();
for (int i = 0; i < outputLength; i++) {
output.writeDouble(rand.nextDouble() * 10 + 1);
}
output.flush();
}
}
(Note that generating the random numbers and writing them to disk only takes 3.8 seconds on my test machine.)
When you build without optimisations, it will often be slower than it would be in Java. But build it with optimisations (rustc -O or cargo --release) and it should be very much faster. If the standard version of it still ends up slower, it’s something that should be examined carefully to figure out where the slowness is—perhaps something is being inlined that shouldn’t be, or not that should be, or perhaps some optimisation that was expected is not occurring.

What is the correct way to read a binary file in chunks of a fixed size and store all of those chunks into a Vec?

I'm having trouble with opening a file. Most examples read files into a String or read the entire file into a Vec. What I need is to read a file into chunks of a fixed size and store those chunks into an array (Vec) of chunks.
For example, I have a file called my_file of exactly 64 KB size and I want to read it in chunks of 16KB so I would end up with an Vec of size 4 where each element is another Vec with size 16Kb (0x4000 bytes).
After reading the docs and checking other Stack Overflow answers, I was able to come with something like this:
let mut file = std::fs::File::open("my_file")?;
// ...calculate num_of_chunks 4 in this case
let list_of_chunks = Vec::new();
for chunk in 0..num_of_chunks {
let mut data: [u8; 0x4000] = [0; 0x4000];
file.read(&mut data[..])?;
list_of_chunks.push(data.to_vec());
}
Although this seems to work fine, it looks a bit convoluted. I read:
For each iteration, create a new array on stack
Read the chunk into the array
Copy the contents of the array into a new Vec and then move the Vec into the list_of_chunks Vec.
I'm not sure if it's idiomatic or even possible, but I'd rather have something like this:
Create a Vec with num_of_chunk elements where each element is another Vec of size 16KB.
Read file chunk directly into the correct Vec
No copying and we make sure memory is allocated before reading the file.
Is that approach possible? or is there a better conventional/idiomatic/correct way to do this?
I'm wondering if Vec is the correct type for solving this. I mean, I won't need the array to grow after reading the file.
Read::read_to_end reads efficiently directly into a Vec. If you want it in chunks, combine it with Read::take to limit the amount of bytes that read_to_end will read.
Example:
let mut file = std::fs::File::open("your_file")?;
let mut list_of_chunks = Vec::new();
let chunk_size = 0x4000;
loop {
let mut chunk = Vec::with_capacity(chunk_size);
let n = file.by_ref().take(chunk_size as u64).read_to_end(&mut chunk)?;
if n == 0 { break; }
list_of_chunks.push(chunk);
if n < chunk_size { break; }
}
The last if is not necessary, but it prevents an extra read call: If less than the requested amount of bytes was read by read_to_end, we can expect the next read to read nothing, since we hit the end of the file.
I think the most idiomatic way would be to use an iterator. The code below (freely inspired by M-ou-se's answer):
Handles many use cases by using generic types
Will use a pre-allocated vector
Hides side effect
Avoid copying data twice
use std::io::{self, Read, Seek, SeekFrom};
struct Chunks<R> {
read: R,
size: usize,
hint: (usize, Option<usize>),
}
impl<R> Chunks<R> {
pub fn new(read: R, size: usize) -> Self {
Self {
read,
size,
hint: (0, None),
}
}
pub fn from_seek(mut read: R, size: usize) -> io::Result<Self>
where
R: Seek,
{
let old_pos = read.seek(SeekFrom::Current(0))?;
let len = read.seek(SeekFrom::End(0))?;
let rest = (len - old_pos) as usize; // len is always >= old_pos but they are u64
if rest != 0 {
read.seek(SeekFrom::Start(old_pos))?;
}
let min = rest / size + if rest % size != 0 { 1 } else { 0 };
Ok(Self {
read,
size,
hint: (min, None), // this could be wrong I'm unsure
})
}
// This could be useful if you want to try to recover from an error
pub fn into_inner(self) -> R {
self.read
}
}
impl<R> Iterator for Chunks<R>
where
R: Read,
{
type Item = io::Result<Vec<u8>>;
fn next(&mut self) -> Option<Self::Item> {
let mut chunk = Vec::with_capacity(self.size);
match self
.read
.by_ref()
.take(chunk.capacity() as u64)
.read_to_end(&mut chunk)
{
Ok(n) => {
if n != 0 {
Some(Ok(chunk))
} else {
None
}
}
Err(e) => Some(Err(e)),
}
}
fn size_hint(&self) -> (usize, Option<usize>) {
self.hint
}
}
trait ReadPlus: Read {
fn chunks(self, size: usize) -> Chunks<Self>
where
Self: Sized,
{
Chunks::new(self, size)
}
}
impl<T: ?Sized> ReadPlus for T where T: Read {}
fn main() -> io::Result<()> {
let file = std::fs::File::open("src/main.rs")?;
let iter = Chunks::from_seek(file, 0xFF)?; // replace with anything 0xFF was to test
println!("{:?}", iter.size_hint());
// This iterator could return Err forever be careful collect it into an Result
let chunks = iter.collect::<Result<Vec<_>, _>>()?;
println!("{:?}, {:?}", chunks.len(), chunks.capacity());
Ok(())
}

What is the most efficient way to read a large file in chunks without loading the entire file in memory at once?

What is the most efficient general purpose way of reading "large" files (which may be text or binary), without going into unsafe territory? I was surprised how few relevant results there were when I did a web search for "rust read large file in chunks".
For example, one of my use cases is to calculate an MD5 checksum for a file using rust-crypto (the Md5 module allows you to add &[u8] chunks iteratively).
Here is what I have, which seems to perform slightly better than some other methods like read_to_end:
use std::{
fs::File,
io::{self, BufRead, BufReader},
};
fn main() -> io::Result<()> {
const CAP: usize = 1024 * 128;
let file = File::open("my.file")?;
let mut reader = BufReader::with_capacity(CAP, file);
loop {
let length = {
let buffer = reader.fill_buf()?;
// do stuff with buffer here
buffer.len()
};
if length == 0 {
break;
}
reader.consume(length);
}
Ok(())
}
I don't think you can write code more efficient than that. fill_buf on a BufReader over a File is basically just a straight call to read(2).
That said, BufReader isn't really a useful abstraction when you use it like that; it would probably be less awkward to just call file.read(&mut buf) directly.
I did it this way, I don't know if it is wrong but it worked perfectly for me, still don't know if it is the correct way tho..
use std::io;
use std::io::prelude::*;
use std::fs::File;
fn main() -> io::Result<()>
{
const FNAME: &str = "LargeFile.txt";
const CHUNK_SIZE: usize = 1024; // bytes read by every loop iteration.
let mut limit: usize = (1024 * 1024) * 15; // How much should be actually read from the file..
let mut f = File::open(FNAME)?;
let mut buffer = [0; CHUNK_SIZE]; // buffer to contain the bytes.
// read up to 15mb as the limit suggests..
loop {
if limit > 0 {
// Not finished reading, you can parse or process data.
let _n = f.read(&mut buffer[..])?;
for bytes_index in 0..buffer.len() {
print!("{}", buffer[bytes_index] as char);
}
limit -= CHUNK_SIZE;
} else {
// Finished reading..
break;
}
}
Ok(())
}

Any Efficient way to parse large text files and store parsing information?

My purpose is to parse text files and store information in respective tables.
I have to parse around 100 folders having more that 8000 files and whole size approximately 20GB.
When I tried to store whole file contents in a string, memory out exception was thrown.
That is
using (StreamReader objStream = new StreamReader(filename))
{
string fileDetails = objStream.ReadToEnd();
}
Hence I tried one logic like
using (StreamReader objStream = new StreamReader(filename))
{
// Getting total number of lines in a file
int fileLineCount = File.ReadLines(filename).Count();
if (fileLineCount < 90000)
{
fileDetails = objStream.ReadToEnd();
fileDetails = fileDetails.Replace(Environment.NewLine, "\n");
string[] fileInfo = fileDetails.ToString().Split('\n');
//call respective method for parsing and insertion
}
else
{
while ((firstLine = objStream.ReadLine()) != null)
{
lineCount++;
fileDetails = (fileDetails != string.Empty) ? string.Concat(fileDetails, "\n", firstLine)
: string.Concat(firstLine);
if (lineCount == 90000)
{
fileDetails = fileDetails.Replace(Environment.NewLine, "\n");
string[] fileInfo = fileDetails.ToString().Split('\n');
lineCount = 0;
//call respective method for parsing and insertion
}
}
//when content is 90057, to parse 57
if (lineCount < 90000 )
{
string[] fileInfo = fileDetails.ToString().Split('\n');
lineCount = 0;
//call respective method for parsing and insertion
}
}
}
Here 90,000 is the bulk size which is safe to process without giving out of memory exception for my case.
Still the process is taking more than 2 days for completion. I observed this is because of reading line by line.
Is there any better approach to handle this ?
Thanks in Advance :)
You can use a profiler to detect what sucks your performance. In this case it's obvious: disk access and string concatenation.
Do not read a file more than once. Let's take a look at your code. First of all, the line int fileLineCount = File.ReadLines(filename).Count(); means you read the whole file and discard what you've read. That's bad. Throw away your if (fileLineCount < 90000) and keep only else.
It almost doesn't matter if you read line-by-line in consecutive order or the whole file because reading is buffered in any case.
Avoid string concatenation, especially for long strings.
fileDetails = fileDetails.Replace(Environment.NewLine, "\n");
string[] fileInfo = fileDetails.ToString().Split('\n');
It's really bad. You read the file line-by-line, why do you do this replacement/split? File.ReadLines() gives you a collection of all lines. Just pass it to your parsing routine.
If you'll do this properly I expect significant speedup. It can be optimized further by reading files in a separate thread while processing them in the main. But this is another story.

Resources