Rust - open dynamic number of writers - file

let's say I have a dynamic number of input strings from a file (barcodes).
I want to split up a huge 111GB text file based upon matches to the input strings, and write those hits to files.
I don't know how many inputs to expect.
I have done all the file input and string matching, but am stuck at the output step.
Ideally, I would open a file for each input in the input vector barcodes, just containing strings. Are there any approaches to open a dynamic number of output files?
A suboptimal approach is searching for a barcode string as an input arg, but this means I have to read the huge file repeatedly.
The barcode input vector just contains strings, eg
"TAGAGTAT",
"TAGAGTAG",
Ideally, output should look like this if the previous two strings are input
file1 -> TAGAGTAT.txt
file2 -> TAGAGTAG.txt
Thanks for your help.
extern crate needletail;
use needletail::{parse_fastx_file, Sequence, FastxReader};
use std::str;
use std::fs::File;
use std::io::prelude::*;
use std::path::Path;
fn read_barcodes () -> Vec<String> {
// TODO - can replace this with file reading code (OR move to an arguments based model, parse and demultiplex only one oligomer at a time..... )
// The `vec!` macro can be used to initialize a vector or strings
let barcodes = vec![
"TCTCAAAG".to_string(),
"AACTCCGC".into(),
"TAAACGCG".into()
];
println!("Initial vector: {:?}", barcodes);
return barcodes
}
fn main() {
//let filename = "test5m.fastq";
let filename = "Undetermined_S0_R1.fastq";
println!("Fastq filename: {} ", filename);
//println!("Barcodes filename: {} ", barcodes_filename);
let barcodes_vector: Vec<String> = read_barcodes();
let mut counts_vector: [i32; 30] = [0; 30];
let mut n_bases = 0;
let mut n_valid_kmers = 0;
let mut reader = parse_fastx_file(&filename).expect("Not a valid path/file");
while let Some(record) = reader.next() {
let seqrec = record.expect("invalid record");
// get sequence
let sequenceBytes = seqrec.normalize(false);
let sequenceText = str::from_utf8(&sequenceBytes).unwrap();
//println!("Seq: {} ", &sequenceText);
// get first 8 chars (8chars x 2 bytes)
let sequenceOligo = &sequenceText[0..8];
//println!("barcode vector {}, seqOligo {} ", &barcodes_vector[0], sequenceOligo);
if sequenceOligo == barcodes_vector[0]{
//println!("Hit ! Barcode vector {}, seqOligo {} ", &barcodes_vector[0], sequenceOligo);
counts_vector[0] = counts_vector[0] + 1;
}

You probably want a HashMap<String, File>. You could build it from your barcode vector like this:
use std::collections::HashMap;
use std::fs::File;
use std::path::Path;
fn build_file_map(barcodes: &[String]) -> HashMap<String, File> {
let mut files = HashMap::new();
for barcode in barcodes {
let filename = Path::new(barcode).with_extension("txt");
let file = File::create(filename).expect("failed to create output file");
files.insert(barcode.clone(), file);
}
files
}
You would call it like this:
let barcodes = vec!["TCTCAAAG".to_string(), "AACTCCGC".into(), "TAAACGCG".into()];
let file_map = build_file_map(&barcodes);
And you would get a file to write to like this:
let barcode = barcodes[0];
let file = file_map.get(&barcode).expect("barcode not in file map");
// write to file

I just need an example of a) how to properly instantiate a vector of files named after the relevant string b) setup the output file objects properly c) write to those files.
Here's a commented example:
use std::io::Write;
use std::fs::File;
use std::io;
fn read_barcodes() -> Vec<String> {
// read barcodes here
todo!()
}
fn process_barcode(barcode: &str) -> String {
// process barcodes here
todo!()
}
fn main() -> io::Result<()> {
let barcodes = read_barcodes();
for barcode in barcodes {
// process barcode to get output
let output = process_barcode(&barcode);
// create file for barcode with {barcode}.txt name
let mut file = File::create(format!("{}.txt", barcode))?;
// write output to created file
file.write_all(output.as_bytes());
}
Ok(())
}

Related

Rust Programming - Trying to compare user input to lines from a file

So I am trying to compare user input to the lines from a separate file name fruits.txt. I got it mostly working I believe, but I am running into this error:
error[E0658]: use of unstable library feature 'option_result_contains'
--> src/main.rs:19:20
|
19 | s if s.contains(&ask) => println!("{} is a fruit!", ask),
| ^^^^^^^^
|
= note: see issue #62358 <https://github.com/rust-lang/rust/issues/62358> for more information
For more information about this error, try `rustc --explain E0658`.
error: could not compile `learn_arrays` due to previous error
I have tried several types of ways to match it in rust and this is the closest where it doesn't complain that I am trying to match a string to whatever type lines is. here is what it looks like
use std::fs::File;
use std::io::{BufReader, BufRead, Error, stdin};
fn main() -> Result<(), Error>{
let path = "fruits.txt";
let input = File::open(path)?;
let buffered = BufReader::new(input);
let mut ask = String::new();
stdin()
.read_line(&mut ask)
.expect("Failed to read line");
let ask: String = ask.trim().parse().expect("Please type a valid string!");
for line in buffered.lines() {
match line {
s if s.contains(&ask) => println!("{} is a fruit!", ask),
_ => println!("{} is either not in the list or not a fruit", ask),
}
}
Ok(())
}
Is there a way where I can use the unstable feature or is there another better method to compare user input to lines from a file.
I was able to fix the issue my changing the part where I am attempting to match the input with:
let mut found = false;
println!("Result");
for line in buffered.lines() {
let s = line.unwrap();
if s.find(&ask).is_some() {
println!("{} is a fruit!", ask);
found = true;
break;
}
}
if !found {
println!("{} is either not in the list or not a fruit", ask)
}

collect bytes from string

This is decoded string from bytes, they are always different. Am not using it in the code, its just for shown what is all about.
"Random String; Tags:Value1:1,Value:2,Value3:value4"
This is array of bytes from above string which i get as input.
[&u8...&u8]
What i need is get the values fromthose. While every byte in the array is changing. but some bytes are always same. I was thinking if there is any way how to extract it without using any Strings... Thanks for any ideas
so the output would look like this:
let v1 = [&u8, &u8, &u8, &u8, &u8];
let v2 = [&u8, &u8];
let v3 = [&u8];
let v4 = [&u8];
let v5 = [&u8];
You can do all this without allocating any extra space using rusts iterators, using split and related functions
At the top level, your data is of the form (key:value;)*
This suggests first splitting on ;
Then splitting each of these pieces into key and value using :
In your case, all the information is when the key is "tags".
Then within the tags section, you again have (mostly) key-value pairs of the form (key-value,)* so we need to split on , then break into key-value pairs using -.
An example that does this but only prints all the tag key-value pairs is:
fn split_kv(v: &[u8], c: u8) -> Option<(&[u8], &[u8])> {
let n = v.iter().position(|&b| b == c)?;
let w = v.split_at(n);
Some((w.0, &(w.1)[1..]))
}
fn main() {
let s: &str = "Background:Sunfire Topo;Base:Eagle;Accessory3:None;Patch:Oreo;Jacket:Pink Bonez;Eyes:BloodShot;Beak:Drool;Accessory2:Nose Ring;Accessory1:None;Item:Knife;tags:Dope Eagles,ELEMENT-HYDRO,ATTACK-10,DEFENSE-5,HIGHNESS-4,SWAG-1;metadata:QmU7JcFDoGcUvNkDgsPz9cy13md4xHdNyD6itwmgVLuo7x/860.json";
let ss: &[u8] = s.as_bytes();
let tags = ss
.split(|&b| b == b';') // Break up at ';'
.filter_map(|s| split_kv(s, b':')) // Split each piece into key-value pairs using ':'
.filter_map(|(k, v)| { // Only keep the tags entry.
if k == "tags".as_bytes() {
Some(v)
} else {
None
}
})
.next() // And just take the first of those.
.unwrap();
// Split the tags by ','
for t in tags.split(|&b| b == b',') {
// Then try to convert each to a key-value using '-' as seperator.
if let Some((k, v)) = split_kv(t, b'-') {
println!(
"k={:?} v={:?}",
std::str::from_utf8(k).unwrap(),
std::str::from_utf8(v).unwrap()
);
} else {
println!("t={:?}", std::str::from_utf8(t).unwrap());
}
}
}
You can run this here

remove extra length from string converted from array in rust

I am trying to learn async in rust with tokio. I am trying to take input from the terminal by using tokio::io::AsyncReadExt::Read which need array as buffer. But when I convert that buffer into a string, I can't compare it with other strings cause I think it has extra length.
here is minimal code:-
use std::process::Command;
use tokio::prelude::*;
use tokio::time;
async fn get_input(prompt: &str) -> String {
println!("{}", prompt);
let mut f = io::stdin();
let mut buffer = [0; 10];
// read up to 10 bytes
f.read(&mut buffer).await;
String::from_utf8((&buffer).to_vec()).unwrap()
}
async fn lol() {
for i in 1..5 {
let mut input = get_input("lol asks ").await;
input.shrink_to_fit();
print!("lol {} input = '{}' len = {}", i, input, input.len());
if input.eq("sl\n") {
let mut konsole = Command::new("/usr/bin/konsole");
konsole.arg("-e").arg("sl");
konsole.output().expect("some error happend");
}
}
}
#[tokio::main]
async fn main() {
let h = lol();
futures::join!(h);
}
if I execute this code, I get this:-
lol asks
sl
lol 1 input = 'sl
' len = 10lol asks
which means string has 10 length
I solved with help from Discord people
I needed to use the number returned by f.read(…).await.unwrap(), otherwise, you will have some additional zero bytes at the end which were not returned by reading
let i = f.read(&mut buffer).await.unwrap() ;
String::from_utf8((&buffer[..i]).to_vec()).unwrap()

Take numbers from a file and put into a Vec<i32> but keep getting error

Code:
use std::io::Read;
fn main() {
let mut file = std::fs::File::open("numbs").unwrap();
let mut contents = String::new();
file.read_to_string(&mut contents).unwrap();
let mut v: Vec<i32> = Vec::new();
for s in contents.lines() {
v.push(s.parse::<i32>().unwrap());
}
}
Error:
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: ParseIntError { kind: Empty }', src/libcore/result.rs:1165:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
Most likely, you have a trailing newline character \n at the end of your file i.e. an empty last line. You might also have empty lines in the middle of your file.
The easiest way to fix this for your use case is to just ignore empty lines:
for s in contents.lines() {
if !s.is_empty() {
v.push(s.parse::<i32>().unwrap());
}
}
However, it is generally not a good idea to just unwrap a Result especially if you cannot guarantee that it will never panic. A more robust solution is to handle each possible outcome of the Result appropriately. Another advantage of this solution is that it will not just ignore empty lines but also strings that cannot be parsed as an i32. Whether this is what you want or if you wish to handle this error explicitly is up to you. In the following example, we will use if-let to only insert values into the vector if they were successfully parsed as an i32:
for s in contents.lines() {
if let Ok(i) = s.parse::<i32>() {
v.push(i);
}
}
Side Note: You don't need to read the entire file into a string and then parse that line-by-line. Refer to Read large files line by line in Rust to see how to achieve this more idiomatically
Combining the aforementioned point and the use of flatten and flat_map, we can greatly simplify the logic to:
use std::fs::File;
use std::io::{BufRead, BufReader};
fn main() {
let file = File::open("numbs").unwrap();
let v: Vec<i32> = BufReader::new(file)
.lines()
.flatten() // gets rid of Err from lines
.flat_map(|line| line.parse::<i32>()) // ignores Err variant from Result of str.parse
.collect();
}

What is the most efficient way to read a large file in chunks without loading the entire file in memory at once?

What is the most efficient general purpose way of reading "large" files (which may be text or binary), without going into unsafe territory? I was surprised how few relevant results there were when I did a web search for "rust read large file in chunks".
For example, one of my use cases is to calculate an MD5 checksum for a file using rust-crypto (the Md5 module allows you to add &[u8] chunks iteratively).
Here is what I have, which seems to perform slightly better than some other methods like read_to_end:
use std::{
fs::File,
io::{self, BufRead, BufReader},
};
fn main() -> io::Result<()> {
const CAP: usize = 1024 * 128;
let file = File::open("my.file")?;
let mut reader = BufReader::with_capacity(CAP, file);
loop {
let length = {
let buffer = reader.fill_buf()?;
// do stuff with buffer here
buffer.len()
};
if length == 0 {
break;
}
reader.consume(length);
}
Ok(())
}
I don't think you can write code more efficient than that. fill_buf on a BufReader over a File is basically just a straight call to read(2).
That said, BufReader isn't really a useful abstraction when you use it like that; it would probably be less awkward to just call file.read(&mut buf) directly.
I did it this way, I don't know if it is wrong but it worked perfectly for me, still don't know if it is the correct way tho..
use std::io;
use std::io::prelude::*;
use std::fs::File;
fn main() -> io::Result<()>
{
const FNAME: &str = "LargeFile.txt";
const CHUNK_SIZE: usize = 1024; // bytes read by every loop iteration.
let mut limit: usize = (1024 * 1024) * 15; // How much should be actually read from the file..
let mut f = File::open(FNAME)?;
let mut buffer = [0; CHUNK_SIZE]; // buffer to contain the bytes.
// read up to 15mb as the limit suggests..
loop {
if limit > 0 {
// Not finished reading, you can parse or process data.
let _n = f.read(&mut buffer[..])?;
for bytes_index in 0..buffer.len() {
print!("{}", buffer[bytes_index] as char);
}
limit -= CHUNK_SIZE;
} else {
// Finished reading..
break;
}
}
Ok(())
}

Resources