remove extra length from string converted from array in rust - arrays

I am trying to learn async in rust with tokio. I am trying to take input from the terminal by using tokio::io::AsyncReadExt::Read which need array as buffer. But when I convert that buffer into a string, I can't compare it with other strings cause I think it has extra length.
here is minimal code:-
use std::process::Command;
use tokio::prelude::*;
use tokio::time;
async fn get_input(prompt: &str) -> String {
println!("{}", prompt);
let mut f = io::stdin();
let mut buffer = [0; 10];
// read up to 10 bytes
f.read(&mut buffer).await;
String::from_utf8((&buffer).to_vec()).unwrap()
}
async fn lol() {
for i in 1..5 {
let mut input = get_input("lol asks ").await;
input.shrink_to_fit();
print!("lol {} input = '{}' len = {}", i, input, input.len());
if input.eq("sl\n") {
let mut konsole = Command::new("/usr/bin/konsole");
konsole.arg("-e").arg("sl");
konsole.output().expect("some error happend");
}
}
}
#[tokio::main]
async fn main() {
let h = lol();
futures::join!(h);
}
if I execute this code, I get this:-
lol asks
sl
lol 1 input = 'sl
' len = 10lol asks
which means string has 10 length

I solved with help from Discord people
I needed to use the number returned by f.read(…).await.unwrap(), otherwise, you will have some additional zero bytes at the end which were not returned by reading
let i = f.read(&mut buffer).await.unwrap() ;
String::from_utf8((&buffer[..i]).to_vec()).unwrap()

Related

Rust - open dynamic number of writers

let's say I have a dynamic number of input strings from a file (barcodes).
I want to split up a huge 111GB text file based upon matches to the input strings, and write those hits to files.
I don't know how many inputs to expect.
I have done all the file input and string matching, but am stuck at the output step.
Ideally, I would open a file for each input in the input vector barcodes, just containing strings. Are there any approaches to open a dynamic number of output files?
A suboptimal approach is searching for a barcode string as an input arg, but this means I have to read the huge file repeatedly.
The barcode input vector just contains strings, eg
"TAGAGTAT",
"TAGAGTAG",
Ideally, output should look like this if the previous two strings are input
file1 -> TAGAGTAT.txt
file2 -> TAGAGTAG.txt
Thanks for your help.
extern crate needletail;
use needletail::{parse_fastx_file, Sequence, FastxReader};
use std::str;
use std::fs::File;
use std::io::prelude::*;
use std::path::Path;
fn read_barcodes () -> Vec<String> {
// TODO - can replace this with file reading code (OR move to an arguments based model, parse and demultiplex only one oligomer at a time..... )
// The `vec!` macro can be used to initialize a vector or strings
let barcodes = vec![
"TCTCAAAG".to_string(),
"AACTCCGC".into(),
"TAAACGCG".into()
];
println!("Initial vector: {:?}", barcodes);
return barcodes
}
fn main() {
//let filename = "test5m.fastq";
let filename = "Undetermined_S0_R1.fastq";
println!("Fastq filename: {} ", filename);
//println!("Barcodes filename: {} ", barcodes_filename);
let barcodes_vector: Vec<String> = read_barcodes();
let mut counts_vector: [i32; 30] = [0; 30];
let mut n_bases = 0;
let mut n_valid_kmers = 0;
let mut reader = parse_fastx_file(&filename).expect("Not a valid path/file");
while let Some(record) = reader.next() {
let seqrec = record.expect("invalid record");
// get sequence
let sequenceBytes = seqrec.normalize(false);
let sequenceText = str::from_utf8(&sequenceBytes).unwrap();
//println!("Seq: {} ", &sequenceText);
// get first 8 chars (8chars x 2 bytes)
let sequenceOligo = &sequenceText[0..8];
//println!("barcode vector {}, seqOligo {} ", &barcodes_vector[0], sequenceOligo);
if sequenceOligo == barcodes_vector[0]{
//println!("Hit ! Barcode vector {}, seqOligo {} ", &barcodes_vector[0], sequenceOligo);
counts_vector[0] = counts_vector[0] + 1;
}
You probably want a HashMap<String, File>. You could build it from your barcode vector like this:
use std::collections::HashMap;
use std::fs::File;
use std::path::Path;
fn build_file_map(barcodes: &[String]) -> HashMap<String, File> {
let mut files = HashMap::new();
for barcode in barcodes {
let filename = Path::new(barcode).with_extension("txt");
let file = File::create(filename).expect("failed to create output file");
files.insert(barcode.clone(), file);
}
files
}
You would call it like this:
let barcodes = vec!["TCTCAAAG".to_string(), "AACTCCGC".into(), "TAAACGCG".into()];
let file_map = build_file_map(&barcodes);
And you would get a file to write to like this:
let barcode = barcodes[0];
let file = file_map.get(&barcode).expect("barcode not in file map");
// write to file
I just need an example of a) how to properly instantiate a vector of files named after the relevant string b) setup the output file objects properly c) write to those files.
Here's a commented example:
use std::io::Write;
use std::fs::File;
use std::io;
fn read_barcodes() -> Vec<String> {
// read barcodes here
todo!()
}
fn process_barcode(barcode: &str) -> String {
// process barcodes here
todo!()
}
fn main() -> io::Result<()> {
let barcodes = read_barcodes();
for barcode in barcodes {
// process barcode to get output
let output = process_barcode(&barcode);
// create file for barcode with {barcode}.txt name
let mut file = File::create(format!("{}.txt", barcode))?;
// write output to created file
file.write_all(output.as_bytes());
}
Ok(())
}

What is the correct way to read a binary file in chunks of a fixed size and store all of those chunks into a Vec?

I'm having trouble with opening a file. Most examples read files into a String or read the entire file into a Vec. What I need is to read a file into chunks of a fixed size and store those chunks into an array (Vec) of chunks.
For example, I have a file called my_file of exactly 64 KB size and I want to read it in chunks of 16KB so I would end up with an Vec of size 4 where each element is another Vec with size 16Kb (0x4000 bytes).
After reading the docs and checking other Stack Overflow answers, I was able to come with something like this:
let mut file = std::fs::File::open("my_file")?;
// ...calculate num_of_chunks 4 in this case
let list_of_chunks = Vec::new();
for chunk in 0..num_of_chunks {
let mut data: [u8; 0x4000] = [0; 0x4000];
file.read(&mut data[..])?;
list_of_chunks.push(data.to_vec());
}
Although this seems to work fine, it looks a bit convoluted. I read:
For each iteration, create a new array on stack
Read the chunk into the array
Copy the contents of the array into a new Vec and then move the Vec into the list_of_chunks Vec.
I'm not sure if it's idiomatic or even possible, but I'd rather have something like this:
Create a Vec with num_of_chunk elements where each element is another Vec of size 16KB.
Read file chunk directly into the correct Vec
No copying and we make sure memory is allocated before reading the file.
Is that approach possible? or is there a better conventional/idiomatic/correct way to do this?
I'm wondering if Vec is the correct type for solving this. I mean, I won't need the array to grow after reading the file.
Read::read_to_end reads efficiently directly into a Vec. If you want it in chunks, combine it with Read::take to limit the amount of bytes that read_to_end will read.
Example:
let mut file = std::fs::File::open("your_file")?;
let mut list_of_chunks = Vec::new();
let chunk_size = 0x4000;
loop {
let mut chunk = Vec::with_capacity(chunk_size);
let n = file.by_ref().take(chunk_size as u64).read_to_end(&mut chunk)?;
if n == 0 { break; }
list_of_chunks.push(chunk);
if n < chunk_size { break; }
}
The last if is not necessary, but it prevents an extra read call: If less than the requested amount of bytes was read by read_to_end, we can expect the next read to read nothing, since we hit the end of the file.
I think the most idiomatic way would be to use an iterator. The code below (freely inspired by M-ou-se's answer):
Handles many use cases by using generic types
Will use a pre-allocated vector
Hides side effect
Avoid copying data twice
use std::io::{self, Read, Seek, SeekFrom};
struct Chunks<R> {
read: R,
size: usize,
hint: (usize, Option<usize>),
}
impl<R> Chunks<R> {
pub fn new(read: R, size: usize) -> Self {
Self {
read,
size,
hint: (0, None),
}
}
pub fn from_seek(mut read: R, size: usize) -> io::Result<Self>
where
R: Seek,
{
let old_pos = read.seek(SeekFrom::Current(0))?;
let len = read.seek(SeekFrom::End(0))?;
let rest = (len - old_pos) as usize; // len is always >= old_pos but they are u64
if rest != 0 {
read.seek(SeekFrom::Start(old_pos))?;
}
let min = rest / size + if rest % size != 0 { 1 } else { 0 };
Ok(Self {
read,
size,
hint: (min, None), // this could be wrong I'm unsure
})
}
// This could be useful if you want to try to recover from an error
pub fn into_inner(self) -> R {
self.read
}
}
impl<R> Iterator for Chunks<R>
where
R: Read,
{
type Item = io::Result<Vec<u8>>;
fn next(&mut self) -> Option<Self::Item> {
let mut chunk = Vec::with_capacity(self.size);
match self
.read
.by_ref()
.take(chunk.capacity() as u64)
.read_to_end(&mut chunk)
{
Ok(n) => {
if n != 0 {
Some(Ok(chunk))
} else {
None
}
}
Err(e) => Some(Err(e)),
}
}
fn size_hint(&self) -> (usize, Option<usize>) {
self.hint
}
}
trait ReadPlus: Read {
fn chunks(self, size: usize) -> Chunks<Self>
where
Self: Sized,
{
Chunks::new(self, size)
}
}
impl<T: ?Sized> ReadPlus for T where T: Read {}
fn main() -> io::Result<()> {
let file = std::fs::File::open("src/main.rs")?;
let iter = Chunks::from_seek(file, 0xFF)?; // replace with anything 0xFF was to test
println!("{:?}", iter.size_hint());
// This iterator could return Err forever be careful collect it into an Result
let chunks = iter.collect::<Result<Vec<_>, _>>()?;
println!("{:?}, {:?}", chunks.len(), chunks.capacity());
Ok(())
}

How to split a String into an array? String.split(",") results in unexpected data type/struct [duplicate]

From the documentation, it's not clear. In Java you could use the split method like so:
"some string 123 ffd".split("123");
Use split()
let mut split = "some string 123 ffd".split("123");
This gives an iterator, which you can loop over, or collect() into a vector.
for s in split {
println!("{}", s)
}
let vec = split.collect::<Vec<&str>>();
// OR
let vec: Vec<&str> = split.collect();
There are three simple ways:
By separator:
s.split("separator") | s.split('/') | s.split(char::is_numeric)
By whitespace:
s.split_whitespace()
By newlines:
s.lines()
By regex: (using regex crate)
Regex::new(r"\s").unwrap().split("one two three")
The result of each kind is an iterator:
let text = "foo\r\nbar\n\nbaz\n";
let mut lines = text.lines();
assert_eq!(Some("foo"), lines.next());
assert_eq!(Some("bar"), lines.next());
assert_eq!(Some(""), lines.next());
assert_eq!(Some("baz"), lines.next());
assert_eq!(None, lines.next());
There is a special method split for struct String:
fn split<'a, P>(&'a self, pat: P) -> Split<'a, P> where P: Pattern<'a>
Split by char:
let v: Vec<&str> = "Mary had a little lamb".split(' ').collect();
assert_eq!(v, ["Mary", "had", "a", "little", "lamb"]);
Split by string:
let v: Vec<&str> = "lion::tiger::leopard".split("::").collect();
assert_eq!(v, ["lion", "tiger", "leopard"]);
Split by closure:
let v: Vec<&str> = "abc1def2ghi".split(|c: char| c.is_numeric()).collect();
assert_eq!(v, ["abc", "def", "ghi"]);
split returns an Iterator, which you can convert into a Vec using collect: split_line.collect::<Vec<_>>(). Going through an iterator instead of returning a Vec directly has several advantages:
split is lazy. This means that it won't really split the line until you need it. That way it won't waste time splitting the whole string if you only need the first few values: split_line.take(2).collect::<Vec<_>>(), or even if you need only the first value that can be converted to an integer: split_line.filter_map(|x| x.parse::<i32>().ok()).next(). This last example won't waste time attempting to process the "23.0" but will stop processing immediately once it finds the "1".
split makes no assumption on the way you want to store the result. You can use a Vec, but you can also use anything that implements FromIterator<&str>, for example a LinkedList or a VecDeque, or any custom type that implements FromIterator<&str>.
There's also split_whitespace()
fn main() {
let words: Vec<&str> = " foo bar\t\nbaz ".split_whitespace().collect();
println!("{:?}", words);
// ["foo", "bar", "baz"]
}
The OP's question was how to split with a multi-character string and here is a way to get the results of part1 and part2 as Strings instead in a vector.
Here splitted with the non-ASCII character string "β˜„β˜ƒπŸ€”" in place of "123":
let s = "β˜„β˜ƒπŸ€”"; // also works with non-ASCII characters
let mut part1 = "some string β˜„β˜ƒπŸ€” ffd".to_string();
let _t;
let part2;
if let Some(idx) = part1.find(s) {
part2 = part1.split_off(idx + s.len());
_t = part1.split_off(idx);
}
else {
part2 = "".to_string();
}
gets: part1 = "some string "
Β  Β  Β  Β  Β part2 = " ffd"
If "β˜„β˜ƒπŸ€”" not is found part1 contains the untouched original String and part2 is empty.
Here is a nice example in Rosetta Code -
Split a character string based on change of character - of how you can turn a short solution using split_off:
fn main() {
let mut part1 = "gHHH5YY++///\\".to_string();
if let Some(mut last) = part1.chars().next() {
let mut pos = 0;
while let Some(c) = part1.chars().find(|&c| {if c != last {true} else {pos += c.len_utf8(); false}}) {
let part2 = part1.split_off(pos);
print!("{}, ", part1);
part1 = part2;
last = c;
pos = 0;
}
}
println!("{}", part1);
}
into that
Task
Split a (character) string into comma (plus a blank) delimited strings based on a change of character (left to right).
If you are looking for the Python-flavoured split where you tuple-unpack the two ends of the split string, you can do
if let Some((a, b)) = line.split_once(' ') {
// ...
}

What is the most efficient way to read a large file in chunks without loading the entire file in memory at once?

What is the most efficient general purpose way of reading "large" files (which may be text or binary), without going into unsafe territory? I was surprised how few relevant results there were when I did a web search for "rust read large file in chunks".
For example, one of my use cases is to calculate an MD5 checksum for a file using rust-crypto (the Md5 module allows you to add &[u8] chunks iteratively).
Here is what I have, which seems to perform slightly better than some other methods like read_to_end:
use std::{
fs::File,
io::{self, BufRead, BufReader},
};
fn main() -> io::Result<()> {
const CAP: usize = 1024 * 128;
let file = File::open("my.file")?;
let mut reader = BufReader::with_capacity(CAP, file);
loop {
let length = {
let buffer = reader.fill_buf()?;
// do stuff with buffer here
buffer.len()
};
if length == 0 {
break;
}
reader.consume(length);
}
Ok(())
}
I don't think you can write code more efficient than that. fill_buf on a BufReader over a File is basically just a straight call to read(2).
That said, BufReader isn't really a useful abstraction when you use it like that; it would probably be less awkward to just call file.read(&mut buf) directly.
I did it this way, I don't know if it is wrong but it worked perfectly for me, still don't know if it is the correct way tho..
use std::io;
use std::io::prelude::*;
use std::fs::File;
fn main() -> io::Result<()>
{
const FNAME: &str = "LargeFile.txt";
const CHUNK_SIZE: usize = 1024; // bytes read by every loop iteration.
let mut limit: usize = (1024 * 1024) * 15; // How much should be actually read from the file..
let mut f = File::open(FNAME)?;
let mut buffer = [0; CHUNK_SIZE]; // buffer to contain the bytes.
// read up to 15mb as the limit suggests..
loop {
if limit > 0 {
// Not finished reading, you can parse or process data.
let _n = f.read(&mut buffer[..])?;
for bytes_index in 0..buffer.len() {
print!("{}", buffer[bytes_index] as char);
}
limit -= CHUNK_SIZE;
} else {
// Finished reading..
break;
}
}
Ok(())
}

How can I convert a string of numbers to an array or vector of integers in Rust?

I'm writing on STDIN a string of numbers (e.g 4 10 30 232312) and I want to read that and convert to an array (or a vector) of integers, but I can't find the right way. So far I have:
use std::io;
fn main() {
let mut reader = io::stdin();
let numbers = reader.read_line().unwrap();
}
You can do something like this:
use std::io::{self, BufRead}; // (a)
fn main() {
let reader = io::stdin();
let numbers: Vec<i32> =
reader.lock() // (0)
.lines().next().unwrap().unwrap() // (1)
.split(' ').map(|s| s.trim()) // (2)
.filter(|s| !s.is_empty()) // (3)
.map(|s| s.parse().unwrap()) // (4)
.collect(); // (5)
println!("{:?}", numbers);
}
First, we take a lock of the stdin which lets you work with stdin as a buffered reader. By default, stdin in Rust is unbuffered; you need to call the lock() method to obtain a buffered version of it, but this buffered version is the only one for all threads in your program, hence the access to it should be synchronized.
Next, we read the next line (1); I'm using the lines() iterator whose next() method returns Option<io::Result<String>>, therefore to obtain just String you need to unwrap() twice.
Then we split it by spaces and trim resulting chunks from extra whitespace (2), remove empty chunks which were left after trimming (3), convert strings to i32s (4) and collect the result to a vector (5).
We also need to import std::io::BufRead trait (a) in order to use the lines() method.
If you know in advance that your input won't contain more than one space between numbers, you can omit step (3) and move the trim() call from (2) to (1):
let numbers: Vec<i32> =
reader.lock()
.lines().next().unwrap().unwrap()
.trim().split(' ')
.map(|s| s.parse().unwrap())
.collect();
Rust also provides a method to split a string into a sequence of whitespace-separated words, called split_whitespace():
let numbers: Vec<i32> =
reader.read_line().unwrap().as_slice()
.split_whitespace()
.map(|s| s.parse().unwrap())
.collect()
split_whitespace() is in fact just a combination of split() and filter(), just like in my original example. It uses a split() function argument which checks for different kinds of whitespace, not only space characters.
On Rust 1.5.x, a working solution is:
fn main() {
let mut numbers = String::new();
io::stdin()
.read_line(&mut numbers)
.ok()
.expect("read error");
let numbers: Vec<i32> = numbers
.split_whitespace()
.map(|s| s.parse().expect("parse error"))
.collect();
for num in numbers {
println!("{}", num);
}
}
Safer version. This one skips failed parses so that failed unwrap doesn't panic.
Use read_line for reading single line.
let mut buf = String::new();
// use read_line for reading single line
std::io::stdin().read_to_string(&mut buf).expect("");
// this one skips failed parses so that failed unwrap doesn't panic
let v: Vec<i32> = buf
.split_whitespace() // split string into words by whitespace
.filter_map(|w| w.parse().ok()) // calling ok() turns Result to Option so that filter_map can discard None values
.collect(); // collect items into Vector. This determined by type annotation.
You can even read Vector of Vectors like this.
let stdin = io::stdin();
let locked = stdin.lock();
let vv: Vec<Vec<i32>> = locked.lines()
.filter_map(
|l| l.ok().map(
|s| s.split_whitespace()
.filter_map(|word| word.parse().ok())
.collect()))
.collect();
Above one works for inputs like
2 424 -42 124
42 242 23 22 241
24 12 3 232 445
then turns them it into
[[2, 424, -42, 124],
[42, 242, 23, 22, 241],
[24, 12, 3, 232, 445]]
filter_map accepts a closure that returns Option<T> and filters out all Nones.
ok() turns Result<R,E> to Option<R> so that errors can be filtered in this case.
Safer version from Dulguun Otgon just skips all the errors.
In case when you want to don't skip errors please consider usage of next one method.
fn parse_to_vec<'a, T, It>(it: It) -> Result<Vec<T>, <T as FromStr>::Err>
where
T: FromStr,
It: Iterator<Item = &'a str>,
{
it.map(|v| v.parse::<T>()).fold(Ok(Vec::new()), |vals, v| {
vals.and_then(|mut vals| {
v.and_then(|v| {
vals.push(v);
Ok(vals)
})
})
})
}
while using it you can follow usual panicking way with expect
let numbers = parse_to_vec::<i32, _>(data_str.trim().split(" "))
.expect("can't parse data");
or more smarter way with converting to Result
let numbers = parse_to_vec::<i32, _>(data_str.trim().split(" "))
.map_err(|e| format!("can't parse data: {:?}", e))?;

Resources