Converting CSV file to array consumes massive amounts of memory

Converting CSV file to array consumes massive amounts of memory - arrays

I have some medium large CSV files (about 140mb) and I'm trying to turn them into an array of structs.
I don't want to to load the hole file in the memory so I'm using a steam reader.
For each line I read the data, turn the line into my struct and append the struct to the array. Because there are more then 5_000_000 lines in total, I used reserveCapacity to get a better memory management.
var dataArray : [inputData] = []
dataArray.reserveCapacity(5_201_014)
Unfortunately, that doesn't help at all. There is no perforce difference.
The memory graph in the debug session rises up to 1.54 GB and then stays there.
Im wondering what im doing wrong, because I can't imaging that it takes 1.54Gb of RAM to store an array of structs from a file with an original size of 140mb.
I use the following code to create the array:
var dataArray : [inputData] = []
dataArray.reserveCapacity(5_201_014)
let stream = StreamReader(path: "pathToDocument")
defer { stream!.close() }
while let line = stream!.nextLine() {
if line.isHeader() {} else {
let array = line.components(separatedBy: ",")
dataArray.append(inputData(a: Float32(array[0])!, b: Float32(array[1])!, c: Float32(array[2])!))
}
}
I know that here are several CSV reader packages on GitHub but they are extremely slow.
Here a screenshot of the debug session:
Thanks for any advice.

Related

Splitting string into array string.components(separtedBy: ",") consumes more time

I have text file which contains 18000 lines which have cities names. Each line has city name, state, latitude, longitude etc. Below is the function which does that, if i don't implement string.components(separtedBy: ", ") loading function is pretty fast but with it implemented it takes time which makes my UI freeze. What is the right way of doing it? Is string.components(separtedBy: ", ") that costly?
I profiled the app, this line is taking string.components(separtedBy: ", ") 1.45s out of 2.09s in whole function.
func readCitiesFromCountry(country: String) -> [String] {
var cityArray: [String] = []
var flag = true
var returnedCitiesList: [String] = []
if let path = Bundle.main.path(forResource: country, ofType: "txt") {
guard let streamReader = StreamReader(path: path) else {fatalError()}
defer {
streamReader.close()
}
while flag {
if let nextLine = streamReader.nextLine() {
cityArray = nextLine.components(separatedBy: ",") // this is the line taking a lot of time, without this function runs pretty fast
if (country == "USA") {
returnedCitiesList.append("\(cityArray[0]) , \(cityArray[1]) , \(cityArray[2])")
} else {
returnedCitiesList.append("\(cityArray[0]) , \(cityArray[1])")
}
//returnedCitiesList.append(nextLine)
} else {
flag = false
}
}
} else {
fatalError()
}
return returnedCitiesList
}
StreamReader used in the code can be found here. It helps to read file line by line
Read a file/URL line-by-line in Swift
This question is not about how to split the string into array Split a String into an array in Swift? , rather why splitting is taking more time in the given function.

NSString.components(separatedBy:) returns a [String], which requires that all of the pieces' content be copied, from the original string, and pasted into new-ly allocated stringss. This slows things down.
You could address the symptoms (UI freezing) by putting this work on a background thread, but that just sweeps the problem under the wrong (the inefficient copying is still there), and complicates things (async code is never fun).
Instead, you should consider using String.split(separator:maxSplits:omittingEmptySubsequences:), which returns [Substring]. Each Substring is just a view into the original string's memory, which stores the relevant range so that you only see that portion of the String which is modeled by the Substring. The only memory allocation happening here is for the array.
Hopefully that should be enough to speed your code up to acceptable levels. If not, you should combine both solutions, and use split off-thread.

How do I get a random line from a file?

I'm trying to get a random line from a file:
extern crate rand;
use rand::Rng;
use std::{
fs::File,
io::{prelude::*, BufReader},
};
const FILENAME: &str = "/etc/hosts";
fn find_word() -> String {
let f = File::open(FILENAME).expect(&format!("(;_;) file not found: {}", FILENAME));
let f = BufReader::new(f);
let lines: Vec<_> = f.lines().collect();
let n = rand::thread_rng().gen_range(0, lines.len());
let line = lines
.get(n)
.expect(&format!("(;_;) Couldn't get {}th line", n))
.unwrap_or(String::from(""));
line
}
This code doesn't work:
error[E0507]: cannot move out of borrowed content
--> src/main.rs:18:16
|
18 | let line = lines
| ________________^
19 | | .get(n)
20 | | .expect(&format!("(;_;) Couldn't get {}th line", n))
| |____________________________________________________________^ cannot move out of borrowed content
I tried adding .clone() before .expect(...) and before .unwrap_or(...) but it gave the same error.
Is there a better way to get a random line from a file that doesn't involve collecting the whole file in a Vec?

Use IteratorRandom::choose to randomly sample from an iterator using reservoir sampling. This will scan through the entire file once, creating Strings for each line, but it will not create a giant vector for every line:
use rand::seq::IteratorRandom; // 0.7.3
use std::{
fs::File,
io::{BufRead, BufReader},
};
const FILENAME: &str = "/etc/hosts";
fn find_word() -> String {
let f = File::open(FILENAME)
.unwrap_or_else(|e| panic!("(;_;) file not found: {}: {}", FILENAME, e));
let f = BufReader::new(f);
let lines = f.lines().map(|l| l.expect("Couldn't read line"));
lines
.choose(&mut rand::thread_rng())
.expect("File had no lines")
}
Your original problem is that:
slice::get returns an optional reference into the vector.
You can either clone this or take ownership of the value:
let line = lines[n].cloned()
let line = lines.swap_remove(n)
Both of these panic if n is out-of-bounds, which is reasonable here as you know that you are in bounds.
BufRead::lines returns io::Result<String>, so you have to handle that error case.
Additionally, don't use format! with expect:
expect(&format!("..."))
This will unconditionally allocate memory. When there's no failure, that allocation is wasted. Use unwrap_or_else as shown.

Is there a better way to get a random line from a file that doesn't involve collecting the whole file in a Vec?
You will always need to read the whole file, if only to know the number of lines. However, you don't need to store everything in memory, you can read lines one by one and discard them as you go so that you only keep one in the end. Here is how it goes:
Read and store the first line;
Read the second line, draw a random choice and either:
keep the first line with a probability of 50%,
or discard the first line and store the second line with a probability of 50%,
Keep reading lines from the file and for line number n, draw a random choice and:
keep the currently stored line with a probability of (n-1)/n,
or replace the currently stored line with the current line with a probability of 1/n.
Note that this is more or less what sample_iter does, except that sample_iter is more generic since it can work on any iterator and it can pick samples of any size (eg. it can choose k items randomly).

Append function overwrites existing data in slice

I wrote a small application which records data from a sound card and stores the data in an array for later processing.
Whenever new data is available, portaudio executes the callback record. Within the callback I append the data to the array RecData.data.
The golang builtin function append adds as expected another element to the slice, but for whatever reason also overwrites all existing elements within the array with exactly the same data.
I have been trying to isolate the problem for more than two days, without success.
Here is a stripped down version of the code, which works and shows the problem:
package main
import (
"fmt"
"time"
// "reflect"
"github.com/gordonklaus/portaudio"
)
type RecData struct{
data [][][]float32
}
func main() {
var inputChs int = 1
var outputChs int = 0
var samplingRate float64 = 48000
var framesPerBuffer int = 3 //for test purpose that low. Would normally be 1024 or 2048
rec := RecData{make([][][]float32, 0, 1000)}
portaudio.Initialize()
stream, err := portaudio.OpenDefaultStream(inputChs, outputChs, samplingRate, framesPerBuffer, rec.record)
if err != nil {
fmt.Println(err)
}
defer stream.Close()
stream.Start()
for {
time.Sleep(time.Millisecond * 10)
}
}
// callback which gets called when new data is in the buffer
func (re *RecData)record(in [][]float32) {
fmt.Println("Received sound sample: ")
fmt.Println(in)
re.data = append(re.data, in)
fmt.Println("Content of RecData.data after adding received sound sample:")
fmt.Println(re.data, "\n")
time.Sleep(time.Millisecond * 500) //limit temporarily the amount of data read
// iterate over all recorded data and compare them
/*
for i, d := range re.data{
if reflect.DeepEqual(d, in){
fmt.Printf("Data at index %d is the same as the recorded one, but should not be!\n", i )
}
}*/
}
2. Update
This is the application output:
Received sound sample:
[[0.71575254 1.0734825 0.7444282]]
Content of RecData.data after adding received sound sample:
[[[0.71575254 1.0734825 0.7444282]]]
Received sound sample:
[[0.7555193 0.768355 0.6575008]]
Content of RecData.data after adding received sound sample:
[[[0.7555193 0.768355 0.6575008]] [[0.7555193 0.768355 0.6575008]]]
Received sound sample:
[[0.7247052 0.68471473 0.6843796]]
Content of RecData.data after adding received sound sample:
[[[0.7247052 0.68471473 0.6843796]] [[0.7247052 0.68471473 0.6843796]] [[0.7247052 0.68471473 0.6843796]]]
Received sound sample:
[[0.6996536 0.66283375 0.67252487]]
Content of RecData.data after adding received sound sample:
[[[0.6996536 0.66283375 0.67252487]] [[0.6996536 0.66283375 0.67252487]] [[0.6996536 0.66283375 0.67252487]] [[0.6996536 0.66283375 0.67252487]]]
.... etc ....
As we one can see, over time, the size of the slice is growing, but instead of just appending the data, the data in the array gets also overwritten.
This should not happen. portaudio provides in the callback a [][]float32 with the audio sample recorded from the sound card. As you can see they are always different.
As mentioned, the code above is a stripped down version of my application. Usually I would record lets say 5 seconds, and then perform a Fast Fourier Transformation (FFT) over the samples to calculate the spectrum. I left this part away since it has no impact on this particular problem.
I would very much appreciate any help. Maybe somebody can point me out what I'm doing wrong.
Thanks!

The buffer passed into the callback is reused by the portaudio package, so you are appending the same slice structure to your data slice. Each time the buffer allocated by portaudio overwrites the data, you see the results in every element of your data slice.
You will need to allocate new slices and make a copy of the data:
func (re *RecData) record(in [][]float32) {
buf := make([][]float32, len(in))
for i, v := range in {
buf[i] = append([]float32(nil), v...)
}
re.data = append(re.data, buf)
Example:
https://play.golang.org/p/cF57lQIZFU

copy NSData to UnsafeMutablePointer<Void>

Hi there stackoverflowers. I'm implementing a wrapper for Secure Transport and I'm stuck on some of the C -> Swift syntax.
func sslReadCallback(connection: SSLConnectionRef,
data: UnsafeMutablePointer<Void>,
var dataLength: UnsafeMutablePointer<Int>) -> OSStatus
{
//let bytesRequested = dataLength.memory
let transportWrapper:SecureTransportWrapper = UnsafePointer(connection).memory
let bytesRead:NSData = transportWrapper.readFromConnectionFunc(transportWrapper.connection)
dataLength = UnsafeMutablePointer<Int>.alloc(1)
dataLength.initialize(bytesRead.length)
if (bytesRead.length == 0)
{
return OSStatus(errSSLClosedGraceful)
}
else
{
data.alloc(sizeof(bytesRead.length)) //<----compile error here
return noErr
}
}
I've marked the location of the compile error. I don't blame it for erring, I was kind of guessing here :P. I'm trying to copy the the NSData to the data:UnsafeMutablePointer. How do I do that?
Compile error:
/Users/*/SecureTransportWrapper.swift:108:9: Static member 'alloc' cannot be used on instance of type 'UnsafeMutablePointer' (aka 'UnsafeMutablePointer<()>')
Thanks a ton!
================
Update: here is the api doc for what the sslReadCallback is supposed to do:
connection: A connection reference.
data: On return, your callback should overwrite the memory at this location with the data read from the connection.
dataLength: On input, a pointer to an integer
representing the length of the data in bytes. On return, your callback
should overwrite that integer with the number of bytes actually
transferred.
Excerpt from here

OK, lets go through your code:
dataLength = UnsafeMutablePointer<Int>.alloc(1)
dataLength.initialize(bytesRead.length)
dataLength is a pointer you get passed in, it is where the caller of the function both gives you the size of the buffer and wants you to put the number of bytes you read. You don't need to alloc this, it is already allocated.
(Irrelevant for this example but: Also in alloc(N) and initialize(N) the N should be the same (it is the amount of memory being allocated, and then initialized))
I think what you want (Swift 3 uses pointee instead of memory) is this:
dataLength.memory = bytesRead.length
The C API says that you also get the size of the data buffer from this variable. data will be pre-allocated for this size.
Make sure the data you read fits (bytesRead.length <= dataLength.memory), then just do a
memcpy(data, bytesRead.bytes, bytesRead.length)
That's all.

Windows phone 7 silverlight string array in Isolated storage

I have an array of strings which I am trying to store in Isolated storage, However I need to store each string in the array in a new file of its own.
Any approach is welcomed.
Thanks.

I do something similar in an app with code roughly along these lines. Though I am serializing objects in an array to json. Same rough idea though.
using (IsolatedStorageFile file = IsolatedStorageFile.GetUserStoreForApplication()) {
for (int i = 0; i < array.Length; i++) {
string fileName = "file" + i.ToString() + ".dat";
using (var stream = file.CreateFile(filename)) {
using (var writer = new StreamWriter(stream)) {
writer.Write(array[i]);
}
}
}
}
Note this is just typed straight in, I may have a mistake in there :)

Your question is a little vauge, but here I go.
What is stopping you from just serializing each string to a file with the index as the name? For example, store stringarray[0] in a file 0.xml.
Just check whether the file exists before trying to read it.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Converting CSV file to array consumes massive amounts of memory - arrays

Related

Splitting string into array string.components(separtedBy: ",") consumes more time

How do I get a random line from a file?

Append function overwrites existing data in slice

copy NSData to UnsafeMutablePointer<Void>

Windows phone 7 silverlight string array in Isolated storage

Categories

Resources