Golang dynamic sizing slice when reading a file using buffo.read - file

I have a problem where, I need to use bufio.read to read a tsv file line by line and I need to record how many bytes each line Ive read is.
The problem is, It seems like I can't just initialize an empty slice and pass it into bufio.read and expect the slice to contain the entire line of the file.
file, _ := os.Open("file.tsv")
reader := bufio.NewReader(file)
b := make([]byte, 10)
for {
bytesRead, err:= reader.Read(b)
fmt.Println(bytesRead, b)
if err != nil {
break
}
}
So, for this example, since I specified the slice to be 10 bytes, the reader will read at most 10 bytes even if the line is bigger than 10 bytes.
However:
file, _ := os.Open("file.tsv")
reader := bufio.NewReader(file)
b := byte{} //or var b []byte
for {
bytesRead, err:= reader.Read(b)
fmt.Println(bytesRead, b)
if err != nil {
break
}
}
This will always read 0 bytes and I assume its because the buffer is length 0 or capacity 0.
How do I read a file Line by line, save the entire line in a variable or buffer, and return exactly how many bytes Ive read?
Thanks!

If you want to read line by line, and you're using a buffered reader, use the buffered reader's ReadBytes method.
line,err := reader.ReadBytes('\n')
This will give you a full line, one line at a time, regardless of byte length.

Related

How to get file length in Go dynamically?

I have the following code snippet:
func main() {
// Some text we want to compress.
original := "bird and frog"
// Open a file for writing.
f, _ := os.Create("C:\\programs\\file.gz")
// Create gzip writer.
w := gzip.NewWriter(f)
// Write bytes in compressed form to the file.
while ( looping over database cursor) {
w.Write([]byte(/* the row from the database as obtained from cursor */))
}
// Close the file.
w.Close()
fmt.Println("DONE")
}
However, I wish to know a small modification. When the size of file reaches a certain threshold I want to close it and open a new file. And that too in compressed format.
For example:
Assume a database has 10 rows each row is 50 bytes.
Assume compression factor is 2, ie 1 row of 50 bytes is compressed to 25 bytes.
Assume a file size limit is 50 bytes.
Which means after every 2 records I should close the file and open a new file.
How to keep track of the file size while its still open and still writing compressed documents to it ?
gzip.NewWriter takes a io.Writer. It is easy to implement custom io.Writer that does what you want.
E.g. Playground
type MultiFileWriter struct {
maxLimit int
currentSize int
currentWriter io.Writer
}
func (m *MultiFileWriter) Write(data []byte) (n int, err error) {
if len(data)+m.currentSize > m.maxLimit {
m.currentWriter = createNextFile()
}
m.currentSize += len(data)
return m.currentWriter.Write(data)
}
Note: You will need to handle few edge cases like what if len(data) is greater than the maxLimit. And may be you don't want to split a record across files.
You can use the os.File.Seek method to get your current position in the file, which as you're writing the file will be the current file size in bytes.
For example:
package main
import (
"compress/gzip"
"fmt"
"os"
)
func main() {
// Some text we want to compress.
lines := []string{
"this is a test",
"the quick brown fox",
"jumped over the lazy dog",
"the end",
}
// Open a file for writing.
f, err := os.Create("file.gz")
if err != nil {
panic(err)
}
// Create gzip writer.
w := gzip.NewWriter(f)
// Write bytes in compressed form to the file.
for _, line := range lines {
w.Write([]byte(line))
w.Flush()
pos, err := f.Seek(0, os.SEEK_CUR)
if err != nil {
panic(err)
}
fmt.Printf("pos: %d\n", pos)
}
// Close the file.
w.Close()
// The call to w.Close() will write out any remaining data
// and the final checksum.
pos, err := f.Seek(0, os.SEEK_CUR)
if err != nil {
panic(err)
}
fmt.Printf("pos: %d\n", pos)
fmt.Println("DONE")
}
Which outputs:
pos: 30
pos: 55
pos: 83
pos: 94
pos: 107
DONE
And we can confirm with wc:
$ wc -c file.gz
107 file.gz

Expanding a temporary slice if more bytes are needed

I'm generating random files programmatically in a directory, at least temporaryFilesTotalSize worth of random data (a bit more, who cares).
Here's my code:
var files []string
for size := int64(0); size < temporaryFilesTotalSize; {
fileName := random.HexString(12)
filePath := dir + "/" + fileName
file, err := os.Create(filePath)
if err != nil {
return nil, err
}
size += rand.Int63n(1 << 32) // random dimension up to 4GB
raw := make([]byte, size)
_, err := rand.Read(raw)
if err != nil {
panic(err)
}
file.Write(raw)
file.Close()
files = append(files, filePath)
}
Is there any way I can avoid that raw := make([]byte, size) allocation in the for loop?
Ideally I'd like to keep a slice on the heap and only grow if a bigger size is required. Any way to do this efficiently?
First of all you should know that generating random data and writing that to disk is at least an order of magnitude slower than allocating a contiguous memory for buffer. This definitely falls under the "premature optimization" category. Eliminating the creation of the buffer inside the iteration will not make your code noticeably faster.
Reusing the buffer
But to reuse the buffer, move it outside of the loop, create the biggest needed buffer, and slice it in each iteration to the needed size. It's OK to do this, because we'll overwrite the whole part we need with random data.
Note that I somewhat changed the size generation (likely an error in your code as you always increase the generated temporary files, since you use the size accumulated size for new ones).
Also note that writing a file with contents prepared in a []byte is easiest done using a single call to os.WriteFile().
Something like this:
bigRaw := make([]byte, 1 << 32)
for totalSize := int64(0); ; {
size := rand.Int63n(1 << 32) // random dimension up to 4GB
totalSize += size
if totalSize >= temporaryFilesTotalSize {
break
}
raw := bigRaw[:size]
rand.Read(raw) // It's documented that rand.Read() always returns nil error
filePath := filepath.Join(dir, random.HexString(12))
if err := os.WriteFile(filePath, raw, 0666); err != nil {
panic(err)
}
files = append(files, filePath)
}
Solving the task without an intermediate buffer
Since you are writing big files (GBs), allocating that big buffer is not a good idea: running the app will require GBs of RAM! We could improve it with an inner loop to use smaller buffers until we write the expected size, which solves the big memory issue, but increases complexity. Luckily for us, we can solve the task without any buffers, and even with decreased complexity!
We should somehow "channel" the random data from a rand.Rand to the file directly, something similar what io.Copy() does. Note that rand.Rand implements io.Reader, and os.File implements io.ReaderFrom, which suggests we could simply pass a rand.Rand to file.ReadFrom(), and the file itself would get the data directly from rand.Rand that will be written.
This sounds good, but the ReadFrom() reads data from the given reader until EOF or error. Neither will ever happen if we pass rand.Rand. And we do know how many bytes we want to be read and written: size.
To our "rescue" comes io.LimitReader(): we pass an io.Reader and a size to it, and the returned reader will supply no more than the given number of bytes, and after that will report EOF.
Note that creating our own rand.Rand will also be faster as the source we pass to it will be created using rand.NewSource() which returns an "unsynchronized" source (not safe for concurrent use) which in turn will be faster! The source used by the default/global rand.Rand is synchronized (and so safe for concurrent useā€“but is slower).
Perfect! Let's see this in action:
r := rand.New(rand.NewSource(time.Now().Unix()))
for totalSize := int64(0); ; {
size := r.Int63n(1 << 32)
totalSize += size
if totalSize >= temporaryFilesTotalSize {
break
}
filePath := filepath.Join(dir, random.HexString(12))
file, err := os.Create(filePath)
if err != nil {
return nil, err
}
if _, err := file.ReadFrom(io.LimitReader(r, fsize)); err != nil {
panic(err)
}
if err = file.Close(); err != nil {
panic(err)
}
files = append(files, filePath)
}
Note that if os.File would not implement io.ReaderFrom, we could still use io.Copy(), providing the file as the destination, and a limited reader (used above) as the source.
Final note: closing the file (or any resource) is best done using defer, so it'll get called no matter what. Using defer in a loop is a bit tricky though, as deferred functions run at the end of the enclosing function, and not at the end of the loop's iteration. So you may wrap it in a function. For details, see `defer` in the loop - what will be better?

Recording and seeking to CSV file positions in Golang

I need to read a CSV file and record the locations of lines with certain values into an array, then later go back and retrieve those lines in no particular order and with good performance, so random access.
My program uses csv.NewReader(file), but I see no way to get or set the file offset that it uses. I tried file.Seek(0,io.SeekCurrent) to return the file position, but it doesn't change between calls to reader.Read(). I also tried fmt.Println("+v +v\n",reader,file) to see if anything stores the reader's file position, but I don't see it. I also don't know the best way to use the file position if I do find it.
Here's what I need to do:
file,_ = os.Open("stuff.csv")
reader = csv.NewReader(file)
//read file and record locations
for {
line,_ = reader.Read()
if wantToRememberLocation(line) {
locations = append(locations, getLocation()) //need this function
}
}
//then revisit certain lines
for {
reader.GoToLine(locations[random]) //need this function
line,_ = reader.Read()
doStuff(line)
}
Is there even a way to do this with the csv library, or will I have to write my own using more primitive file io functions?
Here's a solution using TeeReader. This example just saves all the positions and goes back and rereads some of them.
//set up some vars and readers to record position and length of each line
type Record struct {
Pos int64
Len int
}
records := make([]Record,1)
var buf bytes.Buffer
var pos int64
file,_ := Open("stuff.csv")
tr := io.TeeReader(file, &buf)
cr := csv.NewReader(tr)
//read first row and get things started
data,_ := cr.Read()
dostuff(data)
//length of current row determines position of next
lineBytes,_ := buf.ReadBytes('\n')
length := len(lineBytes)
pos += int64(length)
records[0].Len = length
records = append(records, Record{ Pos: pos })
for i:=1;;i++ {
//read csv data
data,err = c.Read()
if err != nil {break}
dostuff(data)
//record length and position
lineBytes,_ = buf.ReadBytes('\n')
lenth = len(lineBytes)
pos += int64(length)
records[i].Len = length
records = append(records, Record{ Pos: pos })
}
//prepare individual line reader
line := make([]byte,1000)
lineReader := bytes.NewReader(line)
//read random lines from file
for {
i := someLineNumber()
//use original file reader to fill byte slice with line
file.ReadAt(line[:records[i].Len], records[i].Pos)
//need new lineParser to start at beginning every time
lineReader.Seek(0,0)
lineParser := csv.NewReader(lineReader)
data,_ = lineParser.Read()
doStuff(data)
}
os.Open returns a File, which implements io.Seeker.
So you can do this to rewind the stream to the beginning:
_, err = file.Seek(0, io.SeekStart)
https://golang.org/src/os/file.go

How to return hash and bytes in one step in Go?

I'm trying to understand how I can read content of the file, calculate its hash and return its bytes in one Go. So far, I'm doing this in two steps, e.g.
// calculate file checksum
hasher := sha256.New()
f, err := os.Open(fname)
if err != nil {
msg := fmt.Sprintf("Unable to open file %s, %v", fname, err)
panic(msg)
}
defer f.Close()
b, err := io.Copy(hasher, f)
if err != nil {
panic(err)
}
cksum := hex.EncodeToString(hasher.Sum(nil))
// read again (!!!) to get data as bytes array
data, err := ioutil.ReadFile(fname)
Obviously it is not the most efficient way to do this, since read happens twice, once in copy to pass to hasher and another in ioutil to read file and return list of bytes. I'm struggling to understand how I can combine these steps together and do in one go, read data once, calculate any hash and return it along with list of bytes to another layer.
If you want to read a file, without creating a copy of the entire file in memory, and at the same time calculate its hash, you can do so with a TeeReader:
hasher := sha256.New()
f, err := os.Open(fname)
data := io.TeeReader(f, hasher)
// Now read from data as usual, which is still a stream.
What happens here is that any bytes that are read from data (which is a Reader just like the file object f is) will be pushed to hasheras well.
Note, however, that hasher will produce the correct hash only once you have read the entire file through data, and not until then. So if you need the hash before you decide whether or not you want to read the file, you are left with the options of either doing it in two passes (for example like you are now), or to always read the file but discard the result if the hash check failed.
If you do read the file in two passes, you could of course buffer the entire file data in a byte buffer in memory. However, the operating system will typically cache the file you just read in RAM anyway (if possible), so the performance benefit of doing a buffered two-pass solution yourself rather than just doing two passes over the file is probably negligible.
You can write bytes directly to the hasher. For example:
package main
import (
"crypto/sha256"
"encoding/hex"
"io/ioutil"
)
func main() {
hasher := sha256.New()
data, err := ioutil.ReadFile("foo.txt")
if err != nil {
panic(err)
}
hasher.Write(data)
cksum := hex.EncodeToString(hasher.Sum(nil))
println(cksum)
}
As the Hash interface embeds io.Writer. This allows you to read the bytes from the file once, write them into the hasher then also return them out.
Do data, err := ioutil.ReadFile(fname) first. You'll have your slice of bytes. Then create your hasher, and do hasher.Write(data).
If you plan on hashing files you shouldn't read the whole file into memory because... there are large files that don't fit into RAM. Yes, in practice you very rarely will run into such out-of-memory issues but you can easily prevent them. The Hash interface is an io.Writer. Usually, the Hash packages have a New function that return a Hash. This allows you to read the file in blocks and continuously feed it to the Write method of the Hash you have. You may also use methods like io.Copy to do this:
h := sha256.New()
data := &bytes.Buffer{}
data.Write([]byte("hi there"))
data.Write([]byte("folks"))
io.Copy(h, data)
fmt.Printf("%x", h.Sum(nil))
io.Copy uses a bufer of 32KiB internally so using it requires around 32KiB of memory max.

Jump to specific line in file in Go

In Go is it possible to jump to particular line number in a file and delete it? Something like linecache in python.
I'm trying to match some substrings in a file and remove the corresponding lines. The matching part I've taken care of and I have an array with line numbers I need to delete but I'm stuck on how to delete the matching lines in the file.
This is an old question, but if anyone is looking for a solution I wrote a package that handles going to any line in a file. Link here. It can open a file and seek to any line position without reading the whole file into memory and splitting.
import "github.com/stoicperlman/fls"
// This is just a wrapper around os.OpenFile. Alternatively
// you could open from os.File and use fls.LineFile(file) to get f
f, err := fls.OpenFile("test.log", os.O_CREATE|os.O_WRONLY, 0600)
defer f.Close()
// return begining line 1/begining of file
// equivalent to f.Seek(0, io.SeekStart)
pos, err := f.SeekLine(0, io.SeekStart)
// return begining line 2
pos, err := f.SeekLine(1, io.SeekStart)
// return begining of last line
pos, err := f.SeekLine(0, io.SeekEnd)
// return begining of second to last line
pos, err := f.SeekLine(-1, io.SeekEnd)
Unfortunately I'm not sure how you would delete, this just handles getting you to the correct position in the file. For your case you could use it to go to the line you want to delete and save the position. Then seek to the next line and save that as well. You now have the bookends of the line to delete.
// might want lineToDelete - 1
// this acts like 0 based array
pos1, err := f.SeekLine(lineToDelete, io.SeekStart)
// skip ahead 1 line
pos2, err := f.SeekLine(1, io.SeekCurrent)
// pos2 will be the position of the first character in next line
// might want pos2 - 1 depending on how the function works
DeleteBytesFromFileFunction(f, pos1, pos2)
Based on my read of the linecache module it takes a file and explodes it into an array based on '\n' line endings. You could replicate the same behavior in Go by using strings or bytes. You could also use the bufio library to read a file a line by line and only store or save the lines you want.
package main
import (
"bytes"
"fmt"
)
import "io/ioutil"
func main() {
b, e := ioutil.ReadFile("filename.txt")
if e != nil {
panic(e)
}
array := bytes.Split(b, []byte("\n"))
fmt.Printf("%v", array)
}
I wrote a small function that allowing you remove from a file a specific line.
package main
import (
"io/ioutil"
"os"
"strings"
)
func main() {
path := "path/to/file.txt"
removeLine(path, 2)
}
func removeLine(path string, lineNumber int) {
file, err := ioutil.ReadFile(path)
if err != nil {
panic(err)
}
info, _ := os.Stat(path)
mode := info.Mode()
array := strings.Split(string(file), "\n")
array = append(array[:lineNumber], array[lineNumber+1:]...)
ioutil.WriteFile(path, []byte(strings.Join(array, "\n")), mode)
}

Resources