Expanding a temporary slice if more bytes are needed - file

I'm generating random files programmatically in a directory, at least temporaryFilesTotalSize worth of random data (a bit more, who cares).
Here's my code:
var files []string
for size := int64(0); size < temporaryFilesTotalSize; {
fileName := random.HexString(12)
filePath := dir + "/" + fileName
file, err := os.Create(filePath)
if err != nil {
return nil, err
}
size += rand.Int63n(1 << 32) // random dimension up to 4GB
raw := make([]byte, size)
_, err := rand.Read(raw)
if err != nil {
panic(err)
}
file.Write(raw)
file.Close()
files = append(files, filePath)
}
Is there any way I can avoid that raw := make([]byte, size) allocation in the for loop?
Ideally I'd like to keep a slice on the heap and only grow if a bigger size is required. Any way to do this efficiently?

First of all you should know that generating random data and writing that to disk is at least an order of magnitude slower than allocating a contiguous memory for buffer. This definitely falls under the "premature optimization" category. Eliminating the creation of the buffer inside the iteration will not make your code noticeably faster.
Reusing the buffer
But to reuse the buffer, move it outside of the loop, create the biggest needed buffer, and slice it in each iteration to the needed size. It's OK to do this, because we'll overwrite the whole part we need with random data.
Note that I somewhat changed the size generation (likely an error in your code as you always increase the generated temporary files, since you use the size accumulated size for new ones).
Also note that writing a file with contents prepared in a []byte is easiest done using a single call to os.WriteFile().
Something like this:
bigRaw := make([]byte, 1 << 32)
for totalSize := int64(0); ; {
size := rand.Int63n(1 << 32) // random dimension up to 4GB
totalSize += size
if totalSize >= temporaryFilesTotalSize {
break
}
raw := bigRaw[:size]
rand.Read(raw) // It's documented that rand.Read() always returns nil error
filePath := filepath.Join(dir, random.HexString(12))
if err := os.WriteFile(filePath, raw, 0666); err != nil {
panic(err)
}
files = append(files, filePath)
}
Solving the task without an intermediate buffer
Since you are writing big files (GBs), allocating that big buffer is not a good idea: running the app will require GBs of RAM! We could improve it with an inner loop to use smaller buffers until we write the expected size, which solves the big memory issue, but increases complexity. Luckily for us, we can solve the task without any buffers, and even with decreased complexity!
We should somehow "channel" the random data from a rand.Rand to the file directly, something similar what io.Copy() does. Note that rand.Rand implements io.Reader, and os.File implements io.ReaderFrom, which suggests we could simply pass a rand.Rand to file.ReadFrom(), and the file itself would get the data directly from rand.Rand that will be written.
This sounds good, but the ReadFrom() reads data from the given reader until EOF or error. Neither will ever happen if we pass rand.Rand. And we do know how many bytes we want to be read and written: size.
To our "rescue" comes io.LimitReader(): we pass an io.Reader and a size to it, and the returned reader will supply no more than the given number of bytes, and after that will report EOF.
Note that creating our own rand.Rand will also be faster as the source we pass to it will be created using rand.NewSource() which returns an "unsynchronized" source (not safe for concurrent use) which in turn will be faster! The source used by the default/global rand.Rand is synchronized (and so safe for concurrent useā€“but is slower).
Perfect! Let's see this in action:
r := rand.New(rand.NewSource(time.Now().Unix()))
for totalSize := int64(0); ; {
size := r.Int63n(1 << 32)
totalSize += size
if totalSize >= temporaryFilesTotalSize {
break
}
filePath := filepath.Join(dir, random.HexString(12))
file, err := os.Create(filePath)
if err != nil {
return nil, err
}
if _, err := file.ReadFrom(io.LimitReader(r, fsize)); err != nil {
panic(err)
}
if err = file.Close(); err != nil {
panic(err)
}
files = append(files, filePath)
}
Note that if os.File would not implement io.ReaderFrom, we could still use io.Copy(), providing the file as the destination, and a limited reader (used above) as the source.
Final note: closing the file (or any resource) is best done using defer, so it'll get called no matter what. Using defer in a loop is a bit tricky though, as deferred functions run at the end of the enclosing function, and not at the end of the loop's iteration. So you may wrap it in a function. For details, see `defer` in the loop - what will be better?

Related

How to make write operations to a file faster

I am trying to write a large amount of data to a file but it takes quite some time. I have tried 2 solutions but they both take same amount of time. Here are the solutions I have tried;
Solution A:
f, err := os.Create("file.txt")
if err != nil {
log.Fatal(err)
}
defer f.Close()
w := bufio.NewWriter(f)
for _, d := range data {
bb, err := w.WriteString(fmt.Sprint(d + "\n"))
if err != nil {
fmt.Println(err)
}
}
err = w.Flush()
if err != nil {
log.Fatal(err)
}
Solution B:
e, err := os.OpenFile(filePath, os.O_RDWR|os.O_CREATE|os.O_TRUNC, 0666)
if err != nil {
panic(err)
}
defer e.Close()
for _, d := range data {
_, err = e.WriteString(d)
err = e.Sync()
if err != nil {
return err
}
}
Any other suggestion on how I can make this write operation faster?
I think bufio is your friend, as it can help to reduce the number of sys calls required to write the data to disk. You are already using it as part of solution A, however note the default buffer size is 4K. If you want to try larger buffer sizes you can use NewWriterSize() to set a larger buffer for the writer.
See https://pkg.go.dev/bufio#NewWriterSize
Based on your solution A I have created a benchmark test you can use for experimenting with different buffer sizes. For the test I am using a data set of 100k records of 600 bytes written to the file. The results I get on my machine for 10 repetitive calls of the FUT with various buffer sizes are as follows:
BenchmarkWriteTest/Default_Buffer_Size
BenchmarkWriteTest/Default_Buffer_Size-10 15 73800317 ns/op
BenchmarkWriteTest/Buffer_Size_16K
BenchmarkWriteTest/Buffer_Size_16K-10 21 55606873 ns/op
BenchmarkWriteTest/Buffer_Size_64K
BenchmarkWriteTest/Buffer_Size_64K-10 25 49562057 ns/op
As you can see the number of iterations in the test interval (first number) increases significantly with larger buffer size. Accordingly the time spent per operation drops.
https://gist.github.com/mwittig/f1e6a81c2378906292e2e4961f422870
Combine all your data into a single string, and write that in one operation. This will avoid the overhead of filesystem calls.

How to use NaCl to sign a large file?

Given the sign capability from Go NaCl library (https://github.com/golang/crypto/tree/master/nacl/sign), how to sign a file, especially, a very large file as big as more than 1GB? Most of the internet search results are all about signing a slice or small array of bytes.
I can think of 2 ways:
Loop through the file and stream in a block manner (e.g. 16k each time), then feed it into the sign function. The streamed output are concatenated into a signature certificate. For verification, it is done reversely.
Use SHA(X) to generate the shasum of the file and then sign the shasum output.
For signing very large files (multiple gigabytes and up), the problem of using a standard signing function is often runtime and fragility. For very large files (or just slow disks) it could perhaps take hours or more just to serially read the full file from start to end.
In such cases, you want a way to process the file in parallel. One of the common ways to do this which is suitable for cryptographic signatures is Merkle tree hashes. They allow you to split the large file into smaller chunks, hash them in parallel (producing "leaf hashes"), and then further hash those hashes in a tree structure to produce a root hash which represents the full file.
Once you have calculated this Merkle tree root hash, you can sign this root hash. It then becomes possible to use the signed Merkle tree root hash to verify all of the file chunks in parallel, as well as verifying their order (based on the positions of the leaf hashes in the tree structure).
The problem with NaCl is that you need to put the whole message into RAM, as per godoc:
Messages should be small because: 1. The whole message needs to be held in memory to be processed. 2. Using large messages pressures implementations on small machines to process plaintext without verifying the signature. This is very dangerous, and this API discourages it, but a protocol that uses excessive message sizes might present some implementations with no other choice. 3. Performance may be improved by working with messages that fit into data caches. Thus large amounts of data should be chunked so that each message is small.
However, there are various other methods. Most of them basically do what you described in the first way. You basically copy the file contents into an io.Writer which takes the contents and calculates the hash sum - this is most efficient.
The code below is pretty hacked, but you should get the picture.
I achieved an average throughput of 315MB/s with it.
package main
import (
"crypto/ecdsa"
"crypto/elliptic"
"crypto/rand"
"crypto/sha256"
"flag"
"fmt"
"io"
"log"
"math/big"
"os"
"time"
)
var filename = flag.String("file", "", "file to sign")
func main() {
flag.Parse()
if *filename == "" {
log.Fatal("file can not be empty")
}
f, err := os.Open(*filename)
if err != nil {
log.Fatalf("Error opening '%s': %s", *filename, err)
}
defer f.Close()
start := time.Now()
sum, n, err := hash(f)
duration := time.Now().Sub(start)
log.Printf("Hashed %s (%d bytes)in %s to %x", *filename, n, duration, sum)
log.Printf("Average: %.2f MB/s", (float64(n)/1000000)/duration.Seconds())
r, s, err := sign(sum)
if err != nil {
log.Fatalf("Error creatig signature: %s", err)
}
log.Printf("Signature: (0x%x,0x%x)\n", r, s)
}
func sign(sum []byte) (*big.Int, *big.Int, error) {
priv, err := ecdsa.GenerateKey(elliptic.P256(), rand.Reader)
if err != nil {
log.Printf("Error creating private key: %s", err)
}
return ecdsa.Sign(rand.Reader, priv, sum[:])
}
func hash(f *os.File) ([]byte, int64, error) {
var (
hash []byte
n int64
err error
)
h := sha256.New()
// This is where the magic happens.
// We use the efficient io.Copy to feed the contents
// of the file into the hash function.
if n, err = io.Copy(h, f); err != nil {
return nil, n, fmt.Errorf("Error creating hash: %s", err)
}
hash = h.Sum(nil)
return hash, n, nil
}

Golang dynamic sizing slice when reading a file using buffo.read

I have a problem where, I need to use bufio.read to read a tsv file line by line and I need to record how many bytes each line Ive read is.
The problem is, It seems like I can't just initialize an empty slice and pass it into bufio.read and expect the slice to contain the entire line of the file.
file, _ := os.Open("file.tsv")
reader := bufio.NewReader(file)
b := make([]byte, 10)
for {
bytesRead, err:= reader.Read(b)
fmt.Println(bytesRead, b)
if err != nil {
break
}
}
So, for this example, since I specified the slice to be 10 bytes, the reader will read at most 10 bytes even if the line is bigger than 10 bytes.
However:
file, _ := os.Open("file.tsv")
reader := bufio.NewReader(file)
b := byte{} //or var b []byte
for {
bytesRead, err:= reader.Read(b)
fmt.Println(bytesRead, b)
if err != nil {
break
}
}
This will always read 0 bytes and I assume its because the buffer is length 0 or capacity 0.
How do I read a file Line by line, save the entire line in a variable or buffer, and return exactly how many bytes Ive read?
Thanks!
If you want to read line by line, and you're using a buffered reader, use the buffered reader's ReadBytes method.
line,err := reader.ReadBytes('\n')
This will give you a full line, one line at a time, regardless of byte length.

Go. Writing []byte to file results in zero byte file

I try to serialize a structured data to file. I looked through some examples and made such construction:
func (order Order) Serialize(folder string) {
b := bytes.Buffer{}
e := gob.NewEncoder(&b)
err := e.Encode(order)
if err != nil { panic(err) }
os.MkdirAll(folder, 0777)
file, err := os.Create(folder + order.Id)
if err != nil { panic(err) }
defer file.Close()
writer := bufio.NewWriter(file)
n, err := writer.Write(b.Bytes())
fmt.Println(n)
if err != nil {
panic(err)
}
}
Serialize is a method serializing its object to file called by it's id property. I looked through debugger - byte buffer contains data before writing. I mean object is fully initialized. Even n variable representing quantity of written bytes is more than a thousand - the file shouldn't be empty at all. The file is created but it is totally empty. What's wrong?
bufio.Writer (as the package name hints) uses a buffer to cache writes. If you ever use it, you must call Writer.Flush() when you're done writing to it to ensure the buffered data gets written to the underlying io.Writer.
Also note that you can directly write to an os.File, no need to create a buffered writer "around" it. (*os.File implements io.Writer).
Also note that you can create the gob.Encoder directly directed to the os.File, so even the bytes.Buffer is unnecessary.
Also os.MkdirAll() may fail, check its return value.
Also it's better to "concatenate" parts of a file path using filepath.Join() which takes care of extra / missing slashes at the end of folder names.
And last, it would be better to signal the failure of Serialize(), e.g. with an error return value, so the caller party has the chance to examine if the operation succeeded, and act accordingly.
So Order.Serialize() should look like this:
func (order Order) Serialize(folder string) error {
if err := os.MkdirAll(folder, 0777); err != nil {
return err
}
file, err := os.Create(filepath.Join(folder, order.Id))
if err != nil {
return err
}
defer file.Close()
if err := gob.NewEncoder(file).Encode(order); err != nil {
return err
}
return nil
}

How to return hash and bytes in one step in Go?

I'm trying to understand how I can read content of the file, calculate its hash and return its bytes in one Go. So far, I'm doing this in two steps, e.g.
// calculate file checksum
hasher := sha256.New()
f, err := os.Open(fname)
if err != nil {
msg := fmt.Sprintf("Unable to open file %s, %v", fname, err)
panic(msg)
}
defer f.Close()
b, err := io.Copy(hasher, f)
if err != nil {
panic(err)
}
cksum := hex.EncodeToString(hasher.Sum(nil))
// read again (!!!) to get data as bytes array
data, err := ioutil.ReadFile(fname)
Obviously it is not the most efficient way to do this, since read happens twice, once in copy to pass to hasher and another in ioutil to read file and return list of bytes. I'm struggling to understand how I can combine these steps together and do in one go, read data once, calculate any hash and return it along with list of bytes to another layer.
If you want to read a file, without creating a copy of the entire file in memory, and at the same time calculate its hash, you can do so with a TeeReader:
hasher := sha256.New()
f, err := os.Open(fname)
data := io.TeeReader(f, hasher)
// Now read from data as usual, which is still a stream.
What happens here is that any bytes that are read from data (which is a Reader just like the file object f is) will be pushed to hasheras well.
Note, however, that hasher will produce the correct hash only once you have read the entire file through data, and not until then. So if you need the hash before you decide whether or not you want to read the file, you are left with the options of either doing it in two passes (for example like you are now), or to always read the file but discard the result if the hash check failed.
If you do read the file in two passes, you could of course buffer the entire file data in a byte buffer in memory. However, the operating system will typically cache the file you just read in RAM anyway (if possible), so the performance benefit of doing a buffered two-pass solution yourself rather than just doing two passes over the file is probably negligible.
You can write bytes directly to the hasher. For example:
package main
import (
"crypto/sha256"
"encoding/hex"
"io/ioutil"
)
func main() {
hasher := sha256.New()
data, err := ioutil.ReadFile("foo.txt")
if err != nil {
panic(err)
}
hasher.Write(data)
cksum := hex.EncodeToString(hasher.Sum(nil))
println(cksum)
}
As the Hash interface embeds io.Writer. This allows you to read the bytes from the file once, write them into the hasher then also return them out.
Do data, err := ioutil.ReadFile(fname) first. You'll have your slice of bytes. Then create your hasher, and do hasher.Write(data).
If you plan on hashing files you shouldn't read the whole file into memory because... there are large files that don't fit into RAM. Yes, in practice you very rarely will run into such out-of-memory issues but you can easily prevent them. The Hash interface is an io.Writer. Usually, the Hash packages have a New function that return a Hash. This allows you to read the file in blocks and continuously feed it to the Write method of the Hash you have. You may also use methods like io.Copy to do this:
h := sha256.New()
data := &bytes.Buffer{}
data.Write([]byte("hi there"))
data.Write([]byte("folks"))
io.Copy(h, data)
fmt.Printf("%x", h.Sum(nil))
io.Copy uses a bufer of 32KiB internally so using it requires around 32KiB of memory max.

Resources