I am trying to write a large amount of data to a file but it takes quite some time. I have tried 2 solutions but they both take same amount of time. Here are the solutions I have tried;
Solution A:
f, err := os.Create("file.txt")
if err != nil {
log.Fatal(err)
}
defer f.Close()
w := bufio.NewWriter(f)
for _, d := range data {
bb, err := w.WriteString(fmt.Sprint(d + "\n"))
if err != nil {
fmt.Println(err)
}
}
err = w.Flush()
if err != nil {
log.Fatal(err)
}
Solution B:
e, err := os.OpenFile(filePath, os.O_RDWR|os.O_CREATE|os.O_TRUNC, 0666)
if err != nil {
panic(err)
}
defer e.Close()
for _, d := range data {
_, err = e.WriteString(d)
err = e.Sync()
if err != nil {
return err
}
}
Any other suggestion on how I can make this write operation faster?
I think bufio is your friend, as it can help to reduce the number of sys calls required to write the data to disk. You are already using it as part of solution A, however note the default buffer size is 4K. If you want to try larger buffer sizes you can use NewWriterSize() to set a larger buffer for the writer.
See https://pkg.go.dev/bufio#NewWriterSize
Based on your solution A I have created a benchmark test you can use for experimenting with different buffer sizes. For the test I am using a data set of 100k records of 600 bytes written to the file. The results I get on my machine for 10 repetitive calls of the FUT with various buffer sizes are as follows:
BenchmarkWriteTest/Default_Buffer_Size
BenchmarkWriteTest/Default_Buffer_Size-10 15 73800317 ns/op
BenchmarkWriteTest/Buffer_Size_16K
BenchmarkWriteTest/Buffer_Size_16K-10 21 55606873 ns/op
BenchmarkWriteTest/Buffer_Size_64K
BenchmarkWriteTest/Buffer_Size_64K-10 25 49562057 ns/op
As you can see the number of iterations in the test interval (first number) increases significantly with larger buffer size. Accordingly the time spent per operation drops.
https://gist.github.com/mwittig/f1e6a81c2378906292e2e4961f422870
Combine all your data into a single string, and write that in one operation. This will avoid the overhead of filesystem calls.
Related
I'm generating random files programmatically in a directory, at least temporaryFilesTotalSize worth of random data (a bit more, who cares).
Here's my code:
var files []string
for size := int64(0); size < temporaryFilesTotalSize; {
fileName := random.HexString(12)
filePath := dir + "/" + fileName
file, err := os.Create(filePath)
if err != nil {
return nil, err
}
size += rand.Int63n(1 << 32) // random dimension up to 4GB
raw := make([]byte, size)
_, err := rand.Read(raw)
if err != nil {
panic(err)
}
file.Write(raw)
file.Close()
files = append(files, filePath)
}
Is there any way I can avoid that raw := make([]byte, size) allocation in the for loop?
Ideally I'd like to keep a slice on the heap and only grow if a bigger size is required. Any way to do this efficiently?
First of all you should know that generating random data and writing that to disk is at least an order of magnitude slower than allocating a contiguous memory for buffer. This definitely falls under the "premature optimization" category. Eliminating the creation of the buffer inside the iteration will not make your code noticeably faster.
Reusing the buffer
But to reuse the buffer, move it outside of the loop, create the biggest needed buffer, and slice it in each iteration to the needed size. It's OK to do this, because we'll overwrite the whole part we need with random data.
Note that I somewhat changed the size generation (likely an error in your code as you always increase the generated temporary files, since you use the size accumulated size for new ones).
Also note that writing a file with contents prepared in a []byte is easiest done using a single call to os.WriteFile().
Something like this:
bigRaw := make([]byte, 1 << 32)
for totalSize := int64(0); ; {
size := rand.Int63n(1 << 32) // random dimension up to 4GB
totalSize += size
if totalSize >= temporaryFilesTotalSize {
break
}
raw := bigRaw[:size]
rand.Read(raw) // It's documented that rand.Read() always returns nil error
filePath := filepath.Join(dir, random.HexString(12))
if err := os.WriteFile(filePath, raw, 0666); err != nil {
panic(err)
}
files = append(files, filePath)
}
Solving the task without an intermediate buffer
Since you are writing big files (GBs), allocating that big buffer is not a good idea: running the app will require GBs of RAM! We could improve it with an inner loop to use smaller buffers until we write the expected size, which solves the big memory issue, but increases complexity. Luckily for us, we can solve the task without any buffers, and even with decreased complexity!
We should somehow "channel" the random data from a rand.Rand to the file directly, something similar what io.Copy() does. Note that rand.Rand implements io.Reader, and os.File implements io.ReaderFrom, which suggests we could simply pass a rand.Rand to file.ReadFrom(), and the file itself would get the data directly from rand.Rand that will be written.
This sounds good, but the ReadFrom() reads data from the given reader until EOF or error. Neither will ever happen if we pass rand.Rand. And we do know how many bytes we want to be read and written: size.
To our "rescue" comes io.LimitReader(): we pass an io.Reader and a size to it, and the returned reader will supply no more than the given number of bytes, and after that will report EOF.
Note that creating our own rand.Rand will also be faster as the source we pass to it will be created using rand.NewSource() which returns an "unsynchronized" source (not safe for concurrent use) which in turn will be faster! The source used by the default/global rand.Rand is synchronized (and so safe for concurrent useābut is slower).
Perfect! Let's see this in action:
r := rand.New(rand.NewSource(time.Now().Unix()))
for totalSize := int64(0); ; {
size := r.Int63n(1 << 32)
totalSize += size
if totalSize >= temporaryFilesTotalSize {
break
}
filePath := filepath.Join(dir, random.HexString(12))
file, err := os.Create(filePath)
if err != nil {
return nil, err
}
if _, err := file.ReadFrom(io.LimitReader(r, fsize)); err != nil {
panic(err)
}
if err = file.Close(); err != nil {
panic(err)
}
files = append(files, filePath)
}
Note that if os.File would not implement io.ReaderFrom, we could still use io.Copy(), providing the file as the destination, and a limited reader (used above) as the source.
Final note: closing the file (or any resource) is best done using defer, so it'll get called no matter what. Using defer in a loop is a bit tricky though, as deferred functions run at the end of the enclosing function, and not at the end of the loop's iteration. So you may wrap it in a function. For details, see `defer` in the loop - what will be better?
I want to delete thelast N bytes from file in Go,
Actually, this is already implemented is the os.Truncate() function. But this function takes the new size. So to use this, you have to first get the size of the file. For that, you may use os.Stat().
Wrapping it into a function:
func truncateFile(name string, bytesToRemove int64) error {
fi, err := os.Stat(name)
if err != nil {
return err
}
return os.Truncate(name, fi.Size()-bytesToRemove)
}
Using it to remove the last 5000 bytes:
if err := truncateFile("C:\\Test.zip", 5000); err != nil {
fmt.Println("Error:", err)
}
Another alternative is to use the File.Truncate() method for that. If we have an os.File, we may also use File.Stat() to get its size.
This is how it would look like:
func truncateFile(name string, bytesToRemove int64) error {
f, err := os.OpenFile(name, os.O_RDWR, 0644)
if err != nil {
return err
}
defer f.Close()
fi, err := f.Stat()
if err != nil {
return err
}
return f.Truncate(fi.Size() - bytesToRemove)
}
Using it is the same. This may be preferable if we're working on a file (we have it opened) and we have to truncate it. But in that case you'd want to pass os.File instead of its name to truncateFile().
Note: if you try to remove more bytes than the file currently has, truncateFile() will return an error.
I've got trouble overwriting a files content with zeros. The problem is that the very last byte of the original file remains, even when I exceed its size by 100 bytes. Someone got an idea what I'm missing?
func (h PostKey) ServeHTTP(w http.ResponseWriter, r *http.Request) {
f, err := os.Create("received.dat")
if err != nil {
w.WriteHeader(http.StatusInternalServerError)
return
}
defer f.Close()
_, err = io.Copy(f, r.Body)
if err != nil {
w.WriteHeader(http.StatusInternalServerError)
return
}
// Retrieve filesize
size, _ := f.Seek(0, 1)
zeroFilled := make([]byte, size + 100)
n, err := f.WriteAt(zeroFilled, 0)
if err != nil {
return
}
fmt.Printf("Size: %d\n", size) // prints 13
fmt.Printf("Bytes written: %d\n", n) // prints 113
}
The problem may occurred because the data is written into a same file (shared resource) inside an http handler, and the handler itself may be executed concurrently. You need to lock access to the file during data serialization (overwriting process). Quick solution will be:
import (
"sync"
//... other packages
)
var muFile sync.Mutex
func (h PostKey) ServeHTTP(w http.ResponseWriter, r *http.Request) {
muFile.Lock()
defer muFile.Unlock()
f, err := os.Create("received.dat")
//other statements
//...
}
If your server load is low, the above solution will be fine. But if your server needs to handle a lot of requests concurrently, you need to use different approach (although the rule is the same, lock access to any shared resource).
I was writing to the file and trying to overwrite it in the same context, and so parts of the first write operation were still in memory and not yet written to the disk. By using f.Sync() to flush everything after copying the bodys content I was able to fix the issue.
I try to serialize a structured data to file. I looked through some examples and made such construction:
func (order Order) Serialize(folder string) {
b := bytes.Buffer{}
e := gob.NewEncoder(&b)
err := e.Encode(order)
if err != nil { panic(err) }
os.MkdirAll(folder, 0777)
file, err := os.Create(folder + order.Id)
if err != nil { panic(err) }
defer file.Close()
writer := bufio.NewWriter(file)
n, err := writer.Write(b.Bytes())
fmt.Println(n)
if err != nil {
panic(err)
}
}
Serialize is a method serializing its object to file called by it's id property. I looked through debugger - byte buffer contains data before writing. I mean object is fully initialized. Even n variable representing quantity of written bytes is more than a thousand - the file shouldn't be empty at all. The file is created but it is totally empty. What's wrong?
bufio.Writer (as the package name hints) uses a buffer to cache writes. If you ever use it, you must call Writer.Flush() when you're done writing to it to ensure the buffered data gets written to the underlying io.Writer.
Also note that you can directly write to an os.File, no need to create a buffered writer "around" it. (*os.File implements io.Writer).
Also note that you can create the gob.Encoder directly directed to the os.File, so even the bytes.Buffer is unnecessary.
Also os.MkdirAll() may fail, check its return value.
Also it's better to "concatenate" parts of a file path using filepath.Join() which takes care of extra / missing slashes at the end of folder names.
And last, it would be better to signal the failure of Serialize(), e.g. with an error return value, so the caller party has the chance to examine if the operation succeeded, and act accordingly.
So Order.Serialize() should look like this:
func (order Order) Serialize(folder string) error {
if err := os.MkdirAll(folder, 0777); err != nil {
return err
}
file, err := os.Create(filepath.Join(folder, order.Id))
if err != nil {
return err
}
defer file.Close()
if err := gob.NewEncoder(file).Encode(order); err != nil {
return err
}
return nil
}
The Go's io.Reader documentation states that a Read() may return a non zero n value and an io.EOF at the same time. Unfortunately, the Read() method of a File doesn't do that.
When the EOF is reached and some bytes could still be read, the Read method of file returns non zero n and nil error. It is only when we try to read when already at the end of the file that we get back zero n and io.EOF as error.
I couldn't find a simple method to test if the EOF is reached without trying to read data from the file. If we perform a Read() with a buffer of 0 byte, we get back zero n and nil error although we are at the end of file.
To avoid this last read, the only solution I have found is to keep track myself of the number of bytes remaining to read in the file. Is there a simpler solution ?
You could create a new type, that keeps track of the number of bytes read so far. Then, at EOF check time, you could compare the expected number of bytes read with the actual number of bytes read. Here is a sample implementation. The eofReader keeps track of the number of bytes read and compares it to the file size, in case the underlying type is a file:
package main
// ... imports
// eofReader can be checked for EOF, without a Read.
type eofReader struct {
r io.Reader
count uint64
}
// AtEOF returns true, if the number of bytes read equals the file size.
func (r *eofReader) AtEOF() (bool, error) {
f, ok := r.r.(*os.File)
if !ok {
return false, nil
}
fi, err := f.Stat()
if err != nil {
return false, err
}
return r.Count() == uint64(fi.Size()), nil
}
// Read reads and counts.
func (r *eofReader) Read(buf []byte) (int, error) {
n, err := r.r.Read(buf)
atomic.AddUint64(&r.count, uint64(n))
return n, err
}
// Count returns the count.
func (r *eofReader) Count() uint64 {
return atomic.LoadUint64(&r.count)
}
You could use this type by wrapping any reader in an eofReader:
func main() {
f, err := os.Open("main.go")
if err != nil {
log.Fatal(err)
}
r := &eofReader{r: f}
log.Println(r.AtEOF())
if _, err = ioutil.ReadAll(r); err != nil {
log.Fatal(err)
}
log.Println(r.AtEOF())
}
// 2016/12/19 03:49:35 false <nil>
// 2016/12/19 03:49:35 true <nil>
Code as gist.