How to use NaCl to sign a large file? - file

Given the sign capability from Go NaCl library (https://github.com/golang/crypto/tree/master/nacl/sign), how to sign a file, especially, a very large file as big as more than 1GB? Most of the internet search results are all about signing a slice or small array of bytes.
I can think of 2 ways:
Loop through the file and stream in a block manner (e.g. 16k each time), then feed it into the sign function. The streamed output are concatenated into a signature certificate. For verification, it is done reversely.
Use SHA(X) to generate the shasum of the file and then sign the shasum output.

For signing very large files (multiple gigabytes and up), the problem of using a standard signing function is often runtime and fragility. For very large files (or just slow disks) it could perhaps take hours or more just to serially read the full file from start to end.
In such cases, you want a way to process the file in parallel. One of the common ways to do this which is suitable for cryptographic signatures is Merkle tree hashes. They allow you to split the large file into smaller chunks, hash them in parallel (producing "leaf hashes"), and then further hash those hashes in a tree structure to produce a root hash which represents the full file.
Once you have calculated this Merkle tree root hash, you can sign this root hash. It then becomes possible to use the signed Merkle tree root hash to verify all of the file chunks in parallel, as well as verifying their order (based on the positions of the leaf hashes in the tree structure).

The problem with NaCl is that you need to put the whole message into RAM, as per godoc:
Messages should be small because: 1. The whole message needs to be held in memory to be processed. 2. Using large messages pressures implementations on small machines to process plaintext without verifying the signature. This is very dangerous, and this API discourages it, but a protocol that uses excessive message sizes might present some implementations with no other choice. 3. Performance may be improved by working with messages that fit into data caches. Thus large amounts of data should be chunked so that each message is small.
However, there are various other methods. Most of them basically do what you described in the first way. You basically copy the file contents into an io.Writer which takes the contents and calculates the hash sum - this is most efficient.
The code below is pretty hacked, but you should get the picture.
I achieved an average throughput of 315MB/s with it.
package main
import (
"crypto/ecdsa"
"crypto/elliptic"
"crypto/rand"
"crypto/sha256"
"flag"
"fmt"
"io"
"log"
"math/big"
"os"
"time"
)
var filename = flag.String("file", "", "file to sign")
func main() {
flag.Parse()
if *filename == "" {
log.Fatal("file can not be empty")
}
f, err := os.Open(*filename)
if err != nil {
log.Fatalf("Error opening '%s': %s", *filename, err)
}
defer f.Close()
start := time.Now()
sum, n, err := hash(f)
duration := time.Now().Sub(start)
log.Printf("Hashed %s (%d bytes)in %s to %x", *filename, n, duration, sum)
log.Printf("Average: %.2f MB/s", (float64(n)/1000000)/duration.Seconds())
r, s, err := sign(sum)
if err != nil {
log.Fatalf("Error creatig signature: %s", err)
}
log.Printf("Signature: (0x%x,0x%x)\n", r, s)
}
func sign(sum []byte) (*big.Int, *big.Int, error) {
priv, err := ecdsa.GenerateKey(elliptic.P256(), rand.Reader)
if err != nil {
log.Printf("Error creating private key: %s", err)
}
return ecdsa.Sign(rand.Reader, priv, sum[:])
}
func hash(f *os.File) ([]byte, int64, error) {
var (
hash []byte
n int64
err error
)
h := sha256.New()
// This is where the magic happens.
// We use the efficient io.Copy to feed the contents
// of the file into the hash function.
if n, err = io.Copy(h, f); err != nil {
return nil, n, fmt.Errorf("Error creating hash: %s", err)
}
hash = h.Sum(nil)
return hash, n, nil
}

Related

How to make write operations to a file faster

I am trying to write a large amount of data to a file but it takes quite some time. I have tried 2 solutions but they both take same amount of time. Here are the solutions I have tried;
Solution A:
f, err := os.Create("file.txt")
if err != nil {
log.Fatal(err)
}
defer f.Close()
w := bufio.NewWriter(f)
for _, d := range data {
bb, err := w.WriteString(fmt.Sprint(d + "\n"))
if err != nil {
fmt.Println(err)
}
}
err = w.Flush()
if err != nil {
log.Fatal(err)
}
Solution B:
e, err := os.OpenFile(filePath, os.O_RDWR|os.O_CREATE|os.O_TRUNC, 0666)
if err != nil {
panic(err)
}
defer e.Close()
for _, d := range data {
_, err = e.WriteString(d)
err = e.Sync()
if err != nil {
return err
}
}
Any other suggestion on how I can make this write operation faster?
I think bufio is your friend, as it can help to reduce the number of sys calls required to write the data to disk. You are already using it as part of solution A, however note the default buffer size is 4K. If you want to try larger buffer sizes you can use NewWriterSize() to set a larger buffer for the writer.
See https://pkg.go.dev/bufio#NewWriterSize
Based on your solution A I have created a benchmark test you can use for experimenting with different buffer sizes. For the test I am using a data set of 100k records of 600 bytes written to the file. The results I get on my machine for 10 repetitive calls of the FUT with various buffer sizes are as follows:
BenchmarkWriteTest/Default_Buffer_Size
BenchmarkWriteTest/Default_Buffer_Size-10 15 73800317 ns/op
BenchmarkWriteTest/Buffer_Size_16K
BenchmarkWriteTest/Buffer_Size_16K-10 21 55606873 ns/op
BenchmarkWriteTest/Buffer_Size_64K
BenchmarkWriteTest/Buffer_Size_64K-10 25 49562057 ns/op
As you can see the number of iterations in the test interval (first number) increases significantly with larger buffer size. Accordingly the time spent per operation drops.
https://gist.github.com/mwittig/f1e6a81c2378906292e2e4961f422870
Combine all your data into a single string, and write that in one operation. This will avoid the overhead of filesystem calls.

Expanding a temporary slice if more bytes are needed

I'm generating random files programmatically in a directory, at least temporaryFilesTotalSize worth of random data (a bit more, who cares).
Here's my code:
var files []string
for size := int64(0); size < temporaryFilesTotalSize; {
fileName := random.HexString(12)
filePath := dir + "/" + fileName
file, err := os.Create(filePath)
if err != nil {
return nil, err
}
size += rand.Int63n(1 << 32) // random dimension up to 4GB
raw := make([]byte, size)
_, err := rand.Read(raw)
if err != nil {
panic(err)
}
file.Write(raw)
file.Close()
files = append(files, filePath)
}
Is there any way I can avoid that raw := make([]byte, size) allocation in the for loop?
Ideally I'd like to keep a slice on the heap and only grow if a bigger size is required. Any way to do this efficiently?
First of all you should know that generating random data and writing that to disk is at least an order of magnitude slower than allocating a contiguous memory for buffer. This definitely falls under the "premature optimization" category. Eliminating the creation of the buffer inside the iteration will not make your code noticeably faster.
Reusing the buffer
But to reuse the buffer, move it outside of the loop, create the biggest needed buffer, and slice it in each iteration to the needed size. It's OK to do this, because we'll overwrite the whole part we need with random data.
Note that I somewhat changed the size generation (likely an error in your code as you always increase the generated temporary files, since you use the size accumulated size for new ones).
Also note that writing a file with contents prepared in a []byte is easiest done using a single call to os.WriteFile().
Something like this:
bigRaw := make([]byte, 1 << 32)
for totalSize := int64(0); ; {
size := rand.Int63n(1 << 32) // random dimension up to 4GB
totalSize += size
if totalSize >= temporaryFilesTotalSize {
break
}
raw := bigRaw[:size]
rand.Read(raw) // It's documented that rand.Read() always returns nil error
filePath := filepath.Join(dir, random.HexString(12))
if err := os.WriteFile(filePath, raw, 0666); err != nil {
panic(err)
}
files = append(files, filePath)
}
Solving the task without an intermediate buffer
Since you are writing big files (GBs), allocating that big buffer is not a good idea: running the app will require GBs of RAM! We could improve it with an inner loop to use smaller buffers until we write the expected size, which solves the big memory issue, but increases complexity. Luckily for us, we can solve the task without any buffers, and even with decreased complexity!
We should somehow "channel" the random data from a rand.Rand to the file directly, something similar what io.Copy() does. Note that rand.Rand implements io.Reader, and os.File implements io.ReaderFrom, which suggests we could simply pass a rand.Rand to file.ReadFrom(), and the file itself would get the data directly from rand.Rand that will be written.
This sounds good, but the ReadFrom() reads data from the given reader until EOF or error. Neither will ever happen if we pass rand.Rand. And we do know how many bytes we want to be read and written: size.
To our "rescue" comes io.LimitReader(): we pass an io.Reader and a size to it, and the returned reader will supply no more than the given number of bytes, and after that will report EOF.
Note that creating our own rand.Rand will also be faster as the source we pass to it will be created using rand.NewSource() which returns an "unsynchronized" source (not safe for concurrent use) which in turn will be faster! The source used by the default/global rand.Rand is synchronized (and so safe for concurrent useā€“but is slower).
Perfect! Let's see this in action:
r := rand.New(rand.NewSource(time.Now().Unix()))
for totalSize := int64(0); ; {
size := r.Int63n(1 << 32)
totalSize += size
if totalSize >= temporaryFilesTotalSize {
break
}
filePath := filepath.Join(dir, random.HexString(12))
file, err := os.Create(filePath)
if err != nil {
return nil, err
}
if _, err := file.ReadFrom(io.LimitReader(r, fsize)); err != nil {
panic(err)
}
if err = file.Close(); err != nil {
panic(err)
}
files = append(files, filePath)
}
Note that if os.File would not implement io.ReaderFrom, we could still use io.Copy(), providing the file as the destination, and a limited reader (used above) as the source.
Final note: closing the file (or any resource) is best done using defer, so it'll get called no matter what. Using defer in a loop is a bit tricky though, as deferred functions run at the end of the enclosing function, and not at the end of the loop's iteration. So you may wrap it in a function. For details, see `defer` in the loop - what will be better?

Upload file chunks into MongoDb using mgo/golang

In my case I have a logic which should upload large chunked files for example if I have a file which size is 10mb , I need to send PUT request with 1mb file chunks , 10 times, but the mgo (mgo.v2) is not allow to open file for writing
func UploadFileChunk(rw http.ResponseWriter,rq *http.Request) {
fileid:= mux.Vars(rq)["fileid"]
rq.ParseMultipartForm(10000)
formFile:=rq.MultipartForm.File["file"]
content,err:= formFile[0].Open()
defer content.Close()
if err != nil {
http.Error(rw,err.Error(),http.StatusInternalServerError)
return
}
file,err:= db.GridFS("fs").OpenId(bson.ObjectIdHex(fileid))
if err != nil {
http.Error(rw,err.Error(),http.StatusInternalServerError)
return
}
data,err := ioutil.ReadAll(content)
n,_:= file.Write(data)
file.Close()
// Write a log type message
fmt.Printf("%d bytes written to the Mongodb instance\n", n)
}
So I want to every time write a new chunk but 1) The mgo not allows to open file for writing 2) I don't know is this way a good?

How to return hash and bytes in one step in Go?

I'm trying to understand how I can read content of the file, calculate its hash and return its bytes in one Go. So far, I'm doing this in two steps, e.g.
// calculate file checksum
hasher := sha256.New()
f, err := os.Open(fname)
if err != nil {
msg := fmt.Sprintf("Unable to open file %s, %v", fname, err)
panic(msg)
}
defer f.Close()
b, err := io.Copy(hasher, f)
if err != nil {
panic(err)
}
cksum := hex.EncodeToString(hasher.Sum(nil))
// read again (!!!) to get data as bytes array
data, err := ioutil.ReadFile(fname)
Obviously it is not the most efficient way to do this, since read happens twice, once in copy to pass to hasher and another in ioutil to read file and return list of bytes. I'm struggling to understand how I can combine these steps together and do in one go, read data once, calculate any hash and return it along with list of bytes to another layer.
If you want to read a file, without creating a copy of the entire file in memory, and at the same time calculate its hash, you can do so with a TeeReader:
hasher := sha256.New()
f, err := os.Open(fname)
data := io.TeeReader(f, hasher)
// Now read from data as usual, which is still a stream.
What happens here is that any bytes that are read from data (which is a Reader just like the file object f is) will be pushed to hasheras well.
Note, however, that hasher will produce the correct hash only once you have read the entire file through data, and not until then. So if you need the hash before you decide whether or not you want to read the file, you are left with the options of either doing it in two passes (for example like you are now), or to always read the file but discard the result if the hash check failed.
If you do read the file in two passes, you could of course buffer the entire file data in a byte buffer in memory. However, the operating system will typically cache the file you just read in RAM anyway (if possible), so the performance benefit of doing a buffered two-pass solution yourself rather than just doing two passes over the file is probably negligible.
You can write bytes directly to the hasher. For example:
package main
import (
"crypto/sha256"
"encoding/hex"
"io/ioutil"
)
func main() {
hasher := sha256.New()
data, err := ioutil.ReadFile("foo.txt")
if err != nil {
panic(err)
}
hasher.Write(data)
cksum := hex.EncodeToString(hasher.Sum(nil))
println(cksum)
}
As the Hash interface embeds io.Writer. This allows you to read the bytes from the file once, write them into the hasher then also return them out.
Do data, err := ioutil.ReadFile(fname) first. You'll have your slice of bytes. Then create your hasher, and do hasher.Write(data).
If you plan on hashing files you shouldn't read the whole file into memory because... there are large files that don't fit into RAM. Yes, in practice you very rarely will run into such out-of-memory issues but you can easily prevent them. The Hash interface is an io.Writer. Usually, the Hash packages have a New function that return a Hash. This allows you to read the file in blocks and continuously feed it to the Write method of the Hash you have. You may also use methods like io.Copy to do this:
h := sha256.New()
data := &bytes.Buffer{}
data.Write([]byte("hi there"))
data.Write([]byte("folks"))
io.Copy(h, data)
fmt.Printf("%x", h.Sum(nil))
io.Copy uses a bufer of 32KiB internally so using it requires around 32KiB of memory max.

How to write a safe rename in Go? (Or, how to write this Python in Go?)

I've got the following code in Python:
if not os.path.exists(src): sys.exit("Does not exist: %s" % src)
if os.path.exists(dst): sys.exit("Already exists: %s" % dst)
os.rename(src, dst)
From this question, I understand that there is no direct method to test if a file exists or doesn't exist.
What is the proper way to write the above in Go, including printing out the correct error strings?
Here is the closest I've gotten:
package main
import "fmt"
import "os"
func main() {
src := "a"
dst := "b"
e := os.Rename(src, dst)
if e != nil {
fmt.Println(e.(*os.LinkError).Op)
fmt.Println(e.(*os.LinkError).Old)
fmt.Println(e.(*os.LinkError).New)
fmt.Println(e.(*os.LinkError).Err)
}
}
From the availability of information from the error, where it effectively doesn't tell you what the problem is without you parsing an English freeformat string, it seems to me that it is not possible to write the equivalent in Go.
The code you provide contains a race condition: Between you checking for dst to not exist and copying something into dst, a third party could have created the file dst, causing you to overwrite a file. Either remove the os.path.exists(dst) check because it cannot reliably detect if the target exists at the time you try to remove it, or employ the following algorithm instead:
Create a hardlink from src to dst. If a file named dst exists, the operation will fail and you can bail out. If src does not exist, the operation will fail, too.
Remove src.
The following code implements the two-step algorithm outlined above in Go.
import "os"
func renameAndCheck(src, dst string) error {
err := os.Link(src, dst)
if err != nil {
return err
}
return os.Remove(src)
}
You can check for which reason the call to os.Link() failed:
If the error satisfies os.IsNotExist(), the call failed because src did not exist at the time os.Link() was called
If the error satisfies os.IsExist(), the call failed because dst exists at the time os.Link() is called
If the error satisfies os.IsPermission(), the call failed because you don't have sufficient permissions to create a hard link
As far as I know, other reasons (like the file system not supporting the creation of hard links or src and dst being on different file systems) cannot be tested portably.
The translation if your Python code to Go is:
if _, err := os.Stat(src); err != nil {
// The source does not exist or some other error accessing the source
log.Fatal("source:", err)
}
if _, err := os.Stat(dst); !os.IsNotExists(dst) {
// The destination exists or some other error accessing the destination
log.Fatal("dest:", err)
}
if err := os.Rename(src, dst); err != nil {
log.Fatal(err)
}
The three function call sequence is not safe (I am referring to both the original Python version and my replication of it here). The source can be removed or the destination can be created after the checks, but before the rename.
The safe way to move a file is OS dependent. On Windows, you can just call os.Rename(). On Windows, this function will fail if the destination exists or the source does not. On Posix systems, you should link and remove as described in another answer.

Resources