Memory efficient computation of md5sum of a file in vlang

Memory efficient computation of md5sum of a file in vlang - file

The following code read a file into bytes and computes the md5sum of the bytes array. It works but I would like to find a solution in V that need less RAM.
Thanks for your comments !
import os
import crypto.md5
b := os.read_bytes("file.txt") or {panic(err)}
s := md5.sum(b).hex()
println(s)
I also tried without success :
import os
import crypto.md5
import io
mut f := os.open_file("file.txt", "r")?
mut h := md5.new()
io.cp(mut f, mut h)?
s := h.sum().hex()
println(s) // does not return the correct md5sum

Alrighty. This is what you're looking for. It produces the same result as md5sum and is only slightly slower. block_size is inversely related to the amount of memory used and speed at which the checksum is computed. Decreasing block_size will lower the memory footprint, but takes longer to compute. Increasing block_size has the opposite effect. I tested on a 2GB manjaro disc image and can confirm the memory usage is very low.
Note: It seems this does perform noticeably slower without the -prod flag. The V compiler makes special optimizations in order to run faster for the production build.
import crypto.md5
import io
import os
fn main() {
println(hash_file('manjaro.img')?)
}
const block_size = 64 * 65535
fn hash_file(path string) ?string {
mut file := os.open(path)?
defer {
file.close()
}
mut buf := []u8{len: block_size}
mut r := io.new_buffered_reader(reader: file)
mut digest := md5.new()
for {
x := r.read(mut buf) or { break }
digest.write(buf[..x])?
}
return digest.checksum().hex()
}

To conclude what I've learned from the comments:
V is a programming language with typed arguments
md5.sum takes a byte array argument, and not something that is a sequence of bytes, e.g. read from a file as-you-go.
There's no alternative to md5.sum
So, you will have to implement MD5 yourself. Maybe the standard library is open source and you can build upon that! Or, you can just bind any of the existing (e.g. C) implementations of MD5 and feed in bytes as you read them, in chunks of 512 bits = 2⁶ Bytes.
EDIT: I don't know V, so it's hard for me to judge, but it would look Digest.write would be a method to consecutively push data through the MD5 calculation. Maybe that together with a while loop reading bytes from the file is the solution?

Related

It's like OpenCL kernel instance ends abruptly

I'm new to OpenCL and I'm working on converting an existing algorithm to OpenCL.
In this process, I am experiencing a phenomenon that I cannot solve on my own, and I would like to ask some help.
Here's details.
My kernel is applied to images of different size (to be precise, each layer of the Laplacian pyramid).
I get normal results for images of larger size such as 3072 x 3072, 1536 x 1536.
But I get abnormal results for smaller images such as 12 x 12, 6 x 6, 3 x 3, 2 x 2.
At first, I suspected that clEnqueueNDRangeKernel had a bottom limit for dimensions, causing this problem. So, I added printf to the beginning of the kernel as follows. It is confirmed that all necessary kernel instances are executed.
__kernel void GetValueOfB(/* parameters */)
{
uint xB = get_global_id(0);
uint yB = get_global_id(1);
printf("(%d, %d)\n", xB, yB);
// calculation code is omitted
}
So after wandering for a while, I added the same printf to the end of the kernel. When I did this, it was confirmed that printf works only for some pixel positions. For pixel positions not output by printf, the calculated values in the resulting image are incorrect, and as a result, I concluded that some kernel instances terminate abnormally before completing the calculations.
__kernel void GetValueOfB(/* parameters */)
{
uint xB = get_global_id(0);
uint yB = get_global_id(1);
printf("(%d, %d)\n", xB, yB);
// calculation code is omitted
printf("(%d, %d, %f)\n", xB, yB, result_for_this_position);
}
It seems that there is no problem with the calculation of the kernel. If I compile the kernel turning off the optimization with the -cl-opt-disable option, I get perfectly correct results for all images regardless of their size. In addition to that, with NVIDA P4000, it works correct. Of course, in theses cases, I confirmed that the printf added at the bottom of the Kernel works for all pixels.
Below I put additional information and attach a part of the code I wrote.
Any advice is welcomed and appreciated.
Thank you.
SDK: Intel® SDK For OpenCL™ Applications 2020.3.494
Platform: Intel(R) OpenCL HD Graphics
for all images
{
...
const size_t globalSize[2] = { size_t(vtMatB_GPU_LLP[nLayerIndex].cols), size_t(vtMatB_GPU_LLP[nLayerIndex].rows) };
err = clEnqueueNDRangeKernel(_pOpenCLManager->GetCommandQueue(), kernel, 2,
NULL, globalSize, NULL, 0, NULL, NULL);
if (CL_SUCCESS != err)
return -1;
// I tried with this but it didn't make any difference
//std::this_thread::sleep_for(std::chrono::seconds(1));
err = clFinish(_pOpenCLManager->GetCommandQueue());
if (CL_SUCCESS != err)
return -1;
err = clEnqueueReadBuffer(_pOpenCLManager->GetCommandQueue(), memMatB, CL_TRUE,
0, sizeof(float) * vtMatB_GPU_LLP[nLayerIndex].cols *
vtMatB_GPU_LLP[nLayerIndex].rows, vtMatB_GPU_LLP[nLayerIndex].data, 0, nullptr, nullptr);
if (CL_SUCCESS != err)
return -1;
...
}
And I tried with event, too, but it works the same way.
for all images
{
...
const size_t globalSize[2] = { size_t(vtMatB_GPU_LLP[nLayerIndex].cols), size_t(vtMatB_GPU_LLP[nLayerIndex].rows) };
cl_event event;
err = clEnqueueNDRangeKernel(_pOpenCLManager->GetCommandQueue(), kernel, 2, NULL, globalSize, NULL, 0, NULL, &event);
if (CL_SUCCESS != err)
return -1;
err = clWaitForEvents(1, &event);
if (CL_SUCCESS != err)
return -1;
err = clFinish(_pOpenCLManager->GetCommandQueue());
if (CL_SUCCESS != err)
return -1;
err = clEnqueueReadBuffer(_pOpenCLManager->GetCommandQueue(), memMatB, CL_TRUE,
0, sizeof(float) * vtMatB_GPU_LLP[nLayerIndex].cols *
vtMatB_GPU_LLP[nLayerIndex].rows, vtMatB_GPU_LLP[nLayerIndex].data, 0, nullptr, nullptr);
if (CL_SUCCESS != err)
return -1;
...
}
/////// Added contents ////////////////////////////////////////////
Would you guys please take look at this issue in the aspect of clFinsh, or clWaitEvent. Am I missing something in this regard?
Sometimes I get less correct values and sometimes I get more correct values.
To be more specific, let's say I'm applying the kernel to 12 x 12 size image. So there're 144 pixel values.
Sometime I get correct values for 56 pixels.
Sometime I get correct values for 89 pixels.
Some other time I get correct value for n(less then 144) pixels.
If I turn off the OpenCL optimization when compiling the kernel by specifying -cl-opt-disable option, I get correct values for all 144 pixels.
The other thing that makes me think the calculation code is correct is that the same OpenCL code with no modification(other then device select code) runs perfectly correctly with NVIDIA P4000.
At first, I was really suspicious about the calculation code, but more I inspect code, more I'm confident there's nothing wrong with calculation code.
I know there's still a chance that there is an error in the calculation code so that there happen some exceptions anywhere during calculations.
I have plain C++ code for same task. I'm comparing results from those two.
/////// Another added contents ////////////////////////////////////////////
I made a minimum code(except projects template) to reproduce the phenomenon.
What's odd more is that if I install "Intel® Distribution for GDB Target" I get correct results.
https://github.com/heysweetethan/GPUOpenCLProjectforWindows

OpenCL kernels run threads in parallel on a specified global range, which in your case is the image size, with one thread per pixel.
The threads are grouped in workgroups, Workgroup size should be a multiple of 32; ideally 64 to make full use of the hardware, or 8x8 pixels in 2D. These workgroups cannot be split, so the global range must be a multiple of workgroup size.
What happens if global range is not clearly divisible by workgroup size, or smaller than workgroup size, like 3x3 pixels? Then the last workgroup is still executed with all 8x8 threads. The first 3x3 work on valid data in memory, but all the other threads read/write unallocated memory. This can cause undefined behavior or even crashes.
If you cannot have global size as a multiple of workgroup size, there is still a solution: a guard clause in the very beginning of the kernel:
if(xB>=xImage||yB>=yImage) return;
This ensures that no threads access unallocated memory.

As you don't supply a complete reproducible code sample, here's a loose collection of comments/suggestions/advice:
1. printf in kernel code
Don't rely on large amounts of printf output from kernels. It's necessarily buffered, and some implementations don't guarantee delivery of messages - often there's a fixed size buffer and when that's full, messages are dropped.
Note that your post-calculation printf increases the total amount of output, for example.
The reliable way to check or print kernel output is to write it to a global buffer and print it in host code. For example, if you want to verify each work-item reaches a specific point in the code, consider creating a zero-initialised global buffer where you can set a flag in each work-item.
2. Events
As you asked about events, flushing, etc. Your clFinish call certainly should suffice to ensure everything has executed - if anything, it's overkill, but especially while you're debugging other issues it's a good way to rule out queuing issue.
The clWaitForEvents() call preceeding it is not a great idea, as you haven't called clFlush() after queueing the kernel whose event you're waiting for. It's fairly minor, but could be a problem on some implementations.
3. Small image sizes
You've not actually posted any of the code that deals with the images themselves, so I can only guess at potential issues there. It looks like you're not using workgroups, so you shouldn't be running into the usual multiple-of-group-size pitfall.
However, are you sure you're loading the source data correctly, and you're correctly indexing into it? There could be all sorts of pitfalls here, from alignment of pixel rows in the source data, enqueueing the kernel before filling the source buffers has completed, creating source buffers with the wrong flags, etc.
So in summary, I'd suggest:
Don't believe in-kernel-printf if something strange is going on. Switch to something more reliable for observing the behaviour of your kernel code.
At minimum, post all your OpenCL API calling host code. Buffer creation, setting arguments, etc. Any fragments of kernel code accessing the buffers are probably not a bad idea either.

Thanks to a person from intel community, I could understand the phenomenon.
Briefly, if you spend to much time on a single kernel instance, 'Timeout Detection and Recovery(TDR)' stops the kernel instance.
For more information about this, you could refer to the followings.
https://learn.microsoft.com/en-us/windows-hardware/drivers/display/tdr-registry-keys
https://www.pugetsystems.com/labs/hpc/Working-around-TDR-in-Windows-for-a-better-GPU-computing-experience-777/
https://community.intel.com/t5/GPU-Compute-Software/It-s-like-OpenCL-kernel-instance-ends-abruptly/m-p/1386883#M478
I appreciate for all the people who gave me advices.

Dynamic range (bit depth) in PIL's fromarray() function?

I did some image-processing on multi-frame TIFF images from a 12-bit camera and would like to save the output. However, the PIL documentation does not list a 12-bit mode for fromarray(). How does PIL handle bit depth and how can I ensure that the saved TIFF images will have the same dynamic range as the original ones?
Example code:
import os
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
# Read image file names
pathname = '/home/user/images/'
filenameList = [filename for filename in os.listdir(pathname)
if filename.endswith(('.tif', '.TIF', '.tiff', '.TIFF'))]
# Open image files, average over all frames, save averaged image files
for filename in filenameList:
img = Image.open(pathname + filename)
X, Y = img.size
NFrames = img.n_frames
imgArray = np.zeros((Y, X))
for i in range(NFrames):
img.seek(i)
imgArray += np.array(img)
i += 1
imgArrayAverage = imgArray/NFrames
imgAverage = Image.fromarray(imgArrayAverage) # <=== THIS!!!
imgAverage.save(pathname + filename.rsplit('.')[0] + '.tif')
img.close()

In my experience, 12-bit images get opened as 16-bit images with the first four MSB as all zeroes. My solution has been to convert the images to numpy arrays using
arr = np.array(img).astype(np.uint16)
the astype() directive is probably not strictly necessary, but it seems like it's a good idea. Then to convert to 16-bit, shift your binary digits four to the left:
arr = np.multiply(arr,2**4)
If you want to work with 8-bit instead,
arr = np.floor(np.divide(arr,2**4)).astype(np.uint8)
where here the astype() is necessary to force conversion to 8-bit integers. I think that the 8-bit truncation implicitly performs the floor() function but I left it in just in case.
Finally, convert back to PIL Image object and you're good to go:
img = Image.fromarray(arr)
For your specific use-case, this would have the same effect:
imgAverage = Image.fromarray(imgarrayAverage.astype(np.uint16) * 2**4)
The type conversion again may not be necessary but it will probably save you time since dividing imgArray by NFrames should implicity result in an array of floats. If you're worried about precision, it could be omitted.

Dumping a INT32 array into a .bin file

I have the array defined as below
INT32 LUT_OffsetValues[6][12] = {
0,180,360,540,720,900,1080,1260,1440,1620,1800,1980,
2160,2340,2520,2700,2880,3060,3240,3420,3600,3780,3960,4140,
4320,4500,4680,4860,5040,5220,5400,5580,5760,5940,6120,6300,
6480,6660,6840,7020,7200,7380,7560,7740,7920,8100,8280,8460,
8640,8820,9000,9180,9360,9540,9720,9900,10080,10260,10440,10620,
10800,10980,11160,11340,11520,11700,11880,12060,12240,12420,12600,12780
};
int main(int argc,char *argv[])
{
int var_row_index = 4 ;
int var_column_index = 5 ;
int computed_val = 0 ;
FILE *fp = NULL ;
fp = fopen("./LUT_Offset.bin","wb");
if(NULL != fp)
{
fwrite(LUT_OffsetValues,sizeof(INT32),72,fp);
fclose(fp);
}
printf("Size of Array:%d\n",sizeof(LUT_OffsetValues));
//computed_val = LUT_OffsetValues[var_row_index][var_column_index];
return 0;
}
Above is the code snippet with which I have generated the .bin file. Is that the right way of doing it?

No, it is not the right way if you plan to transfer the file to a different machine and read it as you haven't considered the Endianness. Let's say the file is:
Written in little endian machine but read in big endian machine
Written in big endian machine but read in little endian machine
It won't work for none of the cases above.

Out of the order of the bytes signaled by askinoor, that way is not generic because the reader have to now it is an INT32[6][12] when it read it
Why the useless variables var_row_index etc in your program ?

As already mentioned, when serializing data in and out of the CPU it is preferable to force network byte order. This can be done easily using functions like htonl(), which should be available on most platforms (and compile down to nothing on big endian machines).
Here's the doc from Linux:
https://linux.die.net/man/3/htonl
Also, it's not good practice to explicitly code sizes and types into your program.
Use sizeof(array[0][0]) to get the size of the element type of array, then iterate over it and use htonl() to write each element to the file.

Delphi - writing a large dynamic array to disk using stream

In a Delphi program, I have a dynamic array with 4,000,000,001 cardinals. I'm trying to write (and later read) it do a drive. I used the following:
const Billion = 1000000000;
stream := tFileStream.Create( 'f:\data\BigList.data', fmCreate);
stream.WriteBuffer( Pointer( BigArray)^, (4 * billion + 1) * SizeOf( cardinal));
stream.free;
It bombed out with: ...raised exception class EWriteError with message 'Stream write error'.
The size of the file it wrote is only 3,042,089KB.
Am I doing something wrong? Is there a limit to the size that can be written at once (about 3GB)?

The Count parameter of WriteBuffer is a 32 bit integer so you cannot pass the required value in that parameter. You will need to write the file with multiple separate calls to WriteBuffer, where each call passes a count that does not exceed this limit.
I suggest that you write it something like this.
var
Count, Index, N: Int64;
....
Count := Length(BigArray);
Index := 0;
while Count > 0 do begin
N := Min(Count, 8192);
stream.WriteBuffer(BigArray[Index], N*SizeOf(BigArray[0]));
inc(Index, N);
dec(Count, N);
end;
An additional benefit is that you can readily display progress.

Why cgo's performance is so slow? is there something wrong with my testing code?

I'm doing a test: compare excecution times of cgo and pure Go functions run 100 million times each. The cgo function takes longer time compared to the Golang function, and I am confused with this result. My testing code is:
package main
import (
"fmt"
"time"
)
/*
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
void show() {
}
*/
// #cgo LDFLAGS: -lstdc++
import "C"
//import "fmt"
func show() {
}
func main() {
now := time.Now()
for i := 0; i < 100000000; i = i + 1 {
C.show()
}
end_time := time.Now()
var dur_time time.Duration = end_time.Sub(now)
var elapsed_min float64 = dur_time.Minutes()
var elapsed_sec float64 = dur_time.Seconds()
var elapsed_nano int64 = dur_time.Nanoseconds()
fmt.Printf("cgo show function elasped %f minutes or \nelapsed %f seconds or \nelapsed %d nanoseconds\n",
elapsed_min, elapsed_sec, elapsed_nano)
now = time.Now()
for i := 0; i < 100000000; i = i + 1 {
show()
}
end_time = time.Now()
dur_time = end_time.Sub(now)
elapsed_min = dur_time.Minutes()
elapsed_sec = dur_time.Seconds()
elapsed_nano = dur_time.Nanoseconds()
fmt.Printf("go show function elasped %f minutes or \nelapsed %f seconds or \nelapsed %d nanoseconds\n",
elapsed_min, elapsed_sec, elapsed_nano)
var input string
fmt.Scanln(&input)
}
and result is:
cgo show function elasped 0.368096 minutes or
elapsed 22.085756 seconds or
elapsed 22085755775 nanoseconds
go show function elasped 0.000654 minutes or
elapsed 0.039257 seconds or
elapsed 39257120 nanoseconds
The results show that invoking the C function is slower than the Go function. Is there something wrong with my testing code?
My system is : mac OS X 10.9.4 (13E28)

As you've discovered, there is fairly high overhead in calling C/C++ code via CGo. So in general, you are best off trying to minimise the number of CGo calls you make. For the above example, rather than calling a CGo function repeatedly in a loop it might make sense to move the loop down to C.
There are a number of aspects of how the Go runtime sets up its threads that can break the expectations of many pieces of C code:
Goroutines run on a relatively small stack, handling stack growth through segmented stacks (old versions) or by copying (new versions).
Threads created by the Go runtime may not interact properly with libpthread's thread local storage implementation.
The Go runtime's UNIX signal handler may interfere with traditional C or C++ code.
Go reuses OS threads to run multiple Goroutines. If the C code called a blocking system call or otherwise monopolised the thread, it could be detrimental to other goroutines.
For these reasons, CGo picks the safe approach of running the C code in a separate thread set up with a traditional stack.
If you are coming from languages like Python where it isn't uncommon to rewrite code hotspots in C as a way to speed up a program you will be disappointed. But at the same time, there is a much smaller gap in performance between equivalent C and Go code.
In general I reserve CGo for interfacing with existing libraries, possibly with small C wrapper functions that can reduce the number of calls I need to make from Go.

Update for James's answer: it seems that there's no thread switch in current implementation.
See this thread on golang-nuts:
There's always going to be some overhead.
It's more expensive than a simple function call but
significantly less expensive than a context switch
(agl is remembering an earlier implementation;
we cut out the thread switch before the public release).
Right now the expense is basically just having to
do a full register set switch (no kernel involvement).
I'd guess it's comparable to ten function calls.
See also this answer which links "cgo is not Go" blog post.
C doesn’t know anything about Go’s calling convention or growable stacks, so a call down to C code must record all the details of the goroutine stack, switch to the C stack, and run C code which has no knowledge of how it was invoked, or the larger Go runtime in charge of the program.
Thus, cgo has an overhead because it performs a stack switch, not thread switch.
It saves and restores all registers when C function is called, while it's not required when Go function or assembly function is called.
Besides that, cgo's calling conventions forbid passing Go pointers directly to C code, and common workaround is to use C.malloc, and so introduce additional allocations. See this question for details.

I support gavv,
on winodws:
/*
#include "stdio.h"
#include <Windows.h>
unsigned long CTid(void){
return GetCurrentThreadId();
}
*/
import "C"
import (
"fmt"
"time"
"golang.org/x/sys/windows"
)
func main() {
fmt.Println(uint32(C.CTid()))
fmt.Println(windows.GetCurrentThreadId())
time.Sleep(time.Second * 5)
}
go and cgo get same TID.

There is a little overhead in calling C functions from Go. This cannot be changed.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight