I have a problem when I write a large amount of data <2GB to a file. First ~1.4GB data are written fast (100 MB/s) than the code become really slow (0-2 MB/s).
My code (simplified) is:
//FileOptions FILE_FLAG_NO_BUFFERING = (FileOptions)0x20000000;
FileOptions fileOptions = FileOptions.SequentialScan;
int fileBufferSize = 1024 * 1024;
byte[] Buffer = new byte[32768];
Random random = new Random();
long fileSize = 2588490188;
long totalByteWritten = 0;
using (FileStream fs = File.Create(#"c:\test\test.bin", fileBufferSize, fileOptions))
{
while (totalByteWritten < fileSize)
{
random.NextBytes(Buffer);
fs.Write(Buffer, 0, Buffer.Length);
totalByteWritten += Buffer.Length;
//Thread.Sleep(10);
}
}
I think there is an issue related to caching problem, in fact during "fast write performance" RAM used increase as well, when RAM usage stop to increase there is a drop in performance.
What I have tried:
change to async write
->no significantly change
change array buffer size
->no significantly change
change fileBufferSize
->no significantly change, but with a large buffer ~100MB, write performance is fast and when RAM usage stop to increase, write performance goes to 0 and than, after a while, goes back to 100MB, it seams that cache buffer is "flushed"
change fileOption to WriteThrough
->performance are always slow..
adding after xx loops fs.Flush(true)
->no significantly change
uncomment Thread.Sleep(10)
->write speed is always good.....this is strange
Is it somehow trying to write before it's finished writing the previous chunk and getting in a mess? (seems unlikely, but it's very odd that the Thread.Sleep should speed it up and this might explain it). What happens if you modify the code inside the using statement to lock the filestream, like this?
using (FileStream fs = File.Create(#"c:\testing\test.bin", fileBufferSize, fileOptions))
{
while (fs.Position < fileBufferSize)
{
lock(fs) // this is the bit I have added to try to speed it up
{
random.NextBytes(Buffer);
fs.Write(Buffer, 0, Buffer.Length);
}
}
}
EDIT: I have tweaked your example code to include the while loop required to make it write a file of the correct size.
Incidentally, when I run the sample code it is very quick with or without the lock statement and adding the sleep slows it down significantly.
Related
Im running a compute shader. It takes about .22ms to complete (according to D3D11 Query), but real world time is around .6ms on average. The map call definitely takes up some time! It computes a dirty RECT that contains all the pixels that have changed since the last frame and returns it back to the CPU. This all works great!
My issue is that the map call is very slow. Normal performance is optimal, at around .6ms, but if I run this is the background with any GPU intensive application (blender, AAA game, etc) it really starts to slow down, I see times jump the normal .6ms all the way to 9ms! I need to figure out why the map is taking so long!
The shader is passing the result into this structure here
buffer_desc.ByteWidth = sizeof(RECT);
buffer_desc.StructureByteStride = sizeof(RECT);
buffer_desc.Usage = D3D11_USAGE_DEFAULT;
buffer_desc.BindFlags = D3D11_BIND_UNORDERED_ACCESS;
buffer_desc.MiscFlags = D3D11_RESOURCE_MISC_BUFFER_STRUCTURED;
buffer_desc.CPUAccessFlags = 0;
hr = ID3D11Device5_CreateBuffer(ctx->device, &buffer_desc, NULL, &ctx->reduction_final);
With a UAV of
uav_desc.Format = DXGI_FORMAT_UNKNOWN;
uav_desc.ViewDimension = D3D11_UAV_DIMENSION_BUFFER;
uav_desc.Buffer.FirstElement = 0;
uav_desc.Buffer.NumElements = 1;
uav_desc.Buffer.Flags = 0;
hr = ID3D11Device5_CreateUnorderedAccessView(ctx->device, (ID3D11Resource*) ctx->reduction_final, &uav_desc, &ctx->reduction_final_UAV);
Copying it into this staging buffer here
buffer_desc.ByteWidth = sizeof(RECT);
buffer_desc.StructureByteStride = sizeof(RECT);
buffer_desc.Usage = D3D11_USAGE_STAGING;
buffer_desc.BindFlags = 0;
buffer_desc.MiscFlags = D3D11_RESOURCE_MISC_BUFFER_STRUCTURED;
buffer_desc.CPUAccessFlags = D3D11_CPU_ACCESS_READ;
hr = ID3D11Device5_CreateBuffer(ctx->device, &buffer_desc, NULL, &ctx->cpu_staging);
and mapping it like so
hr = ID3D11DeviceContext4_Map(ctx->device_ctx, (ID3D11Resource*) ctx->cpu_staging, 0, D3D11_MAP_READ, 0, &mapped_res);
I added a fence to my system so I could timeout if it took too long (if it took longer than 1ms).
Creation of fence and event
hr = ID3D11Device5_CreateFence(ctx->device, 1, D3D11_FENCE_FLAG_NONE, &IID_ID3D11Fence, (void**) &ctx->fence);
ctx->fence_handle = CreateEventA(NULL, FALSE, FALSE, NULL);
Then after my dispatch calls I copy into the staging, then signal the fence
ID3D11DeviceContext4_CopyResource(ctx->device_ctx, (ID3D11Resource*) ctx->cpu_staging, (ID3D11Resource*) ctx->reduction_final);
ID3D11DeviceContext4_Signal(ctx->device_ctx, ctx->fence, ++ctx->fence_counter);
hr = ID3D11Fence_SetEventOnCompletion(ctx->fence, ctx->fence_counter, ctx->fence_handle);
This all seems to be working. The fence signals and allows WaitForSingleObject to pass, and when I wait on it, I can timeout here and move on to other things while this finishes, then come back (this is not OK obviously, I want the data from the shader ASAP).
DWORD waitResult = WaitForSingleObject(ctx->fence_handle, 1);
if (waitResult != WAIT_OBJECT_0)
{
LOG("Timeout Count: %u", ++timeout_count);
return false;
}
else
{
LOG("Timeout Count: %u", timeout_count);
ID3D11DeviceContext4_Map(ctx->device_ctx, (ID3D11Resource*) ctx->cpu_staging, 0, D3D11_MAP_READ, 0, &mapped_res);
memcpy(rect, mapped_res.pData, sizeof(RECT));
ID3D11DeviceContext4_Unmap(ctx->device_ctx, (ID3D11Resource*) ctx->cpu_staging, 0);
return true;
}
Problem is, the map can still take a long time (ranging from .6 to 9ms). The fence signals once everything is done right? Which takes less than 1ms. But the map call itself can take a long time still. What gives? What am I missing about how mapping works? Is it possible to reduce the time to map? Is the fence signaling before the copy happens?
So just to re-cap. My shader runs fast under normal operations. around .6ms total time, with .22ms time for the shader itself (and Im assuming the rest is the mapping). Once I start running my shader while rendering with blender, it starts to seriously slow down. The fence tells me that the work is still being complete in record time, but the map call still is taking a very long time (upwards of 9ms).
Other random things i've tried:
Setting SetGPUThreadPriority to 7.
Setting AvSetMmThreadCharacteristicsW to "DisplayPostProcessing", "Games", and "Capture"
Setting AvSetMmThreadPriority to AVRT_PRIORITY_CRITICAL
I need to sort large binary file of size M, using t threads. Records in file are all equal size. The task explicitly says that the amount of memory I can allocate is m, and is much smaller than M. Also hard drive is guaranteed to have at least 2 * M free space. This calls for merge sort ofc, but turned out it's not so obvious. I see three different approaches here:
A. Map files input, temp1 and temp2 into memory. Perform merge sort input -> temp1 -> temp2 -> temp1 ... until one of temps sorted. Threads only contend for selecting next portion of work , no contention on read/write.
B. fopen 3 files t times each, each thread gets 3 FILE pointers, one per file. Again they contend only for next portion of work, reads and writes should work in parallel.
C. fopen 3 files one time each, keep them under mutexes, all threads work in parallel but to grab more work or to read or to write they lock respective mutex.
Notes:
In real life I would choose A for sure. But doesn't it defeat the whole purpose of having limited buffer? (In other words isn't it cheating?). With such approach I can even radix sort whole file in place without extra buffer. Also this solution is Linux-specific, I think Linux is implied from conversation, but it's not stated explicitly in task description.
Regarding B, I think it works on Linux but isn't portable, see Linux note above.
Regarding C, it's portable but I am not sure how to optimize it (e.g. 8 threads with small enough m will just bump waiting their turn in queue, then read/write tiny portion of data, then instantly sort it and bump into each other again. IMO unlikely to work faster than 1 thread).
Questions:
Which solution is a better match for the task?
Which solution is a better design in real life (assuming Linux)?
Does B work? In other words is opening file multiple times and writing in parallel (to different parts of it) legal?
Any alternative approaches?
Your question has many facets, so I will try to break it down a bit, while trying to answer almost all of your questions:
You are given a large file on a storage device that probably operates on blocks, i.e. you can load and store many entries at the same time. If you access a single entry from storage, you have to deal with rather large access latency which you can only try to hide by loading many elements at the same time thus amortizing the latency over all element load times.
Your main memory is quite fast compared to the storage (especially for random access), so you want to keep as much data in main memory as possible and only read and write sequential blocks on the storage. This is also the reason why A is not really cheating, since if you tried to use your storage for random access, you would be waaay slower than using main memory.
Combining these results, you can arrive at the following approach, which is basically A but with some engineering details that are usually used in external algorithms.
Use only a single dedicated thread for reading and writing on the storage.
This way, you need only one file descriptor for every file and could in theory even collect and reorder read and write requests from all threads within a small timeframe to get nearly sequential access patterns. Additionally, your threads can just queue a write request and continue with the next block without waiting for the IO to finish.
Load t blocks (from input) into main memory of a maximum size such that you can run mergesort in parallel on each of these blocks. After the blocks are sorted, write them onto the storage as temp1.
Repeat this until all blocks in the file have been sorted.
Now do a so-called multiway merge on the sorted blocks:
Every thread loads a certain number k of consecutive blocks from temp1 into memory and merges them using a priority queue or tournament tree to find the next minimum to be inserted into the resulting block. As soon as your block is full, you write it onto your storage at temp2 to free up memory for the next block. After this step, conceptually swap temp1 and temp2
You still need to do several merge steps, but this number is down by a factor of log k compared to regular two-way merges you probably meant in A. After the first few merge steps, your blocks will probably be too large to fit into main memory, so you split them into smaller blocks and, starting from the first small block, fetch the next block only when all of the previous elements have already been merged. Here, you might even be able to do some prefetching since the order of block accesses is predetermined by the block minima, but this is probably outside the scope of this question.
Note that the value for k is usually only limited by available memory.
Finally, you arrive at t huge blocks which need to be merged together. I don't really know if there is a nice parallel approach to this, it might be necessary to just merge them sequentially, so again you can work with a t-way merge as above to result in a single sorted file.
Gnu sort is a multi-threaded merge sort for text files, but it's basic features could be used here. Define a "chunk" as the number of records that can be sorted in memory of size m.
Sort phase: for each "chunk" of records, read a "chunk" of records, use a multi-threaded sort on the "chunk" then write a "chunk" of records to a temp file, ended up with ceiling(M / m) temp files. Gnu sort sorts an array of pointers to records, partially because the records are variable length. For fixed size records, in my testing, due to cache issues, it's faster to sort records directly rather than sort an array of pointers to records (which results in cache unfriendly random access of records), unless record size is greater than somewhere between 128 and 256 bytes.
Merge phase: perform single threaded k-way merges (such as priority queue) on the temp files until a single file is produced. Multi-threading doesn't help here since it's assumed that the k-way merge phase is I/O bound and not cpu bound. For Gnu sort the default for k is 16 (it does 16-way merges on the temp files).
To keep from exceeding 2 x M space, files will need to be deleted once they have been read.
If your file is way bigger than your RAM size then This is the solution. https://stackoverflow.com/a/49839773/1647320
If your file size is 70-80% of your RAM size then following is the solution. It's in-memory parallel merge sort.
Change this lines according to your system . fpath is your one big input file. shared path is where the execution log is stored.fdir is where the intermediate files will be stored and merged. Change these paths according to your machine.
public static final String fdir = "/tmp/";
public static final String shared = "/exports/home/schatterjee/cs553-pa2a/";
public static final String fPath = "/input/data-20GB.in";
public static final String opLog = shared+"Mysort20GB.log";
Then run the following programme. Your final sorted file will be created with the name op2GB in fdir path. the last line Runtime.getRuntime().exec("valsort " + fdir + "op" + (treeHeight*100)+1 + " > " + opLog); checks the output is sorted or not . Remove this line if you dont have valsort installed in your machine or the input file is not generated using gensort(http://www.ordinal.com/gensort.html) .
Also, don't forget to change int totalLines = 20000000; to the total lines in your file. and thread count (int threadCount = 8) should be always in power of 2.
import java.io.*;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.LinkedList;
import java.util.Comparator;
import java.util.HashMap;
import java.util.stream.Stream;
class SplitJob extends Thread {
LinkedList<String> chunkName;
int startLine, endLine;
SplitJob(LinkedList<String> chunkName, int startLine, int endLine) {
this.chunkName = chunkName;
this.startLine = startLine;
this.endLine = endLine;
}
public void run() {
try {
int totalLines = endLine + 1 - startLine;
Stream<String> chunks =
Files.lines(Paths.get(Mysort2GB.fPath))
.skip(startLine - 1)
.limit(totalLines)
.sorted(Comparator.naturalOrder());
chunks.forEach(line -> {
chunkName.add(line);
});
System.out.println(" Done Writing " + Thread.currentThread().getName());
} catch (Exception e) {
System.out.println(e);
}
}
}
class MergeJob extends Thread {
int list1, list2, oplist;
MergeJob(int list1, int list2, int oplist) {
this.list1 = list1;
this.list2 = list2;
this.oplist = oplist;
}
public void run() {
try {
System.out.println(list1 + " Started Merging " + list2 );
LinkedList<String> merged = new LinkedList<>();
LinkedList<String> ilist1 = Mysort2GB.sortedChunks.get(list1);
LinkedList<String> ilist2 = Mysort2GB.sortedChunks.get(list2);
//Merge 2 files based on which string is greater.
while (ilist1.size() != 0 || ilist2.size() != 0) {
if (ilist1.size() == 0 ||
(ilist2.size() != 0 && ilist1.get(0).compareTo(ilist2.get(0)) > 0)) {
merged.add(ilist2.remove(0));
} else {
merged.add(ilist1.remove(0));
}
}
System.out.println(list1 + " Done Merging " + list2 );
Mysort2GB.sortedChunks.remove(list1);
Mysort2GB.sortedChunks.remove(list2);
Mysort2GB.sortedChunks.put(oplist, merged);
} catch (Exception e) {
System.out.println(e);
}
}
}
public class Mysort2GB {
//public static final String fdir = "/Users/diesel/Desktop/";
public static final String fdir = "/tmp/";
public static final String shared = "/exports/home/schatterjee/cs553-pa2a/";
public static final String fPath = "/input/data-2GB.in";
public static HashMap<Integer, LinkedList<String>> sortedChunks = new HashMap();
public static final String opfile = fdir+"op2GB";
public static final String opLog = shared + "mysort2GB.log";
public static void main(String[] args) throws Exception{
long startTime = System.nanoTime();
int threadCount = 8; // Number of threads
int totalLines = 20000000;
int linesPerFile = totalLines / threadCount;
LinkedList<Thread> activeThreads = new LinkedList<Thread>();
for (int i = 1; i <= threadCount; i++) {
int startLine = i == 1 ? i : (i - 1) * linesPerFile + 1;
int endLine = i * linesPerFile;
LinkedList<String> thisChunk = new LinkedList<>();
SplitJob mapThreads = new SplitJob(thisChunk, startLine, endLine);
sortedChunks.put(i,thisChunk);
activeThreads.add(mapThreads);
mapThreads.start();
}
activeThreads.stream().forEach(t -> {
try {
t.join();
} catch (Exception e) {
}
});
int treeHeight = (int) (Math.log(threadCount) / Math.log(2));
for (int i = 0; i < treeHeight; i++) {
LinkedList<Thread> actvThreads = new LinkedList<Thread>();
for (int j = 1, itr = 1; j <= threadCount / (i + 1); j += 2, itr++) {
int offset = i * 100;
int list1 = j + offset;
int list2 = (j + 1) + offset;
int opList = itr + ((i + 1) * 100);
MergeJob reduceThreads =
new MergeJob(list1,list2,opList);
actvThreads.add(reduceThreads);
reduceThreads.start();
}
actvThreads.stream().forEach(t -> {
try {
t.join();
} catch (Exception e) {
}
});
}
BufferedWriter writer = Files.newBufferedWriter(Paths.get(opfile));
sortedChunks.get(treeHeight*100+1).forEach(line -> {
try {
writer.write(line+"\r\n");
}catch (Exception e){
}
});
writer.close();
long endTime = System.nanoTime();
double timeTaken = (endTime - startTime)/1e9;
System.out.println(timeTaken);
BufferedWriter logFile = new BufferedWriter(new FileWriter(opLog, true));
logFile.write("Time Taken in seconds:" + timeTaken);
Runtime.getRuntime().exec("valsort " + opfile + " > " + opLog);
logFile.close();
}
}
[1]: https://i.stack.imgur.com/5feNb.png
Scenario: We are trying to download 2500 PDFs from our website and we need to find the response time of this scenario when run with other business flows of the application. The custom code I had written for selecting and downloading PDFs dynamically worked fine for the size of 200-300 PDFs both on vugen and even on controller. But, when we ran the same script with 2500 PDFs loaded to the DB, the script worked fine on vugen, but failed running out of memory on controller. I tried running this script alone on controller for concurrent users (20) and even then it failed giving the same out of memory error.I started getting this error as soon as the concurrent users started running on the server.I tried following things and my observations:
1. I checked the LG we are using and had no high cpu usage/memory usage at the time I got this memory error.
2. I tried turning off the logging completely and also turned off the "Generate snapshot on error".
3. I increased the network buffer size from default 12KB to a higher value around 2MB as the server was responding with THAT PDF size.
4. Also, increased JavaScript runtime memory value to a higher value but I know it's something to do with the code.
5. I have set web_set_max_html_param_len("100000");
Here is my code:
int download_size,i,m;
m=atoi(lr_eval_string("{DownloadableRecords_FundingNotices_count}"));
for(i=1;i<=m;i++)
lr_param_sprintf("r_buf","%sselectedNotice=%s&",lr_eval_string("{r_buf}"),lr_paramarr_idx("DownloadableRecords_FundingNotices",i));
lr_save_string(lr_eval_string("{r_buf}"), "dpAllRecords");
I am not able to find what the issue with my code as it is running fine in vugen.One thing is: it creates huge mdrv.log file to accommodate all the 2500 members in the format shown above
"%sselectedNotice=%s&".
I need help on this.
Okay, since that did not work and I could not find the root cause, I tried modifying the code with string buffer to hold the value instead of the parameter. This time my code did not work properly and I could not get the proper formatted value resulting in my web_custom_request failing
so, here is the code with sprintf
char *r_buf=(char *) malloc(55000);
int download_size,i,m;
m=atoi(lr_eval_string("{DownloadableRecords_FundingNotices_count}"));
for(i=1;i<=m;i++)
sprintf(r_buf,"%sselectedNotice=%s&",r_buf,lr_paramarr_idx ("DownloadableRecords_FundingNotices",i));
lr_save_string(r_buf, "dpAllRecords");
I also tried using this:
lr_save_string(lr_eval_string("{r_buf}"), "dpAllRecords");
though it is for embedded parameters but in vain
You could try something like the below. If frees the allocated memory, something you do not do in your examples.
I changed:
The way r_buf is allocated
how r_buf is populated (doing a sprintf() into the buffer and from the buffer might not work as expected)
uses lr_paramarr_len()
FREES THE ALLOCATED BUFFER!
Check that the allocated buffer is big enough in the loop
Action() Code:
char *r_buf;
char buf[2048];
int download_size,i,m;
// Allocate memory
if ( (r_buf= (char *)calloc(65535 * sizeof(char))) == NULL)
{
lr_error_message ("Insufficient memory available");
return -1;
}
memset( buf, 0, sizeof(buf) );
m = lr_paramarr_len("DownloadableRecords_FundingNotices");
for(i=1; i<=m; i++) {
sprintf( buf, "selectedNotice=%s&", lr_paramarr_idx("DownloadableRecords_FundingNotices",i) );
// Check buffer is big enough to hold the new data
if ( strlen(r_buf)+strlen(buf) > 65535 ) {
lr_error_message("Buffer exceeded");
lr_abort();
}
// Concatenate to final buffer
strcat( r_buf, buf ); // Bugfix: This was "strcat( r_buf, "%s", buf );"
}
// Save buffer to variable
lr_save_string(r_buf, "dpAllRecords");
// Free memory
free( r_buf );
I use jpeg library v8d from Independent JPEG Group and I want to change the way jpeg decompression reads and processes data.
In the djpeg main(), only one scanline/row at a time is read and processed in each jpeg_read_scanlines() call. So, to read entire image this functions is called until all lines are read and processed:
while (cinfo.output_scanline < cinfo.output_height) {
num_scanlines = jpeg_read_scanlines(&cinfo, dest_mgr->buffer,
dest_mgr->buffer_height); //read and process
(*dest_mgr->put_pixel_rows) (&cinfo, dest_mgr, num_scanlines); //write to file
}
But I would like to read the entire image once and store it in the memory and then process the entire image from memory. By reading libjpeg.txt, I found out this is possible: "You can process an entire image in one call if you have it all in memory, but usually it's simplest to process one scanline at a time."
Even though I made some progress, I couldn't make it completely work. I can now read a couple of rows once by increasing pub.buffer_height value and pub.buffer size, but no matter how large pub.buffer_height and pub.buffer are, only a couple of lines are read in each jpeg_read_scanlines() call. Any thoughts on this?
only a couple of lines are read in each jpeg_read_scanlines()
Yes, so you call it in a loop. Here's a loop that grabs one scanline at a time:
unsigned char *rowp[1], *pixdata = ...;
unsigned rowbytes = ..., height = ...;
while (cinfo.output_scanline < height) {
rowp[0] = pixdata + cinfo.output_scanline * rowbytes;
jpeg_read_scanlines(&cinfo, rowp, 1);
}
Once the loop exits, you have the entire image.
I'm seeing a memory leak with the following code:
while (true) {
console.log("Testing.");
}
I have tried defining the string and just using a constant, but it leaks memory, still:
var test = "Testing.";
while (true) {
console.log(test);
}
The same leak happens if I use a file instead of the standard log:
var test = "Testing.";
var fh = fs.createWriteStream("test.out", {flags: "a"});
while (true) {
fh.write(test);
}
I thought maybe it was because I wasn't properly closing the file, but I tried this and still saw the leak:
var test = "Testing";
while (true) {
var fh = fs.createWriteStream("test.out", {flags: "a"});
fh.end(test);
fh.destroy();
fh = null;
}
Does anyone have any hints as to how I'm supposed to write things without leaking memory?
This happens because you never give node a chance to handle "write successful" events, so they queue up endlessly. To give node a chance to handle them, you have to let the event loop do one iteration from time to time. This won't leak:
function newLine() {
console.log("Testing.");
process.nextTick(newLine);
}
newLine();
In real use cases, this is no issue because you nearly never have to write out so huge amounts of data at once that this matters. And if it does, cycle the event loop from time to time.
However, there's also a second issue with this that also occurs with the nextTick trick: Writing is async, and if the console/file/whatever is slower than node, node buffers data endlessly until the output is free again. To avoid this, you'll have to listen for the drain event after writing some stuff - it tells you when the pipe is free again. See here: http://nodejs.org/docs/latest/api/streams.html#event_drain_