Read entire file bytes at once using intersystems' Cache? - file

I have a file of bytes 1.5GB in size {filebyte}. I want to read the entire file in one instance instance similar to Delphi's
bytedata:=filebyte.readallbytes(filename);
The result being that in one instance you will have a bytearray with the number of elements being high(bytedata)-low(bytedata)+1. Is there equivalent code in Cache. Can a file of 1.5G in size be held in memory in cache.
I do not want to read the file in blocks as the operation to analyse the data requires that the whole file be in memory at one time.
Thanks

You can read from the stream as many data as you need. The problem is here, how much you can store in a local variable.
set fs=##class(%Stream.FileCharacter).%New()
set fs.Filename="c:\test.txt"
set length=fs.Size
set data=fs.Read(length) \\ if size no more than 3.5Mb
Local variable size limited by 3,641,144 bytes or 32,767 bytes of long strings diabled. And up to 2012.1 memory per process was limited by 48mbytes. And in 2012.2 it was changed and it is possible to set up to 2 terabytes per process, and in real time programmatically just for a current process with special variable $zstorage.

Related

Read a file after write and closing it in C

My code does the following
do 100 times of
open a new file; write 10M data; close it
open the 100 files together, read and merge their data into a larger file
do steps 1 and 2 many times in a loop
I was wondering if I can keep the 100 open w/o opening and closing them too many times. What I can do is fopen them with w+. After writing I set position the beginning to read, after read I set position to the beginning to write, and so on.
The questions are:
if I read after write w/o closing, do we always read all the written data
would this save some overhead? File open and close must have some overhead, but is this overhead large enough to save?
Bases on the comments and discussion I will talk about why I need to do this in my work. It is also related to my other post
how to convert large row-based tables into column-based tables efficently
I have a calculation that generates a stream of results. So far the results are saved in a row-storage table. This table has 1M columns, each column could be 10M long. Actually each column is one attribute the calculation produces. At the calculation runs, I dump and append the intermediate results the table. The intermediate results could be 2 or 3 double values at each column. I wanted to dump it soon because it already consumes >16M memory. And the calculate needs more memoy. This ends up a table like the following
aabbcc...zzaabbcc..zz.........aabb...zz
A row of data are stored together. The problem happens when I want to analyze the data column by column. So I have to read 16 bytes then seek to the next row for reading 16 bytes then keep on going. There are too many seeks, it is much slower than if all columns are stored together so I can read them sequentially.
I can make the calculation dump less frequent. But to make the late read more efficent. I may want to have 4K data stored together since I assume each fread gets 4K by default even if I read only 16bytes. But this means I need to buffer 1M*4k = 4G in memory...
So I was thinking if I can merge fragment datas into larger chunks like that the post says
how to convert large row-based tables into column-based tables efficently
So I wanted to use files as offline buffers. I may need 256 files to get a 4K contiguous data after merge if each file contains 1M of 2 doubles. This work can be done as an asynchronous way in terms of the main calculation. But I wanted to ensure the merge overhead is small so when it runs in parallel it can finish before the main calculation is done. So I came up with this question.
I guess this is very related to how column based data base is constructed. When people create them, do they have the similar issues? Is there any description of how it works on creation?
You can use w+ as long as the maximum number of open files on your system allows it; this is usually 255 or 1024, and can be set (e.g. on Unix by ulimit).
But I'm not too sure this will be worth the effort.
On the other hand, 100 files of 10M each is one gigabyte; you might want to experiment with a RAM disk. Or with a large file system cache.
I suspect that huger savings might be reaped by analyzing your specific problem structure. Why is it 100 files? Why 10 M? What kind of "merge" are you doing? Are those 100 files always accessed in the same order and with the same frequency? Could some data be kept in RAM and never be written at all?
Update
So, you have several large buffers like,
ABCDEFG...
ABCDEFG...
ABCDEFG...
and you want to pivot them so they read
AAA...
BBB...
CCC...
If you already have the total size (i.e., you know that you are going to write 10 GB of data), you can do this with two files, pre-allocating the file and using fseek() to write to the output file. With memory-mapped files, this should be quite efficient. In practice, row Y, column X of 1,000,000 , has been dumped at address 16*X in file Y.dat; you need to write it to address 16*(Y*1,000,000 + X) into largeoutput.dat.
Actually, you could write the data even during the first calculation. Or you could have two processes communicating via a pipe, one calculating, one writing to both row-column and column-row files, so that you can monitor the performances of each.
Frankly, I think that adding more RAM and/or a fast I/O layer (SSD maybe?) could get you more bang for the same buck. Your time costs too, and the memory will remain available after this one work has been completed.
Yes. You can keep the 100 files open without doing the opening-closing-opening cycle. Most systems do have a limit on the number of open files though.
if I read after write w/o closing, do we always read all the written data
It depends on you. You can do an fseek goto wherever you want in the file and read data from there. It's all the way you and your logic.
would this save some overhead? File open and close must have some overhead, but is this overhead large enough to save?
This would definitely save some overhead, like additional unnecessary I/O operations and also in some systems, the content which you write to file is not immediately flushed to physical file, it may be buffered and flushed periodically and or done at the time of fclose.
So, such overheads are saved, but, the real question is what do you achieve by saving such overheads? How does it suit you in the overall picture of your application? This is the call which you must take before deciding on the logic.

How does one write files to disk, sequentially, in C?

I want to write a program that writes data as one contiguous block of data to disk, so that when I read that data back from the disk, I can just read one long series of bytes without stopping. Are there any references I can be directed to regarding this issue?
I am essentially asking whether or not it is possible to write data for multiple files contiguously and read past an EOF, or many, to retrieve the data written.
I am aware of fwrite and fopen, I just want to be sure that the data being written to disk is contiguous.
It depends on what the underlying filesystem is, as this is filesystem-dependent. You'll want to look at extents, which are a contiguous area of storage reserved for a file.
On Windows you can open an unformatted volume with CreateFile and then WriteFile a contiguous block of data. It won't be a file, but you will be able to read it back as you stated.
According to this NTFS tries to allocate contiguous space if possible, your chances are lower when appending though.

What happens internally when a file is modified and saved?

What happens internally when a file is modified and saved? Will the OS allocates a new block of memory and copy the whole data or only the bits after the modified part are shifted?
Files are manipulated in blocks. A block on disk is like a byte in memory. You can only read and write in units of blocks. 512 bytes used to be the normal block size but 4096 is more common now.
The OS will read the entire block into memory; change whatever bytes; then write the entire block to the disk.
Clusters are units of file allocation. They are multiples of blocks. The disk hardware is generally unaware of clusters. Larger cluster sizes reduce the amount of system allocation overhead but are inefficient for large numbers of small files. You can read and write individual blocks within a cluster.
Diferent methods are for each one, remember we have diferent file systems. for example in ntfs when u write a file and it uses for example six cluster it will be like that in your file sistem:
123456
if u add a new file using 1 cluster it will be like that
1234561
so now u remove that first file:
1
and u will write a new file using 3 clusters
123 1
and now u want to write a file with 7 clusters
12312314567
for example if u want to copy a file in another folder it will be re written in new clusters in your filesystem, but if u want to cut it u will modify only the INDEX thats why is so fast cut files versus copy action.
so if u modify a file a part or complete in the most cases will be loaded in a buffer and then when u save your changes that buffer is written in the hard disk replacing the afected clusters and writting the new ones. But that depends cause diferent software uses diferent methods.

Best logic to erase fewer bytes than sector size(minimum erasable size) in flash

I am using Spansion's flash memory of 16MB. The sector size is 256KB. I am using the flash to read/write/delete 30 byte blocks(structures). I have found in the data sheet of the IC that minimum erasable size is 256KB. One way of deleting a particular block is to
Read the sector containing the block to delete into a temporary array.
Erase that sector.
Delete the required block in temporary array
Write back the temporary array into Flash.
I want to ask is there any better alternative logic to this.
There is no way to erase less than the minimum erasable sector size in flash.
However, there is a typical way to handle invalidating small structures on a large flash sector. Simply add a header to indicate the state of the data in that structure location.
Simple example:
0xffff Structure is erased and available for use.
0xa5a5 Structure contains data that is valid.
0x0000 Structure contains data that is not valid.
The header will be 0xffff after erasing. When writing new data to a structure, set the header to 0xa5a5. When that data is no longer needed, set the header to 0x0000.
The data won't actually be erased, but it can be detected as invalid. This allows you to wait until the sector is full and then clean up the invalid records and perhaps compact the valid ones.
Firstly, check the device datasheet again. Generally Spansion devices will let you have a 64kB page size instead of 256kB. This may or may not help you, but generally increased granularity will help you.
Secondly, you cannot avoid the "erase before write" cycle where you want to change bits from 0 to 1. However, you can always change bits from 1 to 0 on a byte-by-byte basis.
You can either rethink your current 3-byte structure to see if this is of any use to you, or move to a 32-byte size structure (which is a power of 2 and so slightly more sane IMO). Then, to delete, you can simply set the first byte to 0x00 from 0xFF which a normal erased byte will be set to. That means you'll end up with empty slots.
Like how a garbage collector works, you can then re-organise to move any pages that have deleted blocks on so that you create empty pages (full of deleted blocks). Make sure you move good blocks to a blank page before deleting them from their original page! You can then erase the empty page that was full of deleted, or re-organised blocks.
When you're working with flash memory, you have to think out your read/erase/write strategy to work with the flash you have available. Definitely work it out before you start coding or locking down memory structures, because generally you'll need to reserve at least one byte as a validity byte and usually you have to take advantage of the fact that you can always changes bits that are set to 1 to 0 in any byte at any time without an erase cycle.

Reading file using fread in C

I lack formal knowledge in Operating systems and C. My questions are as follows.
When I try to read first single byte of a file using fread in C, does the entire disk block containing that byte is brought into memory or just the byte?
If entire block is brought into memory, what happens on reading
second byte since the block containing that byte is already in
memory?.
Is there significance in reading the file in size of disk blocks?
Where is the read file block kept in memory?
Here's my answers
More than 1 block, default caching is 64k. setvbuffer can change that.
On the second read, there's no I/O. The data is read from the disk cache.
No, a file is ussuly smaller than it's disk space. You'll get an error reading past the file size even if you're within the actual disk space size.
It's part of the FILE structure. This is implementation (compiler) specific so don't touch it.
The above caching is used by the C runtime library not the OS. The OS may or may not have disk caching and is a separate mechanism.

Resources