How to read a binary file entirely and quickly in Ada? - file

I would like to read the content of a binary file of several MB and store it into a buffer. Here's my function prototype (I can change it if needed):
procedure GET_BIN_CONTENT_FROM_PATH(PATH : in UNBOUNDED_STRING;
CONTENT : out UNBOUNDED_STRING);
Until now I've tried two methods, both using the Direct_IO package. In the first method, I was reading the file character by character; it worked, but it was awfully slow. In order to speed up the process, I tried to read the file MB by MB:
procedure GET_BIN_CONTENT_FROM_PATH (PATH : in UNBOUNDED_STRING;
CONTENT : out UNBOUNDED_STRING) is
BIN_SIZE_LIMIT : constant NATURAL := 1000000;
subtype FILE_STRING is STRING (1 .. BIN_SIZE_LIMIT);
package FILE_STRING_IO is new ADA.DIRECT_IO (FILE_STRING);
FILE : FILE_STRING_IO.FILE_TYPE;
BUFFER : FILE_STRING;
begin
FILE_STRING_IO.OPEN (FILE, MODE => FILE_STRING_IO.IN_FILE,
NAME => TO_STRING (C_BASE_DIR & PATH));
while not FILE_STRING_IO.END_OF_FILE (FILE) loop
FILE_STRING_IO.READ (FILE, ITEM => BUFFER);
APPEND (CONTENT, BUFFER);
end loop;
FILE_STRING_IO.CLOSE (FILE);
end GET_BIN_CONTENT_FROM_PATH;
Unfortunately, it seems that the READ operation won't happen if there is less than 1MB remaining in the file. As a result, big files (>1MB) get truncated, and little ones are not read at all. It's especially visible when working on images.
So, my question is: What's the correct method to read a binary file both quickly and entirely?
Thanks in advance.

Make the Bin_Size equal to Ada.Directories.Size(my_file), and read it in one go.
If it's too big for stack allocation (you'll get Storage_Error) allocate it with New, and use the rename trick
my_image : bin_array renames my_image_ptr.all;
so that nothing else need know...
But if it's only a few MB, that won't be necessary.

Ada.Streams.Stream_IO.Read reads into a Stream_Element_Array and tells you the last element read; if the array isn't filled (because you've reached the end of file), Last will be less than Item'Last.
A purist will note that Ada.Streams.Stream_Element'Size may not be the same as Character'Size, but for any normal processor chip it will be, so you can do unchecked conversion between the used part of the Stream_Element_Array and a String of the same size before appending to your Content.

There are a number of "correct" ways, but here's one that you might like. Especially when reading large files, an efficient way to read an entire file is to map the memory using mmap.
Depending on your licensing needs, you could be open to a third party, GPLd solution. AdaCore provides the GNATColl collection, which provides a nice interface for mmap. You can map the entire file and copy the contents.
declare
File : Mapped_File;
Str : Str_Access;
begin
File := Open_Read ("/tmp/file_on_disk");
Read (File); -- read the whole file
Str := Data (File);
for S in 1 .. Last (File) loop
Put (Str (S));
end loop;
Close (File);
end;
If your system doesn't support the mmap call, the library falls back to a read(2) implementation.

As others have mentioned, Ada.Streams.Stream_IO.Read is the way to go. Here's an example I put together on my system. Assuming you have sufficient memory available for dynamic allocation, this is able to read files larger than the stack size.
I haven't dug into the internals of the Stream.IO.Read code, but I suspect that the Stream_IO package is using a 4k block of memory (allocated from the heap) to buffer read operations.
with Ada.Directories; use Ada.Directories;
with Ada.Direct_IO;
with Ada.Unchecked_Deallocation;
with Ada.Streams.Stream_IO;
procedure Read_Input_File is
type Byte is mod 2 ** 8;
type Byte_Array is array (File_Size range <>) of Byte;
type Byte_Array_Access is access Byte_Array;
procedure Delete is new Ada.Unchecked_Deallocation
(Byte_Array, Byte_Array_Access);
function Read_Binary_File (Filename : String)
return Byte_Array_Access
is
package SIO renames Ada.Streams.Stream_IO;
Binary_File_Size : File_Size := Ada.Directories.Size (Filename);
Binary_File_Data : Byte_Array_Access;
S : SIO.Stream_Access;
File : SIO.File_Type;
begin
-- Allocate memory from the heap
Binary_File_Data := new Byte_Array (1 .. Binary_File_Size);
SIO.Open (File, SIO.In_File, Filename);
S := SIO.Stream (File);
-- Read entire file into the buffer
Byte_Array'Read (S, Binary_File_Data.all);
SIO.Close (File);
return Binary_File_Data;
end;
File_Data : Byte_Array_Access;
begin
File_Data := Read_Binary_File ("File_Name.bin");
-- Do something with data
Delete (File_Data);
end Read_Input_File;

Related

How can I replace two characters in a 40GB file in Unix?

I have two huge json files (20GB each) and I need to join them. The files have the following content:
file_1.json = [{"key": "value"}, {...}]
file_2.json = [{"key": "value"}, {...}]
The main problem, however, is that I need all dict to be in the same list. I tried to do this in python, but unfortunately, I don't have the memory to do this operation.
So, I thought maybe I could tackle this with unix commands, by replacing, in the first file, the ] for , (note that there is a space after the comma) and erasing [ from the second file. Then, I would join the two files with the cat unix command.
Is there a way for me to edit only the last 10 char in unix?
I tried to use echo and tr but I might be doing something wrong with the syntax.
You can very easily append to a file in place, i.e. add characters at the end without rewriting the data that's already there. With the right tools (truncate if your system has it), you can truncate a file in place, i.e. remove characters at the end without rewriting the data that's staying. With the right tools (dd, if you're feeling adventurous), you can replace a part of a file by a string of the same length, without rewriting the unchanged parts. On the other hand, you can't remove characters from the beginning or middle of a file without rewriting the file (with a few exceptions that aren't relevant here).
But anyway rewriting both files in place wouldn't help you that much. You will need to at least rewrite the content of the second file to append it to the first file.
If you don't need to keep the split files around, you can append the second file to the first file in place, after taking care of the middle punctuation. Remove the last ] character from the first file, as well as any following spaces and line breaks. Assuming that the first file ends in ] and a newline and you have GNU core utilities (e.g. non-embedded Linux):
truncate -s -2 file_1.json
Now you can add a comma and optionally a line break to the first file, and append the data from the second file without its first character.
echo , >>file_1.json
tail -c +2 file_2.json >>file_1.json
If you want to keep the original files unmodified, you can make a copy of the first file and truncate it. Or you can directly make a truncated copy of the first file (still assuming GNU coreutils):
head -c -2 file_1.json >concatenated.json
echo , >>concatenated.json
tail -c +2 file_2.json >>concatenated.json
If you're more comfortable with Python, you can do all of this in Python. Just don't read the whole file in one go, i.e. don't call read() or use readline() in a way that reads all the lines as once. Instead, read and process a single line at a time (if the lines are short) or a single block of data. Untested code:
with open('concatenated.json', 'wb') as out:
with open('file_1.json', 'rb') as inp:
buf = bytes(1024)
size = inp.seek(-len(buf), io.SEEK_END)
n = inp.readinto(buf)
m = re.search(rb']\s*\Z', buf)
stop_at = m.start()
inp.seek(0, io.SEEK_SET)
n = inp.readinto(buf)
total = n
while n > 0:
out.write(buf)
n = inp.readinto(buf)
total += n
if total > stop_at:
out.write(buf[:len(buf)-(total-stop_at)])
n = 0
out.write(b',')
with open('file_2.json', 'rb') as inp:
buf = bytes(1024)
n = inp.readinto(buf)
assert buf[0] == b'['
buf[0:1] = b'\n'
while n > 0:
out.write(buf)
n = inp.readinto(buf)

Best way to append text to file in dm-script

What is the best way to append line(s) to a file?
Currently I am using the following script:
/**
* Append the `line` to the file given at the `path`.
*
* #param path
* The absolute or relative path to the file with
* extension
* #param line
* The line to append
* #param [max_lines=10000]
* The maximum number of lines to allow for a file
* to prevent an infinite loop
*/
void append(string path, string line, number max_lines){
number f = OpenFileForReadingAndWriting(path);
// go through file until the end is reached to set the
// internal pointer to this position
number line_counter = 0;
string file_content = "";
string file_line;
while(ReadFileLine(f, file_line) && line_counter < max_lines){
line_counter++;
// file_content += file_line;
}
// result("file content: \n" + file_content + "{EOF}");
// append the line
WriteFile(f, line + "\n");
CloseFile(f);
}
void append(string path, string line){
append(path, line, 10000);
}
string path = "path/to/file.txt";
append(path, "Appended line");
For me it seems a little bit odd to read the whole file content to just append one line. If the file is very big, this probably is very slow1. So I guess there is a better solution of this. Does anyone know this solution?
Some background
My application is written in python but executed in Digital Micrograph. My python application is logging its steps. Sometimes I am executing dm-script from python. There I have no possibility to see what is going on. Since there is a bug, I need something to find out what is going on. Therefore I want to add logging to dm-script too.
This also explains, why I want to open and close the file every single time. This takes more time, but I don't care about execution speed while debugging. The logs will either be removed or switched off for the normal version, as usual. But on the other hand I am executing dm-script and python alternating so I have to prevent python blocking the file for dm-script and the other way around.
1As written in the background, I am not really interested in speed. So the current script is enough for me. Still I am interested in how to do this better, just for learnings and curiositys sake.
The best way to deal with any files in DM-script (binary or text) is to use the streaming object. The following example should answer your question:
void writeText()
{
string path
if ( !SaveAsDialog( "Save text as" , path , path ) ) return
number fileID = CreateFileForWriting( path )
object fStream = NewStreamFromFileReference( fileID , 1 ) // 1 for auto-close file when out of scope
// Write some text
number encoding = 0 // 0 = system default
fStream.StreamWriteAsText( encoding , "The quick brown dog jumps over the lazy fox" )
// Replace last 'fox' by 'dog'
fStream.StreamSetPos( 1 , -3 ) // 3 bytes before current position
fStream.StreamWriteAsText( encoding, "dog" )
// Replace first 'dog' by 'fox'
fStream.StreamSetPos( 0 , 16 ) // 16 bytes after start
fStream.StreamWriteAsText( encoding, "fox" )
// Append at end
fStream.StreamSetPos( 2 , 0 ) // end position (0 bytes from end)
fStream.StreamWriteAsText( encoding, "." )
}
writeText()

How to replace a data in a file with another data from another file?

I'm trying to open this file (final.txt) and read the contents:
c0001
f260
L
D11
H30
R0000
C0040
1X1100000100010B300300003003
181100202900027Part No
181100202900097[PRTNUM]
1e5504002400030B
1X1100002300010L300003
191100202000030Quantity
191100202000080[QUANTY]
1e5504001500040B
1X1100001400010L300003
1X1100001400150L003090
191100202000170P.O.No
191100202000220[PONUMB]
1e5504001500180B
191100201200030Supplier
1e3304000700030B
1X1100000600010L300003
181100200300030Serial
181100200300090[SERIAL]
171100300900190Rev
171100300300190[REV]
171100300900240Units
171100300300240[UNITS]
1X1100000100180L003130
Q0001
E
from which I am reading only [PRTNUM], [QUANTY], [PONUMB], [SERIAL], [UNITS].
I've written the following C program:
char* cStart = strchr(cString, '[');
if (cStart)
{
// open bracket found
*cStart++ = '\0'; // split the string at [
char* cEnd = strchr(cStart, ']');
// you could check here for the close bracket being found
// and throw an exception if not
*cEnd = '\0'; // terminate the keyword
printf("Key: %s, Value: %s",cString, cStart);
}
// continue the loop
but now I want to replace these placeholders with data from the 2nd file:
132424235
004342
L1000
DZ12
234235
234235
I want to replace [PRTNUM] (from the 1st file) with 132424235 and so on... In the end my file should be updated with all this data. Can you tell me what function I should use in the above program?
If you don't mind having an alternate approach, here's an algorithm to do the work in an elegant way
Create one (large enough) temporary buffer. Also, create (open) one output file which will be the modified version.
Read a line from the input file into the buffer using fgets()
Search for the particular "keyword" using strstr()
If a match is found --
4.1. Open the other input file.
4.2. Read the corresponding data (line), using fgets()
4.3. Replace the actual data in the temporary buffer with the newly read value.
4.4. write the modified data to the output file.
If match is not found, write the original data in the output file. Then, go to step 2.
Continue until fgets() returns NULL (indicates the file content has been exhausted).
Finally, the output file will have the data from the first file with those particular "placeholders" substituted with the value read from the second file.
Obviously, you need to polish the algorithm a little bit to make it work with multiple "placeholder" string.
Keep an extra string(name it copy) large enough to hold file 1 + some extra to manage replacement of [PRTNUM] with 132424235.
Start reading first string that has file1 and keep copying into second string (copy) as soon as you encounter [PRTNUM] , in string 2 instead of copying [PRTNUM] you append it with 132424235 and so on for all others.
And finally replace file1.txt with this second (copy) string.

libjpeg: process all scanlines once

I use jpeg library v8d from Independent JPEG Group and I want to change the way jpeg decompression reads and processes data.
In the djpeg main(), only one scanline/row at a time is read and processed in each jpeg_read_scanlines() call. So, to read entire image this functions is called until all lines are read and processed:
while (cinfo.output_scanline < cinfo.output_height) {
num_scanlines = jpeg_read_scanlines(&cinfo, dest_mgr->buffer,
dest_mgr->buffer_height); //read and process
(*dest_mgr->put_pixel_rows) (&cinfo, dest_mgr, num_scanlines); //write to file
}
But I would like to read the entire image once and store it in the memory and then process the entire image from memory. By reading libjpeg.txt, I found out this is possible: "You can process an entire image in one call if you have it all in memory, but usually it's simplest to process one scanline at a time."
Even though I made some progress, I couldn't make it completely work. I can now read a couple of rows once by increasing pub.buffer_height value and pub.buffer size, but no matter how large pub.buffer_height and pub.buffer are, only a couple of lines are read in each jpeg_read_scanlines() call. Any thoughts on this?
only a couple of lines are read in each jpeg_read_scanlines()
Yes, so you call it in a loop. Here's a loop that grabs one scanline at a time:
unsigned char *rowp[1], *pixdata = ...;
unsigned rowbytes = ..., height = ...;
while (cinfo.output_scanline < height) {
rowp[0] = pixdata + cinfo.output_scanline * rowbytes;
jpeg_read_scanlines(&cinfo, rowp, 1);
}
Once the loop exits, you have the entire image.

How to get the length of a file in MATLAB?

Is there any way to figure out the length of a .dat file (in terms of rows) without loading the file into the workspace?
Row Counter -- only loads one character per row:
Nrows = numel(textread('mydata.txt','%1c%*[^\n]'))
or file length (Matlab):
datfileh = fopen(fullfile(path, filename));
fseek(datfileh, 0,'eof');
filelength = ftell(datfileh);
fclose(datfileh);
I'm assuming you are working with text files, since you mentioned finding the number of rows.
Here's one solution:
fid = fopen('your_file.dat','rt');
nLines = 0;
while (fgets(fid) ~= -1),
nLines = nLines+1;
end
fclose(fid);
This uses FGETS to read each line, counting the number of lines it reads. Note that the data from the file is never saved to the workspace, it is simply used in the conditional check for the while loop.
It's also worth bearing in mind that you can use your file system's in-built commands, so on linux you could use the command
[s,w] = system('wc -l your_file.dat');
and then get the number of lines from the returned text (which is stored in w). (I don't think there's an equivalent command under Windows.)

Resources