writing float as bytearray produces newline in python - arrays

Short version first, long version will follow :
Short :
I have a 2D matrix of float32. I want to write it to a .txt file as bytearray. I also want to keep the structure, which means adding a newline character at the end of a row. Some numbers like 683.61, when converted to bytearray include \n which produces an undesired newline character and messes up the reading ot the file as lines. How can I do this?
Long :
I am writing a program to work with huge arrays of datas (2D matrix). For that purpose, I need the array stored on disk rather then on my ram as the datas might be too big for the ram of the computer. I created my own type of file which is going to be read by the program. It has a header with important parameter as bytes followed by the matrix as bytearrays.
As I write the datas to the file one float32 at a time, I add a newline (\n) character at the end of one row of the matrix, so I keep the structure.
Writing goes well, but reading causes issues as some numbers, once converted to byte array, include \n.
As an example :
struct.pack('f',683.61)
will yield
b'\n\xe7*D'
This cuts my matrix rows as well as sometimes cut in the middle of a bytearray making the bytearray size wrong.
From this question :
Python handling newline and tab characters when writing to file
I found out that a str can be encoded with 'unicode_escape' to double the backslash and avoid confusion when reading.
Some_string.encode('unicode_escape')
However, this method only works on strings, not bytes or bytearrays. (I tryed it) This means I can't use it when I directly convert a float32 to a bytearray and write it to a file.
I have also tryed to convert the float to bytearray, decode the bytearray as a str and reencode it like so :
struct.pack('f',683.61).decode('utf-8').encode('unicode_escape')
but decode can't decode bytearrays.
I have also tryed converting the bytearray to string directly then encoding like so :
str(struct.pack('f',683.61)).encode('unicode_escape')
This yields a mess from which it is possible to get the right bytes with this :
bytes("b'\\n\\xe7*D'"[2:-1],'utf-8')
And finally, when I actually read the byte array, I obtain two different results wheter the unicode_escape has been used of not :
numpy.frombuffer(b'\n\xe7*D', dtype=float32)
yields : array([683.61], dtype=float32)
numpy.frombuffer(b'\\n\\xe7*D', dtype=float32)
yields : array([1.7883495e+34, 6.8086554e+02], dtype=float32)
I am expecting the top restults, not the bottom one. So I am back to square one.
--> How can I encode my matrix of floats as a bytearray, on multiple lines, without being affected by newline character in the bytearrays?
F.Y.I. I decode the bytearray with numpy as this is the working method I found, but it might not be the best way. Just starting to play around with bytes.
Thank you for you help. If there is any issue with my question, please inform me, I will gladly rewrite it properly if it was wrong.

You either write your data as binary data, or you use newlines to keep it readable - it does not even make sense otherwise.
When you are trying to record "bytes" to a file, and have float32 values raw as a 4 byte sequence, each of those bytes can, of course, have any value from 0-255 - and some of these will be control characters.
The alternatives are to serialize to a format that will encode your byte values to characters in the printable ASCII range, like base64, Json, or even pickle, using protocol 0.
Perhaps what will be most confortable for you is just to write your raw bytes in a binary byte, and change the programs you are using to interact with it - using and hexeditor like "hexedit" or Midnight Commander. Both will allow you to browse your bytes by their hexadecimal representation in a confortable way, and will display eventual ASCII-text sequences inside the files.

For anyone having the same questionning as I did, trying to keep the readline function working with byte, the previous answer from #jsbueno got me thinking of alternate ways to proceed rather than modify the bytes.
Here is an alternative if like me you are making your own file with data as bytes. write your own readline() function based on the classic read() function, but with a customized "newline character". Here is what I worked out :
def readline(file, newline=b'Some_byte',size=None):
buffer = bytearray()
if size is None :
while 1 :
buffer += file.read(1)
if buffer.endswith(newline):
break
else :
while len(buffer) < size :
buffer += file.read(1)
if buffer.endswith(newline):
break
return buffer

Related

Reading a file using pread

The aim of the problem is to use only pread to read a file with the intergers.
I am trying to device a generic solution where I can read intergers of any length, but I think there must be a better solution from my current algorithm.
For the sake of explanation and to guide the algorithm, here is a sample input file. I have explicitly added \r\n to show that they exist in the file.
Input file:
23456\r\n
134\r\n
1\r\n
345678\r\n
Algorithm
1. Read a byte from the file
2. Check if it is number i.e '0' <= byte <= '9'
3.1 if yes, increment the offset and read the next byte
3.2 if not, is it \r
3.2.1 if yes, read the next and it should be \n.
Here the line is finished and we can use strtol to convert string to int.
3.2.2 // Error condition
I'm required to make this algorithm because if found out that pread reads the files as string and just pust the requested number of bytes in the provided buffer.
Question:
Is there an better way of reading intergers from the file using pread() instead of parsing each byte to determine the end-of-string and then converting to interget?
Is there an better way of reading intergers from the file using pread() instead of parsing each byte to determine the end-of-string and then converting to interget?
Yes, read big chunks of data into memory and then do the parsing on the memory. Use a big buffer (i.e. depending on system memory). On a mordern system where giga-bytes of memory is available, you can go for a buffer in the mega byte range. I would probably start out with a 1 or 2mega byte buffer and see how it performs.
This will be much more efficient that byte-by-byte reads.
note: your code needs to handle situations where a chunk from the file stops in the middle of an integer. That adds a little complexity to code but it's not that difficult to handle.
where I can read intergers of any length
Well, if you actually mean integers greater than the largest integer of your system, it's much more complicated. Standard functions like strtol can't be used. Further, you'll need to define your own way of storing these values. Alternatively, you can fetch a public library that can handle such values.

What to use in python3.x to fetch non-ascii values from console

I am trying to form a pipe, where a program written in C gives its continuous output with write(). This stream of data shall be accepted by python script and processed further.
By using input() (python 3.x) I was able to catch the data when C-written source program was giving its data out with printf(), anyway to speed up I changed to write() to console.
And this is where the problem start: From now on I am unable to fetch the data with python script, because input() doesn't want to accept non-ascii values. I know, write() gives a binary data instead of strings.
Is there any other way for python to accept the data from outside, preferably from console, where data consists non-ascii values (preferably binary input)?
I have tried to modify the C program to form a ascii string. The array of chars plus terminating zero (also: zero-terminated string, with CR and LF+CR), anyway python seems to read further behind terminating zero and then giving UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 130: invalid continuation byte error, where sent string was 4 characters long (C - array: [99, 0, 10, 13] which is equivalent to 'c' zero-terminated and LF and CR).
So - trying to implement other solution - question 2. Is there a possibility to limit the length of input() before python grabs unknown portion of data from console and throws errors?
For reading binary, you should probably use the .read() method on a file object. And the object you want to use is probably sys.stdin.buffer.
Check the docs for the io module, and for sys.stdin.
For your first question, the documentation for sys.stdin would be helpful here. In particular there is a very helpful note contained in the documentation:
Note: To write or read binary data from/to the standard streams, use the underlying binary buffer object. For example, to write bytes to stdout, use sys.stdout.buffer.write(b'abc').
For you second question, the default input function reads in lines at a time.

fread in c reads more than instructed

I'm trying to write a program that will add effects to a .wav file.
The program should:
Read a .wav file
Parse the header
Read the data
Manipulate the data
Create a new .wav file -
Write the header
Write the new data
I'm stuck on some weird thing with fread() function -
when I'm trying to read 4 Bytes to the char array I've defined (of size 4 Bytes) - I'm getting the word + garbage.
If I try to read 2, or 3 Bytes in the same manner - everything works fine.
I tried printing the content of the array in both cases (when I read 2/3 Bytes v.s. when I read 4 Bytes) with a while loop until '\n' instead of printf("%s") - I got the same result (write string in the first case, string + garbage in the second case).
Also, whem I write the header back and just COPY the data - the file that is created is NOT the same song!
It does open - so the header is fine, but the data is garbage.
I'll be very glad to hear some ideas for the possible reasons for this.. I'm really stuck on it, please help me guys!
The problem - printscreen of the output
fread is not intended to read strings. It reads binary data. This means that the data will not be null terminated, nor have any other termination.
fread returns the amount of read bytes. After that point, the data will not be initialized, and must be ignored.
If you want to treat the data as string, you must null terminate it yourself with arr[count]=0. Make sure that arr has a least count+1 capacity in order to avoid a buffer overflow.
Perhaps reserve 5 bytes for your fmt_chunk_marker. That will let you represent a 4-character string as a null-terminated C string. The byte after the last character read should be set to the null character ('\0').

Writing structure into a file in C

I am reading and writting a structure into a text file which is not readable. I have to write readable data into the file from the structure object.
Here is little more detail of my code:
I am having the code which reads and writes a list of itemname and code into a file (file.txt). The code uses linked list concept to read and write data.
The data are stored into a structure object and then writen into a file using fwrite.
The code works fine. But I need to write a readable data into the text file.
Now the file.txt looks like bellow,
㵅㡸䍏䥔䥆㘸䘠㵅㩃䠀\䵏㵈䑜㵅㡸䍏䥔䥆㘸䘠\㵅㩃䠀䵏㵈䑜㵅㡸䍏䥔䥆㘸䘠㵅㩃䠀䵏㵈\䑜㵅㡸䍏䥔䥆㘸䘠㵅㩃䠀䵏㵈䑜㵅㡸䍏䥔\䥆㘸䘠㵅㩃䠀䵏㵈
I am expecting the file should be like this,
pencil aaaa
Table bbbb
pen cccc
notebook nnnn
Here is the snippet:
struct Item
{
char itemname[255];
char dspidc[255];
struct Item *ptrnext;
};
// Writing into the file
printf("\nEnter Itemname: ");
gets(ptrthis->itemname);
printf("\nEnter Code: ");
gets(ptrthis->dspidc);
fwrite(ptrthis, sizeof(*ptrthis), 1, fp);
// Reading from the file
while(fread(ptrthis, sizeof(*ptrthis), 1, fp) ==1)
{
printf("\n%s %s", ptrthis->itemname,ptrthis->dspidc);
ptrthis = ptrthis->ptrnext;
}
Writing the size of an array that is 255 bytes will write 255 bytes to file (regardless of what you have stuffed into that array). If you want only the 'textual' portion of that array you need to use a facility that handles null terminators (i.e. printf, fprintf, ...).
Reading is then more complicated as you need to set up the idea of a sentinel value that represents the end of a string.
This speaks nothing of the fact that you are writing the value of a pointer (initialized or not) that will have no context or validity on the next read. Pointers (i.e. memory locations) have application only within the currently executing process. Trying to use one process' memory address in another is definitely a bad idea.
The code works fine
not really:
a) you are dumping the raw contents of the struct to a file, including the pointer to another instance if "Item". you can not expect to read back in a pointer from disc and use it as you do with ptrthis = ptrthis->ptrnext (i mean, this works as you "use" it in the given snippet, but just because that snippet does nothing meaningful at all).
b) you are writing 2 * 255 bytes of potential crap to the file. the reason why you see this strange looking "blocks" in your file is, that you write all 255 bytes of itemname and 255 bytes of dspidc to the disc .. including terminating \0 (which are the blocks, depending on your editor). the real "string" is something meaningful at the beginning of either itemname or dspidc, followed by a \0, followed by whatever is was in memory before.
the term you need to lookup and read about is called serialization, there are some libraries out there already which solve the task of dumping data structures to disc (or network or anything else) and reading it back in, eg tpl.
First of all, I would only serialize the data, not the pointers.
Then, in my opinion, you have 2 choices:
write a parser for your syntax (with yacc for instance)
use a data dumping format such as rmi serialization mechanism.
Sorry I can't find online docs, but I know I have the grammar on paper.
Both of those solution will be platform independent, be them big endian or little endian.

A C style string file format conundrum

I'm very confused with this wee little problem I have. I have a non-indexed file format header. (more specifically the ID3 header) Now, this header stores a string or rather three bytes for conformation that the data is actually an ID3 tag (TAG is the string btw.) Point is, now that this TAG in the file format is not null-terminated. So there are two things that can be done:
Load the entire file with fread and for non-terminated string comparison, use strncmp. But:
This sounds hacky
What if someone opens it up and tries to manipulate the string w/o prior knowledge of this?
The other option is that the file be loaded, but the C struct shouldn't exactly map to the file format, but include proper null-terminators, and then each member should be loaded using a unique call. But, this too feels hacky and is tedious.
Help, especially from people who have practical experience with dealing with such stuff, is appreciated.
The first thing to consider when parsing anything is: Are the lengths of these fields either fixed in size, or prefixed by counts (that are themselves fixed in size, for example, nearly every graphics file has a fixed size/structure header followed by a variable sized sequence of the pixels)? Or, does the format have completely variable length fields that are delimited somehow (for example, MPEG4 frames are delimited by the bytes 0x00, 0x00, 0x01)? Usually the answer to this question will go a long way toward telling you how to parse it.
If the file format specification says a certain three bytes have the values corresponding to 'T', 'A', 'G' (84, 65, 71), then you should compare just those three bytes.
For this example, strncmp() is OK. In general, memcmp() is better because it doesn't have to worry about string termination, so even if the byte stream (tag) you are comparing contains ASCII NUL '\0' characters, memcmp() will work.
You also need to recognize whether the file format you are working with is primarily printable data or whether it is primarily binary data. The techniques you use for printable data can be different from the techniques used for binary data; the techniques used for binary data sometimes (but not always) translate for use with printable data. One big difference is that the lengths of values in binary data is known in advance, either because the length is embedded in the file or because the structure of the file is known. With printable data, you are often dealing with variable-length encodings with implicit boundaries on the fields - and no length encoding information ahead of it.
For example, the Unix password file format is a text encoding with variable length fields; it uses a ':' to separate fields. You can't tell how long a field is until you come across the next ':' or the end of the line. This requires different handling from a binary format encoded using ASN.11, where fields can have a type indicator value (usually a byte) and a length (can be 1, 2 or 4 bytes, depending on type) before the actual data for the field.
1 ASN.1 is (justifiably) regarded as very complex; I've given a very simple example of roughly how it is used that can be criticized on many levels. Nevertheless, the basic idea is valid - length (and with ASN.1, usually type too) precedes the (binary) data. This is also known as TLV - type, length, value - encoding.
If you are just learning something, you can find the ID3v1 tag in a MP3 file by reading the last 128 bytes of the file, and checking if the first 3 characters of the block are TAG.
For a real application, use TagLib.
Keep three bytes and compare each byte with the characters 'T', 'A' and 'G'. This may not be very smart, but gets the job done well and more importantly correctly.
And don´t forget the genre that two different meaning on id3 v1 and id3v1.1

Resources