fread in c reads more than instructed

fread in c reads more than instructed - c

I'm trying to write a program that will add effects to a .wav file.
The program should:
Read a .wav file
Parse the header
Read the data
Manipulate the data
Create a new .wav file -
Write the header
Write the new data
I'm stuck on some weird thing with fread() function -
when I'm trying to read 4 Bytes to the char array I've defined (of size 4 Bytes) - I'm getting the word + garbage.
If I try to read 2, or 3 Bytes in the same manner - everything works fine.
I tried printing the content of the array in both cases (when I read 2/3 Bytes v.s. when I read 4 Bytes) with a while loop until '\n' instead of printf("%s") - I got the same result (write string in the first case, string + garbage in the second case).
Also, whem I write the header back and just COPY the data - the file that is created is NOT the same song!
It does open - so the header is fine, but the data is garbage.
I'll be very glad to hear some ideas for the possible reasons for this.. I'm really stuck on it, please help me guys!
The problem - printscreen of the output

fread is not intended to read strings. It reads binary data. This means that the data will not be null terminated, nor have any other termination.
fread returns the amount of read bytes. After that point, the data will not be initialized, and must be ignored.
If you want to treat the data as string, you must null terminate it yourself with arr[count]=0. Make sure that arr has a least count+1 capacity in order to avoid a buffer overflow.

Perhaps reserve 5 bytes for your fmt_chunk_marker. That will let you represent a 4-character string as a null-terminated C string. The byte after the last character read should be set to the null character ('\0').

Related

Is it possible to count the frequency of a word in a file precisely using two buffers in C?

I have a file of size 1GB. I want to find out how many times the word "sosowhat" is found in the file. I've written a code using fgetc() which reads one character at a time from the file which is way too slower when it comes for a file of size 1GB. So I made a buffer of size 1000(using mmalloc) to hold 1000 words at a time from the file and I used the strstr() function to count the occurrence of the word "sosowhat". The logic is fine. But the problem is that if the part "so" of "sosowhat" is located at the end of the buffer and the "sowhat" part in the new buffer, the word will not be counted. So I used two buffers old_buffer and current_buffer. At the beginning of each buffer I want to check from the last few characters of old buffer. Is this possible? How can I go back to the old buffer? Is it possible without memmove()? As a beginner, I will be more than happy for your help.

Yes, it can be done. There are more possible approaches to this.
The first one, which is the cleanest, is to keep a second buffer, as suggested, of the length of the searched word, where you keep the last chunk of the old buffer. (It needs to be exactly the length of the searched word because you store wordLength - 1 characters + NULL terminator). Then the quickest way is to append to this stored chunk from the old buffer the first wordLen - 1 characters from the new buffer and search your word here. Then continue with your search normally. - Of course you can create a buffer which can hold both chunks (the last bytes from the old buffer and the first bytes from the new one).
Another approach (which I don't recommend, but can turn out to be a bit easier in terms of code) would be to fseek wordLen - 1 bytes backwards in the read file. This will "move" the chunk stored in previous approach to the next buffer. This is a bit dirtier as you will read some of the contents of the file twice. Although that's not something noticeable in terms of performance, I again recommend against it and use something like the first described approach.

use the same algorithm as per fgetc only read from the buffers you created. It will be same efficient as strstr iterates thorough the string char by char as well.

writing float as bytearray produces newline in python

Short version first, long version will follow :
Short :
I have a 2D matrix of float32. I want to write it to a .txt file as bytearray. I also want to keep the structure, which means adding a newline character at the end of a row. Some numbers like 683.61, when converted to bytearray include \n which produces an undesired newline character and messes up the reading ot the file as lines. How can I do this?
Long :
I am writing a program to work with huge arrays of datas (2D matrix). For that purpose, I need the array stored on disk rather then on my ram as the datas might be too big for the ram of the computer. I created my own type of file which is going to be read by the program. It has a header with important parameter as bytes followed by the matrix as bytearrays.
As I write the datas to the file one float32 at a time, I add a newline (\n) character at the end of one row of the matrix, so I keep the structure.
Writing goes well, but reading causes issues as some numbers, once converted to byte array, include \n.
As an example :
struct.pack('f',683.61)
will yield
b'\n\xe7*D'
This cuts my matrix rows as well as sometimes cut in the middle of a bytearray making the bytearray size wrong.
From this question :
Python handling newline and tab characters when writing to file
I found out that a str can be encoded with 'unicode_escape' to double the backslash and avoid confusion when reading.
Some_string.encode('unicode_escape')
However, this method only works on strings, not bytes or bytearrays. (I tryed it) This means I can't use it when I directly convert a float32 to a bytearray and write it to a file.
I have also tryed to convert the float to bytearray, decode the bytearray as a str and reencode it like so :
struct.pack('f',683.61).decode('utf-8').encode('unicode_escape')
but decode can't decode bytearrays.
I have also tryed converting the bytearray to string directly then encoding like so :
str(struct.pack('f',683.61)).encode('unicode_escape')
This yields a mess from which it is possible to get the right bytes with this :
bytes("b'\\n\\xe7*D'"[2:-1],'utf-8')
And finally, when I actually read the byte array, I obtain two different results wheter the unicode_escape has been used of not :
numpy.frombuffer(b'\n\xe7*D', dtype=float32)
yields : array([683.61], dtype=float32)
numpy.frombuffer(b'\\n\\xe7*D', dtype=float32)
yields : array([1.7883495e+34, 6.8086554e+02], dtype=float32)
I am expecting the top restults, not the bottom one. So I am back to square one.
--> How can I encode my matrix of floats as a bytearray, on multiple lines, without being affected by newline character in the bytearrays?
F.Y.I. I decode the bytearray with numpy as this is the working method I found, but it might not be the best way. Just starting to play around with bytes.
Thank you for you help. If there is any issue with my question, please inform me, I will gladly rewrite it properly if it was wrong.

You either write your data as binary data, or you use newlines to keep it readable - it does not even make sense otherwise.
When you are trying to record "bytes" to a file, and have float32 values raw as a 4 byte sequence, each of those bytes can, of course, have any value from 0-255 - and some of these will be control characters.
The alternatives are to serialize to a format that will encode your byte values to characters in the printable ASCII range, like base64, Json, or even pickle, using protocol 0.
Perhaps what will be most confortable for you is just to write your raw bytes in a binary byte, and change the programs you are using to interact with it - using and hexeditor like "hexedit" or Midnight Commander. Both will allow you to browse your bytes by their hexadecimal representation in a confortable way, and will display eventual ASCII-text sequences inside the files.

For anyone having the same questionning as I did, trying to keep the readline function working with byte, the previous answer from #jsbueno got me thinking of alternate ways to proceed rather than modify the bytes.
Here is an alternative if like me you are making your own file with data as bytes. write your own readline() function based on the classic read() function, but with a customized "newline character". Here is what I worked out :
def readline(file, newline=b'Some_byte',size=None):
buffer = bytearray()
if size is None :
while 1 :
buffer += file.read(1)
if buffer.endswith(newline):
break
else :
while len(buffer) < size :
buffer += file.read(1)
if buffer.endswith(newline):
break
return buffer

Writing a CFSTR to the terminal in Mac OS X

How best would I output the following code
#include <CoreFoundation/CoreFoundation.h> // Needed for CFSTR
int main(int argc, char *argv[])
{
char *c_string = "Hello I am a C String. :-).";
CFStringRef cf_string = CFStringCreateWithCString(0, c_string, kCFStringEncodingUTF8);
// output cf_string
//
}

There's no API to write a CFString directly to any file (including stdout or stderr), because you can only write bytes to a file. Characters are a (somewhat) more ideal concept; they're too high-level to be written to a file. It's like saying “I want to write these pixels”; you must first decide what format to write them in (say, PNG), and then encode them in that format, and then write that data.
So, too, with characters. You must encode them as bytes in some format, then write those bytes.
Encoding the characters as bytes/data
First, you must pick an encoding. For display on a Terminal, you probably want UTF-8, which is kCFStringEncodingUTF8. For writing to a file… you usually want UTF-8. In fact, unless you specifically need something else, you almost always want UTF-8.
Next, you must encode the characters as bytes. Creating a C string is one way; another is to create a CFData object; still another is to extract bytes (not null-terminated) directly.
To create a C string, use the CFStringGetCString function.
To extract bytes, use the CFStringGetBytes function.
You said you want to stick to CF, so we'll skip the C string option (which is less efficient anyway, since whatever calls write is going to have to call strlen)—it's easier, but slower, particularly when you use it on large strings and/or frequently. Instead, we'll create CFData.
Fortunately, CFString provides an API to create a CFData object from the CFString's contents. Unfortunately, this only works for creating an external representation. You probably do not want to write this to stdout; it's only appropriate for writing out as the entire contents of a regular file.
So, we need to drop down a level and get bytes ourselves. This function takes a buffer (region of memory) and the size of that buffer in bytes.
Do not use CFStringGetLength for the size of the buffer. That counts characters, not bytes, and the relationship between number of characters and number of bytes is not always linear. (For example, some characters can be encoded in UTF-8 in a single byte… but not all. Not nearly all. And for the others, the number of bytes required varies.)
The correct way is to call CFStringGetBytes twice: once with no buffer (NULL), whereupon it will simply tell you how many bytes it'll give you (without trying to write into the buffer you haven't given it); then, you create a buffer of that size, and then call it again with the buffer.
You could create a buffer using malloc, but you want to stick to CF stuff, so we'll do it this way instead: create a CFMutableData object whose capacity is the number of bytes you got from your first CFStringGetBytes call, increase its length to that same number of bytes, then get the data's mutable byte pointer. That pointer is the pointer to the buffer you need to write into; it's the pointer you pass to the second call to CFStringGetBytes.
To recap the steps so far:
Call CFStringGetBytes with no buffer to find out how big the buffer needs to be.
Create a CFMutableData object of that capacity and increase its length up to that size.
Get the CFMutableData object's mutable byte pointer, which is your buffer, and call CFStringGetBytes again, this time with the buffer, to encode the characters into bytes in the data object.
Writing it out
To write bytes/data to a file in pure CF, you must use CFWriteStream.
Sadly, there's no CF equivalent to nice Cocoa APIs like [NSFileHandle fileHandleWithStandardOutput]. The only way to create a write stream to stdout is to create it using the path to stdout, wrapped in a URL.
You can create a URL easily enough from a path; the path to the standard output device is /dev/stdout, so to create the URL looks like this:
CFURLRef stdoutURL = CFURLCreateWithFileSystemPath(kCFAllocatorDefault, CFSTR("/dev/stdout"), kCFURLPOSIXPathStyle, /*isDirectory*/ false);
(Of course, like everything you Create, you need to Release that.)
Having a URL, you can then create a write stream for the file so referenced. Then, you must open the stream, whereupon you can write the data to it (you will need to get the data's byte pointer and its length), and finally close the stream.
Note that you may have missing/un-displayed text if what you're writing out doesn't end with a newline. NSLog adds a newline for you when it writes to stderr on your behalf; when you write to stderr yourself, you have to do it (or live with the consequences).
So:
Create a URL that refers to the file you want to write to.
Create a stream that can write to that file.
Open the stream.
Write bytes to the stream. (You can do this as many times as you want, or do it asynchronously.)
When you're all done, close the stream.

Reading lines in c with windows.h

I need to use system-calls of windows.h to read a file which I get from command line. I can read to whole file to buffer using ReadFile() and then cut the buffer at the first \0, but how can I read only one line? Also I need to read the last line of the file, Is this possible without reading the whole file into buffer, because maybe the file is 4gb or more so I won't be able to read it. So anyone knows how to read it by lines?

If you have an idea of how long lines are then you are in business, make a buffer that is a bit larger than max line.
ReadFile read a number of bytes and cut buffer at first end of line (\n)
Use LZSeek to position at end of file, then move back a line of bytes and look for end of line, start there and read rest of line.

Don't "cut the buffer at the first \0", ReadFile doesn't return a zero-terminated string. It reads raw bytes. You have to pay attention to the value returned through the lpNumberOfBytesRead argument. It will be equal to the nNumberOfBytesToRead value you pass unless you've reached the end of the file.
Now you know how many valid bytes are in the buffer. Search them for the first '\r' or '\n' byte to find the line terminator. Copy the range of bytes to a string buffer supplied by the caller and return. The next time you read a line, start where you left off previously, past the line terminator. When you don't find the line terminator then you have to copy the bytes in the buffer and call ReadFile() again to read more bytes. That makes the code a bit tricky to get right, excellent exercise otherwise.

ReadFile is a particularly poor choice for what you want to do. Are you allowed to use fgets? That would be much easier to use in your case.

Reading upto newline

Hi
My program reads a CSV file.
So I used fgets to read one line at a time.
But now the interface specification says that it is possible to find NULL characters in few of the columns.
So I need to replace fgets with another function to read from the file
Any suggestions?

If your text stream has a NUL (ascii 0) character, you will need to handle your file as a binary file and use fread to read the file. There are two approaches to this.
Read the entire file into memory. The length of the file can be obtained by fseek(fp, 0, SEEK_END) and then calling ftell.You can then allocate enough memory for the whole file.Once in memory, parsing the file should be relatively easy. This approach is only really suitable for smallish files (probably less than 50M max). For bonus marks look at the mmap function.
Read the file byte by byte and add the characters to a buffer until a newline is found.
Read and parse bit by bit. Create a buffer that is biggest than you largest line and fill it with content from your file. You then parse and extract as many lines as you can. Add the remainder to the beginning of a new buffer an read the next bit. Using a bigger buffer will help minimize copying.

fgets works perfectly well with embedded null bytes. Pre-fill your buffer with \n (using memset) and then use memchr(buf, '\n', sizeof buf). If memchr returns NULL, your buffer was too small and you need to enlarge it to read the rest of the line. Otherwise, you can determine whether the newline you found is the end of the line or the padding you pre-filled the buffer with by inspecting the next byte. If the newline you found is at the end of the buffer or has another newline just after it, it's from padding, and the previous byte is the null terminator inserted by fgets (not a null from the file). Otherwise, the newline you found has a null byte after it (terminator inserted by fgets, and it's the end-of-line newline.
Other approaches will be slow (repeated fgetc) or waste (and risk running out of) resources (loading the whole file into memory).

use fread and then scan the block for the separator
Check the function int T_fread(FILE *input) at http://www.mrx.net/c/source.html

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight