I have a png file that is to be stored in a database, however, even when passing a length to sqlite3_bind_blob() it stops filling in the value at the first nul character.
Here's the code in question:
fseek(file,0xC,SEEK_CUR); // Skip to 12 (0xC) and read everything (It's a raw png file)
char content[size-0xC];
fread(content,1,size-0xC,file);
sqlite3_bind_int(inserticonstmt,1,id);
sqlite3_bind_blob(inserticonstmt,2,content,size-0xC,SQLITE_STATIC);
sqlite3_step(inserticonstmt);
sqlite3_clear_bindings(inserticonstmt);
sqlite3_reset(inserticonstmt);
Any ideas?
Edit: It looks like, while the database is in fact storing the whole blob, it's not returning it from the CLI interface
The sqlite cli interface has a bug where it parses blobs as strings and stops printing them early, including if it's told to send output to a file.
Related
I am using NODERED to read a serial port. The data coming in is 8 bit per byte. reading is working fine. Now, I want to write that buffer to a file.
When I use buffer.toString() it generates a string but with formatting. Even the 'binary' option escapes the data when 8th bit is set.
Howto write the raw data to a file without it getting changed?
Mission:
I want to generate a bitmap file, adding a fixed length image to the header.
Comparing the file with an hex editor with an existing bitmap shows the escaped bytes in the new file. Obviously that is wrong.
I need some help.
I'm writing a program that opens 2 source files in UTF-8 encoding without BOM. The first contains English text and some other information, including ID. The second contains only string ID and translation. The program changes every string from the first file by replacing English chars to Russian translation from the second one and writes these strings to output file. Everything seems to be ok, but there is BOM appears in destination file. And i want to create file without BOM, like source.
I open files with fopen function in text mode with ccs=UTF-8
read string with fgetws function to wchar_t buffer
and write with fputws function to output file
Don't use text mode, don't use the MS ccs= extension to fopen, and don't use fputws. Instead use fopen in binary mode and write the correct UTF-8 yourself.
I want to write a program in C(only c not c++ or java) that will read doc, docx, pdf and want to make it available on github to use for all who needs that code. So I started with .doc file I explored that if I open .doc file with simple notepad it will show you all text but just with some extra content which you can easily trim. So I did write a simple c program to read .doc wile in both 'r' and 'rb' mode but both time it gives me only 5-9 character in the file and those also not readable. I don't know why it's happening. Any comment or disccussion will be very helpful for me.
Here is the link for github Source code. Please help me to complete all three format.
To answer your specific question, the reason your little application stops reading is because it mistakenly thinks there is an EOF character in your file.
Look at your code:
char ch;
int nol=0, not=0, nob=0, noc=0;
FILE *fp;
fp = fopen("file.doc","rb");
while(1)
{
ch = fgetc(fp);
if(ch==EOF)
{
break;
}
You store the result of fgetc(fp) in a variable of type char, which is a single-byte variable. However, the result of fgetc is very purposefully "int", not "char".
fgetc always returns a positive result in the range 0 to 255, except for when you reach the end of the file in which case it returns EOF, which is often implemented as a -1 value.
If you read a byte of value 255 and store it in an int, everything is OK, it's stored as the value 255 and your loop can continue. If you store the result in a char, it's going to be interpreted equal to EOF. And your loop stops.
Don't expect to get anywhere with this idea. .doc is a huge binary file format that is inhumanly complicated to parse. With that said, Cubia mentioned the offset where the text section of the document starts. I'm not familiar with the details of the format, but if the raw text is contained in one location, use fseek to get at it and stop when you reach the end. This won't be the case for the other formats because they are very different.
.docx and .pdf should be easier to parse because they are more modern formats. If you want to read anything from a docx you need to read from a zip file with a ton of xml in it and use a parser to figure out which text you want.
.pdf should be the easiest of the three because you might be able to find a library out there that can almost do what you want.
As for why you are getting strange output from your program, remember that .doc is a binary format and the vast majority of the data is garbage from your perspective. Dumping it to the terminal will yield readable text but also a bunch of control characters that should screw with your terminal.
As a last note - don't try to read docx files directly using fread - they are compressed so you likely won't recover the text unaltered. Take a look at libarchive. Also - expect to have to read the document specifications. docx seems to be a microsoft extension to the openoffice format. See this and some PDF specification documents (there are multiple versions).
Look at the .doc file type as a txt file but with extra non-printable characters before, in the middle, and after your content. These non-printable characters are used for defining special formatting, metadata and other infos.
With this said, all .doc files follow a certain structure.
If you open two different .doc files in a hex editor, you will notice that the text content of both files start at an offset of 0xA00 (2560 bytes) from the beginning of the file. This means that when you open your file initially, you can ignore the first 2560 bytes of the file (Take a look at the fseek() function).
From this point on, you can read the contents of your file until you reach '\0'.
I have not seen the implementation of a .pdf or a .docx file, but you can take open up both files with a hex editor and figure out what pattern you can use the isolate the important contents of the files.
Hope this helps.
EDIT : You can always find documentation on the different file formats that you want to manipulate. Here are the specifications of the PDF file type :
http://www.adobe.com/devnet/pdf/pdf_reference.html
http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
Using the C language, I am trying to manipulate some files generated by openssl and containing a lot of (very) special characters. But the end of file seems to be prematurely detected.
For example see an extract of my program, that is supposed to copy a file to another :
(for simplicity reasons I do not show the test of the opening of the file but I do that in my program)
char msgcrypt[FFILE];
FILE* fMsg = fopen(f4Path,"r");
while(fgets(tmp,FFILE,fMsg) != NULL) strcat(msgcrypt,tmp);
fclose(fMsg);
FILE* fMsg2 = fopen(f5Path,"w");
fprintf(fMsg2,"%s",msgcrypt);
fclose(fMsg2);
here is the content of the file located at f4Path :
Salted__X¢~xÁïÈú™xe^„fl¯�˜<åD
now the content of the file located at f5Path :
Salted__X¢~xÁïÈú™xe^„fl¯
Notice that 4 characters are missing.
Do someone have an idea?
But the end of file seems to be prematurely detected
Sounds familiar.
Use fopen(f4Path, "rb") when opening the file. This has real significance on Windows.
Don't use string functions (fprintf, strcat, fgets etc) they will choke on NUL characters. Use fread and fwrite instead.
strcat tries and copy a nul-terminated char *. Which means, if it encounters a 0, which it probably has done here, it will stop copying.
You'd better use open read, memcpy and write.
That character it stops on I copied into a hex editor, and it ends up being EF BF BD, a BOM if I'm not mistaken. As a result, reading the file as a text file fails. I don't see any NULL characters (unless copying and pasting got rid of them).
The answer (as has already been discussed) is to not treat it as a text file, and avoiding the str functions won't do any harm either.
The first thing I'd do though is add a check for how may characters are read, that way you'll know where the data is being truncated. Right now it could be in any of: read, strcat, write.
I'm currently trying to solve the problem when I need to load rows from the file and then sort them in the right order.
If I manually assign lettes to the array of wint_t and then sort them, everything from just fine with any encoding http://pastebin.com/85eycH15.
But if I read the very same letters from file and then try to sort them it works just with one encoding (cs_CZ.utf8) and with the rest of them it doesn't read the letters properly or or just skip them http://pastebin.com/3C8r9W5T.
I highly appreciate any help.
I assume that the only encoding with which you get your expected result is the one used for your data file. Re-encode the data file in another encoding, you'll get your expected result for the new encoding and not others.