writing 8 bit data to file, without escape character - file

I am using NODERED to read a serial port. The data coming in is 8 bit per byte. reading is working fine. Now, I want to write that buffer to a file.
When I use buffer.toString() it generates a string but with formatting. Even the 'binary' option escapes the data when 8th bit is set.
Howto write the raw data to a file without it getting changed?
Mission:
I want to generate a bitmap file, adding a fixed length image to the header.
Comparing the file with an hex editor with an existing bitmap shows the escaped bytes in the new file. Obviously that is wrong.

Related

Remove trailing "0d 0a" bytes from a file, using PowerShell

I am trying to encrypt and decrypt files using PowerShell. In this case, I am working with .docx files. After encrypting the file, I passed it onto the decrypt function, and after decrypting it, the file is corrupted when trying to open.
However, after using a hex editor to compare both original and decrypted .docx files, the only difference is that the decrypted .docx file has 2 trailing bytes of "0d 0a".
I think this is the result of PowerShell's "Set-Content" command.
(The Out-File command produces a far worse result.)
However, I am not able to just replace all the carriage return and line feed bytes as I would like to preserve the line feeds and carriage returns of the word document.
Is there a way I am able to remove only the trailing bytes of "0d 0a" of the decrypted and already-written .docx file?
Without seeing your example its impossible to to tell where the extra CRLF is coming from. I would recommend examining your code to determine where it is coming from and then use an alternate route like the System.IO.File class. If you are just looking for a quick solution you could read in the file and strip the last four bytes then output the byte array back to the same file overwriting it. This is a bandaid but should work.
#read in all contents
$bytes = [system.io.file]::ReadAllBytes("somefile.docx")
#write out all bytes except the last 4
#0 based so the last byte is at position length-1 then an additional 4 bytes
[System.IO.File]::WriteAllBytes("somefile.docx",$bytes[0..($bytes.length-5)])

Is it possible to prevent adding BOM to output UTF-8 file? (Visual Studio 2005)

I need some help.
I'm writing a program that opens 2 source files in UTF-8 encoding without BOM. The first contains English text and some other information, including ID. The second contains only string ID and translation. The program changes every string from the first file by replacing English chars to Russian translation from the second one and writes these strings to output file. Everything seems to be ok, but there is BOM appears in destination file. And i want to create file without BOM, like source.
I open files with fopen function in text mode with ccs=UTF-8
read string with fgetws function to wchar_t buffer
and write with fputws function to output file
Don't use text mode, don't use the MS ccs= extension to fopen, and don't use fputws. Instead use fopen in binary mode and write the correct UTF-8 yourself.

C program for reading doc, docx, pdf

I want to write a program in C(only c not c++ or java) that will read doc, docx, pdf and want to make it available on github to use for all who needs that code. So I started with .doc file I explored that if I open .doc file with simple notepad it will show you all text but just with some extra content which you can easily trim. So I did write a simple c program to read .doc wile in both 'r' and 'rb' mode but both time it gives me only 5-9 character in the file and those also not readable. I don't know why it's happening. Any comment or disccussion will be very helpful for me.
Here is the link for github Source code. Please help me to complete all three format.
To answer your specific question, the reason your little application stops reading is because it mistakenly thinks there is an EOF character in your file.
Look at your code:
char ch;
int nol=0, not=0, nob=0, noc=0;
FILE *fp;
fp = fopen("file.doc","rb");
while(1)
{
ch = fgetc(fp);
if(ch==EOF)
{
break;
}
You store the result of fgetc(fp) in a variable of type char, which is a single-byte variable. However, the result of fgetc is very purposefully "int", not "char".
fgetc always returns a positive result in the range 0 to 255, except for when you reach the end of the file in which case it returns EOF, which is often implemented as a -1 value.
If you read a byte of value 255 and store it in an int, everything is OK, it's stored as the value 255 and your loop can continue. If you store the result in a char, it's going to be interpreted equal to EOF. And your loop stops.
Don't expect to get anywhere with this idea. .doc is a huge binary file format that is inhumanly complicated to parse. With that said, Cubia mentioned the offset where the text section of the document starts. I'm not familiar with the details of the format, but if the raw text is contained in one location, use fseek to get at it and stop when you reach the end. This won't be the case for the other formats because they are very different.
.docx and .pdf should be easier to parse because they are more modern formats. If you want to read anything from a docx you need to read from a zip file with a ton of xml in it and use a parser to figure out which text you want.
.pdf should be the easiest of the three because you might be able to find a library out there that can almost do what you want.
As for why you are getting strange output from your program, remember that .doc is a binary format and the vast majority of the data is garbage from your perspective. Dumping it to the terminal will yield readable text but also a bunch of control characters that should screw with your terminal.
As a last note - don't try to read docx files directly using fread - they are compressed so you likely won't recover the text unaltered. Take a look at libarchive. Also - expect to have to read the document specifications. docx seems to be a microsoft extension to the openoffice format. See this and some PDF specification documents (there are multiple versions).
Look at the .doc file type as a txt file but with extra non-printable characters before, in the middle, and after your content. These non-printable characters are used for defining special formatting, metadata and other infos.
With this said, all .doc files follow a certain structure.
If you open two different .doc files in a hex editor, you will notice that the text content of both files start at an offset of 0xA00 (2560 bytes) from the beginning of the file. This means that when you open your file initially, you can ignore the first 2560 bytes of the file (Take a look at the fseek() function).
From this point on, you can read the contents of your file until you reach '\0'.
I have not seen the implementation of a .pdf or a .docx file, but you can take open up both files with a hex editor and figure out what pattern you can use the isolate the important contents of the files.
Hope this helps.
EDIT : You can always find documentation on the different file formats that you want to manipulate. Here are the specifications of the PDF file type :
http://www.adobe.com/devnet/pdf/pdf_reference.html
http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf

SQLite binding blob in C is terminated early

I have a png file that is to be stored in a database, however, even when passing a length to sqlite3_bind_blob() it stops filling in the value at the first nul character.
Here's the code in question:
fseek(file,0xC,SEEK_CUR); // Skip to 12 (0xC) and read everything (It's a raw png file)
char content[size-0xC];
fread(content,1,size-0xC,file);
sqlite3_bind_int(inserticonstmt,1,id);
sqlite3_bind_blob(inserticonstmt,2,content,size-0xC,SQLITE_STATIC);
sqlite3_step(inserticonstmt);
sqlite3_clear_bindings(inserticonstmt);
sqlite3_reset(inserticonstmt);
Any ideas?
Edit: It looks like, while the database is in fact storing the whole blob, it's not returning it from the CLI interface
The sqlite cli interface has a bug where it parses blobs as strings and stops printing them early, including if it's told to send output to a file.

Widechar sorting and loading from file

I'm currently trying to solve the problem when I need to load rows from the file and then sort them in the right order.
If I manually assign lettes to the array of wint_t and then sort them, everything from just fine with any encoding http://pastebin.com/85eycH15.
But if I read the very same letters from file and then try to sort them it works just with one encoding (cs_CZ.utf8) and with the rest of them it doesn't read the letters properly or or just skip them http://pastebin.com/3C8r9W5T.
I highly appreciate any help.
I assume that the only encoding with which you get your expected result is the one used for your data file. Re-encode the data file in another encoding, you'll get your expected result for the new encoding and not others.

Resources