Remove trailing "0d 0a" bytes from a file, using PowerShell - file

I am trying to encrypt and decrypt files using PowerShell. In this case, I am working with .docx files. After encrypting the file, I passed it onto the decrypt function, and after decrypting it, the file is corrupted when trying to open.
However, after using a hex editor to compare both original and decrypted .docx files, the only difference is that the decrypted .docx file has 2 trailing bytes of "0d 0a".
I think this is the result of PowerShell's "Set-Content" command.
(The Out-File command produces a far worse result.)
However, I am not able to just replace all the carriage return and line feed bytes as I would like to preserve the line feeds and carriage returns of the word document.
Is there a way I am able to remove only the trailing bytes of "0d 0a" of the decrypted and already-written .docx file?

Without seeing your example its impossible to to tell where the extra CRLF is coming from. I would recommend examining your code to determine where it is coming from and then use an alternate route like the System.IO.File class. If you are just looking for a quick solution you could read in the file and strip the last four bytes then output the byte array back to the same file overwriting it. This is a bandaid but should work.
#read in all contents
$bytes = [system.io.file]::ReadAllBytes("somefile.docx")
#write out all bytes except the last 4
#0 based so the last byte is at position length-1 then an additional 4 bytes
[System.IO.File]::WriteAllBytes("somefile.docx",$bytes[0..($bytes.length-5)])

Related

writing 8 bit data to file, without escape character

I am using NODERED to read a serial port. The data coming in is 8 bit per byte. reading is working fine. Now, I want to write that buffer to a file.
When I use buffer.toString() it generates a string but with formatting. Even the 'binary' option escapes the data when 8th bit is set.
Howto write the raw data to a file without it getting changed?
Mission:
I want to generate a bitmap file, adding a fixed length image to the header.
Comparing the file with an hex editor with an existing bitmap shows the escaped bytes in the new file. Obviously that is wrong.

How to read, manipulate and write .docx file in c

I am reading .docx file in a buffer and writing it to a new file successfully. (Using fread and fwrite in C) However now I want to enhance the scope of this project for the purpose of encryption. For which I want to be able to manipulate the buffer, then write it in new file.
Now one question might be, what manipulation do I need?
It could be anything really, like I'd write character 's' in buffer's location 15. Like below, and then write this new buffer (having character 's' at location 15, but the rest of the buffer remains unchanged) in a new .docx file.
buffer[15] = 's';
When I did this, the file that was created was corrupt. Since I am not fully aware of the structure of .docx file, this byte number 15 could be some potential identifier, or header, or any important information of .docx file needed for creating a non-corrupt file.
However, the things I know about .docx internal structure are:
It consists of XML files, zipped together.
The content that is written in .docx file, (for e.g. I have a file named test.docx, and it contains "Hello, how are you?") then the contents "Hello, how are you?" are stored in XML files.
There is a .rels (not confirm) extension file, among those files that are zipped together, that tells MS word about where the content is stored in file, i.e. where to look for content.
Apart from these 3 points I don't know much about structure of .docx file. Now considering all this, I want to be able to extract the contents of .docx file, from the XML files zipped together, read it (in C) in a buffer, change the buffer as I need it, and create a new file, with the new content that is present in the buffer.
Can someone guide me through this?
Also kindly mention, if I need to provide code, or any other essential details. Thanks in advance.
EDIT
PURPOSE OF ALL THIS:
I want to do all this for encryption. As by encrypting a file (using AES) the whole file will become unreadable, corrupt and everything inside will be changed from its place. When I decrypt that file, the file is unable to open. My guess is, as AES decryption algo does not know how to parse the contents recovered from decrypting the encrypted file, in to a new .docx file, thus it is unable to place the contents/structure properly in its place.
I have tried it. Original docx file was of 14 KB, encrypted docx file was of 14 KB as well as the decrypted docx file. But when I try to open the decrypted file, it says file is corrupt. Also I tried to check it in HEX editor. Decrypted file has just 00 bytes after exactly 30 Bytes.
DOCX files are based on OPC and OOXML. OPC is based on Zip. OOXML is based on XML. Therefore, you can use Zip and XML tools to operate on DOCX files. Beyond this, you'll have to be more specific about what you wish to do in order to receive better guidance.
Poking characters into random index locations in an XML file is operating at the wrong level of abstraction.

C program for reading doc, docx, pdf

I want to write a program in C(only c not c++ or java) that will read doc, docx, pdf and want to make it available on github to use for all who needs that code. So I started with .doc file I explored that if I open .doc file with simple notepad it will show you all text but just with some extra content which you can easily trim. So I did write a simple c program to read .doc wile in both 'r' and 'rb' mode but both time it gives me only 5-9 character in the file and those also not readable. I don't know why it's happening. Any comment or disccussion will be very helpful for me.
Here is the link for github Source code. Please help me to complete all three format.
To answer your specific question, the reason your little application stops reading is because it mistakenly thinks there is an EOF character in your file.
Look at your code:
char ch;
int nol=0, not=0, nob=0, noc=0;
FILE *fp;
fp = fopen("file.doc","rb");
while(1)
{
ch = fgetc(fp);
if(ch==EOF)
{
break;
}
You store the result of fgetc(fp) in a variable of type char, which is a single-byte variable. However, the result of fgetc is very purposefully "int", not "char".
fgetc always returns a positive result in the range 0 to 255, except for when you reach the end of the file in which case it returns EOF, which is often implemented as a -1 value.
If you read a byte of value 255 and store it in an int, everything is OK, it's stored as the value 255 and your loop can continue. If you store the result in a char, it's going to be interpreted equal to EOF. And your loop stops.
Don't expect to get anywhere with this idea. .doc is a huge binary file format that is inhumanly complicated to parse. With that said, Cubia mentioned the offset where the text section of the document starts. I'm not familiar with the details of the format, but if the raw text is contained in one location, use fseek to get at it and stop when you reach the end. This won't be the case for the other formats because they are very different.
.docx and .pdf should be easier to parse because they are more modern formats. If you want to read anything from a docx you need to read from a zip file with a ton of xml in it and use a parser to figure out which text you want.
.pdf should be the easiest of the three because you might be able to find a library out there that can almost do what you want.
As for why you are getting strange output from your program, remember that .doc is a binary format and the vast majority of the data is garbage from your perspective. Dumping it to the terminal will yield readable text but also a bunch of control characters that should screw with your terminal.
As a last note - don't try to read docx files directly using fread - they are compressed so you likely won't recover the text unaltered. Take a look at libarchive. Also - expect to have to read the document specifications. docx seems to be a microsoft extension to the openoffice format. See this and some PDF specification documents (there are multiple versions).
Look at the .doc file type as a txt file but with extra non-printable characters before, in the middle, and after your content. These non-printable characters are used for defining special formatting, metadata and other infos.
With this said, all .doc files follow a certain structure.
If you open two different .doc files in a hex editor, you will notice that the text content of both files start at an offset of 0xA00 (2560 bytes) from the beginning of the file. This means that when you open your file initially, you can ignore the first 2560 bytes of the file (Take a look at the fseek() function).
From this point on, you can read the contents of your file until you reach '\0'.
I have not seen the implementation of a .pdf or a .docx file, but you can take open up both files with a hex editor and figure out what pattern you can use the isolate the important contents of the files.
Hope this helps.
EDIT : You can always find documentation on the different file formats that you want to manipulate. Here are the specifications of the PDF file type :
http://www.adobe.com/devnet/pdf/pdf_reference.html
http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf

Changing backslashes to forward slashes changes file size

I have two small to medium sized files (2k) that are for all intents and purposes identical. The second file is the result of the first file being duplicated and replacing backslashes with forward slashes. The new file is bigger by 80 bytes (or one byte per line).
I did this with a simple batch script,and at first I thought the script might have unintentionally added some spaces or other artifacts. Or maybe the fact that their extensions are different has something to do with it (one has a tmp extension and the other has a lst extension).
From an editor, I replaced all forward slashes in the new file with backslashes and saved it without changing the extension.
And, hey guess what? The files were the same size again.
Now, before this is written off as a random fluke, I also see the same behavior exhibited in three other pairs of files (in other words six files) created in the same manner as the first. They are all one byte bigger per line in the file. The largest is about 12k bytes, and the smallest is about 2k.
I wouldn't think it has anything to do with escaping because I am on a Windows box using the Windows 7 cmd.exe shell.
Also one other thing. I tried the following:
echo \\\\\ >> a.txt
echo ///// >> b.txt
The files matched in size (7 bytes)
Does anyone have an explanation for this behavior?
I would suggest opening the files with an editor like Notepad++ that shows the type of linefeed (Windows/Mac/Unix). This is most likely your problem if the file size differs 1 byte per line.
Notepad++ can show line endings as small CR/LF symbols (View -> Show Symbol -> Show End of Line) and convert between the Windows/Mac/Unix line endings (Edit -> EOL Conversion).
Both Unix and Mac systems are usually storing files with an one byte line ending (Mac: CR, Unix: LF), Windows uses two bytes (CR LF).
Depending on the programs your batch scripts use, this might occur even though your system is a pure Windows box. The reason you don't get a difference when using an editor is that editors usually keep the file's original line endings.
Okay. I just solved it. #schnaader pointed me in the right direction. It actually has nothing to do with the forward or backslashes.
What happened is that my script added one character of trailing white space to each line. Why the file again became the same size after I reverted the slashes is because the editor I used to find and replace (Komodo Edit) is set up to automatically trim trailing white space on file save.
Funny.

SQLite binding blob in C is terminated early

I have a png file that is to be stored in a database, however, even when passing a length to sqlite3_bind_blob() it stops filling in the value at the first nul character.
Here's the code in question:
fseek(file,0xC,SEEK_CUR); // Skip to 12 (0xC) and read everything (It's a raw png file)
char content[size-0xC];
fread(content,1,size-0xC,file);
sqlite3_bind_int(inserticonstmt,1,id);
sqlite3_bind_blob(inserticonstmt,2,content,size-0xC,SQLITE_STATIC);
sqlite3_step(inserticonstmt);
sqlite3_clear_bindings(inserticonstmt);
sqlite3_reset(inserticonstmt);
Any ideas?
Edit: It looks like, while the database is in fact storing the whole blob, it's not returning it from the CLI interface
The sqlite cli interface has a bug where it parses blobs as strings and stops printing them early, including if it's told to send output to a file.

Resources