I'm currently trying to solve the problem when I need to load rows from the file and then sort them in the right order.
If I manually assign lettes to the array of wint_t and then sort them, everything from just fine with any encoding http://pastebin.com/85eycH15.
But if I read the very same letters from file and then try to sort them it works just with one encoding (cs_CZ.utf8) and with the rest of them it doesn't read the letters properly or or just skip them http://pastebin.com/3C8r9W5T.
I highly appreciate any help.
I assume that the only encoding with which you get your expected result is the one used for your data file. Re-encode the data file in another encoding, you'll get your expected result for the new encoding and not others.
Related
I am using NODERED to read a serial port. The data coming in is 8 bit per byte. reading is working fine. Now, I want to write that buffer to a file.
When I use buffer.toString() it generates a string but with formatting. Even the 'binary' option escapes the data when 8th bit is set.
Howto write the raw data to a file without it getting changed?
Mission:
I want to generate a bitmap file, adding a fixed length image to the header.
Comparing the file with an hex editor with an existing bitmap shows the escaped bytes in the new file. Obviously that is wrong.
I have a file that holds manufacturing orders for a machine.
I would like to read the content of this file and edit it, but when I open it in a text editor i.e. Notepad++, I get a bunch of wierd charecters:
xÚ¥—_HSQÀo«a)’êaAXŽâê×pD8R‰¬©s“i+ƒ´#¡$
-þl-ó/ÓíºIúPôàƒHˆP–%a&RÎÈn÷ü¹·;Ú;ç<ìòÝÃý}¿ó}‡{϶«rWg>˜›ãR‡)Çn0³Ûf³yÎW[5–šw½ÇRW{ñ’rO6¹ŽŸp¦ÙœcÏ.9yÀnýg
)Ë—e90ejÕø£rC. f¦}3ËŒ˜hü”å1g[…ø±ú ÜJøz®‹˜YfÈ,4`ŽKÉ—ù“ÔË¿d„þlG3#=˜Ž´+hF¬¦£€«šm¿áØ
ïÖµv‡ËpíÍ~™‡Aù
šëÈÚ]ÿç™DŒÉFØ ïƒæsij ¦y=-74Æ/t=ÕŠr\˜š»Âä‰Ý¨žã΢
dz·à‡'fœ½yâ½4qåPjácòÄŒeÊhñ“ý™ÙÎÕ÷5ôlñ=˜Õ{ú;ø=Û;4OêYä>Ìpxbæâ'è"oëB×1gQ9“'¹]Ô³’Ô³ø!ÌózÞyŸõžÓIŽù*&OÌXPÕ"ŽWžpíOÌè‚Þ3Òr0{Ž†R=_?…/¼žÞ0,ê=/?£ûÓËîy“2Z<ij³[ËÁì™÷–ôžÎ’Ããa÷<Maêéí…¼ž}©žYýZ-˜=”á¤}π>3°¢÷œ$ïè‰3ìž«ƒÄs¿—xnŒÀ*¯gi$ÕómDËÁìùIeоû‡À¬?3°x¾"~ª§c˜öÝÇî颌°›x¾Fßb>Ï}QXÓ{öFi-êÙßóR”œe^Ñ÷ü‘¿g[Lë ŽwJZϘë¹3”³L©gH‚,^Ïe 2ôžWGøëÙ2‚Î
øœL¾ÅqÈäõ,ýç\œË3¾þeྗ&`Ϻ<KÒf“’»ðù]í‰ãžU^wèþåÔÖy”H}ò•6ø6
It looks like the file is encoded.
Any idea how to find the encoding and make the file readable and editable?
It's binary and probably encoded so without knowledge of data structure you can't do much - just reverse engineering based on trying and checking what changed, operating with hex editor.
It isn't impossible, tho. If you can change the data the way you know (eg. change number of orders from 1 to 2) and export to file, you can compare binary values and find which byte holds that number. Of course if it is encrypted and you don't know the key... It's easier to find another way.
For further read, check this out - https://en.wikibooks.org/wiki/Reverse_Engineering/File_Formats
If you've got access to a Linux box why not use
hexdump -C <filename>
You will be able to get a much better insight into how the file is structured, than by using a text editor.
There are also many "hexdump" equivalent commands on Windows
I want to write a program in C(only c not c++ or java) that will read doc, docx, pdf and want to make it available on github to use for all who needs that code. So I started with .doc file I explored that if I open .doc file with simple notepad it will show you all text but just with some extra content which you can easily trim. So I did write a simple c program to read .doc wile in both 'r' and 'rb' mode but both time it gives me only 5-9 character in the file and those also not readable. I don't know why it's happening. Any comment or disccussion will be very helpful for me.
Here is the link for github Source code. Please help me to complete all three format.
To answer your specific question, the reason your little application stops reading is because it mistakenly thinks there is an EOF character in your file.
Look at your code:
char ch;
int nol=0, not=0, nob=0, noc=0;
FILE *fp;
fp = fopen("file.doc","rb");
while(1)
{
ch = fgetc(fp);
if(ch==EOF)
{
break;
}
You store the result of fgetc(fp) in a variable of type char, which is a single-byte variable. However, the result of fgetc is very purposefully "int", not "char".
fgetc always returns a positive result in the range 0 to 255, except for when you reach the end of the file in which case it returns EOF, which is often implemented as a -1 value.
If you read a byte of value 255 and store it in an int, everything is OK, it's stored as the value 255 and your loop can continue. If you store the result in a char, it's going to be interpreted equal to EOF. And your loop stops.
Don't expect to get anywhere with this idea. .doc is a huge binary file format that is inhumanly complicated to parse. With that said, Cubia mentioned the offset where the text section of the document starts. I'm not familiar with the details of the format, but if the raw text is contained in one location, use fseek to get at it and stop when you reach the end. This won't be the case for the other formats because they are very different.
.docx and .pdf should be easier to parse because they are more modern formats. If you want to read anything from a docx you need to read from a zip file with a ton of xml in it and use a parser to figure out which text you want.
.pdf should be the easiest of the three because you might be able to find a library out there that can almost do what you want.
As for why you are getting strange output from your program, remember that .doc is a binary format and the vast majority of the data is garbage from your perspective. Dumping it to the terminal will yield readable text but also a bunch of control characters that should screw with your terminal.
As a last note - don't try to read docx files directly using fread - they are compressed so you likely won't recover the text unaltered. Take a look at libarchive. Also - expect to have to read the document specifications. docx seems to be a microsoft extension to the openoffice format. See this and some PDF specification documents (there are multiple versions).
Look at the .doc file type as a txt file but with extra non-printable characters before, in the middle, and after your content. These non-printable characters are used for defining special formatting, metadata and other infos.
With this said, all .doc files follow a certain structure.
If you open two different .doc files in a hex editor, you will notice that the text content of both files start at an offset of 0xA00 (2560 bytes) from the beginning of the file. This means that when you open your file initially, you can ignore the first 2560 bytes of the file (Take a look at the fseek() function).
From this point on, you can read the contents of your file until you reach '\0'.
I have not seen the implementation of a .pdf or a .docx file, but you can take open up both files with a hex editor and figure out what pattern you can use the isolate the important contents of the files.
Hope this helps.
EDIT : You can always find documentation on the different file formats that you want to manipulate. Here are the specifications of the PDF file type :
http://www.adobe.com/devnet/pdf/pdf_reference.html
http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
SITUATION:
My instructor for my micro-controller class refuses to save sample code to a text file and instead saves it to a word document file instead. When I open up the doc file and copy/paste the code into my IDE "CodeWarrior" it causes errors upon compile time.
I am having to rewrite all the code into a text editor and then copy/paste it into my IDE.
MY UNDERSTANDING:
I was told to always save code as a text file because when you save code as a word document file it will bring in unwanted characters when your copy/pasting the code into your IDE for compiling.
MY QUESTIONS TO YOU:
1.)
Can someone explain this dilemma to me so I can understand it better? I would like to present a better case next time when I receive errors and to also know more about what is happening.
2.)
Is it possible to write a script that will show me all the characters that are being copied and pasted into a file when the code is coming from a word document vs. a text file? In otherwords is there a program that will allow me to see what is going on between copying/pasting code from a word doc file versus a txt file?
Saving source code as a Word document is just silly. If your instructor is insisting on this, chances are no matter how well-reasoned and thorough your argument, they're not going to listen. They're beyond help.
However, to answer your questions: 1) It depends on what you're pasting the thing into. Programs that copy onto the clipboard usually make the data available in several different formats, ranging from their own internal format to plain ASCII text, to maximize compatibility so that the data can be pasted into pretty much any target program. Most text editors will only accept the plan-text version, in which case no extra characters should be transferred. However if your text editor supports RTF or HTML, this may not be true. I'm not sure what CodeWarrior supports but it is certainly possible.
A workaround if this is the case: First paste into a PURE text editor like Notepad. Then copy from Notepad into CodeWarrior. This should eliminate any hidden formatting. As shoover said above, make sure double-quotes " are really double-quotes and not the fancy left- and right-specific quotes that Word sometimes uses.
Use a hex editor like XVI32 to see the raw contents of the file, including nonprinting characters. Or use a text editor with support for showing nonprinting characters (vi/vim, etc.).
I'm studying C and I've just had the same problem. When coping a piece of code from a PDF file and trying to compiling it, gcc would return a serie of errors. Reading the answer above I had an idea: "What if I converted the utf8 into ascii?". Well, I found a website that does just that (https://onlineutf8tools.com/convert-utf8-to-ascii). But instead of also converting the utf8 characters into ascii, it showed them as hexadecimals (Copying from the website to the text editor you can see it better). From there i realised that the problem were mostly the quote marks "".
I then copied the ascii "translation" into my code editor (I must add that it worked fine with Sublime, while VScode read the same utf8 code as it was in the original file, even after cp from the website) and replaced all the hex with the actual ascii characters that were needed to compile the code properly. I used the function find and replace from my editor to do it. I must say that it wasn't very fast doing it. But I believe that in some cases, if the code you're trying to copying is too long, doing it the way I've just described could be faster than rewriting the entire code.
I have a png file that is to be stored in a database, however, even when passing a length to sqlite3_bind_blob() it stops filling in the value at the first nul character.
Here's the code in question:
fseek(file,0xC,SEEK_CUR); // Skip to 12 (0xC) and read everything (It's a raw png file)
char content[size-0xC];
fread(content,1,size-0xC,file);
sqlite3_bind_int(inserticonstmt,1,id);
sqlite3_bind_blob(inserticonstmt,2,content,size-0xC,SQLITE_STATIC);
sqlite3_step(inserticonstmt);
sqlite3_clear_bindings(inserticonstmt);
sqlite3_reset(inserticonstmt);
Any ideas?
Edit: It looks like, while the database is in fact storing the whole blob, it's not returning it from the CLI interface
The sqlite cli interface has a bug where it parses blobs as strings and stops printing them early, including if it's told to send output to a file.