I'm currently interesting in encrypt file. But I don't know much about what issue may occur when I convert byte-to-byte. I read about end-of-file character, but as in Wikipedia say, EOF is system dependent.
What I want to know is what byte (or group of byte) I should keep away when write some method to encrypt file, on Windows or Linux? Thank!
Related
With the C standard library stdio.h, I read that to output ASCII/text data, one should use mode "w" and to output binary data, one should use "wb". But why the difference?
In either case, I'm just outputting a byte (char) array, right? And if I output a non-ASCII byte in ASCII mode, the program still outputs the correct byte.
Some operating systems - mostly named "windows" - don't guarantee that they will read and write ascii to files exactly the way you pass it in. So on windows they actually map \r\n to \n. This is fine and transparent when reading and writing ascii. But it would trash a stream of binary data. Basically just always give windows the 'b' flag if you want it to faithfully read and write data to files exactly the way you passed it in.
There are certain transformations that can take place when outputting in ASCII (e.g. outputting neline+carriage-return when the outputted character is new-line) -- depending on your platform. Such transformations will not take place when using binary format
I want to know in which ASCII characters a newline represented in my environment,
How can I check it?
when I read it by getchar or scanf and check what the the ASCII number that was read, I get 10.
How can I check the sequence that newline is represented in the environment itself?
Those "text-aware" I/O functions will abstract this and do conversions so that '\n' works.
One way is to create a text file containing a single (empty) line of text, then re-open it in binary mode and inspect the contents. Binary mode will turn off any such translations of course, and expose the raw bytes.
Not sure how you'd do that without touching the file system, but I'm sure it's doable. Most of the time this kind of thing is static, it's always going to be the same for a particular target platform, so it's of course possible to i.e. add the knowledge at compile-time instead.
For example, there is a text file called "Hello.txt"
Hello World!
Then how does the operating system (I'm using MS-DOS) recognize the end of this text file? Is some kind of character or symbol hidden after '!' which indicates the end of file?
If you use MS-Dos then there are some odds that there is indeed a special character at the end of the string. MS-Dos was derived from Tim Paterson's QDos who wrote it to be as compatible as possible with the then-dominant CP/M. An OS for 8-bit machines, it kept track of a file size by only counting the number of disk sectors used by the file. Which made the file size always a multiple of 128 bytes.
Which required a hack to indicate the real end of a text file, since it could be located in the middle of a sector, it used the Ctrl+Z control character (character code 0x1A). Which required a language runtime implementation to remove it again and declare end-of-file when it encounters the character. Ctrl+Z is not quite forgotten, it still works when you type it in a Windows console to terminate input. Compare to Ctrl+D in a Unix terminal.
Whether it actually is present in the file depends on what program created the file. Which would have to be an MS-Dos program as well to get the Ctrl+Z appended. It is certainly not required. Paterson improved on CP/M to remove some of its restrictions, greatly aided by having a lot more address space available (1 MB vs 64 KB), MS-Dos keeps track of the actual number of bytes in a file. So it can always reliable indicate the true end of a file. Which is probably the most accurate answer to your question.
Ancient history btw, invest your time wisely.
I'm trying to write a console program that reads characters from a file.
i want it to be able to read from a Unicode file as well as an ANSI one.
how should i address this issue? do i need to programatically distinguish the type of file and read acoordingly? or can i somehow use the windows API data types like TCHAR and stuff like that.
The only differnce between reading from the files is that in Unicode i have to read 2 bytes for a character and in ASNSI its 1 byte?
im a little lost with this windows API.
would appretiate any help
thanks
You can try to read the first chunk of the file in binary mode in a buffer (1 KB should be enough), and use the IsTextUnicode function to determine if it's likely to be some variety of Unicode; notice that this function, unless it finds some "strong" proofs that it's Unicode text (e.g. a BOM) performs fundamentally a statistical analysis on the buffer to determine what "it looks like", so it can give wrong results; a case in which this function fails is the (in)famous "Bush hid the facts" bug.
Still, you can set how much guesswork is done using its flags.
Notice that, if your application does not really manipulate/display the text you may not even need to determine if it's ANSI or Unicode, and just let it be encoding agnostic.
I'm not sure if the Windows API has some utility methods for all kinds of text files.
In general, you need to read the BOM of the file (http://en.wikipedia.org/wiki/Byte_Order_Mark), which will tell you which encoding of Unicode is actually used when you succeed in reading the character correctly.
You read bytes from a file, and then parse these bytes as the expected format dictates.
Typically you check if a file contains UTF text by reading the initial BOM, and then proceed to read the remainder of the file bytes, parsing these the way you think they're encoded.
Unicode text is typically encoded as UTF-8 (1-4 bytes per character), UTF-16 (2 or 4 bytes per character), or UTF-16 (4 bytes per character)
Dont worry, there is no difference between reading ANSI and UNICODE files. Difference has place only during processing
I am working on a small text replacement application that basically lets the user select a file and replace text in it without ever having to open the file itself. However, I want to make sure that the function only runs for files that are text-based. I thought I could accomplish this by checking the encoding of the file, but I've found that Notepad .txt files use Unicode UTF-8 encoding, and so do MS Paint .bmp files. Is there an easy way to check this without placing restrictions on the file extensions themselves?
Unless you get a huge hint from somewhere, you're stuck. Purely by examining the bytes there's a non-zero probability you'll guess wrong given the plethora of encodings ("ASCII", Unicode, UTF-8, DBCS, MBCS, etc). Oh, and what if the first page happens to look like ASCII but the next page is a btree node that points to the first page...
Hints can be:
extension (not likely that foo.exe is editable)
something in the stream itself (like BOM [byte-order-marker])
user direction (just edit the file, goshdarnit)
Windows used to provide an API IsTextUnicode that would do a probabilistic examination, but there were well-known false-positives.
My take is that trying to be smarter than the user has some issues...
Honestly, given the Windows environment that you're working with, I'd consider a whitelist of known text formats. Windows users are typically trained to stick with extensions. However, I would personally relax the requirement that it not function on non-text files, instead checking with the user for goahead if the file does not match the internal whitelist. The risk of changing a binary file would be mitigated if your search string is long - that is assuming you're not performing Y2K conversion (a la sed 's/y/k/g').
It's pretty costly to determine if a file is text-based or not (i.e. a binary file). You would have to examine each byte in the file to determine if it is a valid character, irrespective of the file encoding.
Others have said to look at all the bytes in the file and see if they're alphanumeric. Some UNIX/Linux utils do this, but just check the first 1K or 2K of the file as an "optimistic optimization".
well a text file contains text, right ? so a really easy way to check a file if it does contain only text is to read it and check if it does contains alphanumeric characters.
So basically the first thing you have to do is to check the file encoding if its pure ASCII you have an easy task just read the whole file in to a char array (I'm assuming you are doing it in C/C++ or similar) and check every char in that array with functions isalpha and isdigit ...of course you have to take care about special exceptions like tabulators '\t' space ' ' or the newline ('\n' in linux , '\r'\'n' in windows)
In case of a different encoding the process is the same except the fact that you have to use different functions for checking if the current character is an alphanumeric character... also note that in case of UTF-16 or greater a simple char array is simply to small...but if you are doing it for example in C# you dont have to worry about the size :)
You can write a function that will try to determine if a file is text based. While this will not be 100% accurate, it may be just enough for you. Such a function does not need to go through the whole file, about a kilobyte should be enough (or even less). One thing to do is to count how many whitespaces and newlines are there. Another thing would be to consider individual bytes and check if they are alphanumeric or not. With some experiments you should be able to come up with a decent function. Note that this is just a basic approach and text encodings might complicate things.