Are there any known characters limitation for a value string that I have to put into INI file?
I currently have a line with 1800 characters, and I consider to extend it and save another 200 characters on this line. Does it make sense?
I found this thread but it does not mention a relevant limitation (it talks about 32K limitation of the file size, which is not my case).
Thank you!
Related
I know in C++, you can check the length of the string, but in C, not so much.
Is it possible knowing the file size of a text file, to know how many characters are in the file?
Is it one byte per character or are other headers secretly stored whether or not I set them?
I would like to avoid performing a null check on every character as I iterate through the file for performance reasons.
Thanks.
You can open the file and read all the characters and count them.
Besides that, there's no fully portable method to check how long a file is -- neither on disk, nor in terms of how many characters will be read. This is true for text files and binary files.
How do you determine the size of a file in C? goes over some of the pitfalls. Perhaps one of the solutions there will suit a subset of systems that you run your code on; or you might like to use a POSIX or operating system call.
As mentioned in comments; if the intent behind the question is to read characters and process them on the fly, then you still need to check for read errors even if you knew the file size, because reading can fail.
Characters (of type char) are single byte values, as defined in the C standard (see CHAR_BIT). A NUL character is also a character, and so it, too, takes up a single byte.
Thus, if you are working with an ASCII text file, the file size will be the number of bytes and therefore equivalent to the number of characters.
If you are asking how long individual strings are inside the file, then you will indeed need to look for NUL and other extended character bytes and calculate string lengths on that basis. You might not be able to safely assume that there is only one NUL character and that it is at the end of the file, depending on how that file was made. There can also be newlines and other extended characters you would want to exclude. You have to decide on a character set and do counting from that set.
Further, if you are working with a file containing multibyte characters encoded in, say, Unicode, then this will be a different answer. You would use different functions to read a text file using a multibyte encoding.
So the answer will depend on what type of encoding your text file uses, and whether you are calculating characters or string lengths, which are two different measures.
I have been searching for the answer to this and everyone's answer is always just do it line by line, but the thing is my file is all just one line of characters, and trying to io.open("file.txt", "rb"):read("*a") results in a memory error. I can not think of how to load it in part at a time because like I said, its all one giant line.
You can use io.read(size) to read a buffer of a specified size (as is already discussed in comments). See the example at the end of the I/O section in Programming in Lua.
Since you are doing a search in the chunks you read, the string you are searching for may be split between different chunks, so you need to take that into account. Another example from PiL that talks about reading large files may be of interest.
you could use table as buffer:
function readFile(file)
local t = {}
for line in io.lines(file) do
t[#t + 1] = line .. "\n"
end
local s = table.concat(t)
return s
end
I am learning to code in Unix with C. So far I have written the code to find the index of the first byte of the line that I want to replace. The problem is that sometimes, the number of bytes replacing the line might be greater than the number of bytes already on the line. In this case, the code start overwriting the next line. I came up with two standard solutions:
a) Rather than trying to edit the file in-place, I could copy the entire file into memory, edit it by shifting all the bytes if necessary and rewriting it back to file.
b) Only copy the line I want to end-of-file to memory and edit.
Both suggestions doesn't scale well. And I don't want to impose any restrictions on the line size(like every line must be 50 bytes or something). Is there any efficient way to do the line replacement ? Any help would be appreciated.
Copy the first part of the file to a new file (no need to read it all into memory). Then, write the new version of the line. Finally, copy the final part of the file. Swap files and done.
I'm trying to write a console program that reads characters from a file.
i want it to be able to read from a Unicode file as well as an ANSI one.
how should i address this issue? do i need to programatically distinguish the type of file and read acoordingly? or can i somehow use the windows API data types like TCHAR and stuff like that.
The only differnce between reading from the files is that in Unicode i have to read 2 bytes for a character and in ASNSI its 1 byte?
im a little lost with this windows API.
would appretiate any help
thanks
You can try to read the first chunk of the file in binary mode in a buffer (1 KB should be enough), and use the IsTextUnicode function to determine if it's likely to be some variety of Unicode; notice that this function, unless it finds some "strong" proofs that it's Unicode text (e.g. a BOM) performs fundamentally a statistical analysis on the buffer to determine what "it looks like", so it can give wrong results; a case in which this function fails is the (in)famous "Bush hid the facts" bug.
Still, you can set how much guesswork is done using its flags.
Notice that, if your application does not really manipulate/display the text you may not even need to determine if it's ANSI or Unicode, and just let it be encoding agnostic.
I'm not sure if the Windows API has some utility methods for all kinds of text files.
In general, you need to read the BOM of the file (http://en.wikipedia.org/wiki/Byte_Order_Mark), which will tell you which encoding of Unicode is actually used when you succeed in reading the character correctly.
You read bytes from a file, and then parse these bytes as the expected format dictates.
Typically you check if a file contains UTF text by reading the initial BOM, and then proceed to read the remainder of the file bytes, parsing these the way you think they're encoded.
Unicode text is typically encoded as UTF-8 (1-4 bytes per character), UTF-16 (2 or 4 bytes per character), or UTF-16 (4 bytes per character)
Dont worry, there is no difference between reading ANSI and UNICODE files. Difference has place only during processing
I am working on a small text replacement application that basically lets the user select a file and replace text in it without ever having to open the file itself. However, I want to make sure that the function only runs for files that are text-based. I thought I could accomplish this by checking the encoding of the file, but I've found that Notepad .txt files use Unicode UTF-8 encoding, and so do MS Paint .bmp files. Is there an easy way to check this without placing restrictions on the file extensions themselves?
Unless you get a huge hint from somewhere, you're stuck. Purely by examining the bytes there's a non-zero probability you'll guess wrong given the plethora of encodings ("ASCII", Unicode, UTF-8, DBCS, MBCS, etc). Oh, and what if the first page happens to look like ASCII but the next page is a btree node that points to the first page...
Hints can be:
extension (not likely that foo.exe is editable)
something in the stream itself (like BOM [byte-order-marker])
user direction (just edit the file, goshdarnit)
Windows used to provide an API IsTextUnicode that would do a probabilistic examination, but there were well-known false-positives.
My take is that trying to be smarter than the user has some issues...
Honestly, given the Windows environment that you're working with, I'd consider a whitelist of known text formats. Windows users are typically trained to stick with extensions. However, I would personally relax the requirement that it not function on non-text files, instead checking with the user for goahead if the file does not match the internal whitelist. The risk of changing a binary file would be mitigated if your search string is long - that is assuming you're not performing Y2K conversion (a la sed 's/y/k/g').
It's pretty costly to determine if a file is text-based or not (i.e. a binary file). You would have to examine each byte in the file to determine if it is a valid character, irrespective of the file encoding.
Others have said to look at all the bytes in the file and see if they're alphanumeric. Some UNIX/Linux utils do this, but just check the first 1K or 2K of the file as an "optimistic optimization".
well a text file contains text, right ? so a really easy way to check a file if it does contain only text is to read it and check if it does contains alphanumeric characters.
So basically the first thing you have to do is to check the file encoding if its pure ASCII you have an easy task just read the whole file in to a char array (I'm assuming you are doing it in C/C++ or similar) and check every char in that array with functions isalpha and isdigit ...of course you have to take care about special exceptions like tabulators '\t' space ' ' or the newline ('\n' in linux , '\r'\'n' in windows)
In case of a different encoding the process is the same except the fact that you have to use different functions for checking if the current character is an alphanumeric character... also note that in case of UTF-16 or greater a simple char array is simply to small...but if you are doing it for example in C# you dont have to worry about the size :)
You can write a function that will try to determine if a file is text based. While this will not be 100% accurate, it may be just enough for you. Such a function does not need to go through the whole file, about a kilobyte should be enough (or even less). One thing to do is to count how many whitespaces and newlines are there. Another thing would be to consider individual bytes and check if they are alphanumeric or not. With some experiments you should be able to come up with a decent function. Note that this is just a basic approach and text encodings might complicate things.