How to dynamically change the string from the i/o stream in c - c

I was looking at a problem in K&R (Exercise 1-18), which asked to remove any trailing blanks or tabs. That pushed me to think about text messengers like Whatsapp. The thing is lets say I am writing a word Parochial, then the moment I had just written paro, it shows parochial as options, I click on that replaces the entire word (even if the spelling is wrong written by me, it replaces when I chose an option).
What I am thinking is the pointer goes back to the starting of the word or say that with start of every new word when I am writing, the pointer gets fixed to the 1st letter & if I choose some option it replaces that entire word in the stream (don't know if I'm thinking in the right direction).
I can use getchar() to point at the next letter but how do I:
1: Go backward from the current position of the pointer pointing the stream?
(By using fseek())?
2: How to fix a pointer a position in an I/o stream, so that I can fix it at the beginning of a new word.
Please tell me my approach is correct or understanding of some different concept is needed. Thanks in advance

Standard streams are mainly for going forward*, minimizing the number of IO system calls, and for avoiding the need to keep large files in memory at once.
A GUI app is likely to want to keep all of its display output in memory, and when you have the whole thing in memory, going back and forth is just a simple mater of incrementing and decrementing pointers or indices.
*(random seeks aren't always optimal and they limit you from doing IO on nonseekable files such as pipes or sockets)

Related

what's more efficient: reading from a file or allocating memory

I have a text file and I should allocate an array with as many entries as the number of lines in the file. What's more efficient: to read the file twice (first to find out the number of lines) and allocate the array once, or to read the file once, and use "realloc" after each line read? thank you in advance.
Reading the file twice is a bad idea, regardless of efficiency. (It's also almost certainly less efficient.)
If your application insists on reading its input teice, that means its input must be rewindable, which excludes terminal input and pipes. That's a limitation so annoying that apps which really need to read their input more than once (like sort) generally have logic to make a temporary copy if the input is unseekable.
In this case, you are only trying to avoid the trivial overhead of a few extra malloc calls. That's not justification to limit the application's input options.
If that's not convincing enough, imagine what will happen if someone appends to the file between the first time you read it and the second time. If your implementation trusts the count it got on the first read, it will overrun the vector of line pointers on the second read, leading to Undefined Behaviour and a potential security vulnerability.
I presume you want to store the read lines also and not just allocate an array of that many entries.
Also that you don't want to change the lines and then write them back as in that case you might be better off using mmap.
Reading a file twice is always bad, even if it is cached the 2nd time, too many system calls are needed. Also allocing every line separately if a waste of time if you don't need to dealloc them in a random order.
Instead read the entire file at once, into an allocated area.
Find the number of lines by finding line feeds.
Alloc an array
Put the start pointers into the array by finding the same line feeds again.
If you need it as strings, then replace the line feed with \0
This might also be improved upon on modern cpu-architectures, instead of reading the array twice it might be faster simply allocating a "large enough" array for the pointer and scan the array once. This will cause a realloc at the end to have the right size and potentially a couple of times to make the array larger if it wasn't large enough at start.
Why is this faster? because you have a lot of if's that can take a lot of time for each line. So its better to only have to do this once, the cost is the reallocation, but copying large arrays with memcpy can be a bit cheaper.
But you have to measure it, your system settings, buffer sizes etc. will influence things too.
The answer to "What's more efficient/faster/better? ..." is always:
Try each one on the system you're going to use it on, measure your results accurately, and find out.
The term is "benchmarking".
Anything else is a guess.

Why can't we strip the begining or end of files so fast?

I was reading the answers to this question and this one. comparing files to C arrays if we want to remove some elements from the middle of a C array we have to shift all other elements so it takes so much time. But if we want to remove the first few items or the last few ones, we can just change the pointers and deallocate the removed elements with takes almost no time, and it's independent of the length of the array. I was wondering why there is no such way for truncating files. I thought maybe there are some meta data about the file in the beginning or end of files that cause the issue but if that is the case I think it must be in either beginning or end of the file so we must be able to remove a few lines from at least one of beginning and end so fast. But it seems like it's not possible. Why is that? What am I missing here?
I need this because I have a 10GB file that I have to remove lines from it's beginning or end one by one until it's empty. I'm on ubuntu 16.04 but I would love to know if there are other solutions in other OS.
This is a very simplistic answer, but files are normally stored on disk by 'pages/blocks' which have a certain amount of bytes. So in theory it would be relatively easy to remove the exact page/block size. Because afaik all blocks are filled completely (except the last one). However, in practice, the chance this is used is very low that exactly the amount of bytes should be removed from the beginning of a file.
For the end of the file it would be easier, but also in this case it does not happen often, and therefore no 'generic' way is implemented for it.

How to represent a random-access text file in memory (C)

I'm working on a project in which I need to read text (source) file in memory and be able to perform random access into (say for instance, retrieve the address corresponding to line 3, column 15).
I would like to know if there is an established way to do this, or data structures that are particularly good for the job. I need to be able to perform a (probably amortized) constant time access. I'm working in C, but am willing to implement higher level data structures if it is worth it.
My first idea was to go with a linked list of large buffer that will hold the character data of the file. I would also make an array, whose index are line numbers and content are addresses corresponding to the begin of the line. This array would be reallocated on need.
Subsidiary question: does anyone have an idea the average size of a source file ? I was surprised not to find this on google.
To clarify:
The file I'm concerned about are source files, so their size should be manageable, they should not be modified and the lines have variables length (tough hopefully capped at some maximum).
The problem I'm working on needs mostly a read-only file representation, but I'm very interested in digging around the problem.
Conlusion:
There is a very interesting discussion of the data structures used to maintain a file (with read/insert/delete support) in the paper Data Structures for Text Sequences.
If you just need read-only, just get the file size, read it in memory with fread(), then you have to maintain a dynamic array which maps the line numbers (index) to pointer to the first character in the line. Someone below suggested to build this array lazily, which seems a good idea in many cases.
I'm not quite sure what the question is here, but there seems to be a bit of both "how do I keep the file in memory" and "how do I index it". Since you need random access to the file's contents, you're probably well advised to memory-map the file, unless you're tight on address space.
I don't think you'll be able to avoid a linear pass through the file once to find the line endings. As you said, you can create an index of the pointers to the beginning of each line. If you're not sure how much of the index you'll need, create it lazily (on demand). You can also store this index to disk (as offsets, not pointers) if you will need it on subsequent runs. You can estimate the size of the index based on the file size and the expected line length.
1) Read (or mmap) the entire file into one chunk of memory.
2) In a second pass create an array of pointers or offsets pointing to the beginnings of the lines (hint: one after the '\n' ) into that memory.
Now you can index the array to access a specific line.
It's impossible to make insertion, deletion, and reading at a particular line/column/character address all simultaneously O(1). The best you can get is simultaneous O(log n) for all of these operations, and it can be achieved using various sorts of balanced binary trees for storing the file in memory.
Of course, unless your files will be larger than 100 kB or so, you're probably best off not bothering with anything fancy and just using a flat linear buffer...
solution: If lines are about same size, make all lines equally long by appending needed number of metacharacters to each line. Then you can simply calculate the fseek() position from line number, making your search O(1).
If lines are sorted, then you can perform binary search, making your search O(log(nõLines)).
If neither, you can store the indexes of line begginings. But then, you have a problem if you modify file a lot, because if you insert let's say X characters somewhere, you have to calculate which line it is, and then add this X to the all next lines. Similar with with deletion. Yu essentially get O(nõLines). And code gets ugly.
If you want to store whole file in memory, just create aray of lines *char[]. You then get line by first dereference and character by second dereference.
As an alternate suggestion (although I do not fully understand the question), you might want to consider a struct based, dynamically linked list of dynamic strings. If you want to be astutely clever, you could build a dynamically linked list of chars which you then export as strings.
You'd have to use OO type design for this to be manageable.
So structs you'd likely want to build are:
DynamicArray;
DynamicListOfArrays;
CharList;
So it goes:
CharList(Gets Chars/Size) -> (SetSize)DynamicArray -> (AddArray)DynamicListOfArrays
If you build suitable helper functions for malloc and delete, and make it so the structs can either delete themselves automatically or manually. Using the above combinations won't get you O(1) read in (which isn't possible without the files have a static format), but it will get you good time.
If you know the file static length (at least individual line wise), IE no bigger than 256 chars per line, then all you need is the DynamicListOfArries - write directly to the array (preset to 256), create a new one, repeat. Downside is it wastes memory.
Note: You'd have to convert the DynamicListOfArrays into a 'static' ArrayOfArrays before you could get direct point-to-point access.
If you need source code to give you an idea (although mine is built towards C++ it wouldn't take long to rewrite), leave a comment about it. As with any other code I offer on stackoverflow, it can be used for any purpose, even commercially.
Average size of a source file? Does such a thing exist? A source file could go from 0 bytes to thousands of bytes, like any text file, it depends on the number of caracters it contains

Taking values through input and diplaying them during run time

I need to write a program where during run time, a set of integers of arbitrary size will taken as input. They will be seperated by white space. At the end, a new line is given, showing the end of input. How do I save them into an array of integers so that i can display them later. I think it is a little difficult because the number of values that will be entered is not known during compilation
Sounds like homework.
Correct me if I am wrong and I will give you more than hints.
You can either declare an array of a really large size that would not possibly be filled by the user input, then use scanf or something like that to grab the integers until you hit '\n', or you can grab each integer at a time, allocating memory as you go, using a combination of malloc and memcpy calls. The first option should never be done in a real world problem, and I am certainly not advocating such practices even though your textbook probably tells you to do it this way.
There is an example just like this in K&R.
This is a typical problem you will have in C. The solution is usually one of two options.
Use a really large array that is large enough to hold the input. Sometimes this is a poor option when the data could be really large. An example of when it would be a bad idea is when you are saving a video frame or a large text file to the array. This also opens you up to a buffer overrun attack in older versions of Windows. However, this is sometimes a good quick hack solution for smaller (homework) programs where you can count on the user (i.e. your professor who is not trying to break your program) to not input 1000's of characters. Usually this is considered bad practice, please consider my 2nd option for the security reason I mentioned before.
Use dynamic arrays (i.e. malloc). This is probably what your professor wants you to do as this sounds like a typical problem to use when a student is first learning pointers and arrays. This is a great approach, just remember to call free on your memory when you are finished. The tricky part here is that you still have to know the size of the array you want ahead of time (not at compile time though of course).

How do you read a file until you hit a certain string in c?

I wanted to know how, in C, you can read a certain file until the reading hits a certain string, or character array. What I want to be able to do is, once the file hits that string, I want the position to be set at that point. I am going to use fseek for that, and that's not a problem. It's just the reading until a certain string is hit that I am not able to do. I've been reading up on some of the functions, but there doesn't seem to be anything that guides with this. Fgets is the closest thing to this, but I don't want to provide a certain number of characters to be read, as I don't know how many. But can you give me some tips on how to do this?
Thanks!
There are many efficient string searching algorithms, each of which can be implemented in C.
http://en.wikipedia.org/wiki/String_searching_algorithm
If you're looking for a string of length N, easiest is to keep a circular buffer of length N and read 1 byte at a time from the file adding it to the circular buffer. At each step you compare your buffer with the string you're searching for. It's highly inefficient but easy to code.
There's no built-in function to do exactly what you want, but there are a few options.
Option one: Read data in chunks. You don't know exactly where your data is, so read in a few kbs of data at a time, and search within these chunks. Make sure you deal with the case where the string you're looking for straddles a chunk boundrary! Once you've located the string, use fseek() to position yourself at the start of it.
Option two: Memory map the file and use memmem() on the entire file (as mapped into memory). This requires unportable calls to set up the memory mapping, so you'll need to know your OS (or use a portability wrapper library like glib). On 32-bit machines, it will also limit the size of files you can search in to a few hundred megabytes. It is, however, a very simple and efficient approach when it's an option.
If you go with option one, the trickiest part will be dealing with the chunk-straddling case. One option is to always keep two chunks in memory, and restart the search so it begins (length of target string) - 1 bytes before the end of the previous block. The actual search could then be done using memmem() or any other string searching algorithm. You could also convert your search into a DFA (since it is a regular language) and keep the current state across blocks.

Resources