I need to write a program where during run time, a set of integers of arbitrary size will taken as input. They will be seperated by white space. At the end, a new line is given, showing the end of input. How do I save them into an array of integers so that i can display them later. I think it is a little difficult because the number of values that will be entered is not known during compilation
Sounds like homework.
Correct me if I am wrong and I will give you more than hints.
You can either declare an array of a really large size that would not possibly be filled by the user input, then use scanf or something like that to grab the integers until you hit '\n', or you can grab each integer at a time, allocating memory as you go, using a combination of malloc and memcpy calls. The first option should never be done in a real world problem, and I am certainly not advocating such practices even though your textbook probably tells you to do it this way.
There is an example just like this in K&R.
This is a typical problem you will have in C. The solution is usually one of two options.
Use a really large array that is large enough to hold the input. Sometimes this is a poor option when the data could be really large. An example of when it would be a bad idea is when you are saving a video frame or a large text file to the array. This also opens you up to a buffer overrun attack in older versions of Windows. However, this is sometimes a good quick hack solution for smaller (homework) programs where you can count on the user (i.e. your professor who is not trying to break your program) to not input 1000's of characters. Usually this is considered bad practice, please consider my 2nd option for the security reason I mentioned before.
Use dynamic arrays (i.e. malloc). This is probably what your professor wants you to do as this sounds like a typical problem to use when a student is first learning pointers and arrays. This is a great approach, just remember to call free on your memory when you are finished. The tricky part here is that you still have to know the size of the array you want ahead of time (not at compile time though of course).
Related
TL;DR:
I am asking you to tell me what would be the most efficient approach to double my strings and print them out?
Full story:
I had trouble with the title, and the actual problem may be a bit different than you expect.
Imagine I have a main buffer.
At some index determined by the program, I want to insert a string.
But every char in that string needs to be doubled.
So "abc", inserted at index 10 of buffer[999], needs to be "aabbcc".
Now, the second part of the problem - this needs to be as efficient as possible. I could make this easily, but I need the fastest option.
I thought I had devised several approaches, but it boils down to:
fill buffer(1000) with single chars and double the chars when printing (pushing to stdout)
fill buffer(2000) with double chars and print like normal
The variations to the second approach would be When to double the chars (when copying or generating "aabbcc" from the start and copy the full thing).
The first approach would be the most intuitive, but I fear I would need to devise a low level char-doubling function because putc and printf and any large amount of function calls will have much overhead. (There are allegedly very efficient functions in libc with bitshifting and pointer magic, but I couldn't find them. I can only find the very dissappointing versions where fgets() is just a wrapper for getc() - which can't be efficient.)
The second approach obviously wastes a lot of memory and requires a lot of copying, but it could probably put everything into stdout more efficiently as a chunk without the overhead of copying single chars.
I am unsure if under everything there is just a system write call, and I also lack the knowledge how it works. I am just going with my research that says that fgets is about 12 times faster than fgetc for equal data. And so I assume it is with all the single-char vs line functions.
So in conclusion, I am asking you to tell me what would be the most efficient approach to double my strings and print them out?
For example, the program grabs the line: Hello World! and assigns the string to a dynamic array.
The length of each line is unknown and I want compatibility for all sizes.
getline() is the obvious answer here like Barmar suggested, but fgets() is also an option (see https://en.wikibooks.org/wiki/C_Programming/stdio.h/fgets).
But from what I understand, you don't know its size yet you want to put it into a perfectly sized dynamic array right off the bat? That's gonna take some crafty thinking and is difficult with a compiled language. The only way I can think of off the top of my head is quite long in time of execution; open the file twice, once to get each line size, and next time to read in each line of the file with the given file size after malloc'ing the correct number of bytes, and storing pointers to these dynamic arrays in a list. This is going to take a lot longer to execute, so if you're not limited on CPU power, this may be an option.
Normally, you'd just know what maximum size to expect and have the array defined at that maximum size. In the grand scheme of things, an extra 50 bytes isn't gonna hurt anything... which hurts me as an embedded guy to say that, but computers have large enough memory these days...
I have a text file and I should allocate an array with as many entries as the number of lines in the file. What's more efficient: to read the file twice (first to find out the number of lines) and allocate the array once, or to read the file once, and use "realloc" after each line read? thank you in advance.
Reading the file twice is a bad idea, regardless of efficiency. (It's also almost certainly less efficient.)
If your application insists on reading its input teice, that means its input must be rewindable, which excludes terminal input and pipes. That's a limitation so annoying that apps which really need to read their input more than once (like sort) generally have logic to make a temporary copy if the input is unseekable.
In this case, you are only trying to avoid the trivial overhead of a few extra malloc calls. That's not justification to limit the application's input options.
If that's not convincing enough, imagine what will happen if someone appends to the file between the first time you read it and the second time. If your implementation trusts the count it got on the first read, it will overrun the vector of line pointers on the second read, leading to Undefined Behaviour and a potential security vulnerability.
I presume you want to store the read lines also and not just allocate an array of that many entries.
Also that you don't want to change the lines and then write them back as in that case you might be better off using mmap.
Reading a file twice is always bad, even if it is cached the 2nd time, too many system calls are needed. Also allocing every line separately if a waste of time if you don't need to dealloc them in a random order.
Instead read the entire file at once, into an allocated area.
Find the number of lines by finding line feeds.
Alloc an array
Put the start pointers into the array by finding the same line feeds again.
If you need it as strings, then replace the line feed with \0
This might also be improved upon on modern cpu-architectures, instead of reading the array twice it might be faster simply allocating a "large enough" array for the pointer and scan the array once. This will cause a realloc at the end to have the right size and potentially a couple of times to make the array larger if it wasn't large enough at start.
Why is this faster? because you have a lot of if's that can take a lot of time for each line. So its better to only have to do this once, the cost is the reallocation, but copying large arrays with memcpy can be a bit cheaper.
But you have to measure it, your system settings, buffer sizes etc. will influence things too.
The answer to "What's more efficient/faster/better? ..." is always:
Try each one on the system you're going to use it on, measure your results accurately, and find out.
The term is "benchmarking".
Anything else is a guess.
My typical use of Fortran begins with reading in a file of unknown size (usually 5-100MB). My current approach to array allocation involves reading the file twice. First to determine the size of the problem (to allocate arrays) and a second time to read the data into those arrays.
Are there better approaches to size determination/array allocation? I just read about automatic array allocation (example below) in another post that seemed much easier.
array = [array,new_data]
What are all the options and their pros and cons?
I'll bite, though the question is teetering close to off-topicality. Your options are:
Read the file once to get the array size, allocate, read again.
Read piece-by-piece, (re-)allocating as you go. Choose the size of piece to read as you wish (or, perhaps, as you think is likely to be most speedy for your case).
Always, always, work with files which contain metadata to tell an interested program how much data there is; for example a block
header line telling you how many data elements are in the next
block.
Option 3 is the best by far. A little extra thought, and about one whole line of code, at the beginning of a project and so much wasted time and effort saved down the line. You don't have to jump on HDF5 or a similar heavyweight file design method, just adopt enough discipline to last the useful life of the contents of the file. For iteration-by-iteration dumps from your simulation of the universe, a home-brewed approach will do (be honest, you're the only person who's ever going to look at them). For data gathered at an approximate cost of $1M per TB (satellite observations, offshore seismic traces, etc) then HDF5 or something similar.
Option 1 is fine too. It's not like you have to wait for the tapes to rewind between reads any more. (Well, some do, but they're in a niche these days, and a de-archiving system will often move files from tape to disk if they're to be used.)
Option 2 is a faff. It may also be the worst performing but on all but the largest files the worst performance may be within a nano-century of the best. If that's important to you then check it out.
If you want quantification of my opinions run your own experiments on your files on your hardware.
PS I haven't really got a clue how much it costs to get 1TB of satellite or seismic data, it's a factoid invented to support an argument.
I would add to the previous answer:
If your data has a regular structure and it's possible to open it in a txt file, press ctrl+end substract header to the rows total and there it is. Although you may waste time opening it if it's very large.
I've done some searching and have not found anything that would boost the file and formatting functions in Visual Studio VS2010 C (not C++).
I've been able to address the raw i/o issues to some extent by using large buffers and a SSD drive, so the more pressing issue is a replacement for the family of printf functions.
Has anyone found something worthwhile?
As I understand it, part of the glacial speed issue with the printf functions is that they have to handle myriad types of arguments. Does anyone have experience with writing a datatype-specific version of printf; eg, one that only prints ints, or only prints doubles, etc?
First off, you should profile the code first before assuming it's printf.
But if you're sure it's printf and similar then you can do a few things to fix the issue.
1) print less. IE, don't call expensive operations as much if you can avoid it. Do you need all the output, for example?
2) manually replace the string concatenation with manually built routines that do all the pieces without having to parse the format specifier.
EG: printf("--%s--", "really cool");
Can become:
write(1, "--", 2);
write(1, "really cool", 11);
write(1, "--", 2);
That may be faster. But again, you won't know until you profile it. Don't spend energy on a solution till you can confirm it's the solution you need and be able to measure the success of your proposed solution.
#Wes is right, never assume you know what you need to fix until you have proof.
Where I differ is on the method of finding out.
I and others use random pausing which works for these reasons, and here's a short slide show demo & C++ code so you can see how it works, if you want.
The thing about printf (or any output) function is it spends A) a certain number of CPU cycles creating a buffer to be output, and then it spends B) a certain amount of time waiting while the system and/or auxiliary hardware actually moves the data out.
That's maybe a bit over-simplified, but if you randomly pause and examine the state, that's what you see.
What you've done by using large buffers and an SSD drive is reduce B, and that's good.
That means of the time remaining, A is a larger fraction.
You know that.
Now of the samples you find in A, you might get a hint of what's happening if you see what subordinate routines inside printf are showing up.
Usually printf calls something like vprintf to get rid of the variable argument list, which then cycles over the format string to figure out what to do, including things like parsing precision specifiers.
If it looks like that's what it's doing, then you know about how much time goes into parsing the format.
On the other hand, if you see it inside a routine that is copying a string, or formatting an integer (along with dealing with leading/trailing characters, etc.) then you know to concentrate on that.
On yet another hand, if you see it inside a routine that looks like it's formatting a floating point number (which is actually quite complicated), you know to concentrate on that.
Given all that, you want to know what I do?
First, I ask who is going to read this anyway?
If nobody really needs to read all this text, why not pump it out in binary? Or failing that, in hex?
If you simply write binary, A shrinks to nothing, and when you read it back in with another program, guess what?
No Lost Bits!