I've seen many answers online saying that checking the size of a list is constant time, but I don't understand why.
My understanding was that a list isn't stored in contiguous memory chunks (like an array), meaning there is no way of getting the size of a list (last element index + 1), without first traversing through every element.
Thoughts?
I've seen many answers online saying that checking the size of a list is constant time.
This fallacy may originate in a fundamental flaw in the Python language: arrays are called lists in Python. As Python gains popularity, the word list has become ambiguous.
Computing the length of a linked list is an O(n) operation, unless the length has been stored separately and maintained properly.
Retrieving the size of an array is performed in constant time if the size is stored along with the array, as is the case in Python, so a=[1,2,3]; len(a) is indeed very fast.
Computing the length of an array may be an O(n) operation if the array must be scanned for a terminating value, such as a null pointer or a null byte. Thus strlen() in C, which computes the number of bytes in a C string (a null terminated array of char) operates in linear time.
The only way to find the length of a list without holding the locations before hand is iterating through the list. They are not stored in chunks so it requires iterating until reaching the last element. Therefore the time complexity is O(n). Of course if you store a variable holding the length and increment it every time an element is added (and decrement when an element is removed), you would not need to iterate through it which would make it constant as you would only need the first element, or retrieve the data from wherever the length is stored. Perhaps one could use the root element to hold the length therefore making it unnecessary to loop through it to get the length. In short, you are correct. The reason one might say constant is if it is stored beforehand.
Related
I have found the usual answer for determining the size on an array - 'variable = sizeof buffer/sizeof *buffer;' or similar which gives the declared size
Other than passing a counter variable around when an array is 'filled', is there a way (a command or function - something similar to the above 1 liner) to find the used number of elements in a given array?
And I don't mean searching for an end of line character or implementing a circular buffer. I only expect between 5 and 25 Bytes in any given transaction.
Edit:- for omissions in original post (as requested)
Language C (via Atmel Studio)
8bit Bytes (non characters)
I don't know the meaning of 'strong-typed' and 'associative', but eventually it will be semi random bytes from a device and not fixed numbers like in a string
Fixed size declared array (ultimately volatile)
Thanks
The C standard generally does not provide any way for tracking or determining which or how many elements of an array are used.
I need to use array of graph data, i.e. struct with x and y integers. This array will be passed through many functions, and I need to decide the API choice.
typedef struct {
int x;
int y;
} GraphData_t;
How should I choose whether to use NULL-termination for the array, or supply count variable?
I have three approaches for my API:
1: loadGraph(GraphData_t *data, int count); //use count variable
2: loadGraph(GraphData_t *data); // use null-termination (or any other termination value)
typedef struct {
GraphData_t *data;
int count;
} GraphArray_t;
3: loadGraph(GraphArray_t *data); //use a struct which has integrated count variable
So far these seem equal to me. Which one would be the preferable method, and why?
As a rather old dinosaur, I will use history here.
Anyway, the size + pointer idiom is the multi purpose and bullet proof way. If in doubt, use it.
The delimited way is just more common for human beings, specially when you want to initialize an array: no need to manually count the items (with the risk of a one off mistake specially if you later add or remove elements to the initialization list), you just add the delimitor as the last element. BTW, it is the way we use lines in text files... But anyway, the sizeof(array)/sizeof(array[0]) idiom allows to easily and automatically get the size...
The NULL terminated idiom comes from the begining of micro-processors, where code was close to the hardware for performance reasons: comparison to 0 was the fastest test, and memory was expensive. And programmers began to end their constant strings with a NULL character for that reason: only one byte overhead, even if the string was longer than 256 characters. You find reference to this ASCIIZ idiom in MS/DOS 2 manuals but it had been made popular by the pair Unix and K&R C language since the 70's.
It is still convenient, and still used in C strings, but many higher level tools like C++ std::string now prefere the counted idiom which does not require one forbidden value.
For daily programming, the (null) terminated idiom should only be used when an array can only be browsed forward, and when you have no special need for the size. But beware, if you simply want to copy a null terminated array, you have to scan it twice: once for its size and once for its data.
Null termination is a convention that can only be used if the null value is excluded from the set of legal values for the entries. For example the string array argv in main has a null pointer at the end because a null pointer cannot be a legal string.
In your case, the array elements are structures with 2 int coordinates. You would need to decide what values to consider invalid for these coordinates. If all values are OK, then you must pass the number of elements explicitly. Passing the array length explicitly is preferred in all cases, as it avoids unnecessary scans. Indeed main also gets the length of the argv array as a separate int argument argc.
Whether to encapsulate the array pointer and the length in a structure is a matter of style and convenience. For complex structures, it is preferable to group all characteristics in a structure, but for a simple array, it may be more convenient to pass a pointer and a size explicitly as it allows you to apply the function to a subset of the array with loadGraph(data + i, j).
While all approaches can of course get the job done, there are some differences which may or may not be relevant for your use-case.
Null-termination is very convenient if the user needs or wants to use hard-coded arrays, because then they can just add or delete entries, without needing to worry about possibly breaking the application (unless they remove the terminator of course).
Since the size is unknown, almost every function working with a null-terminated array needs to iterate over the whole thing. This might be a problem if the array is large, and many functions usually wouldn't actually need to access all entries (or not in the order they are stored).
The terminator itself obviously needs to be a value that can never occur in your actual data. So, depending on your data, there might not be an obvious candidate to use as terminator (or even none at all).
There are probably more subtle differences which might influence your decision, but these are the first ones that came to my mind.
I'm working in ANSI C with lots of fixed length arrays. Rather than setting an array length variable for every array, it seems easier just to add a "NULL" terminator at the end of the array, similar to character strings. Fot my current app I'm using "999999" which would never occur in the actual arrays. I can execute loops and determine array lengths just by looking for the terminator. Is this a common approach? What are the issues with it? Thanks.
This approach is technically used by your main arguments, where the last value is a terminal NULL, but it's also accompanied by an argc that tells you the size.
Using just terminals sounds like it's more prone to mistakes in the future. What's wrong with storing the size along with an array?
Something like:
struct fixed_array {
unsigned long len;
int arr[];
};
This will also be more efficient and less error-prone.
The main problem I can think of is that keeping track of the length can be useful because there are built in functions in C that take length as a parameter, and you need it to know the length to know where to add the next element too.
In reality it depends on the size of your array, if it is a huge array than you should keep track of the length. Otherwise looping through it to determine the length every time you want to add an element to the end would be very expensive. O(n) instead of the O(1) time you normally get with arrays
The main problem with this approach is that you can't know the length in advance without looping to the end of the array - and that can affect the performance quite negatively if you only want to determine the length.
Why don't you just
Initialize it with a const int that you can use later in the code to check the size, or
Use int len = sizeof(my_array) / sizeof(the_type).
Since you're using 2-dimensional arrays to hold a ragged array, you could just use a ragged array: type *my_array[];. Or you could put the length in element 0 of each row and treat the rows as 1-indexed arrays. With some evil trickery you could even put the lengths at element -1 of each row![1]
Left as exercise ;)
I am almost sure there is no reverse strpbrk() in C99. But:
Is there a reason for that? I mean, why does strchr() have strrchr() but strpbrk() doesn't hae strrpbrk()?
How do you get the last occurrence in a string of any of the characters in another string?
In my opinion, because no one thinks out of the box, stpcpy isn't part of C99 either :(
Look at glibc's stpbrk implementation to get an inspiration, it's not that hard
/* Find the first occurrence in S of any character in ACCEPT. */
char *
strpbrk (s, accept)
const char *s;
const char *accept;
{
while (*s != '\0')
{
const char *a = accept;
while (*a != '\0')
if (*a++ == *s)
return (char *) s;
++s;
}
return NULL;
}
Note that strpbrk() is NOT optimized if the second string is long and contains duplicates.
One obvious thing to do would be to scan first the second string for a limited length (at most 256 bytes including the null terminator as it typically does not contain duplicates): if this string is found to be longer, it contains duplicate bytes.
During this scan, it can create a bitmap (32 bytes needed if using a packed form: this can easily be allocated as an automatic array on the stack, but access to the bitmap may be longer; if you don't optimize for stack space, you may want to just create an array of 256 booleans stored as one byte for each boolean, this would use 256 bytes on the stack). That array would normally set the boolean at position 0 that will be true if the second string is shorter than 255 bytes as it would be followed by the null terminator. That boolean in the array at position 0 (indicating a long second string) could be kept in a register, as well as the last position still not scanned in the second string (s2+256). Typically this initial step will be short (and its worst case is bound by design).
Now you can scan the first string and perform simple indexing on the scanned byte to see if it is set to true in the array of boolean. Otherwise, check if the long string indicator is set, if not you need to scan further the second string (and continue feeding the array of booleans until you locate the character of the first string or the null terminator, update the last position scanned in that local register).
In most cases, this will optimize a lot because you don't perform two loops (loops are costly because of their conditional jumps needing breach prediction, each string will be scanned (partly) at most once. The only price is the indirection for accessing the array of booleans in the stack, but that array is small enough to fit inside the CPU data cache (so the indirection has a virtual cost of zero).
This works because strpbrk() works on array of char (limited to 256 possible values).
The initial scan of the second string will most often be complete and will set the array completely (so its null byte termination will be detected and will set the first boolean to true).
A secondary optimization would be to use the value of this first boolean in the array when it is true (the second string is short) to reduce the second loop so that it will only perform indexing without retesting if it needs to scan the second string further.
A profiler may reveal that you actually don't need to scan first up to 256 bytes in the first loop, and most code will typically use a second string that is at most 16 bytes long. You may even optimize the case where the second string is empty or contains a single character, in which case you won't need to use any array or extra register and you can either return NULL directly (the second string is empty, only a single null byte) or just scan the first string for a single character value or null byte. Which length to use (for the initialization scanning the second string) could be determined on profiling your apps using that function
I bet that 16 would be good enough, you may find that a shorter value of 8 bytes may eventually be a bit better most of the time but at the price of using the complex branch (with the embedded loop conditionally continuing scanning the second string and updating more data in the array of booleans) more often in some cases. Profiling may also help determine if you want a packed array of booleans (stored as single bits instead of full bytes) or an extended array (booleans stored as plain words: this may be architecture-dependent), or if you can use registers instead of an array (for architectures that have many registers).
And as I said, the only cost is stack use (but it can be limited if you pack the array; it could optionally use the heap on some architectures that have very small stacks, but using the heap has a price and generally it uses complex code that could require more heap usage and extra costs for the internal function calls and calls to system's API).
Some extreme optimizations may also use vector instructions. So glibc could still be largely optimized for the implementation of this function.
The same initialization could then be used to implement the proposed "reverse" strrpbrk(), that must scan the first string up to its terminating null byte, and keep a local pointer register storing the last position found (instead of stopping on the first occurrence found): for implementing it, you could as well call strpbrk() repeatedly in a loop until it returns NULL.
For that implementation, don't use strrev() for that purpose: reversing the first string first has a cost: it requires at least two loops, the first one to know its effective length, and it requires either extra storage (unbound limitation for the 1st string, so it cannot be allocated on the stack and would use the heap, which is quite costly), or it requires transforming in place before scanning it, then deallocate the extra storage or undo the in-situ transform by reversing it again: this would not work if the first string is shared by competing threads, causing possible security issues.
Because int lastMatchIdx=strlen(haystack)-strpbrk(strrev(haystack),needles) is too easy to write? And has the same complexity (though somewhat less empirical performance)
for(char* h=haystack;(h=strpbrk(h,needles))!=NULL;rightMostMatch=h++); is similarly simple
As for how to make a strrpbrk i suppose best is to
repeat strrchr() for each character in second param and keep record of the highest pointer.
Regarding the strpbrk function definition i wish the Standard developers would have been more precise. There should be two variants of this function -- one that searches for one character at a time from param 2 and returns first match found (glibc variant) and one that returns the very first of the possible characters in the string (this MSVC seems to do).
But i guess the world wont even be perfect...
In other languages like C++, you have to keep track of the array length yourself - how does Delphi know the length of my array? Is there an internal, hidden integer?
Is it better, for performance-critical parts, to not use Length() but a direct integer managed by me?
There are three kinds of arrays, and Length works differently for each:
Dynamic arrays: These are implemented as pointers. The pointer points to the first array element, but "behind" that element (at a negative offset from the start of the array) are two extra integer values that represent the array's length and reference count. Length reads that value. This is the same as for the string type.
Static arrays: The compiler knows the length of the array, so Length is a compile-time constant.
Open arrays: The length of an open array parameter is passed as a separate parameter. The compiler knows where to find that parameter, so it replaces Length with that a read of that parameter's value.
Don't forget that the layout of dynamic arrays and the like would change in a 64-bit version of Delphi, so any code that relies on finding the length at a particular offset would break.
I advise just using Length(). If you're working with it in a loop, you might want to cache it, but don't forget that a for loop already caches the terminating bounds of the loop.
Yes, there are in fact two additional fields with dynamic arrays. First is the number of elements in the array at -4 bytes offset to the first element, and at -8 bytes offset there's the reference count. See Rudy's article for a detailed explanation.
For the second question, you'd have to use SetLength for sizing dynamic arrays, so the internal 'length' field would be available anyway. I don't see much use for additional size tracking.
Since Rob Kennedy gave such a good answer to the first part of your question, I'll just address the second one:
Is it better, for performance-critical parts, to not use Length() but a direct integer managed by me?
Absolutely not. First, as Rob mentioned, the compiler does it's thing to access the information extremely quickly, either by reading a fixed offset before the start of the array in the case of dynamic ones, using a compile-time constant in the case of static ones, and passing a hidden parameter in the case of open arrays, you're not going to gain any improvement in performance.
Secondly, the direct integer managed by you wouldn't be any faster, but would actually use more memory (an additional integer allocated along with the one Delphi already provides for dynamic and open arrays, and an extra integer entirely in the case of static arrays).
Even if you directly read the value Delphi stores already for dynamic arrays, you wouldn't gain any performance over Length(), and would risk your code breaking if the internal representation of that hidden header for arrays changes in the future.
Is there an internal, hidden integer
Yes.
to not use Length() but a direct integer managed by me?
Doesn't matter.
See Dynamic arrays item in Addressing pointers article by Rudy Velthuis.
P.S. You can also hit F1 button.