Time Complexity of finding the length of an array - arrays

I am a little confused on what the time complexity of a len() function would be.
I have read in many different posts that finding the length of an array in python is O(1) with the len() function and similar for other languages.
How is this possible? Do you not have to iterate through the whole array to count how many indices its taking up?

Do you not have to iterate through the whole array to count how many indices its taking up?
No, you do not.
You can generally always trade space for time when constructing algorithms.
For example, when creating a collection, allocate a separate variable holding the size. Then increment that when adding an item to the collection and decrement it when removing something.
Then, voilĂ , the size of the collection can then be obtained in O(1) time just by accessing that variable.
And this appears to be what Python actually does, as per this page, which states (checking of the Python source code shows that this is the action when requesting the size of a great many objects):
Py_SIZE(o) - This macro is used to access the ob_size member of a Python object. It expands to (((PyVarObject*)(o))->ob_size).
If you compare the two approaches (iterating vs. a length variable), the properties of each can be seen in the following table:
Measurement
Iterate
Variable
Space needed
No extra space beyond the collection itself.
Tiny additional length (4 bytes allowing for size of about four billion).
Time taken
Iteration over the collection.Depends on collection size, so could be significant.
Extraction of length, very quick.Changes to list size (addition or deletion) incur slight extra expense of updating length, but this is also tiny.
In this case, the extra cost is minimal but the time saved for getting the length could be considerable, so it's probably worth it.
That's not always the case since, in some (rare) situations, the added space cost may outweigh the reduced time taken (or it may need more space than can be made available).
And, by way of example, this is what I'm talking about. Ignore the fact that it's totally unnecessary in Python, this is for a mythical Python-like language that has an O(n) cost for finding the length of a list:
import random
class FastQueue:
""" FastQueue: demonstration of length variable usage.
"""
def __init__(self):
""" Init: Empty list and set length zero.
"""
self._content = []
self._length = 0
def push(self, item):
""" Push: Add to end, increase length.
"""
self._content.append(item)
self._length += 1
def pull(self):
""" Pull: Remove from front, decrease length, taking
care to handle empty queue.
"""
item = None
if self._length > 0:
item = self._content[0]
self._content = self._content[1:]
self._length -= 1
return item
def length(self):
""" Length: Just return stored length. Obviously this
has no advantage in Python since that's
how it already does length. This is just
an illustration of my answer.
"""
return self._length
def slow_length(self):
""" Length: A slower version for comparison.
"""
list_len = 0
for _ in self._content:
list_len += 1
return list_len
""" Test harness to ensure I haven't released buggy code :-)
"""
queue = FastQueue()
for _ in range(10):
val = random.randint(1, 50)
queue.push(val)
print(f'push {val}, length = {queue.length()}')
for _ in range(11):
print(f'pull {queue.pull()}, length = {queue.length()}')

Related

Is it possible to alter an Array object's length?

How does one alter self in an Array to be a totally new array? How do I fill in the commented portion below?
class Array
def change_self
#make this array be `[5,5,5]`
end
end
I understand this: Why can't I change the value of self? and know I can't just assign self to a new object. When I do:
arr = [1,2,3,4,5]
arr contains a reference to an Array object. I can add a method to Array class that alters an array, something like:
self[0] = 100
but is it possible to change the length of the array referenced by arr?
How are these values stored in the Array object?
You are asking three very different questions in your title and in your text:
Is it possible to alter an Array object's length using an Array method?
Yes, there are 20 methods which can (potentially) change the length of an Array:
<< increases the length by 1
[]= can alter the length arbitrarily, depending on arguments
clear sets the length to 0
compact! can decrease the length, depending on contents
concat can increase the length, depending on arguments
delete can decrease the length, depending on arguments and contents
delete_at can decrease the length, depending on arguments
delete_if / reject! can decrease the length, depending on arguments and contents
fill can increase the length, depending on arguments
insert increases the length
keep_if / select! can decrease the length, depending on arguments and contents
pop decreases the length
push increases the length
replace can alter the length arbitrarily, depending on arguments and contents (it simply replaces the Array completely with a different Array)
shift decreases the length
slice! decreases the length
uniq! can decrease the length, depending on contents
unshift increases the length
When monkey patching the Array class, how does one alter "self" to be a totally new array? How do I fill in the commented portion below?
class Array
def change_self
#make this array be [5,5,5] no matter what
end
end
class Array
def change_self
replace([5, 5, 5])
end
end
How are these values actually stored in the Array object?
We don't know. The Ruby Language Specification does not prescribe any particular storage mechanism or implementation strategy. Implementors are free to implement Arrays any way they like, as long as they obey the contracts of the Array methods.
As an example, here's the Array implementation in Rubinius, which I find fairly readable (at least more so than YARV):
vm/builtin/array.cpp: certain core methods and data structures
kernel/bootstrap/array.rb: a minimal implementation for bootstrapping the Rubinius kernel
kernel/common/array.rb: the bulk of the implementation
For comparison, here is Topaz's implementation:
lib-topaz/array.rb
And JRuby:
core/src/main/java/org/jruby/RubyArray.java
arr = [1,2,3,4,5]
arr.replace([5,5,5])
I wouldn't monkey-patch a new method into Array; especially since it already exists. Array#replace
As Array are mutables, you can alter it's contents:
class Array
def change_self
self.clear
self.concat [5, 5, 5]
end
end
You modify the array so it becomes empty, and then add all the elements from the target array. They still are two different objects (ie, myAry.object_id would differ from [5, 5, 5].object_id), but now they are equivalent arrays.
Moreover, the array still is the same that before - just it's content changed:
myAry = [1, 2, 3]
otherRef = myAry
previousId = myAry.object_id
previousHash = myAry.hash
myAry.change_self
puts "myAry is now #{myAry}"
puts "Hash changed from #{previousHash} to #{myAry.hash}"
puts "ID #{previousId} remained as #{myAry.object_id}, as it's still the same instance"
puts "otherRef points to the same instance - it shows the changes, too: #{otherRef}"
Anyway, I really don't know why one would want to do this - are you solving the right problem, or just kidding with the language?

How to sample from a Scala array efficiently

I want to sample from a Scala array, the sample size can be much larger than the length of the array. How can I do this efficiently? By using the following code the running time is linear to the sample size, when the sample size is very big it is slow if we need to do the sampling many times:
def getSample(dataArray: Array[Double], sampleSize: Int, seed: Int): Array[Double] =
{
val arrLength = dataArray.length
val r = new scala.util.Random(seed)
Array.fill(sampleSize)(dataArray(r.nextInt(arrLength)))
}
val myArr= Array(1.0,5.0,9.0,4.0,7.0)
getSample(myArr, 100000, 28)
The probability that any given element of an array of length $n$ appears at least once in a sample of size $k$ is $1-(1-1/n)^k$. If this value is close to 1, which occurs when $k$ is large compared to $n$, then the following algorithm might be a good choice depending on your needs:
import org.apache.commons.math3.random.MersennseTwister
import org.apache.commons.math3.distribution.BinomialDistribution
def getSampleCounts[T](data: Array[T], k: Int, seed: Long): Array[Int] = {
val rng = new MersenneTwister(seed)
val counts = new Array[Int](data.length)
var i = k
do {
val j = new BinomialDistribution(rng.nextLong(), i, 1.0/i)
counts(i) = j
i -= j
} while (i > 0)
counts
}
Note that this algorithm does not return a sample. Instead it returns an Array[Int] whose $i$-th entry is equal to the number of times data(i) appears in the random sample. This may not be suitable for all applications, but for some use cases having the sample in the form of some sort of Iterable over (value, count) pairs (which can be obtained by data.view.zip(getSampleCounts(data, k, seed)), for example) is actually very convenient since it often enables us to do a computation once for groups of samples (since they are equal.) For example, suppose I had an expensive function f: T => Double and I wanted to compute the sample mean of f applied to a random sample of size $k$ draw from data. Then we could do the following:
data.view.zip(getSampleCounts(data, k, seed)).map({case (x, count) => f(x)*count}).sum/k
This computation for the sample mean evaluates f $n$ instead of $k$ times (recall that we are assuming that $k$ is large compared to $n$.)
Note that getSampleCounts will loop at most $n$ times where $n$ is data.length. Also, sampling from the binomial distribution in each iteration, assuming this is done in a reasonable fashion in the apache.commons.math3 library, should have complexity no worse than $O(\log k)$ (inverse CDF method and binary search.) So the complexity of the above algorithm is $O(n \log k)$ where $n$ is data.length and $k$ is the number of samples you want to draw.
There is no way around it. If you need to take N elements with constant time element access the complexity will be O(n) (linear) no matter what.
You can deffer/amortize the cost by making it lazy. For instance you can return a Stream or Iterator that evaluates each element as you access it. This will help you save on memory usage if you can fold that stream as you are consuming it. In other words you can skip the copy part and work directly with initial array - not always possible, depends on the task.
To make this sampling program run faster, use Akka actor framework to run the sampling jobs in parallel.
Create a master actor for distributing the sampling works to Worker actors and also to concatenate the elements from different workers. So each Worker actor would prepare/collect a fixed number of sample elements and give back the resulting collection as an immutable array to the master. Upon receiving the 'WorkDone' user-defined message from Worker, the Master actor concatenates the elements into the final collection.
it is easy with a list. Use the following implicit function
object ListImplicits {
implicit class SampledArray[T](in: List[T]) {
def sample(n: Int, seed:Option[Long]=None): List[T] = {
seed match {
case Some(s) => Random.setSeed(s)
case _ => // nothing
}
Random.shuffle(in).take(n)
}
}
}
And then import the object and use collection conversions to switch from Array to list (slight overhead):
import ListImplicits.SampledArray
val n = 100000
val list = (0 to n).toList.map(i => Random.nextInt())
val array = list.toArray
val t0 = System.currentTimeMillis()
array.toList.sample(5).toArray
val t1 = System.currentTimeMillis()
list.sample(5)
val t2 = System.currentTimeMillis()
println( "Array (conversion) => delta = " + (t1-t0) + " ms") // 10 ms
println( "List => delta = " + (t2-t1) + " ms") // 8 ms

Will this code add hash-part to Lua array-like table?

Lua code
local var = {}
for i = 1, 10000, 1 do
table.insert ( var, i )
end
var[5] = { some = false, another = 2 }
var[888] = #userdata
var[10000] = {}
var[1] = 1
var[1] = false
1 ) After this, is var still only with array-part, or it has hash-part too?
2 ) Does code
var[10001] = 1
add hash-part to var or it only forces table rehash without adding hash-part ?
3) How it affects on size of table?
Thanks!
The table will only have an array part after both 1 and 2. The reason is that you have a contiguous set of indices. Specifically, you created entries from 1 to 10,001 and Lua will have allocated space for them.
If for example, you had created 1 to 1000 then added 10001, it would have added the last one in the hash part rather than create nil entries for all the entries between.
It doesn't matter what type of data you put in as the value for the entries, Lua is only interested in the indices when deciding between array and hash. The exception here is setting values to nil. This can get a bit complicated but if table space is your primary concern, I don't believe Lua ever reduces the array an hash parts if you nil sets of them out. I could be mistaken on this.
As to the size, Lua uses a doubling strategy. So, after you hit entry 8192, Lua added another 8192 so there was no extra array space created between 10000 and 10001.
BTW Lua doesn't rehash every table addition. When it adds buckets, it gives itself some headroom. I believe it doubles there too. Note, if your data is sparse i.e. you aren't going to fill most of the indices between 1 and your max, then this approach to hashing can be very beneficial for space even if your indices are numbers. The main downside is it means you can't use ipairs across all your entries.

Determine regular expression's specificity

Given the following regular expressions:
- alice#[a-z]+\.[a-z]+
- [a-z]+#[a-z]+\.[a-z]+
- .*
The string alice#myprovider.com will obviously match all three regular expressions. In the application I am developing, we are only interested in the 'most specific' match. In this case this is obviously the first one.
Unfortunately there seems no way to do this. We are using PCRE and I did not find a way to do this and a search on the Internet was also not fruitful.
A possible way would be to keep the regular expressions sorted on descending specificity and then simply take the first match. Of course then the next question would be how to sort the array of regular expressions. It is not an option to give the responsability to the end-user to ensure that the array is sorted.
So I hope you guys could help me out here...
Thanks !!
Paul
The following is the solution to this problem I developed based on Donald Miner's research paper, implemented in Python, for rules applied to MAC addresses.
Basically, the most specific match is from the pattern that is not a superset of any other matching pattern. For a particular problem domain, you create a series of tests (functions) which compare two REs and return which is the superset, or if they are orthogonal. This lets you build a tree of matches. For a particular input string, you go through the root patterns and find any matches. Then go through their subpatterns. If at any point, orthogonal patterns match, an error is raised.
Setup
import re
class RegexElement:
def __init__(self, string,index):
self.string=string
self.supersets = []
self.subsets = []
self.disjoints = []
self.intersects = []
self.maybes = []
self.precompilation = {}
self.compiled = re.compile(string,re.IGNORECASE)
self.index = index
SUPERSET = object()
SUBSET = object()
INTERSECT = object()
DISJOINT = object()
EQUAL = object()
The Tests
Each test takes 2 strings (a and b) and tries to determine how they are related. If the test cannot determine the relation, None is returned.
SUPERSET means a is a superset of b. All matches of b will match a.
SUBSET means b is a superset of a.
INTERSECT means some matches of a will match b, but some won't and some matches of b won't match a.
DISJOINT means no matches of a will match b.
EQUAL means all matches of a will match b and all matches of b will match a.
def equal_test(a, b):
if a == b: return EQUAL
The graph
class SubsetGraph(object):
def __init__(self, tests):
self.regexps = []
self.tests = tests
self._dirty = True
self._roots = None
#property
def roots(self):
if self._dirty:
r = self._roots = [i for i in self.regexps if not i.supersets]
return r
return self._roots
def add_regex(self, new_regex):
roots = self.roots
new_re = RegexElement(new_regex)
for element in roots:
self.process(new_re, element)
self.regexps.append(new_re)
def process(self, new_re, element):
relationship = self.compare(new_re, element)
if relationship:
getattr(self, 'add_' + relationship)(new_re, element)
def add_SUPERSET(self, new_re, element):
for i in element.subsets:
i.supersets.add(new_re)
new_re.subsets.add(i)
element.supersets.add(new_re)
new_re.subsets.add(element)
def add_SUBSET(self, new_re, element):
for i in element.subsets:
self.process(new_re, i)
element.subsets.add(new_re)
new_re.supersets.add(element)
def add_DISJOINT(self, new_re, element):
for i in element.subsets:
i.disjoints.add(new_re)
new_re.disjoints.add(i)
new_re.disjoints.add(element)
element.disjoints.add(new_re)
def add_INTERSECT(self, new_re, element):
for i in element.subsets:
self.process(new_re, i)
new_re.intersects.add(element)
element.intersects.add(new_re)
def add_EQUAL(self, new_re, element):
new_re.supersets = element.supersets.copy()
new_re.subsets = element.subsets.copy()
new_re.disjoints = element.disjoints.copy()
new_re.intersects = element.intersects.copy()
def compare(self, a, b):
for test in self.tests:
result = test(a.string, b.string)
if result:
return result
def match(self, text, strict=True):
matches = set()
self._match(text, self.roots, matches)
out = []
for e in matches:
for s in e.subsets:
if s in matches:
break
else:
out.append(e)
if strict and len(out) > 1:
for i in out:
print(i.string)
raise Exception("Multiple equally specific matches found for " + text)
return out
def _match(self, text, elements, matches):
new_elements = []
for element in elements:
m = element.compiled.match(text)
if m:
matches.add(element)
new_elements.extend(element.subsets)
if new_elements:
self._match(text, new_elements, matches)
Usage
graph = SubsetGraph([equal_test, test_2, test_3, ...])
graph.add_regex("00:11:22:..:..:..")
graph.add_regex("..(:..){5,5}"
graph.match("00:de:ad:be:ef:00")
A complete usable version is here.
My gut instinct says that not only is this a hard problem, both in terms of computational cost and implementation difficulty, but it may be unsolvable in any realistic fashion. Consider the two following regular expressions to accept the string alice#myprovider.com
alice#[a-z]+\.[a-z]+
[a-z]+#myprovider.com
Which one of these is more specific?
This is a bit of a hack, but it could provide a practical solution to this question asked nearly 10 years ago.
As pointed out by #torak, there are difficulties in defining what it means for one regular expression to be more specific than another.
My suggestion is to look at how stable the regular expression is with respect to a string that matches it. The usual way to investigate stability is to make minor changes to the inputs, and see if you still get the same result.
For example, the string alice#myprovider.com matches the regex /alice#myprovider\.com/, but if you make any change to the string, it will not match. So this regex is very unstable. But the regex /.*/ is very stable, because you can make any change to the string, and it still matches.
So, in looking for the most specific regex, we are looking for the least stable one with respect to a string that matches it.
In order to implement this test for stability, we need to define how we choose a minor change to the string that matches the regex. This is another can of worms. We could for example, choose to change each character of the string to something random and test that against the regex, or any number of other possible choices. For simplicity, I suggest deleting one character at a time from the string, and testing that.
So, if the string that matches is N characters long, we have N tests to make. Lets's look at deleting one character at a time from the string alice#foo.com, which matches all of the regular expressions in the table below. It's 12 characters long, so there are 12 tests. In the table below,
0 means the regex does not match (unstable),
1 means it matches (stable)
/alice#[a-z]+\.[a-z]+/ /[a-z]+#[a-z]+\.[a-z]+/ /.*/
lice#foo.com 0 1 1
aice#foo.com 0 1 1
alce#foo.com 0 1 1
alie#foo.com 0 1 1
alic#foo.com 0 1 1
alicefoo.com 0 0 1
alice#oo.com 1 1 1
alice#fo.com 1 1 1
alice#fo.com 1 1 1
alice#foocom 0 0 1
alice#foo.om 1 1 1
alice#foo.cm 1 1 1
--- --- ---
total score: 5 10 12
The regex with the lowest score is the most specific. Of course, in general, there may be more than one regex with the same score, which reflects the fact there are regular expressions which by any reasonable way of measuring specificity are as specific as one another. Although it may also yield the same score for regular expressions that one can easily argue are not as specific as each other (if you can think of an example, please comment).
But coming back to the question asked by #torak, which of these is more specific:
alice#[a-z]+\.[a-z]+
[a-z]+#myprovider.com
We could argue that the second is more specific because it constrains more characters, and the above test will agree with that view.
As I said, the way we choose to make minor changes to the string that matches more than one regex is a can of worms, and the answer that the above method yields may depend on that choice. But as I said, this is an easily implementable hack - it is not rigourous.
And, of course the method breaks if the string that matches is empty. The usefulness if the test will increase as the length of the string increases. With very short strings, it is more likely produce equal scores for regular expressions that are clearly different in their specificity.
I'm thinking of a similar problem for a PHP projects route parser. After reading the other answers and comments here, and also thinking about the cost involved I might go in another direction altogether.
A solution however, would be to simply sort the regular expression list in order of it's string length.
It's not perfect, but simply by removing the []-groups it would be much closer. On the first example in the question it would this list:
- alice#[a-z]+\.[a-z]+
- [a-z]+#[a-z]+\.[a-z]+
- .*
To this, after removing content of any []-group:
- alice#+\.+
- +#+\.+
- .*
Same thing goes for the second example in another answer, with the []-groups completely removed and sorted by length, this:
alice#[a-z]+\.[a-z]+
[a-z]+#myprovider.com
Would become sorted as:
+#myprovider.com
alice#+\.+
This is a good enough solution at least for me, if I choose to use it. Downside would be the overhead of removing all groups of [] before sorting and applying the sort on the unmodified list of regexes, but hey - you can't get everything.

Help with a special case of permutations algorithm (not the usual)

I have always been interested in algorithms, sort, crypto, binary trees, data compression, memory operations, etc.
I read Mark Nelson's article about permutations in C++ with the STL function next_perm(), very interesting and useful, after that I wrote one class method to get the next permutation in Delphi, since that is the tool I presently use most. This function works on lexographic order, I got the algo idea from a answer in another topic here on stackoverflow, but now I have a big problem. I'm working with permutations with repeated elements in a vector and there are lot of permutations that I don't need. For example, I have this first permutation for 7 elements in lexographic order:
6667778 (6 = 3 times consecutively, 7 = 3 times consecutively)
For my work I consider valid perm only those with at most 2 elements repeated consecutively, like this:
6676778 (6 = 2 times consecutively, 7 = 2 times consecutively)
In short, I need a function that returns only permutations that have at most N consecutive repetitions, according to the parameter received.
Does anyone know if there is some algorithm that already does this?
Sorry for any mistakes in the text, I still don't speak English very well.
Thank you so much,
Carlos
My approach is a recursive generator that doesn't follow branches that contain illegal sequences.
Here's the python 3 code:
def perm_maxlen(elements, prefix = "", maxlen = 2):
if not elements:
yield prefix + elements
return
used = set()
for i in range(len(elements)):
element = elements[i]
if element in used:
#already searched this path
continue
used.add(element)
suffix = prefix[-maxlen:] + element
if len(suffix) > maxlen and len(set(suffix)) == 1:
#would exceed maximum run length
continue
sub_elements = elements[:i] + elements[i+1:]
for perm in perm_maxlen(sub_elements, prefix + element, maxlen):
yield perm
for perm in perm_maxlen("6667778"):
print(perm)
The implentation is written for readability, not speed, but the algorithm should be much faster than naively filtering all permutations.
print(len(perm_maxlen("a"*100 + "b"*100, "", 1)))
For example, it runs this in milliseconds, where the naive filtering solution would take millenia or something.
So, in the homework-assistance kind of way, I can think of two approaches.
Work out all permutations that contain 3 or more consecutive repetitions (which you can do by treating the three-in-a-row as just one psuedo-digit and feeding it to a normal permutation generation algorithm). Make a lookup table of all of these. Now generate all permutations of your original string, and look them up in lookup table before adding them to the result.
Use a recursive permutation generating algorthm (select each possibility for the first digit in turn, recurse to generate permutations of the remaining digits), but in each recursion pass along the last two digits generated so far. Then in the recursively called function, if the two values passed in are the same, don't allow the first digit to be the same as those.
Why not just make a wrapper around the normal permutation function that skips values that have N consecutive repetitions? something like:
(pseudocode)
funciton custom_perm(int max_rep)
do
p := next_perm()
while count_max_rerps(p) < max_rep
return p
Krusty, I'm already doing that at the end of function, but not solves the problem, because is need to generate all permutations and check them each one.
consecutive := 1;
IsValid := True;
for n := 0 to len - 2 do
begin
if anyVector[n] = anyVector[n + 1] then
consecutive := consecutive + 1
else
consecutive := 1;
if consecutive > MaxConsecutiveRepeats then
begin
IsValid := False;
Break;
end;
end;
Since I do get started with the first in lexographic order, ends up being necessary by this way generate a lot of unnecessary perms.
This is easy to make, but rather hard to make efficient.
If you need to build a single piece of code that only considers valid outputs, and thus doesn't bother walking over the entire combination space, then you're going to have some thinking to do.
On the other hand, if you can live with the code internally producing all combinations, valid or not, then it should be simple.
Make a new enumerator, one which you can call that next_perm method on, and have this internally use the other enumerator, the one that produces every combination.
Then simply make the outer enumerator run in a while loop asking the inner one for more permutations until you find one that is valid, then produce that.
Pseudo-code for this:
generator1:
when called, yield the next combination
generator2:
internally keep a generator1 object
when called, keep asking generator1 for a new combination
check the combination
if valid, then yield it

Resources