Postgres array comparison confusion - arrays

When I run
select array[19,21,500] <= array[23,5,0];
I get true.
but when I run
select array[24,21,500] <= array[23,5,0];
I get false. This suggests that the comparison is only on the first element.
I am wondering if there is an operator or possibly function that compares all the entries such that if all the entries in the left array are less than those in the right array (at the same index) it would return true, otherwise return false.
I'm hoping to retrieve all the rows that have an entire array "less than" or "greater than" a given array. I don't know if this is possible.

Arrays use ordinality as a basic property. In other words '{1,3,2}' <> '{1,2,3}' and this is important to understand when looking at comparisons. These look at successive elements.
Imagine for a moment that PostgreSQl didnt have an inet type. We could use int[] to specify cidr blocks. For example, we could see this as '{10,0,0,1,8}' to represent 10.0.0.1/8. We could then compare IP addresses on this way. We could also represent as a bigint as: '{167772161,8}' In this sort of comparison, if you have two IP addresses with different subnets, we can compare them and the one with the more specific subnet would come after the one with the less specific subnet.
One of the basic principles of database normalization is that each field should hold one and only one value for its domain. One reason arrays don't necessarily violate this principle is that, since they have ordinality (and thus act as a tuple rather than a set or a bag), you can use them to represent singular values. The comparisons make perfect sense in that case.
In the case where you want to create an operator which does not respect ordinality, youc an create your own. Basically you make a function that returns a bool based on the two, and then wrap this in an operator (see CREATE OPERATOR in the docs for more on how to do this). You are by no means limited by what PostgreSQL offers out of the box.

To actually conduct the operation you asked for, use unnest() in parallel and aggregate with bool_and():
SELECT bool_and(a < b) -- each element < corresponding element in 2nd array
,bool_and(a <= b)
,bool_and(a >= b)
,bool_and(a > b)
-- etc.
FROM (SELECT unnest('{1,2,3}'::int[]) AS a, unnest('{2,3,4}'::int[]) AS b) t
Both arrays need to have the same number of base elements to be unnested in parallel. Else you get a CROSS JOIN, i.e. a completely different result.

Related

Position of the lowest value greater than x in ordered postgresql array (optimization)

Looking at the postgres function array_position(anyarray, anyelement [, int])
My problem is similar, but I'm looking for the position of the first value in an array that is greater than an element. I'm running this on small arrays, but really large tables.
This works:
CREATE OR REPLACE FUNCTION arr_pos_min(anyarray,real)
RETURNS int LANGUAGE sql IMMUTABLE PARALLEL SAFE AS
'select array_position($1,(SELECT min(i) FROM unnest($1) i where i>$2))';
the array_position takes advantage of the fact that my array is ordered, but the second part doesn't. And I feel like the second part could potentially just return the position without having to re-query.
My arrays are only 100 elements long, but I have to run this millions of times and so looking for a performance pickup.
Suggestions appreciated.
This seems to be a bit faster
CREATE OR REPLACE FUNCTION arr_pos_min(p_input anyarray, p_to_check real)
RETURNS int
AS
$$
select t.idx
from unnest(p_input) with ordinality as t(i, idx)
where t.i > p_to_check
order by t.idx
limit 1
$$
LANGUAGE sql
IMMUTABLE
PARALLEL SAFE
;
The above will use the fact that the values in the array are already sorted. Sorting by the array index is therefor quite fast. I am not sure if unnest() is guaranteed in this context to return the elements in the order they are stored in the array. If that was the case, you could remove the order by and make it even faster.
I don't think that there is a more efficient solution than yours, except if you write a dedicated C function for that.
Storing large arrays is often a good recipe for bad performance.

What C construct would allow me to 'reverse reference' an array?

Looking for an elegant way (or a construct with which I am unfamiliar) that allows me to do the equivalent of 'reverse referencing' an array. That is, say I have an integer array
handle[number] = nameNumber
Sometimes I know the number and need the nameNumber, but sometimes I only know the nameNumber and need the matching [number] in the array.
The integer nameNumber values are each unique, that is, no two nameNumbers that are the same, so every [number] and nameNumber pair are also unique.
Is there a good way to 'reverse reference' an array value (or some other construct) without having to sweep the entire array looking for the matching value, (or having to update and keep track of two different arrays with reverse value sets)?
If the array is sorted and you know the length of it, you could binary search for the element in the array. This would be an O(n log(n)) search instead of you doing O(n) search through the array. Divide the array in half and check if the element at the center is greater or less than what you're looking for, grab the half of the array your element is in, and divide in half again. Each decision you make will eliminate half of the elements in the array. Keep this process going and you'll eventually land on the element you're looking for.
I don't know whether it's acceptable for you to use C++ and boost libraries. If yes you can use boost::bimap<X, Y>.
Boost.Bimap is a bidirectional maps library for C++. With Boost.Bimap you can create associative containers in which both types can be used as key. A bimap can be thought of as a combination of a std::map and a std::map.

Avoiding database when checking for existing values

Is there a way through hashes or bitwise operators or another algorithm to avoid using database when simply checking for previously appeared string or value?
Assuming, there is no way to store whole history of the strings appeared before, only little information can be stored.
You may be interested in Bloom filters. They don't let you authoritatively say, "yes, this value is in the set of interest", but they do let you say "yes, this value is probably in the set" vs. "no, this value definitely is not in the set". For many situations, that's enough to be useful.
The way it works is:
You create an array of Boolean values (i.e. of bits). The larger you can afford to make this array, the better.
You create a bunch of different hash functions that each take an input string and map it to one element of the array. You want these hash functions to be independent, so that even if one hash function maps two strings to the same element, a different hash function will most likely map them to different elements.
To record that a string is in the set, you apply each of your hash functions to it in turn — giving you a set of elements in the array — and you set all of the mapped-to elements to TRUE.
To check if a string is (probably) is in the set, you do the same thing, except that now you just check the mapped-to elements to see if they are TRUE. If all of them are TRUE, then the string is probably in the set; otherwise, it definitely isn't.
If you're interested in this approach, see https://en.wikipedia.org/wiki/Bloom_filter for detailed analysis that can help you tune the filter appropriately (choosing the right array-size and number of hash functions) to get useful probabilities.

How can you guarantee quicksort will sort an array of integers?

Got asked this in a lecture...stumped by it a bit.
how can you guarantee that quicksort will always sort an array of integers?
Thanks.
Gratuitously plagiarising Wikipedia:
The correctness of the partition algorithm is based on the following
two arguments:
At each iteration, all the elements processed so far
are in the desired position: before the pivot if less than the pivot's
value, after the pivot if greater than the pivot's value (loop
invariant).
Each iteration leaves one fewer element to be processed
(loop variant).
The correctness of the overall algorithm can be proven
via induction: for zero or one element, the algorithm leaves the data
unchanged; for a larger data set it produces the concatenation of two
parts, elements less than the pivot and elements greater than it,
themselves sorted by the recursive hypothesis.
Quicksort function by taking a pivot value, and sorting the remaining data in to two groups. One higher and one lower. You then do this to the each group in turn until you get groups no larger than one. At this point you can guarantee that the data is sorted because you can guarantee that any pivot value is in the correct place because you have directly compared it with another pivot value, which is also in the correct place. In the end, you are left with sets of size 1 or size 0 which cannot be sorted because they cannot be rearranged and thus are already sorted.
Hope this helps, it was what we were taught for A Level Further Mathematics (16-18, UK).
Your professor may be referring to "stability." Have a look here: http://en.wikipedia.org/wiki/Stable_sort#Stability. Stable sorting algorithms maintain the relative order of records with equal keys. If all keys are different then this distinction is not necessary.
Quicksort (in efficient implementations) is not a stable sort, so one way to guarantee stability would be to insure that there are no duplicate integers in your array.

C - How to implement Set data structure?

Is there any tricky way to implement a set data structure (a collection of unique values) in C? All elements in a set will be of the same type and there is a huge RAM memory.
As I know, for integers it can be done really fast'N'easy using value-indexed arrays. But I'd like to have a very general Set data type. And it would be nice if a set could include itself.
There are multiple ways of implementing set (and map) functionality, for example:
tree-based approach (ordered traversal)
hash-based approach (unordered traversal)
Since you mentioned value-indexed arrays, let's try the hash-based approach which builds naturally on top of the value-indexed array technique.
Beware of the advantages and disadvantages of hash-based vs. tree-based approaches.
You can design a hash-set (a special case of hash-tables) of pointers to hashable PODs, with chaining, internally represented as a fixed-size array of buckets of hashables, where:
all hashables in a bucket have the same hash value
a bucket can be implemented as a dynamic array or linked list of hashables
a hashable's hash value is used to index into the array of buckets (hash-value-indexed array)
one or more of the hashables contained in the hash-set could be (a pointer to) another hash-set, or even to the hash-set itself (i.e. self-inclusion is possible)
With large amounts of memory at your disposal, you can size your array of buckets generously and, in combination with a good hash method, drastically reduce the probability of collision, achieving virtually constant-time performance.
You would have to implement:
the hash function for the type being hashed
an equality function for the type being used to test whether two hashables are equal or not
the hash-set contains/insert/remove functionality.
You can also use open addressing as an alternative to maintaining and managing buckets.
Sets are usually implemented as some variety of a binary tree. Red black trees have good worst case performance.
These can also be used to build an map to allow key / value lookups.
This approach requires some sort of ordering on the elements of the set and the key values in a map.
I'm not sure how you would manage a set that could possibly contain itself using binary trees if you limit set membership to well defined types in C ... comparison between such constructs could be problematic. You could do it easily enough in C++, though.
The way to get genericity in C is by void *, so you're going to be using pointers anyway, and pointers to different objects are unique. This means you need a hash map or binary tree containing pointers, and this will work for all data objects.
The downside of this is that you can't enter rvalues independently. You can't have a set containing the value 5; you have to assign 5 to a variable, which means it won't match a random 5. You could enter it as (void *) 5, and for practical purposes this is likely to work with small integers, but if your integers can get into large enough sizes to compete with pointers this has a very small probability of failing.
Nor does this work with string values. Given char a[] = "Hello, World!"; char b[] = "Hello, World!";, a set of pointers would find a and b to be different. You would probably want to hash the values, but if you're concerned about hash collisions you should save the string in the set and do a strncmp() to compare the stored string with the probing string.
(There's similar problems with floating-point numbers, but trying to represent floating-point numbers in sets is a bad idea in the first place.)
Therefore, you'd probably want a tagged value, one tag for any sort of object, one for integer value, and one for string value, and possibly more for different sorts of values. It's complicated, but doable.
If the maximum number of elements in the set (the cardinality of the underlying data type) is small enough, you might want to consider using a plain old array of bits (or whatever you call them in your favourite language).
Then you have a simple set membership check: bit n is 1 if element n is in the set. You could even count 'ordinary' members from 1, and only make bit 0 equal to 1 if the set contains itself.
This approach will probably require some sort of other data structure (or function) to translate from the member data type to the position in the bit array (and back), but it makes basic set operations (union, intersection, membership test, difference, insertion, removal,compelment) very very easy. And it is only suitable for relatively small sets, you wouldn't want to use it for sets of 32-bit integers I don't suppose.

Resources