Is multi column sorting an expensive operation?

Is multi column sorting an expensive operation? - database

{
A: "numeric",
B: "numeric",
C: "numeric",
}
In general, what is the time complexity of sorting by multiple columns?
I have never found a website that has an option for multi-column sorting. That begs the question, is this operation simply to costly?

If the number of columns that you are sorting by is constant, than it doesn't factor into the big O runtime complexity of the sorting algorithm.
The different columns are handled in the comparator. Pseudo code for that is:
if column A values different
compare values from column A
else if column B values different
compare values from column B
else if column C values different
compare values from column C
else
they are equal
The first thing column you are sorting by is usually the only column consulted. The others are just used for tie breakers. It doesn't matter if you are sorting just one column or several, your sorting algorithm is still going to run in O(n*log(n)) time complexity.

Related

How can one sort a column of text based on a different column of numerical values?

I have a spreadsheet where a couple of colleagues and I are rank-ordering some books based on our preferences. It looks something like this:
Books,John Ranks,Smith Ranks,Doe Ranks
book1,1,3,2
book2,3,1,1
book3,2,2,3
In this ranking world, the highest rank for a book is 1, and the lowest rank is {number of books}, which would be 3 in this example. I have a column, Totals, that counts the rank of something:
Books,John Ranks,Smith Ranks,Doe Ranks,Totals
book1,1,3,2,6
book2,3,1,1,5
book3,2,2,3,7
In this example, book2 would have won out, because it has the lowest number and therefore has the highest preference. Now, I want to have another column, True Ranks or something, that sorts the names of the books based on the values in the Totals column. A total cell value will always be in the same row as the book it represents. I want to write a function for True Ranks that will have in rank-order (sorted order) the book names based on the total values in ascending order. So, it would look something like this:
Books,John Ranks,Smith Ranks,Doe Ranks,Totals,True Ranks
book1,1,3,2,6,book2
book2,3,1,1,5,book1
book3,2,2,3,7,book3
Because {book2: 5, book1: 6, book3: 7} is the ascending order. Scratching my head on how this would happen in a spreadsheet, I'm not well-versed in all the options available. Any ideas?
Edit: I am not sure if I am explaining this well enough (there is always the fear of this happening): in programmatic terms, I essentially have two lists, one int list Totals and one String list Books, and I am asking how I could first copy the String list (I do not want to modify the String list) and the int list (I also don't want to modify the int list), then sort them both simultaneously (though I suppose the result of the int list sort won't be displayed anywhere).
...Any ideas?

You can use the SORT function in a formula
=SORT(A2:A4,E2:E4,1)

F1=sort(A1:E3,5,False)
Sort is an array function so needs clearance below and to the right it for the results.
This will return all 5 columns. You can hide the unwanted ones.
You can also be clever (which will bite you later...) and construct a range on the fly:
F1=sort({A1:A3,E1:E3},2,False)
This eliminates the extra columns of individual results.

try direct approach:
=QUERY(SORT({A2:A, B2:B+C2:C+D2:D}, 2, 1),
"select Col1 where Col1 is not null", )

Excel: How to add multiple criterias to an array calculation of Median and 1st Quartile

I have a question regarding a big excel-file I am working on right now.
I have a long column of values (Column B) in a larger list of data. In Column C I have a connection to another values in the datalist that can be either TRUE or FALSE depending on different conditions.
I have managed to make two array functions that calculate the median and 1st quartile for the values in column B in which the value in column C is TRUE, using the following formulas. They calculate the median and first quartile of the values in Column B as long as the value in Column C of the corresponding line equals TRUE.
{=MEDIAN(IF($C$2:$C$11; $B$2:$B$11))}
{=QUARTILE(IF($C$2:$C$11; $B$2:$B$11);1)}
Now I want to add another condition to the calculation. Apart from having the value in Column C equals TRUE, I also just want to calculate the median and first quartile if the value in Column A equals "Measure 1", or any other dynamic value. I have tried to nest the AND function below but it doesn't work at all.
{=MEDIAN(IF(AND($B$2:$B$11;$A$2:$A$11="Measure 1"); $C$2:$C$11))}
Would anyone be able to help me solving how I can add values to an array depending on multiple criterias and later calculate median and quartiles of that array?

You construct an array of the applicable values, and then apply the function to the array.
Since these are array formulas, you must enter/confirm these formulas by holding down ctrl + shift while hitting enter. If you do this correctly, Excel will place braces {...} around the formula as observed in the formula bar
So, for Median:
=MEDIAN((IF((Condition_1=TRUE)*(Measure="Measure 1")*Value,(Condition_1=TRUE)*(Measure="Measure 1")*Value)))
and for Quartile 1
=QUARTILE((IF((Condition_1=TRUE)*(Measure="Measure 1")*Value,(Condition_1=TRUE)*(Measure="Measure 1")*Value)),1)
Quartile is also one of the AGGREGATE arguments that can work on an array, so you could normally enter:
=AGGREGATE(17,6,1/((Condition_1=TRUE)*(Measure = "Measure 1"))*Value,1)
And, where the QUARTILE.INC quart argument = 2, that is the same as MEDIAN
So for Median, you could use:
=AGGREGATE(17,6,1/((Condition_1=TRUE)*(Measure = "Measure 1"))*Value,2)

Name of largest row value

I have an Excel table that lists various projects from which I would like to return the name of the largest project, based on certain criteria. This is the table structure:
Project Title (A); Category (B); Completed Year (C); Dollar Amount(D)
The array formula below will give me the largest ranking item based on that criteria. However when I try to look up column A, it won't work properly with duplicates, like many zero dollar projects:
{(LARGE(IF($B$2:$B$1000="Services",IF(YEAR($C$2:$C$1000)=2015,$D$2:$D$1000,""),"‌"),1)}

Please consider using the following array formula:
{=(LARGE(IF($B$2:$B$1000="Services",1,0)*IF(YEAR($C$2:$C$1000)=2015,1,0)*$D$2:$D$1000,1))}
I'm using the IF's to generate arrays with 1 for each line that conform to each criteria. Then multiplying them only maintains a 1 for each line that meets both. Finally multiply that array by an array of your target values to perpetuate only the relevant ones for comparison by LARGE. Regards,
Regards,

Finding arrays that contain a subset of another array without using #> with postgreSQL

I have a table with 1.5 MM records. Each record has a row number and an array with between 1 and 1,000 elements in the array. I am trying to find all of the arrays that are a subset of the larger arrays.
When I use the code below, I get ERROR: statement requires more resources than resource queue allows (possibly because there are over a trillion possible combinations):
select
a.array as dup
from
table a
left join
table b
on
b.array #> a.array
and a.row_number <> b.row_number
Is there a more efficient way to identify which arrays are subsets of the other arrays and mark them for removal other than using #>?

Your example code suggests that you are only interested in finding arrays that are subsets of any other array in another row of the table.
However, your query with a JOIN returns all combinations, possibly multiplying results.
Try an EXISTS semi-join instead, returning qualifying rows only once:
SELECT a.array as dup
FROM table a
WHERE EXISTS (
SELECT 1
FROM table b
WHERE a.array <# b.array
AND a.row_number <> b.row_number
);
With this form, Postgres can stop iterating rows as soon as the first match is found. If this won't go through either, try partitioning your query. Add a clause like
AND table_id BETWEEN 0 AND 10000
and iterate through the table. Should be valid for this case.
Aside: it's a pity that your derivate (Greenplum) doesn't seem to support GIN indexes, which would make this operation much faster. (The index itself would be big, though)

Well, I don't see how to do this efficiently in a single declarative SQL statement without appropriate support from an index. I don't know how well this would work with a GIN index, but using a GIN index would certainly avoid the need to compare every possible pair of rows.
The first thing I would do is to carefully investigate the types of indexes you do have at your disposal, and try creating one as necessary.
If that doesn't work, the first thing that comes to my mind, procedurally speaking, would be to sort all the arrays, then sort the rows into a graded lexicographic order on the arrays. Then start with the shortest arrays, and work upwards as follows: e.g. for [1,4,9], check all the arrays with length <= 3 that start with 1 if they are a subset, then check all arrays with length <= 2 that start with 4, and then check all the arrays of length <= 1 that start with 9, removing any found subsets from consideration as you go so that you don't keep re-checking the same rows over and over again.
I'm sure you could tune this algorithm a bit, especially depending on the particular nature of the data involved. I wouldn't be surprised if there was a much better algorithm; this is just the first thing that I thought of. You may be able to work from this algorithm backwards to the SQL you want, or you may have to dump the table for client-side processing, or some hybrid thereof.

Categorizing data based on the data's signature

Let us say I have some large collection of rows of data, where each element in the row is a (key, value) pair:
1) [(bird, "eagle"), (fish, "cod"), ... , (soda, "coke")]
2) [(bird, "lark"), (fish, "bass"), ..., (soda, "pepsi")]
n) ....
n+1) [(bird, "robin"), (fish, "flounder"), ..., (soda, "fanta")]
I would like the ability to run some computation that would allow me to determine for a new row, what is the row that is "most similar" to this row?
The most direct way I could think of finding the "most similar" row for any particular row is to directly compare said row against all other rows. This is obviously computationally very expensive.
I am looking for a solution of the following form.
A function that can take a row, and generate some derivative integer for that row. This returned integer would be a sort of "signature" of the row. The important property of this signature is that if two rows are very "similar" they would generate very close integers, if rows are very "different", they would generate distant integers. Obviously, if they are identical rows they would generate the same signature.
I could then takes these generated signatures, with the index of the row they point to, and sort them all by their signatures. This data structure I would keep so that I can do fast lookups. Call it database B.
When I have a new row, I wish to know which existent row in database B is most similar, I would:
Generate a signature for the new row
Binary search through the sorted list of (signature,index) in database B for the closet match
Return the closest matching (could be a perfect match) row in database B.
I know their is a lot of hand waving in this question. My problem is that I do not actually know what the function would be that would generate this signature. I see Levenshtein distances, but those represent the transformation cost, not so much the signature. I see that I could try lossy compressions, two things might be "bucketable" as they compress to the same thing. I am looking for other ideas on how to do this.
Thank you.

EDIT: This is my original answer, which we will call Case 1, where there is no precedence to the keys
You cannot do it as a sorted integer because that is one dimensional and your data is multi-dimensional. So "nearness" in that sense cannot be established on a line.
Your example shows bird, fish and soda for all 3 lines. Are the keys fixed and known? If they are not, then your first step is to hash the keys of a row to establish rows that have the same keys.
For the values, consider this as a poor man's Saturday Night similarity trick. Hash the values, any two rows that match on that hash are an exact match and represent the same "spot", zero distance.
If N is the number of key/value pairs:
The closest non-exact "nearness" would mean matching N-1 out of N values. So you generate N more hashes, each one dropping out one of the values. Any two rows that match on those hashes have N-1 out of N values in common.
The next closest non-exact "nearness" would mean matching N-2 out of N values. So you generate more than N more hashes (I can't figure the binary this late), this time each hash leaves out a combination of two values. Any two rows that match on those hashes have N-2 out of N values in common.
So you can see where this is going. At the logical extreme you end up with 2^N hashes, not very savory, but I'm assuming you would not go that far because you reach a point where too few matching values would be considered to "far" to be worth considering.
EDIT: To see how we cannot escape dimensionality, consider just two keys, with values 1-9. Plot all possible values on a graph. We see see that {1,1} is close to {2,2}, but also that {5,6} is close to {6,7}. So we get a brainstorm, we say, Aha! I'll calculate each point's distance from the origin using Pythagorean theorem! This will make both {1,1} and {2,2} easy to detect. But then the two points {1,10} and {10,1} will get the same number, even though they are as far apart as they can be on the graph. So we say, ok, I need to add the angle for each. Two points at the same distance are distinguished by their angle, two points at the same angle are distinguished by their distance. But of course now we've plotted them on two dimensions.
EDIT: Case 2 would be when there is precedence to the keys, when key 1 is more significant than key 2, which is more significant than key 3, etc. In this case, if the allowed values were A-Z, you would string the values together as if they were digits to get a sortable value. ABC is very close to ABD, but very far from BBD.

If you had a lot of data, and wanted to do this hardcore, I would suggest a statistical method like PLSA or PSVM, which can extract identifying topics from text and identify documents with similar topic probabilities.
A simpler, but less accurate way of doing it is using Soundex, which is available for many languages. You can store the soundex (which will be a short string, not an integer I'm afraid), and look for exact matches to the soundex, which should point to similar rows.
I think it's unrealistic to expect a function to turn a series of strings into an integer such that integers near each other map to similar strings. The closest you might come is doing a checksum on each individual tuple, and comparing the checksums for the new row to the checksums of existing rows, but I'm guessing you're trying to come up with a single number you can index on.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight