what is convergence in k Means? - artificial-intelligence

I have a very small question related to unsupervised learning because my teacher have not use this word in any lectures. I got this word while reading tutorials. Does this mean if values are same to initial values in last iteration of clusters then it is called converge? for example
| c1 | c2 | cluster
| (1,0) | (2,1)|
|-------|------|------------
A(1,0)| .. |.. |get smallest value
B(0,1)|.. |... |
c(2,1)|.. |... |
D(2,1)|.. |.... |
now after performing n-iteration and if values come same in both c1 and c2 that is (1,0) and (2,1) in last n-th iteration and taking avg if other than single , is it convergence?

Ideally, if the values in the last two consequent iterations are same then the algorithm is said to have converged. But often people use a less strict criteria for convergence, like, the difference in the values of last two iterations is less than a particular threshold etc,.

Incase of K-means clustering, the word convergence means the algorithm have successfully completed this clustering or grouping of data points in k number of clusters.The algorithm will make sure it has completely grouped the data points into correct clusters, if the centroids (k values) in k-means remains same place or in point for 2 iteration .

Related

Questions about time and space complexity nuances

Let's say I have an algorithm, spiral(), that takes an integer n and returns an array containing the integers 1 to n2 in a spiral pattern. i.e.,
spiral(n);
iterates over the integers 1 to n^2 and inserts them into a 2d array which it creates and returns. e.g.
spiral(3);
returns
[[1,2,3],
[8,9,4],
[7,6,5]]
Obviously, the time complexity is O(n^2), but what is the space complexity? I would say also O(n^2) as we're allocating that much space as a result of calling the function, but I see in most places like leetcode for questions like this they say that the time complexity is O(1). Does necessary return values not count?
Another smaller question I have is about functions like spiral() above. Let's say there's a function like
returnmbyn(m,n);
which returns an array of size m*n, but uses no other memory, and just iterates through and inserts each value in the array iteratively. Do we also consider this function to have a space complexity of O(n^2) and time complexity of O(1)?
Generally, the auxiliary space is referred to as space complexity. Space complexity is the total space required for solving a problem including the input and output memory. But it doesn't make sense to include the input and output spaces when analyzing an algorithm and hence we talk about auxiliary space and call it the "space complexity" of that algorithm.
Let me give you an example. Suppose you are given this problem:
Given 0 <= i, j < n, find M[i][j], where M is a nxn matrix that looks like this:
| 1 1 1 1 ... 1 1 |
| 1 2 2 2 ... 2 2 |
| 1 2 3 3 ... 3 3 |
| 1 2 3 . |
|  ⋮ . ⋮ |
| 1 2 3 ....... n |
How would you do that? Of course you could easily write such matrix in the memory, which would take you O(n^2) time and space, since you need n^2 operations to fill the matrix and n^2 slots to hold in the memory.
The point is: do you really need to store a matrix to know what is inside the matrix? In this case, this matrix can be easily represented as a function:
f(i, j) = min(i, j) + 1
Thus, with O(1) time and space, you can tell what is in M[i][j] without storing the matrix itself.
The trick of these kinds of question is to think about the function that gives you what is inside each slot of the matrix without having to actually fill it.

Excel: create an array with n occurences of a value x

I'm looking for a way to create an excel array with n occurences of an x value, n and x being vectors.
Desired behaviour :
|---------------------|------------------|------------------|
| occurences | value | result |
|---------------------|------------------|------------------|
| 3 | 4 | {4;4;4;1;1} |
|---------------------|------------------|------------------|
| 2 | 1 |
|---------------------|------------------|
This is a question similar to this one, except that I want one more dimension. I'm not interested in a VBA answer, I'm looking for a formula.
I've tried playing around with index and concatenation like in the answer to the previously linked question but with no luck until now.
This result will be used in a bigger formula that will sum the m greatest values (I already have that part figured and working, the m value is irrelevant here). You can consider this question as if the occurences are the storage amounts, and I want the sum of the m greatest individual values.
Here's another approach in O365:
=INDEX(B:B,MATCH(SEQUENCE(SUM(A1:A3),1,0),
MMULT(N(ROW(A1:A3)>=TRANSPOSE(ROW(A1:A3))),A1:A3)-A1:A3))
where you're looking up the row number of the output array in the running total of the input counts.
I think it could be modified to work over an arbitrary range but would then be a fairly long formula.
If the inputs aren't in the sheet but coming from an array formula, then still possible but it would be a very long formula.
=FILTERXML("<t><s>" & TEXTJOIN("</s><s>",TRUE,SEQUENCE(3,,4,0),SEQUENCE(2,,1,0)) & "</s></t>","//s")
will return: {4;4;4;1;1} which can be used as part of a larger formula.

how to draw the number line of the Gregorian calendar? or is this about array indices?

I'm a self-taught programmer currently teaching some kids how to program conversions between the Hebrew and Gregorian calendars. I've got a basic confusion about the lack of a year zero on the historical (Gregorian and Julian) calendars and I seem to be too dense to understand the explanation at https://en.wikipedia.org/wiki/Year_zero.
Here's how I understand a regular number line:
-2 -1 0 1 2 integer number line
----|------|------|------|------|----
<=-2|<===-1|<===-0|0====>|1====>|2==> item labels
3rd | 2nd | 1st | 1st | 2nd | 3rd ordinal numbers
There's one zero, and then positive and negative instances of all the other integers. The item between 0 and +1 is labeled '0' (as an array index or, for instance as an age—a baby isn't 1 year old until its second year.)
But historical calendars don't have a year zero so a regular number line doesn't work, but the labels and ordinal numbers line up:
? ? ? ? ?
----|------|------|------|------|----
<=-3|<===-2|<===-1|1====>|2====>|3==>
3rd | 2nd | 1st | 1st | 2nd | 3rd
As per the article (if I'm following it correctly), astronomical calendars add a year zero before 1 AD/CE:
? ? ? ? ?
----|------|------|------|------|----
<=-2|<===-1|<====0|1====>|2====>|3==>
3rd | 2nd | 1st | 1st | 2nd | 3rd
What I'm confused about is how to assign number line symbols or array indices for either historical or astronomical calendars.
On a regular number line, the interval between -1 and 0 has a label of -0 (in the one's place), and it contains numbers like -0.1 and -0.2. But arrays don't usually have negative indices since it's hard to distinguish between +0 and -0.
—Ok, now that I've explained this far, I think I can answer my own question, which I will do below. But if someone else gives a clearer or better answer, I'm happy to select it.
It is actually ok to give array negative indices. The number line looks like this:
-2 -1 0 1 2 integers
----|------|------|------|------|----
<=-3|<===-2|<===-1|0====>|1====>|2==> array indices
3rd | 2nd | 1st | 1st | 2nd | 3rd
The range between -1 and 0 is labeled -1 because it starts with -1, so the lack of symmetry around 0 for these is fine, and on the negative side, labels match ordinal numbers even though they don't on the positive side.
The thing that was throwing me off was that the astronomical calendar shifts the labels left one unit.
Since the traditional number line is symmetrical in all three, point labels, interval labels, and ordinal labels, I thought that the astronomical calendar would have to add a year zero on both sides of the origin, even though historical calendars only get rid of a single year zero.
Whether this lengthy exercise helps anyone, it did help me. Thanks, StackOverflow!
In meeting with the kids we also figured out that arrows pointing in two directions away from the origin is appropriate for number lines and is good for symmetry around the origin. But array listings and calendar time are appropriately represented as proceeding from left to right, which makes symmetry impossible even if there is a zero point with negative numbers to the left. So, here are the various labelings with more sensible arrows:
-2 -1 0 1 2 integers
----|------|------|------|------|----
| | | 1st | 2nd | 3rd ordinal labels
<=-2|<===-1|<===-0|0====>|1====>|2==> real number interval labels
-3=>|-2===>|-1===>|0====>|1====>|2==> array indices
-3=>|-2===>|-1===>|1====>|2====>|3==> historical calendar years
-2=>|-1===>|0====>|1====>|2====>|3==> astronomical calendar years

Stata: Observation-pairwise calculation

input X group
21 1
62 1
98 1
12 2
87 2
end
Now I try to calculate a measure as follows:
$$ \sum_{g} \left | X_{ig}-X_{jg} \right | $$
,where $i$ or $j$ ($i \neq j$) indexes an observation. g corresponds to the group variable (here, 1 and 2)
How to calculate this number using loops?
Looks like a Gini mean difference, apart from a scaling factor. There are numerous user-written commands already in this territory. There is (unusually) a summary within the Stata manual at [R] inequality.
In addition, this is related to the second L-moment. See the lmoments command from SSC.
You need not calculate this through a double loop over indexes. It collapses to a linear combination of the order statistics.
LATER: See David's 1998 paper which is open-access at
https://doi.org/10.1214/ss/1028905831

Find all possible row-wise sums in a 2D array

Ideally I'm looking for a c# solution, but any help on the algorithm will do.
I have a 2-dimension array (x,y). The max columns (max x) varies between 2 and 10 but can be determined before the array is actually populated. Max rows (y) is fixed at 5, but each column can have a varying number of values, something like:
1 2 3 4 5 6 7...10
A 1 1 7 9 1 1
B 2 2 5 2 2
C 3 3
D 4
E 5
I need to come up with the total of all possible row-wise sums for the purpose of looking for a specific total. That is, a row-wise total could be the cells A1 + B2 + A3 + B5 + D6 + A7 (any combination of one value from each column).
This process will be repeated several hundred times with different cell values each time, so I'm looking for a somewhat elegant solution (better than what I've been able to come with). Thanks for your help.
The Problem Size
Let's first consider the worst case:
You have 10 columns and 5 (full) rows per column. It should be clear that you will be able to get (with the appropriate number population for each place) up to 5^10 ≅ 10^6 different results (solution space).
For example, the following matrix will give you the worst case for 3 columns:
| 1 10 100 |
| 2 20 200 |
| 3 30 300 |
| 4 40 400 |
| 5 50 500 |
resulting in 5^3=125 different results. Each result is in the form {a1 a2 a3} with ai ∈ {1,5}
It's quite easy to show that such a matrix will always exist for any number n of columns.
Now, to get each numerical result, you will need to do n-1 sums, adding up to a problem size of O(n 5^n). So, that's the worst case and I think nothing can be done about it, because to know the possible results you NEED to effectively perform the sums.
More benign incarnations:
The problem complexity may be cut off in two ways:
Less numbers (i.e. not all columns are full)
Repeated results (i.e. several partial sums give the same result, and you can join them in one thread). Much more in this later.
Let's see a simplified example of the later with two rows:
| 7 6 100 |
| 3 4 200 |
| 1 2 200 |
at first sight you will need to do 2 3^3 sums. But that's not the real case. As you add up the first column you don't get the expected 9 different results, but only 6 ({13,11,9,7,5,3}).
So you don't have to carry your nine results up to the third column, but only 6.
Of course, that is on the expense of deleting the repeating numbers from the list. The "Removal of Repeated Integer Elements" was posted before in SO and I'll not repeat the discussion here, but just cite that doing a mergesort O(m log m) in the list size (m) will remove the duplicates. If you want something easier, a double loop O(m^2) will do.
Anyway, I'll not try to calculate the size of the (mean) problem in this way for several reasons. One of them is that the "m" in the sort merge is not the size of the problem, but the size of the vector of results after adding up any two columns, and that operation is repeated (n-1) times ... and I really don't want to do the math :(.
The other reason is that as I implemented the algorithm, we will be able to use some experimental results and save us from my surely leaking theoretical considerations.
The Algorithm
With what we said before, it is clear that we should optimize for the benign cases, as the worst case is a lost one.
For doing so, we need to use lists (or variable dim vectors, or whatever can emulate those) for the columns and do a merge after every column add.
The merge may be replaced by several other algorithms (such as an insertion on a BTree) without modifying the results.
So the algorithm (procedural pseudocode) is something like:
Set result_vector to Column 1
For column i in (2 to n-1)
Remove repeated integers in the result_vector
Add every element of result_vector to every element of column i+1
giving a new result vector
Next column
Remove repeated integers in the result_vector
Or as you asked for it, a recursive version may work as follows:
function genResVector(a:list, b:list): returns list
local c:list
{
Set c = CartesianProduct (a x b)
Set c = Sum up each element {a[i],b[j]} of c </code>
Drop repeated elements of c
Return(c)
}
function ResursiveAdd(a:matrix, i integer): returns list
{
genResVector[Column i from a, RecursiveAdd[a, i-1]];
}
function ResursiveAdd(a:matrix, i==0 integer): returns list={0}
Algorithm Implementation (Recursive)
I choose a functional language, I guess it's no big deal to translate to any procedural one.
Our program has two functions:
genResVector, which sums two lists giving all possible results with repeated elements removed, and
recursiveAdd, which recurses on the matrix columns adding up all of them.
recursiveAdd, which recurses on the matrix columns adding up all of them.
The code is:
genResVector[x__, y__] := (* Header: A function that takes two lists as input *)
Union[ (* remove duplicates from resulting list *)
Apply (* distribute the following function on the lists *)
[Plus, (* "Add" is the function to be distributed *)
Tuples[{x, y}],2] (*generate all combinations of the two lists *)];
recursiveAdd[t_, i_] := genResVector[t[[i]], recursiveAdd[t, i - 1]];
(* Recursive add function *)
recursiveAdd[t_, 0] := {0}; (* With its stop pit *)
Test
If we take your example list
| 1 1 7 9 1 1 |
| 2 2 5 2 2 |
| 3 3 |
| 4 |
| 5 |
And run the program the result is:
{11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27}
The maximum and minimum are very easy to verify since they correspond to taking the Min or Max from each column.
Some interesting results
Let's consider what happens when the numbers on each position of the matrix is bounded. For that we will take a full (10 x 5 ) matrix and populate it with Random Integers.
In the extreme case where the integers are only zeros or ones, we may expect two things:
A very small result set
Fast execution, since there will be a lot of duplicate intermediate results
If we increase the Range of our Random Integers we may expect increasing result sets and execution times.
Experiment 1: 5x10 matrix populated with varying range random integers
It's clear enough that for a result set near the maximum result set size (5^10 ≅ 10^6 ) the Calculation time and the "Number of != results" have an asymptote. The fact that we see increasing functions just denote that we are still far from that point.
Morale: The smaller your elements are, the better chances you have to get it fast. This is because you are likely to have a lot of repetitions!
Note that our MAX calculation time is near 20 secs for the worst case tested
Experiment 2: Optimizations that aren't
Having a lot of memory available, we can calculate by brute force, not removing the repeated results.
The result is interesting ... 10.6 secs! ... Wait! What happened ? Our little "remove repeated integers" trick is eating up a lot of time, and when there are not a lot of results to remove there is no gain, but looses in trying to get rid of the repetitions.
But we may get a lot of benefits from the optimization when the Max numbers in the matrix are well under 5 10^5. Remember that I'm doing these tests with the 5x10 matrix fully loaded.
The Morale of this experiment is: The repeated integer removal algorithm is critical.
HTH!
PS: I have a few more experiments to post, if I get the time to edit them.

Resources