in DBSCAN clustering by increasing the cluster size, iterations fluctuates why? - dbscan

In DBSCAN clustering, by increasing the cluster size for example 1,2,3,4..., iterations to run fluctuates between i.e for cluster size 2, it took 3 iterations for 4, it took 4 up to 5 only 4 iterations, for 6, increases to 5 then 7 later 5, 4, 4 why is it so? may be nature of dataset I have chosen or some other reason?

You need to use large data to get stable measurements. Most likely, your data set is just way too small.
Make sure to disable turbo boost.
Also repeat each experiments a dozen times, to estimate your measurement error. As long as differences are within the standard error of the mean consider them to be just random.

Related

How can you create an array of integers that roughly averages to a given number?

I'm not trying to find the average of an array, I'm trying to create an array that will roughly average to a desired number.
My use case is that I have 2 stepper motors that each need to perform a smooth movement over roughly the same amount of time. Steppers move in discrete steps with an integer ms delay between these steps. I need to be able to control the speed of the "faster" motor (ie: the one that needs to take more total steps with a smaller, constant delay between steps) and the speed of the "slower" motor should adjust as needed.
Consider the case where Motor A needs to take 100 steps and Motor B needs to take 150. The delay between Motor B's steps must be 1ms so the delay between Motor A's steps would then be 1.5ms. This doesn't work since the step delay must be an integer.
To that end, I believe you can solve this problem by generating an array with a length equal to the number of total steps where each element is an integer that, overall, averages to that 1.5ms delay. The example for this case would simply be:
motor_a_step_delays = [1, 2, 1, 2, 1, 2 ... 100 elements total ...]
My issue is that I can't seem to find a good way to create this array. The integer elements should be "close" (for lack of a better word). Something like [51, 1, 1, ... 97 more 1's...] would be correct, but not result in smooth, even movement.
This problem feels like it's been solved, but I don't know how to even start searching for it. This seems like it'd have utility in CNC, robotics, or game design applications.
As usual, the act of typing out my issue made me stop and think about what's actually happening.
Fundamentally, the array would only contain the floor and ceil of the desired average delay. If the desired average was 2.25, the final array would be some combination of 2 and 3, but never 1 or 4. Once I realized that, it seems so simple that the number of ceils and floors is proportional to how far the desired delay is from its ceil/floor. In other words, 2.25 would need an array of 75% 2's and 25% 3's. Easy!
Here is what I ended up with (Elixir):
def generate_step_delays(steps, desired_delay) do
desired_delay_ceil = ceil(desired_delay)
desired_delay_floor = floor(desired_delay)
# The ratio of ceils to floors needed
ratio = desired_delay - desired_delay_floor
ceil_list = List.duplicate(desired_delay_ceil, round(steps * ratio))
floor_list = List.duplicate(desired_delay_floor, steps - length(ceil_list))
ceil_list
|> Enum.concat(floor_list)
|> Enum.shuffle()
end
This implementation randomizes the final array since that works best for my case. However, it would be simple to swap out Enum.shuffle and evenly distribute the numbers if needed.

Really basic stuff! use for loop or while function to design a plan that choose minimum number of gifts

there are four gifts, which cost 100, 30, 5, 2. You have $89 to spend on them. now you want to spend as much your money as possible to buy minimum number of gifts (I know it doesn't sound reasonable at first). For example, in this case, starting from the most expensive one, you cannot afford it and thus you can only choose two $30 gifts. Now you have $29 left and you can do 5 $5 gifts and then 2 $2 gifts, totalling 9 gifts with $0 left. This is the plan. The desired output is 9. I need a set of codes that can generate this kind of plan no matter what is inputted at first. If i change the number to 40, 30, 8, 3 and $100, the best plan can still be outputted.
I got hints that we can input the number from big to small. For example, list1=[100,30,5,2,89] (the cost of gifts first and then the total money you have). And then select maximum amount of the most expensive gifts and see if there's any amount left for other gifts.
it is a beginner question so don't make it look too hard. Just use for loops and while loops (like you just started to learn).
no need to generate random numbers, you can use 100, 30, 5, 2 or other numbers you like.
Thx so much guys I need ur help!!!! kinda desperate now.

how to illustrate the improvement between two prediction modes statisticlly

I have a problem on how to compare two data sets statistically, and am very appreciate your help.
Problem:
i have two data sets from two prediction mode, such as M1={1 2 3 4 5}, and M2={3 4 2 5 6}, which are the mean square error between raw and predicted data. how or what tools i could used to find and show the improvements from M2 to M1?
What i did:
i tried the standard deviations for each set, and compared them. there are some increasing but not that obvious. what other tests i can do on this issue and how to tell the improvement is significant or not?
thanks for you help.

Genetic Algorithm Sudoku - optimizing mutation

I am in the process of writing a genetic algorithm to solve Sudoku puzzles and was hoping for some input. The algorithm solves puzzles occasionally (about 1 out of 10 times on the same puzzle with max 1,000,000 iterations) and I am trying to get a little input about mutation rates, repopulation, and splicing. Any input is greatly appreciated as this is brand new to me and I feel like I am not doing things 100% correct.
A quick overview of the algorithm
Fitness Function
Counts the number of unique values of numbers 1 through 9 in each column, row, and 3*3 sub box. Each of these unique values in the subsets are summed and divided by 9 resulting in a floating value between 0 and 1. The sum of these values is divided by 27 providing a total fitness value ranging between 0 and 1. 1 indicates a solved puzzle.
Population Size:
100
Selection:
Roulette Method. Each node is randomly selected where nodes containing higher fitness values have a slightly better chance of selection
Reproduction:
Two randomly selected chromosomes/boards swap a randomly selected subset (row, column, or 3*3 subsets) The selection of subset(which row, column, or box) is random. The resulting boards are introduced into population.
Reproduction Rate: 12% of population per cycle
There are six reproductions per iteration resulting in 12 new chromosomes per cycle of the algorithm.
Mutation: mutation occurs at a rate of 2 percent of population after 10 iterations of no improvement of highest fitness.
Listed below are the three mutation methods which have varying weights of selection probability.
1: Swap randomly selected numbers. The method selects two random numbers and swaps them throughout the board. This method seems to have the greatest impact on growth early in the algorithms growth pattern. 25% chance of selection
2: Introduce random changes: Randomly select two cells and change their values. This method seems to help keep the algorithm from converging. %65 chance of selection
3: count the number of each value in the board. A solved board contains a count of 9 of each number between 1 and 9. This method takes any number that occurs less than 9 times and randomly swaps it with a number that occurs more than 9 times. This seems to have a positive impact on the algorithm but only used sparingly. %10 chance of selection
My main question is at what rate should I apply the mutation method. It seems that as I increase mutation I have faster initial results. However as the result approaches a correct result, I think the higher rate of change is introducing too many bad chromosomes and genes into the population. However, with the lower rate of change the algorithm seems to converge too early.
One last question is whether there is a better approach to mutation.
You can anneal the mutation rate over time to get the sort of convergence behavior you're describing. But I actually think there are probably bigger gains to be had by modifying other parts of your algorithm.
Roulette wheel selection applies a very high degree of selection pressure in general. It tends to cause a pretty rapid loss of diversity fairly early in the process. Binary tournament selection is usually a better place to start experimenting. It's a more gradual form of pressure, and just as importantly, it's much better controlled.
With a less aggressive selection mechanism, you can afford to produce more offspring, since you don't have to worry about producing so many near-copies of the best one or two individuals. Rather than 12% of the population producing offspring (possible less because of repetition of parents in the mating pool), I'd go with 100%. You don't necessarily need to literally make sure every parent participates, but just generate the same number of offspring as you have parents.
Some form of mild elitism will probably then be helpful so that you don't lose good parents. Maybe keep the best 2-5 individuals from the parent population if they're better than the worst 2-5 offspring.
With elitism, you can use a bit higher mutation rate. All three of your operators seem useful. (Note that #3 is actually a form of local search embedded in your genetic algorithm. That's often a huge win in terms of performance. You could in fact extend #3 into a much more sophisticated method that looped until it couldn't figure out how to make any further improvements.)
I don't see an obvious better/worse set of weights for your three mutation operators. I think at that point, you're firmly within the realm of experimental parameter tuning. Another idea is to inject a bit of knowledge into the process and, for example, say that early on in the process, you choose between them randomly. Later, as the algorithm is converging, favor the mutation operators that you think are more likely to help finish "almost-solved" boards.
I once made a fairly competent Sudoku solver, using GA. Blogged about the details (including different representations and mutation) here:
http://fakeguido.blogspot.com/2010/05/solving-sudoku-with-genetic-algorithms.html

Random number generator repeates some numbers too often

I'm writing a lottery draw simulation program as a project. The way the game works is you need to pick the 6 numbers that are draw from the 49 to win.
Your chance of winning is 1/13,983,816 because that's how many combinations of 6 in 49 there are. The demo program on Go playground generates six new numbers each time around the loop forever.
Each time a new set of numbers is generated I test to see if it already exists and if it does I break out of the loop. With 13,983,816 combinations you would think it would be a long time before the same 6 numbers would repeat but, in testing it fails always before 10000 iteration. Does anyone know why this is happening?
In my opinion you have a couple of problems here.
You use Go playground, where your randomness is fixed. This line rand.Seed(time.Now().UnixNano()) always produce the same seed because time.Now() is the same.
You test completely different things with your simulation. I will write about it in the end.
if you want to do something similar to gambling - you have to use cryptographically secure PRNG and Go has it. If you want you can read more details here (the answer is to php question, but it explains the difference).
On the probability part:
The probability of winning your lottery is indeed 1/C(49, 6) = 1/13,983,816. But this is the probability that someone would select an already predefined set of numbers. For example you claim that your winner is {1, 5, 47, 3, 4, 5} and now the probability that someone would win is approximately 1 in 14 mln. So you have to do the following. Randomly select a set of 6 numbers and then compare your new selection in a loop to already found.
But what you do is to check the probability of collision. That having N people some of them would select the same sets (not necessarily even the winning set). This is known as the birthday paradox. And as you see there, the probability of collision increase dramatically with the increase of number of people N.
This is absolutely the same problem, but your number of days in the year is 13,983,816 and you can check here that for this number of days you need only 5000 iterations to guarantee with 0.59 percents that you will get a collision. And with 9000 iterations you will find the collision with probability 0.94.
I believe you are solving a different problem, the likelihood that two identical draws show up is much higher than one specific one will show up.
This is known as Birthday Problem
BTW, a rough rule of thumb for the birthday paradox is that if you have N days, you need roundabout sqrt(N) people to get a good chance (around 50%) of a collision.
So, for the original birthday paradox, you have 365 days, so the rule of thumb gives you that with 365^.5 or about 19 people you already have a decent chance of collision (true answer for >50%: 23 people).
Here, with 13,983,816 possible outcomes, the rule of thumb tells you that with 3739 draws you have a pretty good chance of collision (true answer for 50%: 4400 draws).

Resources