I am newbie to c. I have n structs holding the 4 members, 1st the unique index of and three floats representing special coordinates in 3D space. I need to find k nearest struct according to Euclidian distances.
//struct for input csv data
struct oxygen_coordinates
{
unsigned int index; //index of an atom
//x,y and z coordinates of atom
float x;
float y;
float z;
};
struct oxygen_coordinates atom_data[n];
//I need to write a function something like,
knn(atom_data[i], atom_data, k); // This should return to 4 closest struct based on Euclidian distances.
//I have already written a function to get distances.
//Distance function for two pints in a struct
float getDistance(struct oxygen_coordinates a, struct oxygen_coordinates b)
{
float distance;
distance = sqrt((a.x - b.x) * (a.x - b.x) + (a.y-b.y) *(a.y-b.y) + (a.z - b.z) * (a.z - b.z));
return distance;
}
At this point I am totally lost, any leads on algorithm will be really helpful. Particularly, in my data set there are only 3d coordinates therefore do I really need to classify points ? Thank you in advance.
Here is some code that might help you. This code is just to give an idea about the approach to the problem, as asked in the question.
// declare a global array that will hold the 4 nearest atom_data...
struct oxygen_coordinates nearestNeighbours[4];
// This function adds the structure passed to it until it becomes full, after that it replaces the structure added from the first...
void addStructure(struct oxygen_coordinates possibleNeighbour) {
static int counter = 0;
int length = sizeof(nearestNeighbour)/sizeof(possibleNeighbour);
if(length < 3) {
nearestNeighbours[length] = possibleNeighbour;
}
else {
nearestNeighbours[counter%4] = possibleNeighbour;
counter++;
}
}
Given atom is the atom_data of the atom you want to find the neighbours of and atom data is the whole array.
Now we make a new float variable which stores the min distance found so far, and initialize it with a very high value.
After that we loop through the atom_data and if we find a candidate with distance less than the min value we have stored, we update the min value and add the structure to our nearestNeighbours array via the add method we created above.
Once we loop through the entire structure, we will have the 4 nearest atom_data inside the nearestNeighbour array.
knn(given_atom, atom_data, k) {
float minDistance = 10000; // Some large value...
for(int i=0; i<n; i++) {
int tempDistance = getDistance(given_atom, atom_data[i])
if(tempDistance<minDistance) {
addStructure(atom_data[i])
}
}
}
The time complexity will depend on the length of the atom_data, i.e. n. If the array is stored in a sorted manner, this time complexity can be reduced significantly.
You may want to use a spatial index, such as the boost R-Tree. There are others, but this is the only one that comes with boost, as far as I am aware.
Other (much simpler) spatial indexes are quadtrees and kD-trees.
Related
I want to find the maximum sum of the given values in a given range, and the values chosen has to be contiguous and in complexity of O(logn) while calculating the sum.
For example the input values are: 1 2 -100 3 4 and I want the maximum value from index 1 to 3 (index starts from 1), then the output will be:3 because 1+2=3.
I know this question has been solved in a way using the array, you can look up in here
Now I have to complete it in complexity of O(logn)... What I can only think of is to build an AVL tree to save the values. But I don't exactly know how to find the max sum, since it has also to be contiguous. Does anyone have some ideas?
By using this I could find an max sum but not contiguous. I tried to fix it to satisfy that condition but I still can't.
btw my structure for the AVL tree is:
struct Node{
int data;
int height; //used to build AVL tree
int index; //save the sequence of the input
struct Node* left;
struct Node* right;
}
int findMaxSum(struct Node* root, int x, int y) //my function to find max sum (x y are my range index, from x to y)
{
int Max, returnMax;
if (root==NULL){
return 0;
}
int left= findMaxSum(root->left, x, y);
int right= findMaxSum(root->right, x, y);
if (root->index>=x && root->index<=y && (preIndex==x || (root->index-1==preIndex))){
//preIndex is a global var to save the previous index
returnMax= max((max(left, right)+root->key), root->key);
Max= max(returnMax, (left+right+root->key));
if (Max>maxSum){
maxSum= Max;
preIndex=root->index;
}
}
else returnMax=root->key;
return returnMax;
}
My actual input is to read some integers from the txt file:8 -4990 -9230 -3269 -2047 2875 3955 6022 8768 1 4 (8 means there will be 8 numbers to participate the calculation. 1 4means to choose the max sum between -4990 -9230 -3269 -2047 since index starts from 1).
My wrong output is: 1908 and the answer should be -2047.
I think the problem of my code is that when I use recursive, my sum will accumulate the sums of every node(even the nodes that I don't want to count).
Is there another way of solving this or can it be fixed by adding conditions?
I have this snippet of code with some pointer math that I'm having trouble understanding:
#include <stdlib.h>
#include <complex.h>
#include <fftw3.h>
int main(void)
{
int i, j, k;
int N, N2;
fftwf_complex *box;
fftwf_plan plan;
float *smoothed_box;
// Allocate memory for arrays (Ns are set elsewhere and properly,
// I've just left it out for clarity)
box = (fftwf_complex *)fftwf_malloc(N * sizeof(fftwf_complex));
smoothed_box = (float *)malloc(N2 * sizeof(float));
// Create complex data and fill box with it. Do FFT. Box has the
// Hermitian symmetry that complex data has when doing FFTs with
// real data
plan = fftwf_plan_dft_c2r_3d(N,N,N,box,(float *)box,
FFTW_ESTIMATE);
...
// end fft
// Now do the loop I don't understand
for(i = 0; i < N2; i++)
{
for(j = 0; j < N2; j++)
{
for(k = 0; k < N2; k++)
{
smoothed_box[R_INDEX(i,j,k)] = *((float *)box +
R_FFT_INDEX(i*f + 0.5, j*f + 0.5, k*f +0.5))/V;
}
}
}
// Do other stuff
...
return 0;
}
Where f and V are just some numbers that are set elsewhere in the code and don't matter for this particular question. Additionally, the functions R_FFT_INDEX and R_INDEX don't really matter, either. What's important is that, for the first loop iteration ,when i=j=k=0, R_INDEX = 0 and R_FFT_INDEX=45. smoothed_box has 8 elements and box has 320.
So, in gdb, when I print smoothed_box[0] after the loop, I get smoothed_box[0] = some number. Now, I understand that, for an array of normal types, say floats, array + integer will give array[integer], assuming that integer is within the bounds of the array.
However, fftwf_complex is defined as typedef float fftw_complex[2], as you need to hold both the real and imaginary parts of the complex number. It's also being casted to a float * from a fftwf_complex *, and I'm unsure what this does, given the typedef.
All I know is that when I print box[45] in gdb, I get box[45] = some complex number that is not smoothed_box[0] * V. Even when I print *((float *)box + 45)/V, I get a different number than smoothed_box[0].
So, I was just wondering if anyone could explain to me the pointer math that is being done in the above loop? Thank you, and I appreciate your time!
box is allocated as an array of N fftwf_complex. Then a backward 3D c2r fftw transform using N,N,N is performed on box, requiring N*N*(N/2+1) fftwf_complex. See http://www.fftw.org/fftw3_doc/Real_002ddata-DFT-Array-Format.html#Real_002ddata-DFT-Array-Format Therefore, this code might trigger undefined behavior, such as segmentation fault, before reaching the pointer arithmetics...
It is practical to cast back box to an array of float because the DFT is performed in place. Indeed, box is used twice as the fftwf_plan is created. box is both the input array of complex and the output array of real:
plan = fftwf_plan_dft_c2r_3d(N,N,N,box,(float *)box,
FFTW_ESTIMATE);
Once fftwf_execute(plan); is called, box is better seen as an array of real. Nevertheless, this array is of size N*N*2*(N/2+1), where the items located at positions i,j,k where k>N-1 are meaningless. See FFTW's Real-data DFT Array Format:
For an in-place transform, some complications arise since the complex data is slightly larger than the real data. In this case, the final dimension of the real data must be padded with extra values to accommodate the size of the complex data—two extra if the last dimension is even and one if it is odd. That is, the last dimension of the real data must physically contain 2 * (nd-1/2+1) double values (exactly enough to hold the complex data). This physical array size does not, however, change the logical array size—only nd-1 values are actually stored in the last dimension, and nd-1 is the last dimension passed to the planner.
This is the reason why the real array smoothed_box is introduced, though an N*N*N array would be expected. If smoothed_box were an array of size N*N*N, then the following conversion could have been performed:
for(i=0;i<N;i++){
for(j=0;j<N;j++){
for(k=0;k<N;k++){
smoothed_box[(i*N+j)*N+k]=((float *)box)[(i*N+j)*(2*(N/2+1))+k]
}
}
}
Goal: Implement the diagram shown below in OpenCL. The main thing needed from the OpenCl kernel is to multiply the coefficient array and temp array and then accumilate all those values into one at the end. (That is probably the most time intensive operation, parallelism would be really helpful here).
I am using a helper function for the kernel that does the multiplication and addition (I am hoping this function will be parallel as well).
Description of the picture:
One at a time, the values are passed into the array (temp array) which is the same size as the coefficient array. Now every time a single value is passed into this array, the temp array is multiplied with the coefficient array in parallel and the values of each index are then concatenated into one single element. This will continue until the input array reaches it's final element.
What happens with my code?
For 60 elements from the input, it takes over 8000 ms!! and I have a total of 1.2 million inputs that still have to be passed in. I know for a fact that there is a way better solution to do what I am attempting. Here is my code below.
Here are some things that I know are wrong with he code for sure. When I try to multiply the coefficient values with the temp array, it crashes. This is because of the global_id. All I want this line to do is simply multiply the two arrays in parallel.
I tried to figure out why it was taking so long to do the FIFO function, so I started commenting lines out. I first started by commenting everything except the first for loop of the FIFO function. As a result this took 50 ms. Then when I uncommented the next loop, it jumped to 8000ms. So the delay would have to do with the transfer of data.
Is there a register shift that I could use in OpenCl? Perhaps use some logical shifting method for integer arrays? (I know there is a '>>' operator).
float constant temp[58];
float constant tempArrayForShift[58];
float constant multipliedResult[58];
float fifo(float inputValue, float *coefficients, int sizeOfCoeff) {
//take array of 58 elements (or same size as number of coefficients)
//shift all elements to the right one
//bring next element into index 0 from input
//multiply the coefficient array with the array thats the same size of coefficients and accumilate
//store into one output value of the output array
//repeat till input array has reached the end
int globalId = get_global_id(0);
float output = 0.0f;
//Shift everything down from 1 to 57
//takes about 50ms here
for(int i=1; i<58; i++){
tempArrayForShift[i] = temp[i];
}
//Input the new value passed from main kernel. Rest of values were shifted over so element is written at index 0.
tempArrayForShift[0] = inputValue;
//Takes about 8000ms with this loop included
//Write values back into temp array
for(int i=0; i<58; i++){
temp[i] = tempArrayForShift[i];
}
//all 58 elements of the coefficient array and temp array are multiplied at the same time and stored in a new array
//I am 100% sure this line is crashing the program.
//multipliedResult[globalId] = coefficients[globalId] * temp[globalId];
//Sum the temp array with each other. Temp array consists of coefficients*fifo buffer
for (int i = 0; i < 58; i ++) {
// output = multipliedResult[i] + output;
}
//Returned summed value of temp array
return output;
}
__kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output) {
//Initialize the temporary array values to 0
for (int i = 0; i < 58; i ++) {
temp[i] = 0;
tempArrayForShift[i] = 0;
multipliedResult[i] = 0;
}
//fifo adds one element in and calls the fifo function. ALL I NEED TO DO IS SEND ONE VALUE AT A TIME HERE.
for (int i = 0; i < 60; i ++) {
Output[i] = fifo(Array[i], coefficients, 58);
}
}
I have had this problem with OpenCl for a long time. I am not sure how to implement parallel and sequential instructions together.
Another alternative I was thinking about
In the main cpp file, I was thinking of implementing the fifo buffer there and having the kernel do the multiplication and addition. But this would mean I would have to call the kernel 1000+ times in a loop. Would this be the better solution? Or would it just be completely inefficient.
To get good performance out of GPU, you need to parallelize your work to many threads. In your code you are just using a single thread and a GPU is very slow per thread but can be very fast, if many threads are running at the same time. In this case you can use a single thread for each output value. You do not actually need to shift values through a array: For every output value a window of 58 values is considered, you can just grab these values from memory, multiply them with the coefficients and write back the result.
A simple implementation would be (launch with as many threads as output values):
__kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output)
{
int globalId = get_global_id(0);
float sum=0.0f;
for (int i=0; i< 58; i++)
{
float tmp=0;
if (globalId+i > 56)
{
tmp=Array[i+globalId-57]*coefficient[57-i];
}
sum += tmp;
}
output[globalId]=sum;
}
This is not perfect, as the memory access patterns it generates are not optimal for GPUs. The Cache will likely help a bit, but there is clearly a lot of room for optimization, as the values are reused several times. The operation you are trying to perform is called convolution (1D). NVidia has an 2D example called oclConvolutionSeparable in their GPU Computing SDK, that shows an optimized version. You adapt use their convolutionRows kernel for a 1D convolution.
Here's another kernel you can try out. There are a lot of synchronization points (barriers), but this should perform fairly well. The 65-item work group is not very optimal.
the steps:
init local values to 0
copy coefficients to local variable
looping over the output elements to compute:
shift existing elements (work items > 0 only)
copy new element (work item 0 only)
compute dot product
5a. multiplication - one per work item
5b. reduction loop to compute sum
copy dot product to output (WI 0 only)
final barrier
the code:
__kernel void lowpass(__global float *Array, __constant float *coefficients, __global float *Output, __local float *localArray, __local float *localSums){
int globalId = get_global_id(0);
int localId = get_local_id(0);
int localSize = get_local_size(0);
//1 init local values to 0
localArray[localId] = 0.0f
//2 copy coefficients to local
//don't bother with this id __constant is working for you
//requires another local to be passed in: localCoeff
//localCoeff[localId] = coefficients[localId];
//barrier for both steps 1 and 2
barrier(CLK_LOCAL_MEM_FENCE);
float tmp;
for(int i = 0; i< outputSize; i++)
{
//3 shift elements (+barrier)
if(localId > 0){
tmp = localArray[localId -1]
}
barrier(CLK_LOCAL_MEM_FENCE);
localArray[localId] = tmp
//4 copy new element (work item 0 only, + barrier)
if(localId == 0){
localArray[0] = Array[i];
}
barrier(CLK_LOCAL_MEM_FENCE);
//5 compute dot product
//5a multiply + barrier
localSums[localId] = localArray[localId] * coefficients[localId];
barrier(CLK_LOCAL_MEM_FENCE);
//5b reduction loop + barrier
for(int j = 1; j < localSize; j <<= 1) {
int mask = (j << 1) - 1;
if ((localId & mask) == 0) {
localSums[local_index] += localSums[localId +j]
}
barrier(CLK_LOCAL_MEM_FENCE);
}
//6 copy dot product (WI 0 only)
if(localId == 0){
Output[i] = localSums[0];
}
//7 barrier
//only needed if there is more code after the loop.
//the barrier in #3 covers this in the case where the loop continues
//barrier(CLK_LOCAL_MEM_FENCE);
}
}
What about more work groups?
This is slightly simplified to allow a single 1x65 work group computer the entire 1.2M Output. To allow multiple work groups, you could use / get_num_groups(0) to calculate the amount of work each group should do (workAmount), and adjust the i for-loop:
for (i = workAmount * get_group_id(0); i< (workAmount * (get_group_id(0)+1) -1); i++)
Step #1 must be changed as well to initialize to the correct starting state for localArray, rather than all 0s.
//1 init local values
if(groupId == 0){
localArray[localId] = 0.0f
}else{
localArray[localSize - localId] = Array[workAmount - localId];
}
These two changes should allow you to use a more optimal number of work groups; I suggest some multiple of the number of compute units on the device. Try to keep the amount of work for each group in the thousands though. Play around with this, sometimes what seems optimal on a high-level will be detrimental to the kernel when it's running.
Advantages
At almost every point in this kernel, the work items have something to do. The only time fewer than 100% of the items are working is during the reduction loop in step 5b. Read more here about why that is a good thing.
Disadvantages
The barriers will slow down the kernel just by the nature of what barriers do: the pause a work item until the others reach that point. Maybe there is a way you could implement this with fewer barriers, but I still feel this is optimal because of the problem you are trying to solve.
There isn't room for more work items per group, and 65 is not a very optimal size. Ideally, you should try to use a power of 2, or a multiple of 64. This won't be a huge issue though, because there are a lot of barriers in the kernel which makes them all wait fairly regularly.
I have several variables inside a struct.
struct my_struct{
float variable_2_x[2],variable_2_y[2],variable_2_z[2];
float coef_2_xyz[3];
float variable_3_x[3],variable_3_y[3],variable_3_z[3];
float coef_3_xyz[3];
float variable_4_x[4],variable_4_y[4],variable_4_z[4];
float coef_4_xyz[3];
};
This struct its going to contain Lagrange polynomial (en.wikipedia.org/wiki/Lagrange_polynomial) coefficients, for several polynomial lenght: 2, 3, 4. The value of this coefficients its easy to calculate but the problem is, that i have to repeat the same code to create every single polynomial. for example
// T_space is a cube with {[-1:1][-1:1][-1:1]} dimension,
// its call transformed space.
// distance is the distance between two points of T_space
// point_1 its the point where the function has value 1
p = 2;
step = distance / p;
polinoms.coef_2_xyz[0] = 1.0:
polinoms.coef_2_xyz[1] = 1.0:
polinoms.coef_2_xyz[2] = 1.0:
for( i = 0; i < p ; ++i)
{
polinoms.pol_2_x[i] = (T_space.xi[point_1] + step) + (i * step);
polinoms.pol_2_y[i] = (T_space.eta[point_1] + step) + (i * step);
polinoms.pol_2_z[i] = (T_space.sigma[point_1] + step) + (i * step);
polinoms.coef_2_xyz[0]*= (T_space.eta[point_1] - polinoms.pol_2_x[i]);
polinoms.coef_2_xyz[1]*= (T_space.eta[point_1] - polinoms.pol_2_y[i]);
polinoms.coef_2_xyz[2]*= (T_space.eta[point_1] - polinoms.pol_2_z[i]);
}
As i don't want to repeat the same loop several times in the code. And what is more important next step in the code i need to integrate the product of the gradient of the polynomial, to every point in the cube.
It will very useful beening able to call every array of the struct independently.
As i know that, variables can't be dynamically called on runtime. I think of making an array which contains the memory directions of the struct. something like this.
// declare variable to store memory directions
float mem_array[12];
// some code
mem_array[0] = &variable_2_x;
mem_array[1] = &variable_2_y;
mem_array[2] = &variable_2_z;
mem_array[3] = &coef_2_xyz;
mem_array[4] = &variable_3_x;
mem_array[11] = &variable_4_z;
mem_array[12] = &coef_4_xyz;
// work calling mem_array.
But i don't know if this is possible or if it will work. If you think this is not the proper way to face the problem, i'm open to advice. Because i'm really stuck with this.
I have edited the question to be more clear, hope it will help.
You'd be better to allocate the memory you need dynamically. You can have a struct that represents a single Lagrange polynomial (of any order), and then have an array of these, one for each order.
You could also store the order of the polynomial as a member of the struct if you wish. You should be able to factor out code that deals with these into functions that take a LagrangePolynomial*, determine the order, and do whatever computation is required.
The key benefit of all of this is that you don't need to have special code for each order, you can use the same code (and the same struct) for any size of polynomial.
Example below:
struct LagrangePolynomial {
float *x;
float *y;
float *z;
};
For p=2:
LagrangePolynomial p;
p.x = malloc(sizeof(float)*2);
p.y = malloc(sizeof(float)*2);
p.z = malloc(sizeof(float)*2);
for (size_t i=0; i<2; i++) {
p.x[i] = ...;
p.y[i] = ...;
p.z[i] = ...;
}
When you've finished with the structure you can free all the memory you've allocated.
free(p.x);
free(p.y);
free(p.z);
As mentioned before you can have an array of these.
LagrangePolynomial ps[4];
for (size_t i=0; i<4; i++) {
p[i].x = malloc(sizeof(float)*2);
p[i].y = malloc(sizeof(float)*2);
p[i].z = malloc(sizeof(float)*2);
for (size_t j=0; j<2; j++) {
p[i].x[j] = ...;
p[i].y[j] = ...;
p[i].z[j] = ...;
}
}
I want to be able to move a particle in a straight line within a 3D environment but I can't think how to work out the next location based on two points within a 3D space?
I have created a struct which represents a particle which has a location and a next location? Would this be suitable to work out the next location to move too? I know how to initially set the next location using the following method:
// Set particle's direction to a random direction
void setDirection(struct particle *p)
{
float xnm = (p->location.x * -1) - p->velocity;
float xnp = p->location.x + p->velocity;
float ynm = (p->location.y * -1) - p->velocity;
float ynp = p->location.y + p->velocity;
float znm = (p->location.z * -1) - p->velocity;
float znp = p->location.z + p->velocity;
struct point3f nextLocation = { randFloat(xnm, xnp), randFloat(ynm, ynp), randFloat(znm, znp) };
p->nextLocation = nextLocation;
}
The structs I have used are:
// Represents a 3D point
struct point3f
{
float x;
float y;
float z;
};
// Represents a particle
struct particle
{
enum TYPES type;
float radius;
float velocity;
struct point3f location;
struct point3f nextLocation;
struct point3f colour;
};
Am I going about this completely the wrong way?
here's all my code http://pastebin.com/m469f73c2
The other answer is a little mathish, it's actually pretty straight forward.
You need a "Velocity" which you are moving. It also has x, y and z coordinates.
In one time period, to move you just add the x velocity to your x position to get your new x position, repeat for y and z.
On top of that, you can have an "Acceleration" (also x,y,z) For instance, your z acceleration could be gravity, a constant.
Every time period your velocity should be recalcualted in the same way, Call velocity x "vx", so vx should become vx + ax, repeat for y and z (again).
It's been a while since math, but that's how I remember it, pretty straight forward unless you need to keep track of units, then it gets a little more interesting (but still not bad)
I'd suggest that a particle should only have one location member -- the current location. Also, the velocity should ideally be a vector of 3 components itself. Create a function (call it move, displace whatever) that takes a particle and a time duration t. This will compute the final position after t units of time has elapsed:
struct point3f move(struct *particle, int time) {
particle->location->x = particle->velocity->x * t;
/* and so on for the other 2 dimensions */
return particle->location;
}
I would recomend two things:
read an article or two on basic vector math for animation. For instance, this is a site that explains 2d vectors for flash.
start simple, start with a 1d point, ie a point only moving along x. Then try adding a second dimension (a 2d point in a 2d space) and third dimension. This might help you get a better understanding of the underlying mechanics.
hope this helps
Think of physics. An object has a position (x, y, z) and a movement vector (a, b, c). Your object should exist at its position; it has a movement vector associated with it that describes its momentum. In the lack of any additional forces on the object, and assuming that your movement vector describes the movement over a time period t, the position of your object at time x will be (x + (at), y + (bt), z + (c*t)).
In short; don't store the current position and the next position. Store the current position and the object's momentum. It's easy enough to "tick the clock" and update the location of the object by simply adding the momentum to the position.
Store velocity as a struct point3f, and then you have something like this:
void move(struct particle * p)
{
p->position.x += p->velocity.x;
p->position.y += p->velocity.y;
p->position.z += p->velocity.z;
}
Essentially the velocity is how much you want the position to change each second/tick/whatever.
You want to implement the vector math X_{i+1} = X_{i} + Vt. For the Xs and V vectors representing position and velocity respectively, and t representing time. I've parameterized the distance along the track by time because I'm a physicist, but it really is the natural thing to do. Normalize the velocity vector if you want to give track distance (i.e. scale V such that V.x*V.x + V.y*V.y + V.z*V.z = 1).
Using the struct above makes it natural to access the elements, but not so convenient to do the addition: arrays are better for that. Like this:
double X[3];
double V[3];
// initialize
for (int i=0; i<3 ++1){
X[i] = X[i] + V[i]*t;
}
With a union, you can get the advantages of both:
struct vector_s{
double x;
double y;
double z;
}
typedef
union vector_u {
struct vector_s s; // s for struct
double a[3]; // a for array
} vector;
If you want to associate both the position and the velocity of with the particle (a very reasonable thing to do) you construct a structure that support two vectors
typedef
struct particle_s {
vector position;
vector velocity;
//...
} particle_t;
and run an update routine that looks roughly like:
void update(particle *p, double dt){
for (int i=0; i<3 ++i){
p->position.a[i] += p->velocity.a[i]*dt;
}
}
Afaik, there are mainly two ways on how you can calculate the new position. One is like the other have explaint to use an explicit velocity. The other possibility is to store the last and the current position and to use the Verlet integration. Both ways have their advantages and disadvantages. You might also take a look on this interresting page.
If you are trying to move along a straight line between two points, you can use the interpolation formula:
P(t) = P1*(1-t) + P2*t
P(t) is the calculated position of the point, t is a scalar ranging from 0 to 1, P1 and P2 are the endpoints, and the addition in the above is vector addition (so you apply this formula separately to the x, y and z components of your points). When t=0, you get P1; when t=1, you get P2, and for intermediate values, you get a point part way along the line between P1 and P2. So t=.5 gives you the midpoint between P1 and P2, t=.333333 gives you the point 1/3 of the way from P1 to P2, etc. Values of t outside the range [0, 1] extrapolate to points along the line outside the segment from P1 to P2.
Using the interpolation formula can be better than computing a velocity and repeatedly adding it if the velocity is small compared to the distance between the points, because you limit the roundoff error.