I use liblinear with my program to perform multi-class classification with the L2R_L2LOSS_SVC_DUAL solver. In the current test-setup I have 1600 instances from a total of 9 classes with 1000 features each.
I'm trying to determine the optimal C parameter for training with 5-fold cross-validation, but even with a small C of 1.0 liblinear reaches the maximal number of iterations:
................................................................................
....................
optimization finished, #iter = 1000
WARNING: reaching max number of iterations
Using -s 2 may be faster (also see FAQ)
Objective value = -637.100923
nSV = 783
The FAQ site mentions two possible reasons for this:
Data isn't scaled.
A large C parameter is used.
A lot of instances with a small number of features is used, so that the solver L2R_L2LOSS_SVC may be faster.
Neither one applies to my case. Since my feature vector is some kind of histogram, there is a natural maximum, that I use to scale the features to [0,1].
I set up the parameteres for liblinear as follows:
struct parameter svmParams;
svmParams.solver_type = L2R_L2LOSS_SVC_DUAL;
svmParams.eps = 0.1;
svmParams.nr_weight = 0;
svmParams.weight_label = NULL;
svmParams.weight = NULL;
svmParams.p = 0.1;
svmParams.C = 1.0;
My question is: What other reasons, not mentioned in the FAQ, may cause liblinear to operate slow in this scenario and what may I do against it?
Related
I was looking for a way to generate a logarithmic spaced array in IDL.
From the L3 Harris Geospatial website I came across "arrgen" and was trying to use it for this purpose. However,
arrgen(1,215,/log)
returns the error: Variable is undefined: ARRGEN.
What would be the correct way to do it?
Thanks in advance for your help
Start by defining your lower and upper bounds in which ever log-base you prefer. I will use base $e$ for brevity sake.
lowe = ALOG(low[0])
uppe = ALOG(upp[0])
where low and upp are scalar, numerical values you, the user, define (e.g., 1 and 215 in your example). Then construct an evenly spaced array of n elements, such as:
dinde = DINDGEN(n[0])*(uppe[0] - lowe[0])/(n[0] - 1L) + lowe[0]
where n is a scalar integer. Now convert back to linear space to get:
dind = EXP(dinde)
This will be a logarithmically spaced array. If you want to use base-10 log, then substitute ALOG for ALOG10. If you need another base, then you can use the logarithmic change of base rule given by:
logb x = logc x / logc b
Several posts exist about efficiently calculating pairwise distances in MATLAB. These posts tend to concern quickly calculating euclidean distance between large numbers of points.
I need to create a function which quickly calculates the pairwise differences between smaller numbers of points (typically less than 1000 pairs). Within the grander scheme of the program i am writing, this function will be executed many thousands of times, so even small gains in efficiency are important. The function needs to be flexible in two ways:
On any given call, the distance metric can be euclidean OR city-block.
The dimensions of the data are weighted.
As far as i can tell, no solution to this particular problem has been posted. The statstics toolbox offers pdist and pdist2, which accept many different distance functions, but not weighting. I have seen extensions of these functions that allow for weighting, but these extensions do not allow users to select different distance functions.
Ideally, i would like to avoid using functions from the statistics toolbox (i am not certain the user of the function will have access to those toolboxes).
I have written two functions to accomplish this task. The first uses tricky calls to repmat and permute, and the second simply uses for-loops.
function [D] = pairdist1(A, B, wts, distancemetric)
% get some information about the data
numA = size(A,1);
numB = size(B,1);
if strcmp(distancemetric,'cityblock')
r=1;
elseif strcmp(distancemetric,'euclidean')
r=2;
else error('Function only accepts "cityblock" and "euclidean" distance')
end
% format weights for multiplication
wts = repmat(wts,[numA,1,numB]);
% get featural differences between A and B pairs
A = repmat(A,[1 1 numB]);
B = repmat(permute(B,[3,2,1]),[numA,1,1]);
differences = abs(A-B).^r;
% weigh difference values before combining them
differences = differences.*wts;
differences = differences.^(1/r);
% combine features to get distance
D = permute(sum(differences,2),[1,3,2]);
end
AND:
function [D] = pairdist2(A, B, wts, distancemetric)
% get some information about the data
numA = size(A,1);
numB = size(B,1);
if strcmp(distancemetric,'cityblock')
r=1;
elseif strcmp(distancemetric,'euclidean')
r=2;
else error('Function only accepts "cityblock" and "euclidean" distance')
end
% use for-loops to generate differences
D = zeros(numA,numB);
for i=1:numA
for j=1:numB
differences = abs(A(i,:) - B(j,:)).^(1/r);
differences = differences.*wts;
differences = differences.^(1/r);
D(i,j) = sum(differences,2);
end
end
end
Here are the performance tests:
A = rand(10,3);
B = rand(80,3);
wts = [0.1 0.5 0.4];
distancemetric = 'cityblock';
tic
D1 = pairdist1(A,B,wts,distancemetric);
toc
tic
D2 = pairdist2(A,B,wts,distancemetric);
toc
Elapsed time is 0.000238 seconds.
Elapsed time is 0.005350 seconds.
Its clear that the repmat-and-permute version works much more quickly than the double-for-loop version, at least for smaller datasets. But i also know that calls to repmat often slow things down, however. So I am wondering if anyone in the SO community has any advice to offer to improve the efficiency of either function!
EDIT
#Luis Mendo offered a nice cleanup of the repmat-and-permute function using bsxfun. I compared his function with my original on datasets of varying size:
As the data become larger, the bsxfun version becomes the clear winner!
EDIT #2
I have finished writing the function and it is available on github [link]. I ended up finding a pretty good vectorized method for computing euclidean distance [link], so i use that method in the euclidean case, and i took #Divakar's advice for city-block. It is still not as fast as pdist2, but its must faster than either of the approaches i laid out earlier in this post, and easily accepts weightings.
You can replace repmat by bsxfun. Doing so avoids explicit repetition, therefore it's more memory-efficient, and probably faster:
function D = pairdist1(A, B, wts, distancemetric)
if strcmp(distancemetric,'cityblock')
r=1;
elseif strcmp(distancemetric,'euclidean')
r=2;
else
error('Function only accepts "cityblock" and "euclidean" distance')
end
differences = abs(bsxfun(#minus, A, permute(B, [3 2 1]))).^r;
differences = bsxfun(#times, differences, wts).^(1/r);
D = permute(sum(differences,2),[1,3,2]);
end
For r = 1 ("cityblock" case), you can use bsxfun to get elementwise subtractions and then use matrix-multiplication, which must speed up things. The implementation would look something like this -
%// Calculate absolute elementiwse subtractions
absm = abs(bsxfun(#minus,permute(A,[1 3 2]),permute(B,[3 1 2])));
%// Perform matrix multiplications with the given weights and reshape
D = reshape(reshape(absm,[],size(A,2))*wts(:),size(A,1),[]);
I'm facing the problem of computing values of a clothoid in C in real-time.
First I tried using the Matlab coder to obtain auto-generated C code for the quadgk-integrator for the Fresnel formulas. This essentially works great in my test scnearios. The only issue is that it runs incredibly slow (in Matlab as well as the auto-generated code).
Another option was interpolating a data-table of the unit clothoid connecting the sample points via straight lines (linear interpolation). I gave up after I found out that for only small changes in curvature (tiny steps along the clothoid) the results were obviously degrading to lines. What a surprise...
I know that circles may be plotted using a different formula but low changes in curvature are often encountered in real-world-scenarios and 30k sampling points in between the headings 0° and 360° didn't provide enough angular resolution for my problems.
Then I tried a Taylor approximation around the R = inf point hoping that there would be significant curvatures everywhere I wanted them to be. I soon realized I couldn't use more than 4 terms (power of 15) as the polynom otherwise quickly becomes unstable (probably due to numerical inaccuracies in double precision fp-computation). Thus obviously accuracy quickly degrades for large t values. And by "large t values" I'm talking about every point on the clothoid that represents a curve of more than 90° w.r.t. the zero curvature point.
For instance when evaluating a road that goes from R=150m to R=125m while making a 90° turn I'm way outside the region of valid approximation. Instead I'm in the range of 204.5° - 294.5° whereas my Taylor limit would be at around 90° of the unit clothoid.
I'm kinda done randomly trying out things now. I mean I could just try to spend time on the dozens of papers one finds on that topic. Or I could try to improve or combine some of the methods described above. Maybe there even exists an integrate function in Matlab that is compatible with the Coder and fast enough.
This problem is so fundamental it feels to me I shouldn't have that much trouble solving it. any suggetions?
about the 4 terms in Taylor series - you should be able to use much more. total theta of 2pi is certainly doable, with doubles.
you're probably calculating each term in isolation, according to the full formula, calculating full factorial and power values. that is the reason for losing precision extremely fast.
instead, calculate the terms progressively, the next one from the previous one. Find the formula for the ratio of the next term over the previous one in the series, and use it.
For increased precision, do not calculate in theta by rather in the distance, s (to not lose the precision on scaling).
your example is an extremely flat clothoid. if I made no mistake, it goes from (25/22) pi =~ 204.545° to (36/22) pi =~ 294.545° (why not include these details in your question?). Nevertheless it should be OK. Even 2 pi = 360°, the full circle (and twice that), should pose no problem.
given: r = 150 -> 125, 90 degrees turn :
r s = A^2 = 150 s = 125 (s+x)
=> 1+(x/s) = 150/125 = 1 + 25/125 x/s = 1/5
theta = s^2/2A^2 = s^2 / (300 s) = s / 300 ; = (pi/2) * (25/11) = 204.545°
theta2 = (s+x)^2/(300 s) = (6/5)^2 s / 300 ; = (pi/2) * (36/11) = 294.545°
theta2 - theta = ( 36/25 - 1 ) s / 300 == pi/2
=> s = 300 * (pi/2) * (25/11) = 1070.99749554 x = s/5 = 214.1994991
A^2 = 150 s = 150 * 300 * (pi/2) * (25/11)
a = sqrt (2 A^2) = 300 sqrt ( (pi/2) * (25/11) ) = 566.83264608
The reference point is at r = Infinity, where theta = 0.
we have x = a INT[u=0..(s/a)] cos(u^2) d(u) where a = sqrt(2 r s) and theta = (s/a)^2. write out the Taylor series for cos, and integrate it, term-by-term, to get your Taylor approximation for x as function of distance, s, along the curve, from the 0-point. that's all.
next you have to decide with what density to calculate your points along the clothoid. you can find it from a desired tolerance value above the chord, for your minimal radius of 125. these points will thus define the approximation of the curve by line segments, drawn between the consecutive points.
I am doing my thesis in the same area right now.
My approach is the following.
at each point on your clothoid, calculate the following (change in heading / distance traveled along your clothoid), by this formula you can calculate the curvature at each point by this simple equation.
you are going to plot each curvature value, your x-axis will be the distance along the clothoid, the y axis will be the curvature. By plotting this and applying very easy linear regression algorithm (search for Peuker algorithm implementation in your language of choice)
you can easily identify where are the curve sections with value of zero (Line has no curvature), or linearly increasing or decreasing (Euler spiral CCW/CW), or constant value != 0 (arc has constant curvature across all points on it).
I hope this will help you a little bit.
You can find my code on github. I implemented some algorithms for such problems like Peuker Algorithm.
I want to create a vector containing dates in matlab. For that I specified the start time and the stop time:
WHM01_start = datenum('01-JAN-2005 00:00')
WHM01_stop = datenum('01-SEP-2014 00:00')
and then I created the vector with
WHM01_timevec = WHM01_start:datenum('01-JAN-2014 00:20') - datenum('01-JAN-2014 00:00'):WHM01_stop;
after I want to have time steps of 20 minutes each. Unfortunately I get a rounding error after some thousands of values, leading me to
>> datestr(WHM01_timevec(254160))
ans =
31-Aug-2014 23:39:59
and not as expected, 31-Aug-2014 23:40:00
How can I correct these incorrect values?
Edit: I also saw this thread, but unfortunately I get there a vector per date, and not a number as desired.
You can give year, month, day, ... in numeric format to the function datenum. Datenum accepts vectors for one or several of its arguments, and if the numbers are too big (for example, 120 minutes), datenum knows what to do with it.
So by supplying the minutes vector in 20-minute increments, you can avoid rounding errors (at least on a 1-second level):
WHM01_start = datenum('01-JAN-2005 00:00');
WHM01_stop = datenum('01-SEP-2014 00:00');
time_diff = WHM01_stop - WHM01_start;
WHM01_timevec = test = datenum(2005,01,01,00,[00:20:time_diff*24*60],00);
datestr(WHM01_timevec(254160))
To answer your comment:
The reason you saw rounding errors was that you used the difference of two big numbers for your time-increments. The difference of large numbers has a (relatively) large rounding error.
Matlab time is counted in days since the (fictional) date 0.0.0000. Your time-increment is 1/3 hour, or 1/(24*3) days. Modifying your original code so that it reads
WHM01_timevec = WHM01_start:1/(24*3):WHM01_stop;
is an alternative way to reduce the rounding error, but for absurdely large time spans the first solution is a more robust approach.
Related answer: use linspace instead of the colon operator :.
%// given
WHM01_start = datenum('01-JAN-2005 00:00')
WHM01_stop = datenum('01-SEP-2014 00:00')
%// number of elements
n = numel(WHM01_start: datenum('01-JAN-2014 00:20') - ...
datenum('01-JAN-2014 00:00') : WHM01_stop);
%// creating vector using linspace
WHM01_timevec = linspace(WHM01_start, WHM01_stop, n);
%// proof
datestr(WHM01_timevec(254160))
ans =
31-Aug-2014 23:40:00
Drawback of this solution: to determine the number of elements of the output vector I use the original vector created with :, which is not the best option probably.
Important quote from the linked answer:
Using linspace can reduce the probability of occurance of these issue, it's not a security.
I am trying to resample a signal (sound sample) from one sampling rate, to a higher sampling rate.
Unfortunately it needs some kind of filter, as some 'aliasing' appears to occur, and I'm not familiar with filters. Here is what I came up with:
int i, j, a, b, z;
a = 44100;
b = 8363;
// upsample by a
for(i = z = 0; i < samplen; i++)
for(j = 0; j < a; j++)
cbuf[z++] = sampdata[i];
// some filter goes here???
// downsample by b
for(j = i = 0; i < z; i += b)
buf[j++] = cbuf[i];
The new sample is very similar to the original, but it has some kind of noise.
Can you please tell me what filter I need to add, and preferably some code related to that filter?
Original sound: http://www.mediafire.com/?9gnga1in52d6t4x
Resampled sound: http://www.mediafire.com/?x34h7ggk8n9k8z1
Don't use linear interpolation unless both sample rates (source and destination) are well above the highest frequency in your data. It's a very poor low-pass filter.
What you want is an interpolating low pass filter with a stop-band starting below half the lower of the two sample rates you are dealing with. Common methods of implementing this are upsampling/downsampling using IIR filters, and using poly-phase FIR filters. A windowed Sinc interpolator also works well for this if you don't need real-time performance, and don't want to upsample/downsample. Here's a Windowed Sinc interpolating low-pass filter in Basic, that should be trivial to convert into C.
If you want to use IIR filtering, here's the canonical Cookbook for biquad IIR filters.
If you want the best explanation of audio resampling theory, here's Stanford CCRMA's Resampling page.
Have you considered using a specialised library for this, such as libsamplerate?
It is quite portable and it is developed by people who know how to do things like this correctly. Even if you do not use it directly, you might find the algorithms it implements quite interesting.
A few comments, although I'm only guessing at your actual intent:
You are up-sampling at a rate 44100 times the original sample rate. For example, if your input was at 10kHz your intermediate cbuf[] would be at 441MHz which is a tad high for most audio analysis. Assuming you want cbuf[] to be at 44100Hz then you only need to create 44100/OrigSampleRate of samples in cbuf[] per sample in sampdata[].
You are incrementing z twice in the up-sampling loop. This results in all odd elements of cbuf[] to have their original values. I believe this ultimately results in the final buf[] having invalid odd elements which may be the source of your noise. There is also a potential buffer overflow in cbuf if you didn't create it with at least twice the required number of elements.
As mentioned by Steve a linear interpolation is generally the simplest that creates a good result when up-sampling. More complicated up-sampling can be done if desired (polynomials, splines, etc...). Similarly, when down-sampling you may wish to average samples instead of just truncating.
Best resampling code I ever come across: http://shibatch.sourceforge.net/
Take the source, and try to learn something from it. It is in nasty condition, but results of that resampler are far above everything else.
Use FFMpeg and avcodec directly. Here's a good example showing how to do this:
http://tdistler.com/projects/audio-resampling-with-ffmpeg
Before you resample to a lower sample rate you MUST low pass filter the original less than 1/2 times the sample rate or you will introduce alizing artifacts. The spectrum will fold back upon itself for frequencies more than 1/2 the sample rate. So if you want to resample to 11025 from 44100 you must filter the 44100 lowpassa at 1/2 of 11025 or 5500 Hz since faithfulness of reproduction decreases with lower bandwidths its best to do this with max amplitude like -10Db of amplitude. For 16 bits signed the value is like 10^(-10/20)*2^(16-1) or 10362 +/- for max amplitude. The exact algorithms might be found online since there should be no intellectual rights for these old and basic ideas. After doing all calculations with no rounding double precision floating point then you round the results to their proper integer values and interpolate on the time scale exactly where the one set intercepts the other. It requires quite an imagination and memory and previous experience which then puts you in the realm of the mathematician physics programmer. :-O :-)
Linear interpolation works quite well here. The issues is with the author's code, it's not linear interpolation - it's just taking the nearest value without any interpolation at all.
Here is an example of linear interpolation with source sample rate = 5 and destination sample rate = 6
src val: 5 10 5 0 5 (this is our audio data, 5 samples)
src idx: 0 1 2 3 4 (source positions for 5 samples)
dst idx: 0 1 2 3 4 5 (destination positions for 6 samples)
dst val: ? ? ? ? ? ?
At first let's calculate scale factor:
scaleCF = srcSampleRate / dstSampleRate = 0.83333334
Let's look at dst[2]
For dst index 2 we need to take part from src[1] and part from src[2]
Let's find nearest source indices and their contribution coeffitients:
idxD = (double)idx * cf; = 0.833333334 * 2 = 1.6666668
a = (int)idxD = (int)(1.6666668) = 1
b = a + 1 = 2
bCF = idxD - a = 1.6666668 - 1 = 0.6666668
aCF = 1.0 - bCF = 1.0 - 0.6666668 = 0.3333332
res = (float)(aCF * Data[a] + bCF * Data[b])
= 0.3333332 * 10 + 0.6666668 * 5 = 6.6666666
So our destination value at position 2 will be 6.6666666
Algorithm can be used for downsampling / upsampling.
Probably not the most efficient solution and not the most accurate, still easy to implement and works pretty good.