Not-mathematical description of NL-complexity [closed] - theory

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
NL-Complexity appears to be related to NP-complexity, and i would like a non-mathematic explanation (being that i only have a passing familiarity with the level of mathematics used in that article). Can someone provide a description of how it relates to programming and NP-complexity?

Algorithms that have NL-Complexity can run in a memory space that grows only logarithmically (very slowly) with the size of the problem. Inotherwords, these problems would scale up very well in relation to the required memory usage - double the problem size and you'd hardly need anymore memory to run the algorithm to completion. I don't know if there is a theoretical relationship proven between NL and NP complexity sets. NP complexity relates to the time that it takes to complete a program - whereas NL complexity is characterizing the memory space needed to complete a program.
I did notice in that wiki article that you referred to that it is not known if NL=P. This seems unlikely as it would mean that all algorithms that can complete in polynomial time (w.r.t their size) can also finish in a memory space that scales logrithmically w.r.t. problem size. If only that were true! For now we only know that NL is contained in P.
-Paul

There are only marginal relationships between the three concepts.
In a nutshell, NP problems are those that can be solved with a non-deterministic Turing machine, which doesn't exist, as computers are deterministic, (Quantum computers are a different class, but are not non-deterministic), and whose time to solve grows at most as a function that is polynomial in the length of the input.
Problems can be shown to be NP-complete. These are in NP and every other problem in NP can be converted to one of these in time polynomial in the input. Examples are 3-satisfiability and the Traveling Salesman Problem.
These results would remain entirely theoretical, except that many problems have been shown to be NP-complete, and for none of these has a deterministic (i.e., on a real computer) solution been found that runs in time polynomial in the length of the input. This has lead people to believe that if a problem can be shown to be NP-complete, then all solutions are probably grow exponentially. None of this has been proven either way. In terms of computation, solutions to these kinds of problems involve recursive searches rather than a fixed depth nesting of for loops, which would be polynomial.
NL-complete problems are concerned with memory usage rather than time spent solving a problem. Again, these are solutions that "run" on imaginary non-deterministic machines. They can be thought of as machines that guess the right answer, then check using an amount of memory that grows as a logarithmic function of the length of the input. We don't care how much time it takes. An equivalent deterministic solution would just iterate through all the possible guesses, so would use a bit more memory to store which guess is currently being checked.
An example of NL-complete problem is 2-Satisfiability. Given an input of clauses, make a guess of truth values for the variables and check them while going through the input. The number of variables grows as the log2 of the input, or rather the length of the string of clauses would grow as 2^number of variables.
We know that NL problems are in P, that is they can be solved with a solution that uses a fixed depth of nested for loops. But doesn't imply that these solutions keep memory usage low. The low memory solutions may require longer time to run. In terms of computing, this corresponds to trading space for time.

You can express the difference as follows: while an NP-machine has two-way access to the
guessed nondeterministic bits, an NL-machine is allowed to read them only once.

Related

C/Linux: How to find the perfect number of threads to use, to minimize the time of execution? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
Let's say I have a vector of N numbers. I have to do computations on the vector, so I will assign for each thread a portion of the vector to compute on. For example, I have 100 elements, 10 threads, so the first thread will work on the 0..9 elements, the second on, 10..19 elements, and so on. How do I find the perfect number of threads in order to minimize the time of execution. Of course, we will consider N to be a pretty big number, so we can observe differences. What relation is between the number of threads necessary for the time of execution to be minimum, and the number of cores on my machine?
There is no exact formula or relation as such, but you can judge it depending upon your use case.
With use of multithreading you can decide which Performance Metric you want to optimise:
Latency
Throughput
Your use case has a single task, which needs to be performed in minimum time as possible. Therefore we need to focus on optimising the latency.
For improving latency, each task can be divided into let's say N subtasks, as you did mention in your question.
Multithreading will not always help you minimising the runtime, for eg: if the input size is low(let's say, 100), then, your multithreaded program may end up taking more time than a single threaded program, as the cost of thread management is involved. This may not be true with a large input.
So, for any use case, the best way is to rely on some realtime metrics, discussed as follows:
For an array of size N, you can plot a Latency vs Number Of Threads plot and check what best suites your use case.
For example, have a look at the following plot:
From, the above plot, we can conclude that for a array of constant size N, and a processor with 4 cores,
The Optimal number of threads = 6.
Generally, we prefer Number of threads = Number of cores of the processor.
From the above plot, it is evident that this equation is not always true. The processor had 4 cores, but optimal latency was achieved by using 6 threads.
Now, for each N, you can test the best possible parameters that help you optimise your use case.
PS: You can study the concept of virtual cores and find out the reason why latency starts increasing after we increase the number of threads from 6 to 8.
PS: The entire credits for the image goes to Michael Pogrebinsky and his fantastic course on Udemy - Java Multithreading, Concurrency & Performance Optimization.
Unfortunately there is no easy answer on this one. You should essentially try out different configurations and then pick the best suited.
There is no easy answer to this. Intel have the MKL / IPP libraries for this reason; these libraries contain the necessary information for each CPU Intel make to judge what the best breakdown of threads for an operation is. A pretty good guess would be 1 thread per core.
On the topic of how much data a thread should be allocated to process (i.e. how much of the vector it will chew on), a good guideline is the L1 cache size (normally 32kB). If you judge it so that the amount of vector + the size of the corresponding results is less than 32kB, that gives the cache system the best possible chance of delivering data to the thread and taking the results away again.

What is good measure to compare algorithms?

Well I was reading an article about comparing two algorithms by firstly analyzing them.
My teacher taught me that you can analyze algorithm by directly using number of steps for that algorithm.
for ex:
algo printArray(arr[n]){
for(int i=0;i<n;i++){
write arr[i];
}
}
will have complexity of O(N), where N is size of array. and it repeats the for loop for N times.
while
algo printMatrix(arr,m,n){
for(i=0;i<m;i++){
for(j=0;j<n;j++){
write arr[i][j];
}
}
}
will have complexity of O(MXN) ~ O(N^2) when M=N. statements inside for are executed MXN times.
similarly O(log N). if it divides input into 2 equal parts. and So on.
But according to that article:
The Measures Execution Time, Number of statements aren't good for analyzing the algorithm.
because:
Execution Time will be system Dependent and,
Number of statements will vary with the programming language used.
and It states that
Ideal Solution will be to express running time of algorithm as a function of input size N that is f(n).
That confused me a little, How can you calculate running time if you consider execution time as not good measure?
Can experts here please elaborate this?
Thanks in advance.
When you were saying "complexity of O(N)" that is referred to as "Big-O notation" which is the same as the "Ideal Solution" that you mentioned in your post. It is a way of expressing run time as a function of input size.
I think were you got confused was when it said "express running time" - it didn't mean express it in a numerical value (which is what execution time is), it meant express it in Big-O notation. I think you just got tripped up on the terminology.
Execution time is indeed system-dependent, but it also depends on the number of instructions the algorithm executes.
Also, I do not understand how the number of steps is irrelevant, given that algorithms are analyzed as language-agnostic and without paying any attention to whatever features and syntactic-sugars various languages imply.
The one measure of algorithm analysis I have always encountered since I started analyzing algorithms is the number of executed instructions and I fail to see how this metric may be irrelevant.
At the same time, complexity classes are meant as an "order of magnitude" indication of how fast or slow an algorithm is. They are dependent of the number of executed instructions and independent of the system the algorithm runs on, because by definition an elementary operation (such as addition of two numbers) should take constant time, however large or small this "constant" means in practice, therefore complexity classes do not change. The constants inside the expression for the exact complexity function may indeed vary from system to system, but what is actually relevant for algorithm comparison is the complexity class, as only by comparing those can you find out how an algorithm behaves on increasingly large inputs (asymptotically) compared to another algorithm.
Big-O notation waves away constants (both fixed cost and constant multipliers). So any function that takes kn+c operations to complete is (by definition!) O(n), regardless of k and c. This is why it's often better to take real-world measurements (profiling) of your algorithms in action with real data, to see how fast they effectively are.
But execution time, obviously, varies depending on the data set -- if you're trying to come up with a general measure of performance that's not based on a specific usage scenario, then execution time is less valuable (unless you're comparing all algorithms under the same conditions, and even then it's not necessarily fair unless you model the majority of possible scenarios, and not just one).
Big-O notation becomes more valuable as you move to larger data sets. It gives you a rough idea of the performance of an algorithm, assuming reasonable values for k and c. If you have a million numbers you want to sort, then it's safe to say you want to stay away from any O(n^2) algorithm, and try to find a better O(n lg n) algorithm. If you're sorting three numbers, the theoretical complexity bound doesn't matter anymore, because the constants dominate the resources taken.
Note also that while the number of statements a given algorithm can be expressed in varies wildly between programming languages, the number of constant-time steps that need to be executed (at the machine level for your target architecture, which is typically one where integer arithmetic and memory accesses take a fixed amount of time, or more precisely are bounded by a fixed amount of time). It is this bound on the maximum number of fixed-cost steps required by an algorithm that big-O measures, which has no direct relation to actual running time for a given input, yet still describes roughly how much work must be done for a given data set as the size of the set grows.
In comparing algorithms, execution speed is important as well mentioned by others, but other factors like memory space are crucial too.
Memory space also uses order of complexity notation.
Code could sort an array in place using a bubble sort needing only a handful of extra memory O(1). Other methods, though faster, may need O(ln N) memory.
Other more esoteric measures include code complexity like Cyclomatic complexity and Readability
Traditionally, computer science measures algorithm effectivity (speed) by the number of comparisons or sometimes data accesses, using "Big O notation". This is so, because the number of comparisons (and/or data accesses) is a good mathematical model to describe efficiency of certain algorithms, searching and sorting ones in particular, where O(log n) is considered the fastest possible in theory.
This theoretic model has always had several flaws though. It assumes that comparisons (and/or data accessing) are what takes time, and that the time for performing things like function calls and branching/looping is neglectible. This is of course nonsense in the real world.
In the real world, a recursive binary search algorithm might for example be extremely slow compared to a quick & dirty linear search implemented with a plain for loop, because on the given system, the function call overhead is what takes the most time, not the comparisons.
There are a whole lot of things that affect performance. As CPUs evolve, more such things are invented. Nowadays, you might have to consider things like data alignment, instruction pipe-lining, branch prediction, data cache memory, multiple CPU cores and so on. All these technologies make traditional algorithm theory rather irrelevant.
To write the most effective code possible, you need to have a specific system in mind and you need in-depth knowledge about said system. Fortunately, compilers have evolved a lot too, so a lot of the in-depth system knowledge can be left to the person who implements a compiler port for the specific system.
Generally, I think many programmers today spend far too much time pondering about program speed and coming up with "clever things" to get better performance. Back in the days when CPUs were slow and compilers were terrible, such things were very important. But today, a good, modern programmer focus on making the code bug-free, readable, maintainable, re-useable, secure, portable etc. It doesn't matter how fast your program is, if it is a buggy mess of unreadable crap. So deal with performance when the need arises.

general question about solving problems with parallelisation

i have a general question about programming of parallel algorithms in C. Lets assume that our task is to implement some matrix algorithms with MPI and/or OpenMP. There are some situations, like false sharing in OpenMP or in MPI where the communications arise in dependence of the matrix dimension (columns cyclic distrubuted among processes), which cause some problems . Would it be a good and a common attempt to solve this situations by, for example, transposing the matrix, because this would reduce the necessary communications or even avoiding the false sharing problem? After that you would undo the transposition. Of course, assuming that this would lead to a much better speed up.
I dont think that this would be very cunning and more of a lazy way to do this. But im curious to read some opions about this.
Let's start with the first question first: can it make sense to transpose? The answer is, it depends, and you can estimate whether it will improve things or not.
The transposition/retransposition with impose a one-time memory bandwidth cost of 2*(going through memory the fast way) + 2*(going through memory the slow way) where those memory operations are literally memory operations in the multicore case, or network communications in the distributed memory case. You're going to be reading a matrix in the fast way and putting it into memory the slow way. (You can make this, essentially, 4*(going through memory the fast way) by reading the matrix in one cache-sized block at a time, transposing in cache, and writing out in order).
Whether or not that's a win or not depends on how many times you'll be accessing the array. If you would have been hitting the entire non-transposed array 4 times with memory access in the "wrong" direction, then you will clearly win by doing the two transposes. If you'd only be going through the non-transposed array once in the wrong direction, then you almost certainly won't win by doing the transposition.
As to the larger question, #AlexandreC is absolutely right here -- trying to implement your own linear algebra routines is madness. Take a look at, eg, How To Write Fast Numerical Code, figure 3; there can be factors of 40 in performance between naive and highly-tuned (say) GEMM operations. These things are highly memory-bandwidth limited, and in parallel that means network limited. By far, best is to use existing tools.
For multicore linear algebra, existing libraries include
Atlas
Plasma
Flame
For MPI implementations, there are
BLACS
Scalapack
or complete solver environments like
PETSc
Trilinos
I don't know that you'd throw the transpose away the second that you completed the operation, but yes this is a valid mechanism to increase parallelism.
I am not an expert; I've only read a little bit about this topic, and even that was for SIMD architectures, so please take my opinion lightly... but I think the usual mechanism is to lay your structures out in memory to match the machine (so you'd transpose a large matrix to line up better with your vectors and increase the dependency distance in your loops), and then you also build an indexing structure of pointers around that so that you can quickly access individual elements in the transpose differently. This gets more difficult to do as your input changes more dynamically.
I dont think that this would be very cunning and more of a lazy way to do this.
Lazy solutions are usually better than "cunning" ones, because they tend to be more simple and straightforward. They're therefore easier to implement, document, understand and maintain. Indeed, laziness is arguably one of the greatest virtues a programmer can have. As long as the program produces correct results at acceptable speeds, nobody should care how elegantly you solved the problem (including you).

The limits of parallelism (job-interview question)

Is it possible to solve a problem of O(n!) complexity within a reasonable time given infinite number of processing units and infinite space?
The typical example of O(n!) problem is brute-force search: trying all permutations (ordered combinations).
It sure is. Consider the Traveling Salesman Problem in it's strict NP form: given this list of costs for traveling from each point to each other point, can you put together a tour with cost less than K? With the new infinite-core CPU from Intel, you just assign one core to each possible permutation, and add up the costs (this is fast), and see if any core flags a success.
More generally, a problem in NP is a decision problem such that a potential solution can be verified in polynomial time (i.e., efficiently), and so (since the potential solutions are enumerable) any such problem can be efficiently solved with sufficiently many CPUs.
It sounds like what you're really asking is whether a problem of O(n!) complexity can be reduced to O(n^a) on a non-deterministic machine; in other words, whether Not-P = NP. The answer to that question is no, there are some Not-P problems that are not NP. For example, a limited halting problem (that asks if a program halts in at most n! steps).
The problem would be distributing the work and collecting the results.
If all the CPUs can read the same piece of memory at once, and if each one has a unique CPU-ID that is known to it, then the ID may be used to select a permutation, and the distribution problem is solveable in constant time.
Gathering the results would be tricky, though. Each CPU could compare with its (numerical) neighbor, and then that result compared to the result of the two closest neighbors, etc. This will be a O(log(n!)) process. I don't know for sure, but I suspect that O(log(n!)) is hyperpolynomial, so I don't think that's a solution.
No, N! is even higher than NP. Thinking unlimited parallelism could solve NP problem in polynomial time, which is usually considered as a "reasonable" time complexity, N! problem is still higher than polynomial on such a setup.
You mentioned search as a "typical" problem, but were you actually asked specifically about a search problem? If so, then yes, search is typically parallelizable, but as far as I can tell O(n!) in principle does not imply the degree of concurrency available, does it? You could have a completely serial O(n!) problem, which means infinite computers won't help. I once had an unusual O(n^4) problem that actually was completely serial.
So, available concurrency is the first thing, and IMHO you should get points for bringing up Amdahl's law in an interview. Next potential pitfall is inter-processor communication, and in general the nature of the algorithm. Consider, for example, this list of application classes: http://view.eecs.berkeley.edu/wiki/Dwarf_Mine. FWIW the O(n^4) code I mentioned earlier sort of falls into the FSM category.
Another somewhat related anecdote: I've heard an engineer from a supercomputer vendor claim that if 10% of their CPU time were being spent in MPI libraries, they consider the parallelization a solid success (though that may have just been limited to codes in the computational chemistry domain).
If the problem is one of checking permutations/answers to a problem of complexity O(n!), then of course you can do it efficiently with an infinite number of processors.
The reason is that you can easily distribute atomic pieces of the problem (an atomic piece of the problem might, say, be one of the permutations to check) with logarithmic efficiency.
As a simple example, you could set up the processors as a 'binary tree', so to speak. You could be at the root, and have the processors deliver permutations of the problem (or whatever the smallest pieces of the problem might be) to the leaf processors to solve, and you'd end up solving the problem in log(n!) time.
Remember it's the delivery of the permutations to the processors that takes a long time. Each part of the problem itself will actually be solved instantly.
Edit: Fixed my post according to the comments below.
Sometimes the correct answer is, "How many times does this come up with your code base?" but in this case, there is a real answer.
The correct answer is no, because not all problems can be solved using perfect parallel processing. For example, a travelling salesman-like problem must commit to one path for the second leg of the journey to be considered.
Assuming a fully connected matrix of cities, should you want to display all possible non-cyclic routes for our weary salesman, you're stuck with a O(n!) problem, which can be decomposed to an O(n)*O((n-1)!) problem. The issue is that you need to commit to one path (on the O(n) side of the equation) before you can consider the remaining paths (on the O((n-1)!) side of the equation).
Since some of the computations must be performed prior to other computations, then there is no way to scatter the results perfectly in a single scatter / gather pass. That means the solution will be waiting on the results of calculations which must come before the "next" step can be started. This is the key, as the need for prior partial solutions provide a "bottle neck" in the ability to proceed with the computation.
Since we've proven we can make a number of these infinitely fast, infinitely numerous, CPUs wait (even if they are waiting on themselves), we know that the runtime cannot be O(1), and we only need to pick a very large N to guarantee an "unacceptable" run time.
This is like asking if an infinite number of monkeys typing on a monkey-destruction proof computer with a word-processor can come up with all the works of Shakespeare; given an infinite amount of time. The realist would say not since the conditions are no physically possible. The idealist will say yes; in theory it can happen. Since Software Engineering (Software Engineering, not Computer Science) focuses on real system we can see and touch, then the answer is no. If you doubt me, then go build it and prove me wrong! IMHO.
Disregarding the cost of setup (whatever that might be...assigning a range of values to a processing unit, for instance), then yes. In such a case, any value less than infinity could be solved in one concurrent iteration across an equal number of processing units.
Setup, however, is something significant to disregard.
Each problem could be solved by one CPU, but who would deliver these jobs to all infinite CPU's? In general, this task is centralized, so if we have infinite jobs to deliver to all infinite CPU's, we could take infinite time to do so.

Kernel methods for large scale dataset

Kernel-based classifier usually requires O(n^3) training time because of the inner-product computation between two instances. To speed up the training, inner-product values can be pre-computed and stored in a two-dimensional array. However when the no. of instances is very large, say over 100,000, there will not be sufficient memory to do so.
So any better idea for this?
For modern implementations of support vector machines, the scaling of the training algorithm is dependent on lots of factors, such as the nature of the training data and kernel that you are using. The scaling factor of O(n^3) is an analytical result and isn't particularly useful in predicting how SVM training will scale in real-world situations. For example, empirical estimates of the training algorithm used by SVMLight put the scaling against training set size to be approximately O(n^2).
I would suggest you ask this question in the kernel machines forum. I think you're more likely to get a better answer than on Stack Overflow, which is more of a general-purpose programming site.
The Relevance Vector Machine has a sequential training mode in which you do not need to keep the entire kernel matrix in memory. You can basically calculate a column at a time, determine if it appears relevant, and throw it away otherwise. I have not had much luck with it myself, though, and the RVM has some other issues. There is most likely a better solution in the realm of Gaussian Processes. I haven't really sat down much with those, but I have seen mention of an online algorithm for it.
I am not a numerical analyst, but isn't the QR decomposition which you need to do ordinary least-squares linear regression also O(n^3)?
Anyways, you'll probably want to search the literature (since this is fairly new stuff) for online learning or active learning versions of the algorithm you're using. The general idea is to either discard data far from your decision boundary or to not include them in the first place. The danger is that you might get locked into a bad local maximum and then your online/active algorithm will ignore data that would help you get out.

Resources