SAT solving with more than 2^32 clauses - c

I'm trying to solve a big CNF formula using a SAT solver. The formula (in DIMACS format) has 4,697,898,048 = 2^32 + 402,930,752 clauses, and all SAT solvers I could find are having trouble with it:
(P)lingeling reports that there are "too many clauses" (i.e. more clauses than the header line specifies, but this is not the case)
CryptoMiniSat4 & picosat claim to read the header line as saying 402,930,752 clauses which are 2^32 too few
Glucose seems to parse 98,916,961 clauses and then reports to have solved the formula as UNSAT using simplification, but this is
unlikely to be correct (an initial segment of the formula this short
is very likely to be SAT).
Is anyone aware of a SAT solver that can handle files this large? Or is there something like a compiler switch that can sidestep this sort of behaviour? I believe all solvers are compiled for 64bit linux. (I'm a bit of a noob when it comes to handling numbers this big, sorry.)

I'm the developer of CryptoMiniSat. In most cases where the CNF is so huge, the issue is not the SAT solver but that the translation of the original problem into CNF wasn't done carefully enough. I assume you didn't write that CNF by hand -- you had a problem which you translated to CNF using an automated tool.
The act of translating a problem into CNF is called encoding and it has a huge literature in academia. It's a whole topic to itself, or more appropriately, whole topics to themselves. Please see the research papers on Constraint Programming (CP), Pseudo-boolean constraints (PB), ANF-to-CNF translation techniques (see crypo workshops/conferences) and electronic circuit encoding (search for AIG, Tseitin encoding and its variants and look at the references). These are the big topics but there are many others. Taking a peek at these will reduce your CNF by at least 3 orders of magnitude, probably more.

Related

Is my Sudoku algorithm considered an "expert system"?

I wrote a code which has all the rules of Sudoku written into it (one occurence of a digit per column, line, and square). The code takes an input (unfilled sudoku grid), and returns a solution by translating logical clauses into DIMACS format and using a SAT solver.
Given that the algorithm respects rules, takes in data, and uses that data to form conclusion based on implications (eg if there is a 1 in the first cell, there cannot be a 1 in the second cell), is this code considered an "expert system"? Thank you.
Whether a program is an expert system is subjective, but I'd say unless your program is encoding non-trivial knowledge acquired from a domain expert, it's not an expert system. If you can't teach another person to practically do what your program is doing, it's not an expert system.
By that definition, what you've done is probably not an expert system since it would be too time consuming for a person to use the same technique. I've written a sudoku solver using a production system (https://sourceforge.net/p/clipsrules/code/HEAD/tree/branches/63x/examples/sudoku/) that I would consider to be an expert system. The encoded knowledge was acquired from websites with advanced techniques for humans to use for solving sudoku puzzles. All of the encoded techniques can be practically used by humans for solving puzzles (although some of the more complex techniques push that boundary).
Although my sudoku solver can solve much more complicated puzzles than I could, calling it an expert system is not an indication of its sophistication. There are better approaches for solving extremely complex sudoku puzzles than emulating approaches humans might take.
In the 80's, I had written a clone of the Emycin expert system engine. One important characteristic was the ability for the user to ask WHY the expert system got some conclusion. The system could reply (in an almost natural language) that it applied such and such rules to get to the conclusion.
With this kind of system, the knowledge is modeled and implemented (by a cognitician engineer) as an explicit set of rules. These rules are objects known by the engine. The engine can trigger the rules (forward or backward or maybe using metarules...) and can log the triggered rules and thus explain its conclusions.
(this is my sense for expert systems).

which encoding to use for genetic algorithm?

I want to code a genetic algorithm in C for optimizing a function of 10 variables (x1 to x10). However I am not able to figure out which encoding I should use. I have mostly seen binary encoding being used in example but the variables in my case can take real values. Also, is value encoding a good option for these types of problems?
For real valued problems I would suggest to try CMA-ES or another ES variant. CMA-ES certainly is the current state of the art for real-valued problems. It is designed to find good solutions in multidimensional problems quickly. There are implementations available on Hansen's page. There's also a C# implementation in the work for HeuristicLab. Evolution strategies are algorithms that were specifically designed for real-valued optimization problems. They are very similar to genetic algorithms (both were invented around the same time, but in different places). The main distinction is that for ES the main driver is mutation and it features a clever adaption of the mutation strength. Without this adaption the (local) optimum cannot be located in time. CMA-ES is easy to configure, all it needs is the initial standard deviation and optionally the population size (otherwise there's a formula that estimates this given the problem size).
Genetic algorithms can of course also be applied, but you have to use some specific operators which are able to mutate variables only with very small degree. For example there's the Breeder Genetic Algorithm from Mühlenbein. In general however genetic algorithms are more suited for problems that need a right combination of things. E.g. which items to include in a knapsack problem or which functions and terminals to combine to a formula (genetic programming). Less for problems, where you need to find the right value for something. Although of course there are variants of the genetic algorithm to solve these, look for Real coded Genetic Algorithm (RCGA or RGA).
Another algorithm suited for real-valued problems is Particle Swarm Optimization, but in my opinion it is harder to configure. I'd start with SPSO-2011 the 2011 standard PSO.
If your problem contains integer variables choices become more difficult. Evolution strategies do not perform so well when variables are discrete, because the adaptation schemes for integer variables are different. A genetic algorithm becomes an interesting first-choice algorithm again.
A genetic algorithm is best used when two answers that are pretty close to optimal will make something else pretty close to optimal when combined. The problem with a pure binary encoding is that if you don't check your crossover you end up getting two answers which may not have all that much to do with the original answers.
That said, this is only really an issue if your number of variables is very small and the amount of data in your variables is large. As far as picking an encoding, it's more of an art than a science and it depends on your problem. I would suggest going with an encoding that fits the amount of precision you want. With 10 variables you won't got that far wrong however you encode it, an 8-bit ASCII encoder would probably work fine.
Hope that helps.

What does number of lines of code tell you about your application?

Recently, we were asked to find the lines of code in our application by our managers. I have actually been pondering since then; what does this metric signify ?
Is it to measure the average lines of code the developer has written over the time ?
IF no re-factoring happens then this can be a possibility.
Does it tell how good is your application ?
Does it help one in marketing the product ?
I don't know how does it help. Can some one please guide me in the right direction or answer what does this metric signify ?
Thanks.
Something I found recently http://folklore.org/StoryView.py?project=Macintosh&story=Negative_2000_Lines_Of_Code.txt&sub=HN0
The number of lines of code is a popular but a problematic metrics.
Advantages
Number of lines of code shows a moderate (0.4-0.5) correlation with the number of bugs [Rosenberg 1997, Zhang 2009], i.e., larger modules usually have more bugs, and which might be more interesting, more bugs per line [Fenton and Ohlsson 2000, Zhang 2009]. I would like to stress that there are better (but more complex) ways to predict the number of bugs.
Number of lines of code can be used to predict the development effort, i.e., there are effort prediction models (e.g., COCOMO) that take the number of source lines of code as one of the input parameters.
Some of the more complex OO-metrics show strong correlation with class size [El Emam et al. 2001].
Disadvantages
Using lines of code as a productivity measure is extremely problematic since it becomes difficult to compare modules in different languages or written by different developers. Indeed, some languages are more verbose due to, e.g., presence/absence of “built-in” functionality or structural verbosity (e.g., .h in C). Moreover, as already mentioned above, some developers are paid per line of code which necessarily leads to ridiculously complicated code. Finally, code generation should be taken into account.
While "lines of code" is a common metrics, one has to be careful with distinguishing different kinds of "lines of code": with blank lines or without, with comments or without, counting logical statements of physical lines...
What does number of lines of code tell you about your application?
The number of lines of code will tell you roughly how much disk space you need to store the uncompressed source files. Even this is rough, as each line will have a different number of characters and different encodings could be used (UTF-8 takes twice the disk space of Latin-1).
Is it to measure the average lines of code the developer has written over the time ?
No.
Does it tell how good is your application ?
No.
Does it help one in marketing the product ?
No.
It signifies that your managers are incompetent
If you were being measured by number of lines of code, as a developer what would you do to achieve the target...
Google for this metric, it will tell you it's the dumbest strategy since Adolf decided to win the war in Europe by invading Russia.

how to incorporate C or C++ code into my R code to speed up a MCMC program, using a Metropolis-Hastings algorithm

I am seeking advice on how to incorporate C or C++ code into my R code to speed up a MCMC program, using a Metropolis-Hastings algorithm. I am using an MCMC approach to model the likelihood, given various covariates, that an individual will be assigned a particular rank in a social status hierarchy by a 3rd party (the judge): each judge (approx 80, across 4 villages) was asked to rank a group of individuals (approx 80, across 4 villages) based on their assessment of each individual's social status. Therefore, for each judge I have a vector of ranks corresponding to their judgement of each individual's position in the hierarchy.
To model this I assume that, when assigning ranks, judges are basing their decisions on the relative value of some latent measure of an individual's utility, u. Given this, it can then be assumed that a vector of ranks, r, produced by a given judge is a function of an unobserved vector, u, describing the utility of the individuals being ranked, where the individual with the kth highest value of u will be assigned the kth rank. I model u, using the covariates of interest, as a multivariate normally distributed variable and then determine the likelihood of the observed ranks, given the distribution of u generated by the model.
In addition to estimating the effect of, at most, 5 covariates, I also estimate hyperparameters describing variance between judges and items. Therefore, for every iteration of the chain I estimate a multivariate normal density approximately 8-10 times. As a result, 5000 iterations can take up to 14 hours. Obviously, I need to run it for much more than 5000 runs and so I need a means for dramatically speeding up the process. Given this, my questions are as follows:
(i) Am I right to assume that the best speed gains will be had by running some, if not all of my chain in C or C++?
(ii) assuming the answer to question 1 is yes, how do I go about this? For example, is there a way for me to retain all my R functions, but simply do the looping in C or C++: i.e. can I call my R functions from C and then do looping?
(iii) I guess what I really want to know is how best to approach the incorporation of C or C++ code into my program.
First make sure your slow R version is correct. Debugging R code might be easier than debugging C code. Done that? Great. You now have correct code you can compare against.
Next, find out what is taking the time. Use Rprof to run your code and see what is taking the time. I did this for some code I inherited once, and discovered it was spending 90% of the time in the t() function. This was because the programmer had a matrix, A, and was doing t(A) in a zillion places. I did one tA=t(A) at the start, and replaced every t(A) with tA. Massive speedup for no effort. Profile your code first.
Now, you've found your bottleneck. Is it code you can speed up in R? Is it a loop that you can vectorise? Do that. Check your results against your gold standard correct code. Always. Yes, I know its hard to compare algorithms that rely on random numbers, so set the seeds the same and try again.
Still not fast enough? Okay, now maybe you need to rewrite parts (the lowest level parts, generally, and those that were taking the most time in the profiling) in C or C++ or Fortran, or if you are really going for it, in GPU code.
Again, really check the code is giving the same answers as the correct R code. Really check it. If at this stage you find any bugs anywhere in the general method, fix them in what you thought was the correct R code and in your latest version, and rerun all your tests. Build lots of automatic tests. Run them often.
Read up about code refactoring. It's called refactoring because if you tell your boss you are rewriting your code, he or she will say 'why didn't you write it correctly first time?'. If you say you are refactoring your code, they'll say "hmmm... good". THIS ACTUALLY HAPPENS.
As others have said, Rcpp is made of win.
A complete example using R, C++ and Rcpp is provided by this blog post which was inspired by a this post on Darren Wilkinson's blog (and he has more follow-ups). The example is also included with recent releases of Rcpp in a directory RcppGibbs and should get you going.
I have a blog post which discusses exactly this topic which I suggest you take a look at:
http://darrenjw.wordpress.com/2011/07/31/faster-gibbs-sampling-mcmc-from-within-r/
(this post is more relevant than the post of mine that Dirk refers to).
I think the best method currently to integrate C or C++ is the Rcpp package of Dirk Eddelbuettel. You can find a lot of information at his website. There is also a talk at Google that is available through youtube that might be interesting.
Check out this project:
https://github.com/armstrtw/rcppbugs
Also, here is a link to the R/Fin 2012 talk:
https://github.com/downloads/armstrtw/rcppbugs/rcppbugs.pdf
I would suggest to benchmark each step of the MCMC sampler and identify the bottleneck. If you put each full conditional or M-H-step into a function, you can use the R compiler package which might give you 5%-10% speed gain. The next step is to use RCPP.
I think it would be really nice to have a general-purpose RCPP function which generates just one single draw using the M-H algorithm given a likelihood function.
However, with RCPP some things become difficult if you only know the R language: non-standard random distributions (especially truncated ones) and using arrays. You have to think more like a C programmer there.
Multivariate Normal is actually a big issue in R. Dmvnorm is very inefficient and slow. Dmnorm is faster, but it would give me NaNs quicker than dmvnorm in some models.
Neither does take an array of covariance matrices, so it is impossible to vectorize code in many instances. As long as you have a common covariance and means, however, you can vectorize, which is the R-ish strategy to speed up (and which is the oppositve of what you would do in C).

How do algorithms differ from design patterns?

I am new to C programming; coming from an OOP PHP background.
I find C to be (no wonder) a much more difficult language. I had particularly lots of problems figuring out a couple of things on arrays at first: like there is no native associative array.
Now, this part I guess I'm figuring out little by little, but now I have a question regarding a conversation I had just yesterday with a C developer. She was explaining the binary search algorithm to me because I asked her whether there were libraries to do array related stuff in C or not because it seemed like a smarter solution than always re-inventing the wheel.
I would really love to learn more about algorithms in C, in particular what differences are there between algorithms and the design patterns I'm used to using in PHP?
Taking things in order: the extent of C's support for anything like an associative array would be qsort to sort an array of structures based on a key, and bsearch to find one based on a key. There are, of course, quite a few alternatives -- various other libraries have hash tables, balanced trees, etc. Exactly which will suit your purposes is hard to guess though.
Offhand, I don't know of many good books covering algorithms that use C as their primary vehicle for demonstration. A few obvious recommendations for books on algorithms in general (mostly language independent) would be:
The Art of Computer Programming by Donald Knuth. This is pretty much the class algorithms book. It's now (finally) up to four volumes. Knuth originally started on it in 1967, planning to write 7 volumes. Only three volumes were available for a long time. A fourth was added quite recently. At the rate he's going, it's only going to make it to 7 if Knuth lives to be well past 100 years old. Nonetheless, the parts that are there are extremely good -- but (warning!) he analyzes the algorithms in considerable detail; if you don't know at least a little calculus, a fair amount will probably be hard to follow.
Introduction to Algorithms by Cormen, Leiserson, Rivest and Stein. IIRC, there's now a newer edition than I have, which adds yet another author. This is a large book (dropping it on your toes would be quite painful). It uses a fair amount of mathematical notation and such throughout, but if you're willing to work a little at looking up the notation, it's really pretty understandable. It covers quite a bit of important ground (e.g., graph algorithms) that are scheduled for later volumes of Knuth, but not (at least yet) available there.
Algorithms and Data Structures by Aho, Hopcraft and Ullman. This is (by a pretty fair margin) the smallest, lightest, and at least for most people probably the easiest of these to follow.
Though it's only available used anymore, if you can find a copy of Algorithms + Data Structures = Programs by Niklaus Wirth, that's what I'd really suggest. It uses Pascal (no surprise -- Niklaus Wirth invented Pascal), but that's enough like C that it doesn't cause a real problem. It doesn't go into as much depth as Knuth about each algorithm, but still enough to give a good feel for when one is likely to be a good choice versus another. For somebody in your position (some background in programming, but little in this area) it's my top recommendation.
Though I've said it before, I think it bears repeating: IMO, all of Robert Sedgewick's books on algorithms should be avoided. Algorithms in C++ is probably the worst of them, but the others are only marginally better. The code they include (again, especially the C++ version) is truly execrable, and the descriptions of algorithms are often incomplete and/or misleading. The most recent editions have fixed some of the problems, but (IMO) not nearly enough to qualify as something that should ever be recommended. If there was no alternative, you could probably get by with these, but given the number of alternatives that are dramatically superior, the only reason to read these at all is if somebody gives them to you, and you absolutely can't afford anything else.
As far as algorithms versus design patterns goes, the line can get blurry in places, but generally an algorithm is much more tightly defined. An algorithm will normally have a specific, tightly defined input which it processes in a specific way to produce an equally specific result/output. A design pattern tends to be more loosely defined, more generic. An algorithm can be generic as well (e.g., a sorting algorithms might require a type that defines a strict, weak ordering) but still has specific requirements on the type.
A design pattern tends to be somewhat more loosely defined. For example, the visitor pattern involves processing groups of objects -- but we don't want to modify the types of those objects when we decide we need to process them in a new and different way. We do that by defining the processes separately from the objects to be processed, along with how we'll traverse the groups of objects, and allow a process to work with each.
To look at it from a rather different direction, you can usually implement an algorithm with a function or a small group of functions. A design pattern tends to be oriented more toward the style in which you write your code, rather than just "here's a function, use it."
"Algorithms in C, Parts 1-5 (Bundle): Fundamentals, Data Structures, Sorting, Searching, and Graph Algorithms (3rd Edition)"
Cannot stress how good that series is.

Resources