Programming novice: How to program my own data compression algorithm?

Programming novice: How to program my own data compression algorithm? - c

It is summer, and so I have decided to take it upon myself to write a data-compression program, preferably in C code. I have a decent beginners understanding of how compression works. I just have a few questions:
1) Would c be a suitable programming language to accomplish this task?
2) Should I be working in byte's with the input file? Or at a binary level somehow?
If someone could just give me a nudge in the correct direction, I'd really appreciate it. I would like to code this myself however, and not use a pre-existing compression library or anything like that.

You could start by looking at Huffman Encoding. A lot of computer science classes implement that as a project so it should be manageable. C would be appropriate for Huffman encoding, but it might be easier to do it first in a higher-level language so that you understand the concepts.There are slides, hints, and an example project available in Java for a masters-level project at the University of Pennsylvania (search for "huff" on that page).

To answer your questions:
C is suitable.
It depends on the algorithm, or the way you are thinking about `compression'.
My opinion will be, first decide whether you want to do a lossless compression or a lossy compression, then pick an algorithm to implement. Here are a few pointers:
For the lossless one, some are very intuitive, such as the run-length encoding,
e.g., if there is 11 as and 5 bs, you just encode them as 11a5b.
Some algorithms use a dictionary, please refer to LZW encoding.
Finally, I do recommend Huffman encoding since it is very straight-forward, simple and helpful to gain experience in learning algorithm (for your educational purpose).
For lossy ones, Discrete Fourier Transform (DFT), or wavelet, is used in JPEG compression. This is useful to understand multimedia compression.
Wikipedia page is a good starting point.

Yes, C is well suited for this kind of work.
Whether you work with bytes or bits will depend on the algorithm that you decide to implement. For example, Huffman coding is inherently bit-oriented whereas many other compression algorithms are not.

C is a great choice for writing a compression program. You can use plenty of other languages too, though.
Your computer probably can't directly address units of memory smaller than a byte (pretty much by definition), so working with bytes is probably a good choice. Some of how you work with the data will be affected by the compression algorithm you choose.
Good luck!

1) Would c be a suitable programming language to accomplish this task?
Yes.
2) Should I be working in byte's with the input file? Or at a binary level somehow?
They're the same, so the question makes no sense.
not use a pre-existing compression library
Can you use a pre-existing compression algorithm? There are dozens and "compression algorithm" -- when used with Google -- will reveal a great deal of helpful information.

Related

Is my Sudoku algorithm considered an "expert system"?

I wrote a code which has all the rules of Sudoku written into it (one occurence of a digit per column, line, and square). The code takes an input (unfilled sudoku grid), and returns a solution by translating logical clauses into DIMACS format and using a SAT solver.
Given that the algorithm respects rules, takes in data, and uses that data to form conclusion based on implications (eg if there is a 1 in the first cell, there cannot be a 1 in the second cell), is this code considered an "expert system"? Thank you.

Whether a program is an expert system is subjective, but I'd say unless your program is encoding non-trivial knowledge acquired from a domain expert, it's not an expert system. If you can't teach another person to practically do what your program is doing, it's not an expert system.
By that definition, what you've done is probably not an expert system since it would be too time consuming for a person to use the same technique. I've written a sudoku solver using a production system (https://sourceforge.net/p/clipsrules/code/HEAD/tree/branches/63x/examples/sudoku/) that I would consider to be an expert system. The encoded knowledge was acquired from websites with advanced techniques for humans to use for solving sudoku puzzles. All of the encoded techniques can be practically used by humans for solving puzzles (although some of the more complex techniques push that boundary).
Although my sudoku solver can solve much more complicated puzzles than I could, calling it an expert system is not an indication of its sophistication. There are better approaches for solving extremely complex sudoku puzzles than emulating approaches humans might take.

In the 80's, I had written a clone of the Emycin expert system engine. One important characteristic was the ability for the user to ask WHY the expert system got some conclusion. The system could reply (in an almost natural language) that it applied such and such rules to get to the conclusion.
With this kind of system, the knowledge is modeled and implemented (by a cognitician engineer) as an explicit set of rules. These rules are objects known by the engine. The engine can trigger the rules (forward or backward or maybe using metarules...) and can log the triggered rules and thus explain its conclusions.
(this is my sense for expert systems).

JPEG source-code and quantization mode change - C language

I'm assigned to do a project that consists in changing the quantization in the JPEG source-code, from the quantization tables to Lloyd-Max quantization. The problem is not knowing what to do (I know how to change the quantization), but where to find the code I'm suposed to change.
If someone is familiar with the libjpeg-turbo, could you give me some advice on doing so?

I refrained from responding because it has been a long time since I have prowled around in the LIBJPEG code and I understand that it has been rewritten. The code functions well and is efficient but it is quite torturous to read and understand.
This is a C++ library that apparently was written for instructive purposes. For understandability it is about as good as you are going to get with JPEG:
http://www.colosseumbuilders.com/sourcecode/imagelib403.zip
However, if I remember correctly, this one, like LIbJPEG, combines some steps of the DCT and quantization.

which encoding to use for genetic algorithm?

I want to code a genetic algorithm in C for optimizing a function of 10 variables (x1 to x10). However I am not able to figure out which encoding I should use. I have mostly seen binary encoding being used in example but the variables in my case can take real values. Also, is value encoding a good option for these types of problems?

For real valued problems I would suggest to try CMA-ES or another ES variant. CMA-ES certainly is the current state of the art for real-valued problems. It is designed to find good solutions in multidimensional problems quickly. There are implementations available on Hansen's page. There's also a C# implementation in the work for HeuristicLab. Evolution strategies are algorithms that were specifically designed for real-valued optimization problems. They are very similar to genetic algorithms (both were invented around the same time, but in different places). The main distinction is that for ES the main driver is mutation and it features a clever adaption of the mutation strength. Without this adaption the (local) optimum cannot be located in time. CMA-ES is easy to configure, all it needs is the initial standard deviation and optionally the population size (otherwise there's a formula that estimates this given the problem size).
Genetic algorithms can of course also be applied, but you have to use some specific operators which are able to mutate variables only with very small degree. For example there's the Breeder Genetic Algorithm from Mühlenbein. In general however genetic algorithms are more suited for problems that need a right combination of things. E.g. which items to include in a knapsack problem or which functions and terminals to combine to a formula (genetic programming). Less for problems, where you need to find the right value for something. Although of course there are variants of the genetic algorithm to solve these, look for Real coded Genetic Algorithm (RCGA or RGA).
Another algorithm suited for real-valued problems is Particle Swarm Optimization, but in my opinion it is harder to configure. I'd start with SPSO-2011 the 2011 standard PSO.
If your problem contains integer variables choices become more difficult. Evolution strategies do not perform so well when variables are discrete, because the adaptation schemes for integer variables are different. A genetic algorithm becomes an interesting first-choice algorithm again.

A genetic algorithm is best used when two answers that are pretty close to optimal will make something else pretty close to optimal when combined. The problem with a pure binary encoding is that if you don't check your crossover you end up getting two answers which may not have all that much to do with the original answers.
That said, this is only really an issue if your number of variables is very small and the amount of data in your variables is large. As far as picking an encoding, it's more of an art than a science and it depends on your problem. I would suggest going with an encoding that fits the amount of precision you want. With 10 variables you won't got that far wrong however you encode it, an 8-bit ASCII encoder would probably work fine.
Hope that helps.

how to align two meshes

I have a very nice & tricky question for you.
I need to align two meshes using a very fast algorithm. Given mesh1 and mesh2 I want to find how I need to traslate and rotate mesh1 to be in the same position of mesh2.
Firstly I did this using inertia moments of the two meshes, but the algorithm does not work if the second mesh is similar to the first one but with some missing parts. In other words, take two identical meshes and from one of them cut same parts off.
I'd like to write the code in C because I need to perform that on multiplatform machines (linux/win) and do that in a very fast way: it has to be put into a GA algorithm.
The two meshes are in STL (stereolitography) format (binary or ascii) but maybe can be useful using another kind of file format.
Do you have any idea how to perform this stuff?
question update:
first of all I want to thank you guys very much for all your suggestions. I've downloaded an install PCL on my machine and compiled the ICP (tutorial) algorithm successfully, taken from PCL web site.
But now I have some questions about that, maybe because for me is a brand new thing. what is the meaning of the 4x4 matrix output for the fitness? I should expect a rotational matrix and a traslational vector..
I hope some of you can help me.
If you need any other info please ask.

Point Cloud Library has several resources that you may find useful. As #Throwback1986 says, ICP is one excellent algorithm for aligning geometry. Pcl also features other, often faster alignment algorithms, based on identifying and matching features of interest in two pieces of geometry. The library finds a lot of use in the robotics communities, who, like you, are very performance conscious.
Pcl is written in c++. While not as portable as straight C, They offer installation instructions for windows, a few *nix flavors, and mac os. I've seen it running on ios and android as well. Check out the tutorials.

Iterative Closest Point (ICP) is one way of registering (aligning) 3D point clouds with rigid transformations. (It can also apply to meshes.)
Here is a good introduction: http://www.cs.duke.edu/courses/spring07/cps296.2/scribe_notes/lecture24.pdf
Here is a reasonable summary:
students.asl.ethz.ch/upl_pdf/314-report.pdf
Here is a matlab implementation:
http://www.mathworks.com/matlabcentral/fileexchange/12627-iterative-closest-point-method
Here are some potential optimizations:
http://www.cs.princeton.edu/~smr/papers/fasticp/

How do algorithms differ from design patterns?

I am new to C programming; coming from an OOP PHP background.
I find C to be (no wonder) a much more difficult language. I had particularly lots of problems figuring out a couple of things on arrays at first: like there is no native associative array.
Now, this part I guess I'm figuring out little by little, but now I have a question regarding a conversation I had just yesterday with a C developer. She was explaining the binary search algorithm to me because I asked her whether there were libraries to do array related stuff in C or not because it seemed like a smarter solution than always re-inventing the wheel.
I would really love to learn more about algorithms in C, in particular what differences are there between algorithms and the design patterns I'm used to using in PHP?

Taking things in order: the extent of C's support for anything like an associative array would be qsort to sort an array of structures based on a key, and bsearch to find one based on a key. There are, of course, quite a few alternatives -- various other libraries have hash tables, balanced trees, etc. Exactly which will suit your purposes is hard to guess though.
Offhand, I don't know of many good books covering algorithms that use C as their primary vehicle for demonstration. A few obvious recommendations for books on algorithms in general (mostly language independent) would be:
The Art of Computer Programming by Donald Knuth. This is pretty much the class algorithms book. It's now (finally) up to four volumes. Knuth originally started on it in 1967, planning to write 7 volumes. Only three volumes were available for a long time. A fourth was added quite recently. At the rate he's going, it's only going to make it to 7 if Knuth lives to be well past 100 years old. Nonetheless, the parts that are there are extremely good -- but (warning!) he analyzes the algorithms in considerable detail; if you don't know at least a little calculus, a fair amount will probably be hard to follow.
Introduction to Algorithms by Cormen, Leiserson, Rivest and Stein. IIRC, there's now a newer edition than I have, which adds yet another author. This is a large book (dropping it on your toes would be quite painful). It uses a fair amount of mathematical notation and such throughout, but if you're willing to work a little at looking up the notation, it's really pretty understandable. It covers quite a bit of important ground (e.g., graph algorithms) that are scheduled for later volumes of Knuth, but not (at least yet) available there.
Algorithms and Data Structures by Aho, Hopcraft and Ullman. This is (by a pretty fair margin) the smallest, lightest, and at least for most people probably the easiest of these to follow.
Though it's only available used anymore, if you can find a copy of Algorithms + Data Structures = Programs by Niklaus Wirth, that's what I'd really suggest. It uses Pascal (no surprise -- Niklaus Wirth invented Pascal), but that's enough like C that it doesn't cause a real problem. It doesn't go into as much depth as Knuth about each algorithm, but still enough to give a good feel for when one is likely to be a good choice versus another. For somebody in your position (some background in programming, but little in this area) it's my top recommendation.
Though I've said it before, I think it bears repeating: IMO, all of Robert Sedgewick's books on algorithms should be avoided. Algorithms in C++ is probably the worst of them, but the others are only marginally better. The code they include (again, especially the C++ version) is truly execrable, and the descriptions of algorithms are often incomplete and/or misleading. The most recent editions have fixed some of the problems, but (IMO) not nearly enough to qualify as something that should ever be recommended. If there was no alternative, you could probably get by with these, but given the number of alternatives that are dramatically superior, the only reason to read these at all is if somebody gives them to you, and you absolutely can't afford anything else.
As far as algorithms versus design patterns goes, the line can get blurry in places, but generally an algorithm is much more tightly defined. An algorithm will normally have a specific, tightly defined input which it processes in a specific way to produce an equally specific result/output. A design pattern tends to be more loosely defined, more generic. An algorithm can be generic as well (e.g., a sorting algorithms might require a type that defines a strict, weak ordering) but still has specific requirements on the type.
A design pattern tends to be somewhat more loosely defined. For example, the visitor pattern involves processing groups of objects -- but we don't want to modify the types of those objects when we decide we need to process them in a new and different way. We do that by defining the processes separately from the objects to be processed, along with how we'll traverse the groups of objects, and allow a process to work with each.
To look at it from a rather different direction, you can usually implement an algorithm with a function or a small group of functions. A design pattern tends to be oriented more toward the style in which you write your code, rather than just "here's a function, use it."

"Algorithms in C, Parts 1-5 (Bundle): Fundamentals, Data Structures, Sorting, Searching, and Graph Algorithms (3rd Edition)"
Cannot stress how good that series is.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight