What is the relation between BLAS, LAPACK and ATLAS - c

I don't understand how BLAS, LAPACK and ATLAS are related and how I should use them together! I have been looking through all of their manuals and I have a general idea of BLAS and LAPACK and how to use them with the very few examples I find, but I can't find any actual examples using ATLAS to see how it is related with these two.
I am trying to do some low level work on matrixes and my primary language is C. First I wanted to use GSL, but it says that if you want the best performance you should use BLAS and ATLAS. Is there any good webpage giving some nice examples of how to use these (in C) all together? In other words I am looking for a tutorial on using these three (or any subset of them!). In short I am confused!

BLAS is a collection of low-level matrix and vector arithmetic operations (“multiply a vector by a scalar”, “multiply two matrices and add to a third matrix”, etc ...).
LAPACK is a collection of higher-level linear algebra operations. Things like matrix factorizations (LU, LLt, QR, SVD, Schur, etc) that are used to do things like “find the eigenvalues of a matrix”, or “find the singular values of a matrix”, or “solve a linear system”. LAPACK is built on top of the BLAS; many users of LAPACK only use the LAPACK interfaces and never need to be aware of the BLAS at all. LAPACK is generally compiled separately from the BLAS, and can use whatever highly-optimized BLAS implementation you have available.
ATLAS is a portable reasonably good implementation of the BLAS interfaces, that also implements a few of the most commonly used LAPACK operations.
What “you should use” depends somewhat on details of what you’re trying to do and what platform you’re using. You won't go too far wrong with “use ATLAS + LAPACK”, however.

While ago, when I started doing some linear algebra in C, it came to me as a surprise to see there are so few tutorials for BLAS, LAPACK and other fundamental APIs, despite the fact that they are somehow the cornerstones of many other libraries. For that reason I started collecting all the examples/tutorials I could find all over the internet for BLAS, CBLAS, LAPACK, CLAPACK, LAPACKE, ATLAS, OpenBLAS ... in this Github repo.
Well, I should warn you that as a mechanical engineer I have little experience in managing such a git repository or GitHub. It will first seem as a complete mess to you guys. However if you manage to get over the messy structure you will find all kind of examples and instructions which might be a help. I have tried most of them, to be sure they compile. And the ones which do not compile I have mentioned. I have modified many of them to be compilable with GNU compilers (gcc, g++ and gfortran). I have made MakeFiles which you can read to learn how you can call individual Fortran/FORTRAN routines in a C or C++ program. I have also put some installations instructions for mac and linux (sorry windows guys!). I have also made some bash .sh files for automatic compilation of some of these libraries.
But going to your other question: BLAS and LAPACK are rather APIs not specific SDKs. They are just a list of specifications or language extensions rather than an implementations or libraries. With that said, there are original implementations by Netlib in FORTRAN 77, which most people refer to (confusingly!) when talking about BLAS and LAPACK. So if you see a lot of strange things when using these APIs is because you were actually calling FORTRAN routines in C rather than C libraries and functions. ATLAS and OpenBLAS are some of the best implementations of BLAS and LACPACK as far as I know. They conform to the original API, even though, to my knowledge they are implemented on C/C++ from scratch (not sure!). There are GPGPU implementations of the APIs using OpenCL: CLBlast, clBLAS, clMAGMA, ArrayFire and ViennaCL to mention some. There are also vendor specific implementations optimized for specific hardware or platform, which I strongly discourage anybody to use them.
My recommendation to anyone who wants to learn using BLAS and LAPACK in C is to learn FORTRAN-C mixed programming first. The first chapter of the mentioned repo is dedicated to this matter and there I have collected many different examples.
P.S. I have been working on the dev branch of the repository time to time. It seems slightly less messy!

ATLAS is by now quite outdated. It was developed at a time when it was thought that optimizing the BLAS for various platforms was beyond the ability of humans, and as a result autogeneration and autotuning was the way to go.
In the early 2000s, along came Kazushige Goto, who showed how highly efficient implementations can be coded by hand. You may enjoy an interesting article in the New York Times: https://www.nytimes.com/2005/11/28/technology/writing-the-fastest-code-by-hand-for-fun-a-human-computer-keeps.html.
Kazushige's on the one hand had better insights into the theory behind high-performance implementations of matrix-matrix multiplication, and on the other hand engineered these better. His approach, which on current CPUs is usually the highest performing, is not in the search space that ATLAS autotunes. Hence, ATLAS is inherently inferior. Kazushige's implementation of the BLAS became known as the GotoBLAS. It was forked as the OpenBLAS when he joined industry.
The ideas behind the GotoBLAS were refactored into a new implementation, the BLAS-like Library Instantiation Software (BLIS) framework (https://github.com/flame/blis), which implements the same algorithms, but structures the code so that less needs to be custom implemented for a new architecture. BLIS is coded in C.
What this discussion shows is that there are many implementation of the BLAS. The BLAS themselves are a de facto standard for the interface. ATLAS was once the state of the art. It is no longer.

As far as I'm aware, and after working through the ATLAS repository, it seems that it includes a re-implementation of BLAS in C. There's bit more to it than that but I hope it answers the question.

Related

Arbitrary precision integer on GPU

I'm currently doing a primality test on huge numbers (up to 10M digits).
Right now, I'm using a c program using the GMP library. I did some parallelization using OpenMP and got a nice speedup (3.5~ with 4 cores). The problem is that I don't have enough CPU cores to make it feasible to run with my whole dataset.
I have an NVidia GPU and, I tried to find an alternative to GMP, but for GPUs. It can be either CUDA or OpenCL.
Is there an arbitrary precision library that I can run on my GPU? I'm also open to using another programming language if there is a simple or more elegant way of doing it.
It seems the Julia Language is already able to do multiprecision arithmetic and use the GPU (see here for a simple example combining these two), but you might have to learn Julia and rewrite your code.
The CUMP library is meant to be a GMP substitute for CUDAs, it attempts to make it easier for porting GMP code to CUDAs by offering an interface similar to GMP, for example you replace mpf_... variables and functions by cumpf_... ones. There is a demo you can make if it fits your CUDA. No documentation or support though, you'll have to go through the code to see if it works.
The CAMPARY library from people in LAAS-CNRS might be a shot as well, but no documentation either. It has been applied in more research than CUMP, so some chance there. An answer here gives some clarification on how to use it.
GRNS uses the residue number system on CUDA-compatible processors, it has no documentation but an interesting paper is available. Also, see this one.
XMP comes directly from NVIDIA labs, but seems incomplete and has no docs also. Some info here and a paper here.
XMP 2.0 seems newer but only support sizes up to 32k bits for now.
GPUMP looks promising, but seems not available for download, perhaps by contacting the authors.
The MPRES-BLAS is a multiprecision library for linear algebra on CUDAs, and does - of course - have code for basic arithmetic. Also, papers here and here.
I haven't tested any of those, so please let us know if any of those work well.

In what languages besides C can I write a C library?

I want to write a library that is dynamically loadable and callable from C code, but I really don't want to write it in C - the code is security critical, so I want a language that makes it easier to have confidence that my code is correct. What are my options?
To be more specific, I want C programmers to be able to #include this, and -l that, and start using my library just as if I had written it in C. I'd like programmers in other languages to be able to use their favourite tools for linking to C libraries to link to it. Ideally I'd like that to be possible on every platform that supports C, but I'll settle for Linux, Windows and MacOS.
Anything that compiles to native code. So you might Google for that - "languages that compile to native code." See, e.g., Programming languages that compile to native code and have the batteries included
C++ is often the choice for this. Compiles to native code and provided you keep your interfaces simple, easy to write an adapter layer.
Objective C and Fortran are also possible.
It sounds like you are looking for a language with ABI compatibility or which can be described as resulting in native code. So long as it can be compiled to a valid object file (typically an .obj or .o file) which is accepted by the linker, that should be the main criteria. You also then want to write a header file as a convenience for any client code which is written in C (or a closely related language/variant thereof).
As mentioned by others, you need a pretty good reason for choosing a language other than C as it is the lingua-franco of low-level/systems software. Assembler is an option, although harder to port between platforms. D is a more portable - but less widespread - alternative which is designed to produce secure, efficient native code with a minimum of fuss. There are many others.
Almost every security critical application I know of is written in C. I don't believe that there are any other language that has higher real status in producing secure applications.
C is being said to be a poor language for security by people who don't understand.
If you want C programmers to use your library, use C. Doing anything else is tying one hand behind your back whilst trying to walk on a balance beam (the gymnastics equipment). Sure, there are dozens of other languages that are CAPABLE of interfacing to C, but it typically involves using a C layer and then stuffing the C data types into a language specific data type (Java Objects, Python Objects, etc, etc), and when the call is finished, you use the same conversion back to a C data type. Just makes it harder to work with, and potentially slower if you don't get all the design decisions right. And people won't understand the source code, so won't like to use it (see more about this below).
If you want security, then write very good code, wearing your "security aspects" hat firmly on at all times, find a security mailing list or website and post it there for review, take the review comments on board, understand the comments, and fix any comments that are meaningful to fix. Distribute the source code to the users, so people can see what your code does. Those that understand security will know what to look for and understand that you have done a good job (or a bad job, whichever is applicable) - and those who don't will hopefully trust the right pople. If it's good, people will use it. If it's "hidden", and not easy to access, you won't get many customers, no matter what language you use.
Don't worry, you won't reveal anything more from releasing source. If there is a flaw in the code, and it is popular (or important) enough, someone will find the flaw, even if you publish only binaries. For those skilled in reverse engineering, not having source code is only a small obstacle.
Security doesn't stem from using a specific language or a specific tool, it stems from good design and good basic understanding of the problems with security.
And remember security by obscurity (whether that means "hidden source code" or "unusual language" or something else obscure) is false security.
You might be interested in ATS, http://ats-lang.sourceforge.net/. ATS compiles via C, can be as efficient as C, and can be used in a way that is ABI-compatible with C. From the project website:
ATS is a statically typed programming language that unifies implementation with formal specification. It is equipped with a highly expressive type system rooted in the framework Applied Type System, which gives the language its name. In particular, both dependent types and linear types are available in ATS. The current implementation of ATS (ATS/Anairiats) is written in ATS itself. It can be as efficient as C/C++ (see The Computer Language Benchmarks Game for concrete evidence) and supports a variety of programming paradigms
ATS's dependent and linear type system helps produce static guarantees about your code, including various aspects of resource management safety.
Chris Double has been writing a series of articles exploring the power of ATS's type system for systems programming here: http://bluishcoder.co.nz/tags/ats/. Of particular note is this article: http://bluishcoder.co.nz/2012/08/30/safer-handling-of-c-memory-in-ats.html
This document covers aspects of calling back and forth between ATS and C code: https://docs.google.com/document/d/1W6DYQApEqKgyBzMbvpCI87DBfLdNAQ3E60u1hUiMoU0
The main downside is that dependently-typed programming is still a daunting prospect, even for non-systems programming. The syntax of the language is also a bit weird: consider lexical quirks such as the use of abst#ype as a keyword. Finally, ATS is to some degree a research project, and I personally don't know whether it would be sensible to adopt for a commercial endeavour.
Theoretically, it's going to be Fortran: less indirection (as in: my array is [here], not just a pointer to here, and this is true of most but not all of your data structures and variables).
However... There are many gotchas and quirks in Fortran: not, perhaps, as many as in C but you probably know your way around C rather better than Fortran. Which is the point behind most of the comments saying 'Know your code' - but do you really know what your compiler is doing?
Knowing you, I'm prepared to take it on trust that you do, for C. Most programers don't. You do not know and cannot know what a local JVM or JIT compiler does, and that's a black hole in your security model if you're using Java or C# r scripting languages.
Ignore anyone who tells you that the hairy-chested he-men of secure computing write their own assembler: they probably don't even know the security errors they're making in any and all nontrivial projects they release. Know your compiler, indeed.
You could write it in lua - providing a C API to a Lua library is relatively straight forward. C++ is also an option, though of course you'd have to write C wrappers and make sure no exceptions can escape your functions. But honestly, if it's security critical the minor inconveniences of the C language shouldn't be that much of a big deal. What you really should be doing is prove the correctness of your program where feasible, and test extensively where it's not.
You can write a library in Java. JNI is normally used to call C from Java, but it can be used the other way around.
There is finally a decent answer to this question: Rust.

Symbolic Computation Library in pure C

Does there exist a symbolic computation library written pure C? Symbolic computation as in manipulating mathematical equations in symbolic form.
I know there is Mathematica, and Sympy. But, I am interested in creating in a high performance pure C implementation of a symbolic computation library to bind to a scripting language, specifically Ruby to start.
It would seem that their is a need for a symbolic mathematics library such this. Over time, ideally the library could be built out in a similar manor to libgit2 where there is a central C implementation of the project and various implementations branched off for the purpose of creating bindings to other languages?
Mathomatic is implemented in C, and may suit your purposes.
Mathomatic™ is a portable, command-line, educational CAS and calculator software, written entirely in the C programming language. It is free and open source software (FOSS), published under the GNU Lesser General Public License (LGPL version 2.1), and has been under continual development since 1986. The software can symbolically solve, simplify, combine, and compare algebraic equations, simultaneously performing generalized standard, complex number, modular, and polynomial arithmetic, as needed. It does some calculus and is very easy to compile/install, learn, and use.
From the developer's manual:
The Mathomatic source code can also be compiled as a symbolic math library that is callable from any C compatible program and is mostly operating system independent.
Unfortunately, the author of this package has passed away, and the software is no longer being maintained. The latest version was archived in GitHub, and the links above have been updated.
Have you taken a look at GAP? From its website:
GAP is a system for computational discrete algebra, with particular
emphasis on Computational Group Theory. GAP provides a programming
language, a library of thousands of functions implementing algebraic
algorithms written in the GAP language as well as large data libraries
of algebraic objects. See also the overview and the description of the
mathematical capabilities. GAP is used in research and teaching for
studying groups and their representations, rings, vector spaces,
algebras, combinatorial structures, and more. The system, including
source, is distributed freely. You can study and easily modify or
extend it for your special use.
According to its Wikipedia page, GAP is implemented in C, and the source code is freely available.
Please look at Axiom -- a general purpose Computer Algebra system. Also you can use Giac -- Giac is a free (GPL) C++ library, it is the computation kernel, it may be used inside other C++ programs.
http://www.axiom-developer.org/
http://www-fourier.ujf-grenoble.fr/~parisse/giac.html
You can start with Maxima and use GCL to translate it from Common Lisp to C.
GCL is the official Common Lisp for the GNU project. Its design makes use of the system's C compiler to compile to native object code
There is surely an option to preserve the intermediate C source files.
GCL currently compiles itself and the primary free software Lisp applications, Maxima , ACL2 and Axiom, on eleven GNU/Linux architectures (x86 powerpc s390 sparc arm alpha ia64 hppa m68k mips mipsel), Windows, Sparc Solaris, and FreeBSD.

High-level system language that compiles to c?

I'm looking for a higher-level system language, if possible, suitable for formal verification, that compiles to standard C, so that it can be run cross-platform with (relatively) low overhead.
The two most promising such languages I've stumbled during the past few days are:
BitC - While the design goals of this language match my needs (it even supports the functional paradigm), it is in very unstable state, the documentation is out of date, and, generally, it seems like a very long shot for a real-world project.
Lisaac - It supports Design-by-contract, which is very cool and has a relatively low performance overhead. However the website is dead, there hasn't been a new release since '08 and generally it seems the language is dead.
I'd also like to note that it's not meant for a real-time system, so a GC or, generally, non-determinism (in the real-time sense), is not an issue.
The project involves mainly audio processing, though it has to be cross-platform.
I assume someone would point me to the obvious answer - "plain ol' C". While it is truly cross-platform and very effective, the code quantity would probably be greater.
EDIT: I should clarify that I mean cross-platform AND cross-architecture. That is why I consider only languages, compiled to C in the first place, but if you can point me to another example, I'd be grateful :)
I think you may become interested in ATS. It compiles to C (actually it expresses and explains many C idioms and patterns from a formal type-theoretic perspective, it has even been proposed to prepare a book of sorts to show this -- if only we had more time...).
The project involves mainly audio processing, though it has to be cross-platform.
I don't know much about audio processing, I've been mainly doing some computer graphics stuff (mostly the basic things, just to try it out).
Also, I am not sure if ATS works on Windows (never tried that).
(Disclaimer: I've been working with ATS for some time. It is bulky and large language, and sometimes hard to use, but I very much liked the quality of programs I've been able to produce with it, for instance, see TEST subdirectory in GLES2 bindings for some realistic programs)
The following doesn't strictly adhere to the requirements but I'd like to mention it anyway and it is too long for a comment:
Pypy's RPython can be translated to C. Here's a nice talk about it. It's been used to implement Smalltalk, JavaScript, Io, Scheme, Gameboy (with various degree of completeness), but you can write standalone programs in it. It is known mainly for its implementation of Python language that runs on Intel x86 (IA-32) and x86_64 platforms.
The translation process requires a capable C compiler. The toolchain provides means to infer various things about the code (used by the translation process itself) that you might repurpose for formal verification.
If you know both Python and C you could use cython that translates Python-like syntax to C. It is used to write CPython extensions.

Starting off a simple (the simplest perhaps) C compiler?

I came across this: Writing a compiler using Turbo Pascal
I am curious if there are any tutorials or references explaining how to go about creating a simple C compiler. I mean, it is enough if it gets me to the level of making it understand arithmetic operations. I became really curious after reading this article by Ken Thompson. The idea of writing something that understands itself seems exciting.
Why did I put up this question instead of asking Google? I tried Google and the Pascal one was the first link. The rest did no seem relevant and added to that... I am not a CS major (so I still need to learn what all those tools like yacc do) and I want to learn this by doing and am hoping people with more experience are always better at these things than Google. I want to read some article written in the same spirit as the one I listed above but that which highlights at least the bootstrapping phases of building a simple C compiler.
Also, I don't know the best way to learn. Do I start off building a C compiler in C or some other language? Do I write a C compiler or some other language? I feel questions like this are better answered once I have some direction to explore. Any suggestions?
Any suggestions?
A compiler consists of three pieces:
A parser
An abstract syntax tree (AST)
An assembly code generator
There are lots of nice parser generators that start with language grammars. Maybe ANTLR would be a good place for you to start. If you want to stick to C roots, try lex/yacc or bison.
There are grammars for C, but I think C in its entirety is complex. You'd do well to start off with a subset of the language and work your way up.
Once you have an AST, you use it to generate the machine code that you'll run.
It's doable, but not trivial.
I'd also check Amazon for books about writing compilers. The Dragon Book is the classic, but there are more modern ones available.
UPDATE: There have been similar questions on Stack overflow, like this one. Check out those resources as well.
I advise you this tutorial:
LLVM tutorial
It is a small example on how to implement a "small language" compiler. The source code is very small and is explained step by step.
There is also the C front end library for the LLVM (Low Level Virtual Machine which represent the internal structure of a program) library:
Clang
For what it's worth, the Tiny C Compiler is a pretty full-featured C compiler in a relatively small source package. You might benefit from studying that source, as it's probably significantly easier to understand than trying to comprehend all of GCC's source base, for instance.
This is my opinion (and conjecture) it will be hard to write a compiler without understanding data structures normally covered in undergraduate (post secondary) Computer Science classes. This doesn't mean you cannot, but you will need to know essential data structures such as linked lists, and trees.
Rather than writing a full or standards compliant C language compiler (at least in the start), I would suggest limiting yourself to a basic subset of the language, such as common operators, integer only support, and basic functions and pointers. One classic example of this was Ron Cain's Small-C, made popular by a series of articles written in Dr. Dobbs Journal in I believe the 1980s. They publish a CD with the James Hendrix's out-of-print book, A Small-C Compiler.
What I would suggest is following Crenshaw's tutorial, but write it for a C-like language compiler, and whatever CPU target (Crenshaw targets the Motorola 68000 CPU) you wish to target. In order to do this, you will need to know basic assembly of which ever target you want to run the compiled programs on. This could include a emulator for a 68000, or MIPS which are arguably nicer assembly instruction sets than the venerable CISC instruction set of the Intel x86 (16/32-bit).
There are many potential books that can be used as starting points for learning compiler / translator theory (and practice). Read the comp.compilers FAQ, and reviews at various online book sellers. Most introductory books are written as textbooks for sophomore to senior level undergraduate Computer Science classes, so they can be slow reading without a CS background. One older book that might be more introductory, but easier to read than "The Dragon Book" is Introduction to Compiler Construction by Thomas Parsons. It is older, so you should be able to find an used copy from your choice of online book sellers at a reasonable price.
So I'd say, try starting with Jack Crenshaw's Let's Build a Compiler tutorial, write your own, following his examples as a guide, and build the basics of a simple compiler. Once you have that working, you can better decide where you wish to take it from that point.
Added:
In regards to the bootstrapping process. Since there are existing C compilers freely available, you do not need to worry about bootstrapping. Write your compiler with separate, existing tools (GCC, Visual C++ Express, Mingw / djgpp, tcc), and you can worry about self-compiling your project at a much later stage. I was surprised by this part of the question until I realized you were brought to the idea of writing your own compiler by reading Ken Thomas' ACM Turing award speech, Reflections on Trusting Trust, which does go into the compiler bootstrapping process. It's a moderated advanced topic, and is also simply a lot of hassle as well. I find even bootstrapping the GCC C compiler under older Unix systems (Digital OSF/1 on the 64-bit Alpha) that included a C compiler a slow and time consuming, error prone process.
The other sort-of question was what a compiler tool like Yacc actually does. Yacc (Yet Another Compiler Compiler or Bison from GNU) is a tool designed to make writing a compiler (or translator) parser easier. Based on the formal grammar for your target language that you input to yacc, it generates a parser, which is one portion of a compiler's overall design. Next is Lex (or flex from GNU) which used to generate a lexical analyzer or scanner, which is often used in combination with the yacc generated parser to form the skeleton of the front-end of a compiler. These tools make writer a front end arguably easier than writing an lexical analyzer and parser yourself. Crenshaw's tutorial does not use these tools, and you don't need to either, many compiler writers don't always use them. Of course Crenshaw admits the tutorial's parser is quite basic.
Crenshaw's tutorial also skips generating an AST (abstract syntax tree), which simplifies but also limits the tutorial compiler. It lacks most if not all optimization, and is very tied to the specific programming language and the particular assembly language emitted by the "back-end" of the compiler. Normally the AST is a middle piece where some optimization can be performed, and serves to de-couple the compiler front-end and back-end in design. For a beginner without a Computer Science background, I'd suggest not worrying about not having an AST for your first compiler (or at least the first version of it). I think keeping it small and simple will help you finish writing a compiler, in its first version, and you can decide from there how you want to proceed then.
You might be interested in the book/course The Elements of Computing Systems:Building a Modern Computer from First Principles.
Note that this isn't about building a "pc" from stuff you bought off newegg. It begins with a description of Boolean logic fundamentals, and builds a virtual computer from the lowest levels of abstraction to progressively higher levels of abstraction. The course materials are all online, and the book itself is fairly inexpensive from Amazon.
In the course, in addition to "building the hardware", you'll also implement an assembler, virtual machine, compiler, and rudimentary OS, in a step-wise fashion. I think this would give you enough of a background to delve deeper into the subject area with some of the more commonly recommended resources listed in the other answers.
In The Unix Programming Environment, Kernighan and Pike walk through 5 iterations of making a calculator working from simple C based lexical analysis and immediate execution to yacc/lex parsing and code generation for an abstract machine. Because they write so wonderfully I can't suggest smoother introduction. It is certainly smaller than C, but that is likely to your advantage.
How do I [start writing] a simple C compiler?
There's nothing simple about compiling C. The best simple C compiler is lcc by Chris Fraser and David Hanson. They spent 10 years working on the design to make it as simple as they possibly could, while still generating reasonably good code. If you have access to a university library, you should be able to get their book.
Do I start off building a C compiler in C or some other language?
Some other language. One time I got to ask Hanson what lessons he and Fraser had learned by spending 10 years on the lcc project. The main thing Hanson said was
C is a lousy language to write a compiler in.
You're better off using Haskell or some dialect of ML. Both languages offer functions over algebraic data types, which is a perfect match to the problems faced by the compiler writer. If you still want to pursue C, you could start with George Necula's CIL, which is a big chunk of a C compiler written in ML.
I want to read some article written in the same spirit as the one I listed above but that which highlights at least the bootstrapping phases...
You won't find another article like Ken's. But Andrew Appel has written a nice article called Axiomatic Bootstrapping: A Guide for Compiler Hackers I couldn't find a free version but many people have access to the ACM Digital Library.
Any suggestions?
If you want to write a compiler,
Use Haskell or ML as your implementation language.
For your first compiler, pick a very simple language like Oberon or like P0 from Niklaus Wirth's book Algorithms + Data Structures = Programs. Wirth is famous for designing languages that are easy to compile.
You can write a C compiler for your second compiler.
A compiler is a complex subject matter that covers aspects of
Input processing involving Lexing, Parsing
Building a symbol store of every variable used such as an Abstract Syntax Tree (AST)
From the AST tree, transpose and build a machine code binary based on the syntax
This is by no means exhaustive as it is an abstract bird's eye view from the top of a mountain, it boils down to getting the syntax notation correct and ensuring that malformed inputs do not throw it off, in fact a good input processing should never fall on its knees no matter how malformed, terrible, abused cases of input that gets thrown at it. And, also in deciding and knowing what output is going to be, is it in machine code, which would imply you may have to get to know the processor instructions intimately...including memory addressing for variables and so on...
Here are some links for you to get started:
There was a Jack Crenshaw's port of his code for C....(I recall downloading it months ago...)
Here's a link to a similar question here on SO.
Also, here's another small compiler tutorial for Basic to x86 assembler compiler.
Tiny C Compiler
Hendrix's Small C Compiler found here.
It might be worthwhile to learn about functional programming, too. Functional languages are well-suited to writing a compiler both in and for. My school's intro compilers class contained an intro to functional languages and the assignments were all in OCaml.
Funny you should ask this today, since just a couple days ago I wrote a lambda calculus interpreter. Lambda calculus is the granddaddy of all functional languages. It's just 200 lines long (in C++, incl. error reporting, some pretty printing, some unicode) and has a two-phase structure, with an intermediate format that could be used to generate code.
Not only is starting small and building up the most practical approach to compilers, it also encourages good, modular, organizational practice.
A compiler is a very large project, although I suppose it wouldn't hurt to try.
I know of at least one C compiler written in Pascal, so it's not the most insane thing you could do. I personally would pick a more modern language in which to implement my C compiler project, both for the simplicity (it's easy to d/l packages for Python, Ruby, C, C++ or Java) and because it will look better on your resume.
In order to do a compiler as a beginner project, though, you will need to drink all of the Agile kool-aid.
Always have something running, even if it doesn't do much of anything. Add things to your compiler only in small steps. ("Frequent releases".) Pick a viciously tiny subset of the language and implement that first. (Support only i = 0; at first and expand things from there.)
If you want a mind-blowing experience that teaches you how to write compilers that compile themselves, you need to read this paper from 1964.
META II a syntax-oriented compiler writing language by Val Schorre.
In 10 pages, it tells you how to write compilers, how to write meta compilers, provides a virtual metacompiler instruction set, and a sample compiler built with the metacompiler.
I learned how to write compilers from this paper back in the late 60s, and used the ideas to construct C-like langauges for several minicomputers and microprocessors.
If the paper is too much by itself (it isn't!) there's an online tutorial which will walk you through the whole thing.
And if getting the paper from the original link is awkward because you are not an ACM member, you'll find that the tutorial contains all the details anyway. (IMHO, for the price, the paper itself is waaaaay worth it).
10 pages!
I would not recommend starting with C as the language to implement, nor with any of the compiler-generator or parser-generator tools. C is a very tricky language, and it's probably a better idea to just make up a language of your own. It can be a little C-like (e.g. use curly backets if you want to indicate the function body, use the same type names, so you don't have to remember what you called everything).
The tools for making compilers and parsers are great, but have the problem of really being a shorthand notation. If you don't know how to create a compiler in longhand, the shorthand will seem cryptic, needlessly restrictive etc. So write your own simple compiler first, then continue on from there. I also recommend you don't start generating actual machine code unless you eat and breathe assembler. Create your own bytecode interpreter with a VM.
As to what language you should use to create your first compiler: It doesn't really matter, as long as the language is fairly complete. You will be reading input text, building data structures from them and writing out binary data. So if a language makes those things easier in any way, that's a point in favor of it. Pick a language you know well, so you can focus on creating the compiler, not learning the language. I usually use an OO language, which makes the syntax tree easier to write, a functional language would probably also work if you are familiar with that.
I've blogged a lot about programming languages, so you might find some useful postings here: http://orangejuiceliberationfront.com/category/language-design/
In particular, http://orangejuiceliberationfront.com/how-to-write-a-compiler/ is a starter on the particulars of parsing common constructs and generating something useful from that, as well as http://orangejuiceliberationfront.com/generating-machine-code-at-runtime/ which talks about actually spitting out Intel instructions that do something.
Oh, regarding bootstrapping of a compiler: You probably won't be able to do that right from the start. There is a fair amount of work involved in creating a compiler. So not only would writing a bootstrapping compiler involve writing the compiler (in some other language), once you have it, you would then have to write a second version of the compiler using itself. That's twice the work, plus the debugging needed in the existing and the bootstrapped new compiler until it all works. That said, once you have a working compiler, it is a good way to test its completeness. OK, maybe not twice the work, but more work. I'd go for the easy successes first, then move on from there.
In any event, have fun!

Resources