How to remove all C and C++ comments in vi?
//
/*
*/
You can't. Parsing C and C++ comments is not something that regular expressions can do. It might work for simple cases, but you never know if the result leaves you with a corrupted source file. E.g. what happens to this:
printf ("//\n");
The proper way is to use an external tool able to parse C. For example some compilers may have an option to strip comments.
Writing a comment stripper is also a basic exercise in lex and yacc programming.
See also this question: Remove comments from C/C++ code
With regular expressions, because of the complexity of C/C++ syntax, you will at best achieve a solution that is 90% correct. Better let the right tool (a compiler) do the job.
Fortunately, Vim integrates nicely with external tools. Based on this answer, you can do:
:%! gcc -fpreprocessed -dD -E "%" 2>/dev/null
Caveats:
requires gcc
it also slightly modifies the formatting (shrunk indent etc.)
Esc:%s/\/\///
Esc:%s/\/\*//
Esc:%s/\*\///
You can use a lexical analyzer like Flex directly applied to source codes. In its manual you can find "How can I match C-style comments?".
If you need an in-depth tutorial, you can find it here; under "Lexical Analysis" section you can find a pdf that introduce you to the tool and an archive with some practical examples, including "c99-comment-eater".
Related
I was wondering if it is possible, and if yes how, can I run a C preprocessor, like cpp, on a
C++ source file and only process the conditional directives #if #endif etc. I would like other
directives to stay intact in the output file.
I'm doing some analysis on C# code and there is no C# pre-processor. My idea is to run a C preprocessor on C# file and process only conditionals. This way for example, the #region directive, will stay
in the file, but cpp appears to remove #region.
You might be looking for a tool like coan:
Coan is a software engineering tool for analysing preprocessor-based configurations of C or C++ source code. Its principal use is to simplify a body of source code by eliminating any parts that are redundant with respect to a specified configuration.
It's precisely designed to process #if and #ifdef preprocessor lines, and remove code accordingly, but it has a lot of other possible uses.
The linux unifdef command does what you want:
http://linux.die.net/man/1/unifdef
Even if you're not on linux, there is source available on the web.
BTW, this is a duplicate of another question: Way to omit undefined preprocessor branches by default with unifdef?
Oh, this is the same task as I had in the past. I've tried cpp unifdef and coan tools - all of them stumbled upon special C# preprocessor things like #region. In the end I've decided to make my own one:
https://github.com/gaDZella/undefine.
The tool has a pretty simple set of options compared to the mentioned cpp tools but it is fully compatible with C# preprocessor syntax.
You can use g++ -E option to stop after preprocessing stage
-E -> stop after the preprocessing stage.The output is in the form of preprocessed source code, which is sent to the standard output
I would like to know if there is any kind of regular expression expansion within the compiler(GCC) pre processor. Basically more flexible code generation macros.
If there is not a way, how do you suggest i accomplish the same result
The C preprocessor can't do that.
You might want to use a template processor (for instance Mustache but there are many others) that generates what you need before passing it to the compiler.
Also, if you are planning a bigger project and you know this feature will be beneficial you might want to write your own preprocessor that you can run automatically from some build system. Good example of such solution would be moc which enhances C++ for the purpose of Qt framework. Purist might of course disagree.
There is this https://github.com/graph/qc qc = Quick C it allows you to do this in your source code files that end with qc.h
$replace asdf_(\d+) => asdf_ :) $1 blabla
// and now in your code anything that matches the above regular expression
asdf_123
// will become asdf_ :) 123 blabla
And it will output a .cpp & a .h thats preprocessed. Its made to avoid the need to maintain header files. And some other things not making it backwards compatible with c++, but it outputs c++ code so you can do all the c++ things you want at the end of the day.
Edit: I made it and have a bias towards qc.
You might want to look at re2c.org. It it a separate C preprocessor to generate
C code to match regular expressions. I found that and your question when looking for
something similar.
This question : Is there a way to tell whether code is now being compiled as part of a PCH? lead me to thinking about this.
Is there a way, in perhaps only certain compilers, of getting a C/C++ compiler to dump out the defines that it's currently using?
Edit: I know this is technically a pre-processor issue but let's add that within the term compiler.
Yes. In GCC
g++ -E -dM <file>
I would bet it is possible in nearly all compilers.
Boost Wave (a preprocessor library that happens to include a command line driver) includes a tracing capability to trace macro expansions. It's probably a bit more than you're asking for though -- it doesn't just display the final result, but essentially every step of expanding a macro (even a very complex one).
The clang preprocessor is somewhat similar. It's also basically a library that happens to include a command line driver. The preprocessor defines a macro_iterator type and macro_begin/macro_end of that type, that will let you walk the preprocessor symbol table and do pretty much whatever you want with it (including printing out the symbols, of course).
The #encode directive returns a const char * which is a coded type descriptor of the various elements of the datatype that was passed in. Example follows:
struct test
{ int ti ;
char tc ;
} ;
printf( "%s", #encode(struct test) ) ;
// returns "{test=ic}"
I could see using sizeof() to determine primitive types - and if it was a full object, I could use the class methods to do introspection.
However, How does it determine each element of an opaque struct?
#Lothars answer might be "cynical", but it's pretty close to the mark, unfortunately. In order to implement something like #encode(), you need a full blown parser in order to extract the the type information. Well, at least for anything other than "trivial" #encode() statements (i.e., #encode(char *)). Modern compilers generally have either two or three main components:
The front end.
The intermediate end (for some compilers).
The back end.
The front end must parse all the source code and basically converts the source code text in to an internal, "machine useable" form.
The back end translates the internal, "machine useable" form in to executable code.
Compilers that have an "intermediate end" typically do so because of some need: they support multiple "front ends", possibly made up of completely different languages. Another reason is to simplify optimization: all the optimization passes work on the same intermediate representation. The gcc compiler suite is an example of a "three stage" compiler. llvm could be considered an "intermediate and back end" stage compiler: The "low level virtual machine" is the intermediate representation, and all the optimization takes place in this form. llvm also able to keep it in this intermediate representation right up until the last second- this allows for "link time optimization". The clang compiler is really a "front end" that (effectively) outputs llvm intermediate representation.
So, if you want to add #encode() functionality to an 'existing' compiler, you'd probably have to do it as a "source to source" 'compiler / preprocessor'. This was how the original Objective-C and C++ compilers were written- they parsed the input source text and converted it to "plain C" which was then fed in to the standard C compiler. There's a few ways to do this:
Roll your own
Use yacc and lex to put together a ANSI-C parser. You'll need a grammar- ANSI C grammar (Yacc) is a good start. Actually, to be clear, when I say yacc, I really mean bison and flex. And also, loosely, the other various yacc and lex like C-based tools: lemon, dparser, etc...
Use perl with Yapp or EYapp, which are pseudo-yacc clones in perl. Probably better for quickly prototyping an idea compared to C-based yacc and lex- it's perl after all: Regular expressions, associative arrays, no memory management, etc.
Build your parser with Antlr. I don't have any experience with this tool chain, but it's another "compiler compiler" tool that (seems) to be geared more towards java developers. There appears to be freely available C and Objective-C grammars available.
Hack another tool
Note: I have no personal experience using any of these tools to do anything like adding #encode(), but I suspect they would be a big help.
CIL - No personal experience with this tool, but designed for parsing C source code and then "doing stuff" with it. From what I can glean from the docs, this tool should allow you to extract the type information you'd need.
Sparse - Worth looking at, but not sure.
clang - Haven't used it for this purpose, but allegedly one of the goals was to make it "easily hackable" for just this sort of stuff. Particularly (and again, no personal experience) in doing the "heavy lifting" of all the parsing, letting you concentrate on the "interesting" part, which in this case would be extracting context and syntax sensitive type information, and then convert that in to a plain C string.
gcc Plugins - Plugins are a gcc 4.5 (which is the current alpha/beta version of the compiler) feature and "might" allow you to easily hook in to the compiler to extract the type information you'd need. No idea if the plugin architecture allows for this kind of thing.
Others
Coccinelle - Bookmarked this recently to "look at later". This "might" be able to do what you want, and "might" be able to do it with out much effort.
MetaC - Bookmarked this one recently too. No idea how useful this would be.
mygcc - "Might" do what you want. It's an interesting idea, but it's not directly applicable to what you want. From the web page: "Mygcc allows programmers to add their own checks that take into account syntax, control flow, and data flow information."
Links.
CocoaDev Objective-C Parsing - Worth looking at. Has some links to lexers and grammars.
Edit #1, the bonus links.
#Lothar makes a good point in his comment. I had actually intended to include lcc, but it looks like it got lost along the way.
lcc - The lcc C compiler. This is a C compiler that is particularly small, at least in terms of source code size. It also has a book, which I highly recommend.
tcc - The tcc C compiler. Not quite as pedagogical as lcc, but definitely still worth looking at.
poc - The poc Objective-C compiler. This is a "source to source" Objective-C compiler. It parses the Objective-C source code and emits C source code, which it then passes to gcc (well, usually gcc). Has a number of Objective-C extensions / features that aren't available in gcc. Definitely worth looking at.
You would implement this by implementing the ANSI C compiler first and then add some implementation specific pragmas and functions to it.
Yes i know this is cynical answer and i accept the downvotes.
One way to do it would be to write a preprocessor, which reads the source code for the type definitions and also replaces #encode... with the corresponding string literal.
Another approach, if your program is compiled with -g, would be to write a function that reads the type definition from the program's debug information at run-time, or use gdb or another program to read it for you and then reformat it as desired. The gdb ptype command can be used to print the definition of a particular type (or if that is insufficient there is also maint print type, which is sure to print far more information than you could possibly want).
If you are using a compiler that supports plugins (e.g. GCC 4.5), it may also be possible to write a compiler plugin for this. Your plugin could then take advantage of the type information that the compiler has already parsed. Obviously this approach would be very compiler-specific.
I'm looking for a parser for C. Here is what I need:
Written in C (not C++).
Handwritten (not generated).
BSD or similarly permissive license.
Capable of nontrivially parsing itself (can be a subset of C).
It can be part of a project as long as it's decoupled so that I can pull out the parser.
Is there an existing parser that fulfills these requirements?
If you don't need C99, then lcc is a slam dunk:
It is documented in a very clear, well-written book.
Techniques used for recursive-descent parsing of operators with precedence are well documented in an article and technical report by Dave Hanson.
Clear, handwritten ANSI C code.
One potential downside is that the lcc parser does not build an abstract-syntax treeāit goes straight from parsing to intermediate code.
If you must have C99 then I think tinycc (tcc) is your best bet.
How about Sparse?
You could try TCC. It's licensed under the Lesser GPL.
It seems that nwcc sufficiently agrees with your requirements.
Good c compiler is present at this location. Simple and accessible.
https://github.com/rui314/8cc
GCC has one in gcc/c-parser.c.
Check elsa, it uses the Generalized LR algorithm.
Its main use is for C++, but it also parses C code.
Check on its page, on the section called "How much C can Elsa parse?" which says it can parse most C programs, including the Linux kernel.
It's released under a BSD license.
Here is a recursive descent parser I ported to C:
http://www.gabotronics.com/resources/recursive-descent-parser.htm