Is there any interactive tool on web to understand common code bases? - c

I am modifying the code for glibc 2.5. Now since glibc is large and complex, I need to have a really good tool, to see interaction of different parts of the code. I am using Understand for this purpose, but Understood is only valid for 15 days. Afterwards you have to buy it.
So my question is, on web, are there sites where you can interactively understand common code bases such as glibc, gcc, linux kernel etc. I mean where you could search for some function, and then click on a function call to see its definition and such other useful features. I have used Koders.com, but it will only display the code, and is not interactive.

OpenGrok is good. You have to host it yourself though, it's not on the web.

Related

Porting Autodesk Animator Pro to be cross platform

a previous relevant question from me is here Reverse Engineering old paint programs
I have set up my base of operations here: http://animatorpro.org
wiki coming soon.
Okay, so now I have a 300,000 line legacy MSDOS codebase. It's sort of a "be careful what you wish for" situation. I am not an experienced C programmer. I'm not entirely inexperienced either, but for all intents and purposes I'm a noob to the language and in particular the intricacies of its libraries. I am especially ignorant of the vagaries of the differences between C programs written specifically for MSDOS and programs that are cross platform. However I have been studying this code base for over a year now, and this is what I know about Animator Pro:
Compilers and tools used:
Watcom C compiler
tcmake (make program from Turbo C)
386asm, a specialised assembler for the Phar Lap dos extender
and of course, the Phar Lap dos extender itself.
a selection of obscure dos utilities
Much of the compilation seems to be driven by batch files. Though I have obtained copies of all these tools, I have not yet succeeded at compiling it. (though I have compiled its older brother, autodesk animator original.
It's got a plugin system that replicates DLL before DLL's were available, based on REX. The plugin system handles:
Video Drivers (with a plethora of included VESA drivers)
Input drivers (including wacom tablets, and keyboards)
Drawing Tools
Inks (Like photoshop's filters, or blending modes)
Scripting Addons (essentially compiled scripts)
File formats
It's got its own script interpreter named POCO, based on the C language- The scripting language has enough power to do virtually all the things the plugin system can do- Just slower.
Given this information, this is my development plan. Please criticise this. The source code is available in the link above, so you can easily, if you are so inclined, assess the situation yourself.
Compile with its original tools.
Switch to using DJGPP, and make the necessary changes to get it to compile with that, plus the original assembler.
Include the Allegro.cc "Game" library, and switch over as much functionality to that library as possible- Perhaps by simply writing new video and input drivers that use the Allegro API. I'm thinking allegro rather than SDL because: there is a DOS version of Allegro, and fascinatingly, one of its core functions is the ability to play Animator Pro's native format FLIC.
Hopefully after 3, I will have eliminated most or all of the Assembler in the project. I say hopefully, because it's in an obscure dialect that doesn't assemble in any modern free assembler without significant modification. I have tried them all. Whatever is left gets converted to assemble in NASM, or to C code if I can define the assembler's actual function.
Switch the dos extender from Phar Lap to HX Dos http://www.japheth.de/HX.html, Which promises to replicate as much of the WIN32 api as possible. Then make all the necessary code changes for that to work.
Switch to the win32 version of Allegro.cc, assuming that the win32 version can run on top of HXDos. Make any further necessary changes
Modify the plugin system to use some kind of standard cross platform plugin library. What this would be, I have no idea. Maybe you can offer some suggestions? I talked to the developer who originally wrote the plugin system, and he said some of the things it does aren't possible on modern OS's because of segmentation restrictions. I'm not sure what this means, but I'm guessing it means all the plugins will need to be rewritten almost from scratch.
Magically, I got all the above done, and we can try and make it run in windows, osx, and linux, whilst dealing with other cross platform niggles like long file names, and things I haven't thought of.
Anyone got a problem with any of this? Is allegro a good choice? if not, why? what would you do about this plugin system? What would you do different? Is this whole thing foolish, and should I just rewrite it from scratch, using the original as inpiration? (it would apparently take the original developer "About a month" to do that)
One thing I haven't covered above is the text/font system. Not sure what to do about that, but Animator Pro has its own custom font format, but also is able to use Postscript Type 1 fonts, and some other formats.
My biggest concern with your plan, in a nutshell: Your approach seems to be to attempt to keep the whole enormous thing working at all times, tweaking the environment ever-further away from DOS. During each tweak to the environment, that means you will have approximately a billion subtle assumptions that might have broken at once, none of which you necessarily understand yet. Untangling them all at once will be incredibly painful.
If I were doing the port, my approach would be to disable as much code as possible to get SOMETHING running in a modern environment, and bring the parts back online, one piece at a time. Write a simple test harness program that loads a display driver and draws some stuff, and compile it for DOS to make sure you understand the interface. Then write some C code that implements the same interface, but with Allegro (or SDL or SFML), and make that program work under Windows or Linux. When the output differs, you have a simple test case to work from.
Your entire job on this port is swapping out implementations of various interfaces and functions with completely new ones. This is a job that unit testing excels at. Don't write any new code without a test of some kind that runs on the old code under DOS! Make your potential problems as small and simple as you possibly can. Port assembly code instead of rewriting it only if you're reasonably confident that it will actually make your job easier (ie, algorithmic stuff that compiles fine with few tweaks under NASM). Don't bite off a bigger piece than you can comfortably fit in your brain at once.
I, for one, look forward to seeing your progress! I think what you're attempting to do is great. Thanks for doing it.
Hmmm - I might approach it by writing an OpenGL video "driver" for it. and todays machines are fast enough with tons of ram that you could do all the pixel specific algorithms on main CPU into a back buffer and it would work. As the "generic" VGA driver just mapped the video buffer to a pointer this would be a place to start. There was a zoom mode in the UI so you can look at the pixels on a high res display.
It is often very difficult to take an existing non-trivial code base that wasn't written with portability in mind - you mention a few - and then try to make it portable. There will be a lot of problems on the way. It is probably a better idea to start from scratch and rewrite the code using the existing code as reference only. If you start from scratch you can leverage existing portable UI solution in your new project like Qt.

Erlang source code guide

I am interested in delving into Erlang's C source code and try to understand what is going on under the hood. Where can I find info on the design and structure of the code?
First of all, you might want to have a look to Joe Armstrong's thesis, introducing Erlang at a high level. It will be useful to get an idea of what was the idea behind the language. Then, you could focus on the Erlang Run Time System (erts). The erlang.erl module could be a good start. Then, I would focus on the applications who constitutes the so-called minimal release, kernel and stdlib. Within the stdlib, have a look on how behaviours are implemented. May I suggest the gen_server.erl module as a start?
A Guide To The Erlang Source
http://www.trapexit.org/A_Guide_To_The_Erlang_Source
The short answer is that there is no good guide. And the code is not very well documented.
I recommend finding someone in your neighbourhood that knows the code reasonably well, and buy them dinner in exchange for a little chat.
If you don't have the possibility to do that, then I recommend starting with the loader.
./erts/emulator/beam/beam_load.c
Some useful information can also be found by pretty printing the beam representation. I don't know whether there is any way to do so supplied by OTP, but the HiPE project has some cheats.
hipe:c(MODULE, [pp_beam]).
Should get you started.
(And I also recommend Joe's book.)
Pretty printer of beam can be done by 'erlc -S', which is equivalent with hipe:c(M, [pp_beam]) mentioned by Daniel.
I also use erts_debug:df(Module). to disassemble the loaded beam code, which are instructions actually been interpreted by the VM.
Sometimes I use a debugger. OTP delivers tools supporting gdb very well. See example usage at http://www.erlang.org/pipermail/erlang-questions/2008-September/037793.html
A little late to the party here. If you just download the source from GitHub the internal documentation is really good. You have to generate some of it using make.
Get the documentation built and most of the relevant source is under /erts (Erlang Run Time System)
Edit: BEAM Wisdoms is also a really good guide but it may or may not be what you're after.

How does one go about understanding GNU source code?

I'm really sorry if this sounds kinda dumb. I just finished reading K&R and I worked on some of the exercises. This summer, for my project, I'm thinking of re-implementing a linux utility to expand my understanding of C further so I downloaded the source for GNU tar and sed as they both seem interesting. However, I'm having trouble understanding where it starts, where's the main implementation, where all the weird macros came from, etc.
I have a lot of time so that's not really an issue. Am I supposed to familiarize myself with the GNU toolchain (ie. make, binutils, ..) first in order to understand the programs? Or maybe I should start with something a bit smaller (if there's such a thing) ?
I have little bit of experience with Java, C++ and python if that matters.
Thanks!
The GNU programs big and complicated. The size of GNU Hello World shows that even the simplest GNU project needs a lot of code and configuration around it.
The autotools are hard to understand for a beginner, but you don't need to understand them to read the code. Even if you modify the code, most of the time you can simply run make to compile your changes.
To read code, you need a good editor (VIM, Emacs) or IDE (Eclipse) and some tools to navigate through the source. The tar project contains a src directory, that is a good place to start. A program always start with the main function, so do
grep main *.c
or use your IDE to search for this function. It is in tar.c. Now, skip all the initialization stuff, untill
/* Main command execution. */
There, you see a switch for subcommands. If you pass -x it does this, if you pass -c it does that, etc. This is the branching structure for those commands. If you want to know what these macro's are, run
grep EXTRACT_SUBCOMMAND *.h
there you can see that they are listed in common.h.
Below EXTRACT_SUBCOMMAND you see something funny:
read_and (extract_archive);
The definition of read_and() (again obtained with grep):
read_and (void (*do_something) (void))
The single parameter is a function pointer like a callback, so read_and will supposedly read something and then call the function extract_archive. Again, grep on it and you will see this:
if (prepare_to_extract (current_stat_info.file_name, typeflag, &fun))
{
if (fun && (*fun) (current_stat_info.file_name, typeflag)
&& backup_option)
undo_last_backup ();
}
else
skip_member ();
Note that the real work happens when calling fun. fun is again a function pointer, which is set in prepare_to_extract. fun may point to extract_file, which does the actual writing.
I hope I walked you a great deal through this and shown you how I navigate through source code. Feel free to contact me if you have related questions.
The problem with programs like tar and sed is twofold (this is just my opinion, of course!). First of all, they're both really old. That means they've had multiple people maintain them over the years, with different coding styles and different personalities. For GNU utilities, it's usually pretty good, because they usually enforce a reasonably consistent coding style, but it's still an issue. The other problem is that they're unbelievably portable. Usually "portability" is seen as a good thing, but when taken to extremes, it means your codebase ends up full of little hacks and tricks to work around obscure bugs and corner cases in particular pieces of hardware and systems. And for programs as widely ported as tar and sed, that means there's a lot of corner cases and obscure hardware/compilers/OSes to take into account.
If you want to learn C, then I would say the best place to start is not trying to study code that others have written. Rather, try to write code yourself. If you really want to start with an existing codebase, choose one that's being actively maintained where you can see the changes that other people are making as they make them, follow along in the discussions on the mailing lists and so on.
With well-established programs like tar and sed, you see the result of the discussions that would've happened, but you can't see how software design decisions and changes are being made in real-time. That can only happen with actively-maintained software.
That's just my opinion of course, and you can take it with a grain of salt if you like :)
Why not download the source of the coreutils (http://ftp.gnu.org/gnu/coreutils/) and take a look at tools like yes? Less than 100 lines of C code and a fully functional, useful and really basic piece of GNU software.
GNU Hello is probably the smallest, simplest GNU program and is easy to understand.
I know sometimes it's a mess to navigate through C code, especially if you're not familiarized with it. I suggest you use a tool that will help you browse through the functions, symbols, macros, etc. Then look for the main() function.
You need to familiarize yourself with the tools, of course, but you don't need to become an expert.
Learn how to use grep if you don't know it already and use it to search for the main function and everything else that interests you. You might also want to use code browsing tools like ctags or cscope which can also integrate with vim and emacs or use an IDE if you like that better.
I suggest using ctags or cscope for browsing. You can use them with vim/emacs. They are widely used in the open-source world.
They should be in the repository of every major linux distribution.
Making sense of some code which uses a lot of macros, utility functions, etc, can be hard. To better browse the code of a random C or C++ software, I suggest this approach, which is what I generally use:
Install Qt development tools and Qt Creator
Download the sources you want to inspect, and set them up for compilation (usually just ./configure for GNU stuff).
Run qmake -project in the root of the source directory, to generate Qt .pro file for Qt Creator.
Open the .pro file in Qt Creator (do not use shadow build, when it asks).
Just to be safe, in Qt Creator Projects view, remove the default build steps. The .pro file is just for navigation inside Qt Creator.
Optional: set up custom build and run steps, if you want to build and run/debug under Qt Creator. Not needed for navigation only.
Use Qt Creator to browse the code. Note especially the locator (kb shortcut Ctrl+K) to find stuff by name, and "follow symbol under cursor" (kb shortcut F2), and "find usages" (kb shortcut Ctrl-Shift-U).
I had to take a look at "sed" just to see what the problem was; it shouldn't be that big. I looked and I see what the issue is, and I feel like Charleton Heston catching first sight of a broken statue on the beach. All of what I'm about to describe for "sed" might also apply to "tar". But I haven't looked at it (yet).
A lot of GNU code got seriously grunged up - to the point of unmaintainable morbid legacy - for reasons I don't know. I don't know exactly when it happened, maybe late 1990's or early 2000's, but it was like someone flipped a switch and suddenly, all the nice modular mostly self-contained code widgets got massively grunged with all sorts of extraneous entanglements having little or no connection to what the application itself was trying to do.
In your case, "sed": an entire library got (needlessly) dragged in with the application. This was the case at least as early as version 4.2 (the last version predating your query), probably before that - I'd have to check.
Another thing that got grunged up was the build system (again) to the point of unmaintainability.
So, you're really talking about legacy rescue here.
My advice ... which is generic for any codebase that's been around a long time ... is to dig as deep as you can and go back to its earliest forms first; and to branch out and look at other "sed"'s - like those in the UNIX archive.
https://www.tuhs.org/Archive/
or in the BSD archive:
https://github.com/freebsd
https://github.com/weiss/original-bsd
(the second one goes deeper into early BSD in its earlier commits.)
Many of the "sed"'s on the GNU page - but not all of them - may be found under "Downloads" as a link "mirrors" on the GNU sed page:
https://www.gnu.org/software/sed/
Version 1.18 is still intact. Version 1.17 is implicitly intact, since there is a 1.17 to 1.18 diff present there. Neither version has all the extra stuff piled on top of it. It's more representative of what GNU software looked like, before before becoming knotted up with all the entanglements.
It's actually pretty small - only 8863 lines for the *.c and *.h files, in all. Start with that.
For me the process of analysis of any codebase is destructive of the original and always entails a massive amount of refactoring and re-engineering; and simplification coming from just writing it better and more natively, while yet keeping or increasing its functionality. Almost always, it is written by people who only have a few years' experience (by which I mean: less than 20 years, for instance) and have thus not acquired full-fledged native fluency in the language, nor the breadth of background to be able to program well.
For this, if you do the same, it's strongly advised that you have some of test suite already in place or added. There's one in the version 4.2 software, for instance, though it may be stress-testing new capabilities added between 1.18 and 4.2. Just be aware of that. (So, it might require reducing the test suite to fit 1.18.) Every change you make has to be validated by whatever tests you have in your suite.
You need to have native fluency in the language ... or else the willingness and ability to acquire it by carrying out the exercise and others like it. If you don't have enough years behind you, you're going to hit a soft wall. The deeper you go, the harder it might be to move forward. That's an indication that you're not experienced enough yet, and that you don't have enough breadth. So, this exercise then becomes part of your learning experience, and you'll just have to plod through.
Because of how early the first versions date from, you will have to do some rewriting anyhow, just to bring it up to standard. Later versions can be used as a guide, for this process. At a bare minimum, it should be brought up to C99, as this is virtually mandated as part of POSIX. In other words, you should at least be at least as far up to date as the present century!
Just the challenge of getting it to be functional will be exercise enough. You'll learn a lot of what's in it, just by doing that. The process of getting it to be operational is establishing a "baseline". Once you do that, you have your own version, and you can start with the "analysis".
Once a baseline is established, then you can proceed full throttle forward with refactoring and re-engineering. The test suite helps to provide cover against stumbles and inserted errors. You should keep all the versions that you have (re)made in a local repository so that you can jump back to earlier ones, in case you need to track down the sudden emergence of test failures or other bugs. Some bugs, you may find, were rooted all the way back in the beginning (thus: the discovery of hidden bugs).
After you have the baseline (re)written to your satisfaction, then you can proceed to layer in the subsequent versions. On GNU's archive, 1.18 jumps straight to 2.05. You'll have to make a "diff" between the two to see where all the changes were, and then graft them into your version of 1.18 to get your version of 2.05. This will help you better understand both the issues that the changes made addressed, and what changes were made.
At some point you're going to hit GNU's Grunge Wall. Version 2.05 jumped straight to 3.01 in GNU's historical archive. Some entanglements started slipping in with version 3.01. So, it's a soft wall we have here. But there's also an early test suite with 3.01, which you should use with 1.18, instead of 4.2's test suite.
When you hit the Grunge Wall, you'll see directly what the entanglements were, and you'll have to decide whether to go along for the ride or cast them aside. I can't tell you which direction is the rabbit hole, except that SED has been perfectly fine for a long time, most or all of it is what is listed in and mandated by the POSIX standard (even the current one), and what's there before version 3 serves that end.
I ran diffs. Between 2.05 and 3.01, the diff file is 5000 lines. Ok. That's (mostly) fine and is natural for code that's in development, but some of that may be coming from the soft Grunge Wall. Running a diff on 3.01 versus 4.2 yields a diff file that over 60000 lines. You need only ask yourself: how can a program that's under 10000 lines - that abides by an international standard (POSIX) - be producing 60000 lines of differences? The answer is: that's what we call bloat. So, between 3.01 and 4.2, you're witnessing a problem that is very common to code bases: the rise of bloat.
So, that pretty much tells you which direction ("go along for the ride" versus "cast it aside") is the rabbit hole. I'd probably just stick with 3.01, and do a cursory review of the differences between 3.01 and 4.2 and of the change logs to get an overview of what the changes were, and just leave it at that, except maybe to find a different way to write in what they thought was necessary to change, if the reason for it was valid.
I've done legacy rescue before, before the term "legacy" was even in most people's vocabulary and am quick to recognize the hallmark signs of it. This is the kind of process one might go through.
We've seen it happen with some large codebases already. In effect, the superseding of X11 by Wayland was a massive exercise in legacy rescue. It's also possible that the ongoing superseding of GNU's gcc by clang may be considered instance of that.

What’s the easiest way to grab a web page in C ? (via https)

Almost the same question as this one here:
What's the easiest way to grab a web page in C?
however the conditions have changed and I need to connect via https, this is a bit more tricky, anyone got any snippets?
I am on a qnx platform, building and compiling additional libraries and rolling it out onto our product is very, very hard given the contraints. So things like libcurl are not possible.
Results:
It turns out I had to install libcurl on QNX after all. This involved installing perl and openSSL to build libcurl, but once that was built it was good to go. This was the least desirable option but it ended up being worth it.
libcurl should be able to handle anything you need to do.
If you're not able to use a library, then I guess you're either forced to cheat, as in "call out to a shell or some other environment that already has this capability". I'm not very familiar with QNX or the environments where it's typically run, not enough to dicount this possibility on my own anyway.
By the way, before skipping this: libcurl is known to build on QNX, so try that before even reading further.
Failing that, taking the question literally, I guess you need to implement the relevant parts of the HTTP protocol yourself. Since you now need secure access too, you're in a world of hurt. You just don't want to implement that type of code on your own, it is a lot of work, many many wheels to re-invent.
At the very least, I'd recommend taking a hard look around to see if any of the things you need to do this are already implemented. This page implies that OpenSSH is available for the QNX platform, which is encouraging.
I was away when you posted this followup question.
I've now posted an SSL-capable example program at http://pastebin.com/f1cd08b33
This needs to be linked against OpenSSL (-lssl) but doesn't need libcurl at all.

Tool to determine symbol origin in C

I'm looking for a tool that, given a bit of C, will tell you what symbols (types, precompiler definitions, functions, etc) are used from a given header file. I'm doing a port of a large driver from Solaris to Windows and figuring out where things are coming from is getting to be difficult, so this would be a huge help. Any ideas?
Edit: Not an absolute requirement, but tools that work on Windows would be a plus.
Edit #2: To clarify what I'm trying to do, I have a codebase I'm trying to port, which brings in a large number of headers. What I'd like is a tool that, given foo.c, will tell me which symbols it uses from bar.h.
I like KScope, which copes with very large projects.
KScope http://img110.imageshack.us/img110/4605/99101zd3.png
I use on both Linux and Windows :
gvim + ctags + cscope.
Same environment will work on solaris as well, but this is of course force you to use vim as editor, i pretty sure that emacs can work with both ctags and cscope as well.
You might want give a try to vim, it's a bit hard at first, but soon you can't work another way. The most efficient editor (IMHO).
Comment replay:
Look into the cscope man:
...
Find functions called by this function:
Find functions calling this function:
...
I think it's exactly what are you looking for ... Please clarify if not.
Comment replay 2:
ok, now i understand you. The tools i suggested can help you understand code flow, and find there certain symbol is defined, but not what are you looking for.
Not what you asking for but since we are talking i have some experience with porting and drivers (feel free to ignore)
It seems like compiler is good enough for your task. You just starting with original file and let compiler find what missing part, it will be a lot of empty stubs and you will get you code compiled.
At least for beginning i suggest you to create a lot of stubs and modifying original code as less as possible, later on once you get it working you can optimize.
It's might be more complex depending on the type of driver your are porting (I'm assuming kernel driver), the Windows and Solaris subsystems are not so alike. We do have a driver working on both solaris and windows, but it was designed to be multi platform from the beginning.
emacs and etags.
And I leverage make to run the tag indexing for me---that way I can index a large project with one command. I've been thinking about building a master index and separate module indecies, but haven't gotten around to implementing this yet...
#Ilya: Would pistols at dawn be acceptable?
Try doxygen, it can produce graphs and/or HTML and highly customizable

Resources