Extract information from HTML document with C - c

In my quest to learn C (Plain C, not C#, nor C++. I have my reasons.), I have come across the need to extract some information from a HTML document, fetched from a URL. Namely, I want all href attributes from the links residing in a certain unordered list on the page, in an array of strings. These URLs point at images I want to download and store in a zip file.
Now, I've asked a few people I know are good at C, and they have either told me off with "C is the wrong tool", or pointed me at libXML, which is apparently famous for it's scarce documentation. I've also looked at libsoup and libtidy, but I can't seem to stitch the pieces together.
What approach/library should I pick? Does anyone know of some example code I could look at?
EDIT: Seeing that half the comments are telling me to use something other than C, I'll add that I'm not looking for the "right tool for the job". I'd probably use Ruby if I just wanted to get it done ASAP, simply because I'm comfortable with it. It's part of my quest to learn C, and as such, I'm looking for a pure C solution.

Since you are on a quest to learn C, then I would use the standard library and .
http://www.cplusplus.com/reference/clibrary/cstdio/
http://www.cplusplus.com/reference/clibrary/cstring/
The easiest is to use something else to get the page, write it to a local file, then pass the file name into your program. Print your output to STDOUT.

Related

How to store data from a .TIFF file to then modify it with my C program

I know few about this and i'm trying to keep building upon it. My goal is to do image stacking with some criteria using C language, as i came upon some cool ideas i think i should be capable of doing with my photos. My C background should be enough to understand what i may need. That being said...
So far i've learned how to read an existing .TIFF file and save it into a char array. The problem is i don't know in which way its data is contained so that i can then be able to analize individual pixels and modify them, or build another .TIFF file from data i previously read.
I've read some things about (a so called) libtiff.h which may be usefull but i can't find where to get it, neither how to install it.
Does anyone know how a .TIFF file data is stored so that i can read it and apply changes to it?
Also,
Does anyone have any experience with handling image files and editing in C? Where did you learn it from?
Do you know of any place i could search for information/tutorials?
Any help will be very usefull,
Thanks in advance.
You can do an enormous amount of very sophisticated processing on TIFFs, or any one of 190+ other formats with ImageMagick without any need to understand TIFF format or write any C. Try searching on Stack Overflow for [imagemagick]
If you want to do processing yourself, consider https://cimg.eu
Another option might be to convert your TIFFs to NetPBM which is much, much simpler to read and write in C. That would be as follows with ImageMagick:
magick INPUT.TIFF -compress none OUTPUT.PPM

xmgrace .agr File Commands

I am attempting to write a program that will generate .agr files that can be loaded and manipulated in xmgrace. I've dissected an example file that has the kind of formatting I'm looking for, but I'm not 100% sure what every line does. A lot of the commands are self-explanatory for the most part, but is there a guide somewhere I can use to reference some of the more obscure lines like #reference date 0, #default sformat "%.8g", #r0 off, etc.?
I've looked around the grace website in both the user and developer sections as well as googling individual lines without much luck. All I'm looking for is basically a man page of xmgrace .agr files. The more low-level details, the better.
Any help would be appreciated!
I'm sure that you have already looked through all of the official documentation for Grace/xmgrace. This documentation doesn't give much information about the internals of the .agr files that xmgrace creates.
I have found in the past that creating your own files and studying them in a text editor is a good way to learn what each line does, but as you said it is not always possible to decipher everything.
A project that is doing something similar to you is pygrace.
Maybe if you look at the pygrace source code it will give you some further clues to fill of the gaps in your existing knowledge.

How to write data in a PostgreSQL C extension?

I am writing an extension function in C for PostgreSQL.
I can find lots of examples online but nothing that explicitly shows how to actually write data to a table in an extension function?
Where do I need to look to find the right functionality/documentation for writing a record to an existing table as a C extension?
I should've googled a bit longer before posting.
It seems that SPI fits my needs exactly
http://www.postgresql.org/docs/current/static/spi.html

Standard (or convenient) method to read and write tabular data to a text file in c

This might sound rather awkward, but I want to ask if there is a commonly practiced way of storing tabular data in a text file to be read and written in C.
Like in python you can load a full text file nto an array by f.readlines then go through all the lines and split each line by a specific character or sequence of characters (delimiter).
How do you approach this problem in C?
Pretty much the same way you would in any other language. Pick a field separator (I.E., tab character), open the text file for reading and parse each line.
Of course, in C it will never be as easy as it is in Python, but approaches are similar.
Whoa. I am a bit baffled by the other answers which make me feel like I'm on Mainframes.stackexchange.com instead of stackoverflow.com
Why don't you pick a modern data format like JSON or XML and follow best practices for the data format of your choice?
If you want a good JSON reader/writer for C, I've used Jansson, and it's very easy and fast.
If you want a good XML reader/writer for C, I've used miniXML and it's also easy and fast. Also has SAX *and * DOM support depending on how you want to read in the XML.
Obviously there are a wealth of other libraries available as well.
Please don't give the next guy to come along and support your program some wacky custom file format to deal with.
I find getline() and strtok() to be quite convenient (getline was a gnu extension, standardized in POSIX.1-2008).
There's a handful of mechanisms, but there's a reason why scripting languages have become so popular over the least twenty years -- some of the tasks that seem simple in scripting languages are ponderous in C.
You could use flex and bison to write a parser for your tables. This really only works if the format is very well defined and "static". They're amazing tools that can do more than you might suspect, but it is very heavy machinery for what could be done simply with a split() in a scripting language.
You could read individual fields using getdelim(3). However, this was only standardized with POSIX.1-2008, so this is far from ubiquitous. (Every Linux machine with glibc should have them.)
You could read lines with fgets(3) and discover the split locations using strchr(3).
You could read lines with fgets(3) and use strtok(3) to tokenize strings.
You can use scanf(3) to perform input and scanning in one go; it seems from the questions here that scanf(3) is difficult to use correctly.
You could use character-at-a-time parsing approaches: read characters using getc(3), inspect it, do something with it, iterate until no more characters.

Trying to understand the MD5 algorithm

I am trying to do something in C with the MD5 (and latter trying to do something with the SHA1 algorithm). My main problem is that I never really did anything complex in C, just simple stuff (nothing like pointers to pointers or structs).
I got the md5 algorithm here.
I included the files md5.c and md5.h in my C project (using codeblocks) but the only problem is that I don't really understand how to use it. I have read and re-read the code and I don't understand how I use those functions to turn 'example' into a MD5 hash.
I haven't done C programming in a while (mostly php) so I am a bit lost here.
Basically what I am asking is for some examples of usage. They are provided via the md5main.c file but I don't understand them.
Am I aiming high here? Should I stop all this and start reading the C book again or can anyone give me some pointers and see if I can figure this out.
Thanks.
While I agree with Bill, you should go back to the C book if you want to really understand what you're doing. But, in an effort to help, I've modified and commented some of the code from md5main.c...
const char* testData = "12345"; // this is the data you want to hash
md5_state_t state; // this is a state object used by the MD5 lib to do "stuff"
// just treat it as a black box
md5_byte_t digest[16]; // this is where the MD5 hash will go
// initialize the state structure
md5_init(&state);
// add data to the hasher
md5_append(&state, (const md5_byte_t *)testData, strlen(testData));
// now compute the hash
md5_finish(&state, digest);
// digest will now contain a MD5 hash of the testData input
Hope this helps!
You should stop all this and start reading the C book again.
My experience is that when I am trying to learn a new programming language, it's not practical to try implementing a complex project at the same time. You should do simple exercises in C until you are comfortable with the language, and then tackle something like implementing MD5 or integrating an existing implementation.
By the way, reading code is a skill different from writing code. There are differences between these two skills, but both require that you understand the language well.
I think you picked about the worst thing to look at (by no fault of your own). Encryption and hash type algorithms are going to make the strangest use of the language possible to do the type of math they need to do quickly. They are almost guaranteed to be obfuscated and difficult to understand. Plus, you will need to get bogged down in math in order to really understand them.
If you just want a hashing algorithm, get a well-known implementation and use it as a black box. Don't try and implement it yourself, you will almost certainly introduce some cryptographic weakness into the implementation.
Edit: To be fully responsive if you want great books (or resources) on encryption, look to Bruce Schneier. Applied Cryptography is a classic.

Resources