Parsing HTML files in C - alternatives to libxml2 [closed] - c

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
So I want to create a web crawler in C. There are hardly any libraries to support this.
I can use libtidy to convert HTML to XHTML and get the HTML files using libcurl (which has decent documentation).
My problem is parsing the HTML files and getting all the links present in it. I know libxml2
is there but its extremely hard to understand because there is no good documentation for its API.
Should I even do this in C or go with another language like Java ?
Or are there any good alternatives to libxml2 ?

Parsing HTML requires basically just string manipulation.
But it's quite hard to do without an HTML or XML (if it's XHTML) parser.
As for the second part of the question I woudn't choose C for such task because even basic string operations are much complex than many other languages that support them natively.
I would go for a scripting lanuguage such Python, JavaScript, PHP...
Instead of using libcurl you'll invoke curl as a command line tool.
Btw: libcurl documentation is very good (in my opinion).

Related

Is there any Dictionary Library for C? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I would like to know if there is any dictionary library for C. A dictionary library, literally (It has nothing to do with python dictionaries (hashmaps)). With all the words of the english language, and with tools like...
"I want to print all words that begins with C and end with Y".
I'll not google it, because I really want to know your opinion, if there is any that is specifically good.
Thank you!
You might want to start by looking at Aspell. While it mostly functions as a spell-checker, Aspell also has support for using multiple dictionaries at once and intelligently handling personal dictionaries when more than one Aspell process is open at once. I don't believe you have to be connected to the Internet to use it as well.
Wiktionary might also be of any help. There are a lot of localized variations to support different languages and there will probably be a way to ask them to support your language of interest, if it is not already there.
There's amazing Wordnik API, if you don't mind using Internet for this task. The API is fairly easy and supports regex search. The method you are looking for is /words.{format}/search/{query}
It also has methods to retrieve meanings (/word.{format}/{word}/definitions), synonyms (/word.{format}/{word}/relatedWords), and many other things.
There currently are no C wrappers, although it's very easy to use API directly with libcurl and any JSON or XML parser.

use a parser generator or hand-code when the file to parse is written by a program? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I'm aware of that this question is very generic and therefore difficult to answer.
I need to parse a text file with content similar to a configuration file. It contains descriptions of measureable signals, how to convert to physical units, comments, descriptions etc. If this file was written by a human I would use a parser generator such as lex/yacc or ANTLR. But since this file is written by an other program it is always correctly formatted etc.
Should I use a parser generator anyway, or is the fact that the file is written by an other program reason for writing a different kind of parser by hand-coding?
Don't solve a problem which has been already solved
If you have tools to do your work for you, then use the tools. So if you can use either ANTLR or LEX/YACC then just go ahead and use that instead of hand coding a lexer and a parser.
It will be much less work
Source: The Art of Unix Programming - By Eric S. Raymond
Sounds complex enough to me, save yourself the time and trouble and use lex/yacc, it doesn't matter if the file is generated by a program it just means you don't have to verify it when parsing, but you get that for free anyway if you use lex/yacc. Plus, changes to the format will be easier to deal with if you use a generator.

Which programming language single page web scraping? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I want to build (hire someone to build) a program for windows. This program has to save some data of a single web page like name of the website, product name and product price on a command (under right-click or keyboard shortcuts) in a local database. Which programming language can I chose best? The amount of (affordable) programmers and the possibility to add some extra functionalities in the future is also important.
I found for example that python, Java, Ruby and XPath are used for this job.
Thank You.
Java, python and ruby are all good choices. Xpath is not a programming language, it's a query specification that allows you to extract the data you want from xml or html. No matter which language you choose you will need to also use xpath (all 3 have xpath libraries available).
Python seems to be the most popular but the future of it's libraries
is also the most uncertain (nobody has bothered to port mechanize to
python3 yet, beautiful soup has died and then come back).
Java's biggest strength may be that it's already installed on most
windows machines, but it's also the only one of the three that is not
a scripting language and therefore development time will likely be
longer.
Ruby is a good choice with excellent scraping libs and plenty of
programmers using it.

Is there a good tutorial for figuring out what a website is doing so your program can do the same thing? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Is there a good guide or tutorial for people who need to programmatically interact with dynamic websites? There's been a rash of Perl questions about that lately, and I haven't found a good resource to point people toward. I'm asking not because I need one but because I don't want to waste my time writing it if it already exists. Although I'm most interested in Perl, the extra tools and techniques are mostly the same.
Typically, I see see these problems in people's questions:
Handling, setting, and saving cookies
Finding and interacting with forms
Handling JavaScript inside your user-agent
especially things like onLoad, onSumbit, and Ajax
Using HTTP sniffer tools
Using Web developer plugins in interactive browsers
Interacting with DOM, screen scraping, etc.
If there's no good tutorial, I'll add it to my list of things to do (unless someone else wants to do it). Along the way, if you don't have a suggestion for an existing tutorial, please suggest the things that you think should be in a new one, including links, your favorite tools, and your own user-agent development experiences. I don't care about the particular language you use.
The best I've seen is a Defcon presentation video.
Look at perl library of libraries. Some html parsing libraries should be made for talking to dynamic websites.
Like:
http://metacpan.org/pod/HTML::DOM
But do you want to use web-browser enhanced by perl. Or perl stand alone app?

Best XML parser for C [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
We have to add a new interface to our existing C application. The new interface system requests to our C application and responses to interface will be XML files. We need find a way to read and write XML files. It seems there are many mapping tools available for Java and C++. I did not find anyone for C.
Please let me know if there is anyone suitable for C. We will be okay if it's commercial API as well.
Thanks
One of the most widely used is libxml2. You can take a look here.
It's been a while since I did anything in anger with XML in C but at the time the best offering was the Gnome XML library - libxml from www.xmlsoft.org.
Should be worth a look.
Cheers,
Dan
I've used Expat for some time now, which is great if you need a very fast streaming parser for C. I believe there are DOM and SAX extentions if you need them.

Resources