C based XML parser

C based XML parser - c

What are the recommended XML parsers for parsing a TMX file(XML based tilemap) in C?
What are the pros and cons of each, as I would like to have an efficient one because it will be running on an embedded system.

We used libxml on an embedded product a while back. It may be right for you.

On a high-level I think you should be looking at an event based parser rather than a DOM based one. DOM based parsers will take up a lot of memory building the XML tree.
Here's a from a similiar question. The top suggestion from that case looks to be one of the earliest xml parsers: Expat.

I have written sxmlc as a lightweight, potentially embeddable XML parser in C. See my answer on a related question (also linked by #PPC-Coder).
You can use it as DOM or SAX (to save memory), to parse either files or memory buffers.

Related

How to store "meta" source code in a database

I would like to store a computer program in a database instead of a number of text files. It should contain the structure and all objects, methods, dependencies etc. of the program. I do not want to store a specific language in the database but some kind of "meta" programming language. In a second step I would like to transform/export this structure in the database into either source code of a classic language (C#, Java, etc.) or compile it directly for CLR/JVM.
I think I am not the first person with this idea. I searched the internet and I think what I am looking for is called "source code in a database (SCID)" - unfortunately I could not find an implementation of this idea.
So my questions is:
Is there any program that stores "meta" program code inside of a database and let's you generate traditional text source code from it that can be compiled/executed?
Short remarks:
- It can also be a noSQL database
- I currently don't care how the program is imported/entered into the database

It sounds like you're looking for some kind of common markup language that adequately describes the common semantics of each target language - e.g. objects, functions, inputs, return values, etc.
This is less about storing in a database, and more about having a single (I imagine, XML-like) structure that can subsequently be parsed and eval'd by the target language to produce native source/bytecode. If there was such a thing, storing it in a database would be trivial -- that's not the hard part. Even a key/value database could handle that.
The hard part will be finding something that can abstract away the nuances of multiple languages and attempt to describe them in a common format.
Similar questions have already been asked, without satisfying solutions.
It may be that you don't need the full source, but instead just a description of the runtime data-- formats like XML and JSON are intended exactly for this purpose and provide a simplified description of Objects that can be parsed and mapped to native equivalents, with your source code built around the dynamic parsing of that data.
It may be possible to go further in certain languages. For example, if your language of choice converts to bytecode first, you might technically be able to store the binary bytecode in a BLOB and then run it directly. Languages that offer reflection and dynamic evaluation can probably handle this -- then your DB is simply a wrapper for storing that data on compilation, and retrieving it prior to running it. That'd depend on your target language and how compilation is handled.
Of course, if you're only working with interpreted languages, you can simply store the full source and eval it (in whatever manner is preferred by the target language).
If you give more info on your intended use case, I'm sure you'll get some decent suggestions on how to handle it without having to invent a sourcecode Babelfish.

Handling large amounts of nested elements with libxml SAX parser

I am currently using the SAX interface of the libxml library to parse a large number (around 60000) of XML documents less than 1Mb in size. I have chosen SAX as I thought it would be the most efficient. Would there be much of a difference in performance in this use case as with say a DOM parser?
Also, in my current approach I have an enum with a large number of states which I use in a switch statement in my startElement/endElement handlers. The number of states is growing quite large and becoming unmanageable. Is there a better way to handle this problem in libxml? For example, I've noticed some Java libraries allow you to create multiple instances of parsers so when you enter a certain element you can delegate to another parser for that particular element.

When you say "efficient", I guess you are talking about machine efficiency? But programmer efficiency is much more important, and as you've discovered, writing SAX applications to process complex XML requires a lot of complex code that is hard to develop and hard to debug.
You haven't said what the output of your processing should be. By default, I would start by writing it in the most programmer-efficient language available, typically XQuery or XSLT, and only resort to a lower-level language if you can't achieve the performance requirements that way.

Extracting information from millions of simple but inconsistent text files

We have millions of simple txt documents containing various data structures we extracted from pdf, the text is printed line by line so all formatting is lost (because when we tried tools to maintain the format they just messed it up). We need to extract the fields and there values from this text document but there is some variation in structure of these files (new line here and there, noise on some sheets so spellings are incorrect).
I was thinking we would create some sort of template structure with information about the coordinates (line, word/words number) of keywords and values and use this information to locate and collect keyword values like that using various algorithms to make up for inconsistant formatting.
Is there any standard way of doing this, any links that might help? any other ideas?

the noise can be corrected or ignored by using fuzzy text matching tools like agrep: http://www.tgries.de/agrep/
However, the problem with extra new-lines will remain.
One technique that i would suggest is to limit the error propagation in a similar way compilers do. For example, you try to match your template or a pattern, and you can't do that. Later on in the text there is a sure match, but it might be a part of the current un-matched pattern.
In this case, the sure match should be accepted and the chunk of text that was un-matched should be left aside for future processing. This will enable you to skip errors that are too hard to parse.

Larry Wall's Perl is your friend here. This is precisely the sort of problem domain at which it excels.
Sed is OK, but for this sort of think, Perl is the bee's knees.

While I second the recommendations for the Unix command-line and for Perl, a higher-level tool that may help is Google Refine. It is meant to handle messy real-world data.

I would recoomnd using graph regular expression here with very weak rules and final accpetion predicate. Here you can write fuzzy matching on token level, then on line level etc.

I suggest Talend data integration tool. It is open source (i.e. FREE!). It is build on Java and you can customize your data integration project anyway you like by modifying underlying java code.
I used it and found very helpful on low budget highly complex data integration projects. Here's the link to their WEB site;Talend
Good luck.

Creating a new file type

I am assigned a project to create a new binary file structure for storing co-ordinate information coming out of a 3d CAD/CAM software. I would highly appreciate if you kindly point out some articles or books regarding creation of new binary file format. Thanks for your time :)

I would start by taking a look at other similar file formats on wotsit.org. The site is for various different file formats and it contains links to their specifications.
By looking at other file formats you'll get ideas about how best to format and present information about your specification.

There's a universal binary (and compact) notation called ASN.1. It's used widely and there are books about it available. ASN.1 can be compared to XML, but on some lower (more primitive yet more flexible) level than XML. And XML, especially binary XML mentioned above, would be of great help to you too.
Also, if you have more than just one sequence of data to hold in your file, take a look at Solid File System as a container for several data streams in one file.

If I had the same assignment, I would inspect something already existing, like .OBJ and then try to implement something similiar, probably with minor changes.

Short answer: Don't. Use XML or a text format instead for readability, extensability amd portability.
Longer answer: CAD/CAM has loads of 'legacy' formats around. I'd look to use one of those (possibly extending it if necessary). And if there's nothing suitable, and XML is considered to bloaty and slow, look at Binary XML formats instead.

I think that what you really need is to figure out what data you have to save. Then you load it into memory and serialize that memory
Here is a tutorial on serialization in C++. This page also addresses many issues with saving data

Evaluating HDF5: What limitations/features does HDF5 provide for modelling data?

We are in evaluating technologies that we'll use to store data that we gather during the analysis of C/C++ code. In the case of C++, the amount of data can be relatively large, ~20Mb per TU.
After reading the following SO answer it made me consider that HDF5 might be a suitable technology for us to use. I was wondering if people here could help me answer a few initial questions that I have:
Performance. The general usage for the data will be write once and read "several" times, similar to the lifetime of a '.o' file generated by a compiler. How does HDF5 compare against using something like an SQLite DB? Is that even a reasonable comparison to make?
Over time we will add to the information that we are storing, but will not necessarily want to re-distribute a completely new set of "readers" to support a new format. After reading the user guide I understand that HDF5 is similar to XML or a DB, in that information is associated with a tag/column and so a tool built to read an older structure will just ignore the fields that it is not concerned with? Is my understanding on this correct?
A significant chunk of the information that we wish to write out will be a tree type of structure: scope hierarchy, type hierarchy etc. Ideally we would model scopes as having parents, children etc. Is it possible to have one HDF5 object "point" to another? If not, is there a standard technique to solve this problem using HDF5? Or, as is required in a DB, do we need a unique key that would "link" one object to another with appropriate lookups when searching for the data?
Many thanks!

How does HDF5 compare against using something like an SQLite DB?
Is that even a reasonable comparison to make?
Sort of similar but not really. They're both structured files. SQLite has features to support database queries using SQL. HDF5 has features to support large scientific datasets.
They're both meant to be high performance.
Over time we will add to the information that we are storing, but will not necessarily want to re-distribute a completely new set of "readers" to support a new format.
If you store data in structured form, the data types of those structures are also stored in the HDF5 file. I'm a bit rusty as to how this works (e.g. if it includes innate backwards compatibility), but I do know that if you design your "reader" correctly it should be able to handle types that are changed in the future.
Is it possible to have one HDF5 object "point" to another?
Absolutely! You'll want to use attributes. Each object has one or more strings describing the path to reach that object. HDF5 groups are analogous to folders/directories, except that folders/directories are hierarchical = a unique path describes each one's location (in filesystems w/o hard links at least), whereas groups form a directed graph which can include cycles. I'm not sure whether you can store a "pointer" to an object directly as an attribute, but you can always store an absolute/relative path as a string attribute. (or anywhere else as a string; you could have lookup tables galore if you wanted.)

We produce HDF5 data on my project, but I don't directly deal with it usually. I can take a stab at the first two questions:
We use a write once, read many times model and the format seems to handle this well. I know a project that used to write both to an Oracle database and HDF5. Eventually they removed the Oracle output since performance suffered and no one was using it. Obviously, SQLite is not Oracle, but the HDF5 format was better suited for the task. Based on that one data point, a RDBMS may be better tuned for multiple inserts and updates.
The readers our customers use are robust when we add new data types. Some of the changes are anticipated, but we don't have to worry about breaking thing when adding more data fields. Our DBA recently wrote a Python program to read HDF5 data and populate KMZ files for visualization in Google Earth. Since it was a project he used to learn Python, I'd say it's not hard to build readers.
On the third question, I'll bow to Jason S's superior knowledge.
I'd say HDF5 is a completely reasonable choice, especially if you are already interested in it or plan to produce something for the scientific community.