Parsing XML file in cprogram for particular element - c

currently im working in parsing xml by c and c++
im using pugixml library in c++ and libxml2 library in c for parsing xml
assume i have root element in xml as "configuration" and it have 4 child elements which are protocolversion,servername,daqlist and device.
Now i can get root element (configuration),by using this root i want to move its particular child (device) without moving one by one.
In C++ by using pugixml,we have following line to directly move from configuration to its child device,
doc.child("Configuration").child("device").
In c by using libxml i just move one by one child as,
if cur node is root (configuration),then im using,
cur= cur->children->next->next->next->next->next->next->next->next->next(for move to device from config)
i dont want to move by next next..
i want simple function to move from current to particular node in c by libxml..
could anyone please help to solve this issue..?

The ->next->next... approach assumes a fixed layout of nodes and is unreliable. I'd suggest that you:
iterate the children until you find one with
child->type == XML_ELEMENT_NODE and
strcmp((char *)child->name, "device") == 0
or xmlStrcmp(child->name, BAD_CAST "device") == 0
use XPath to find nodes matching /configuration/device

Related

Appending a big number of nodes to an xml tree

I'm using libxml via C, to work with xml file creation and parsing. Until recently everything worked smoothly, but a case emerged where a single tree has a subnode, lets call it S, with approximately 200,000 children. This case works surprisingly slow and I suspect the function :
xmlNewChild(/**/);
which I'm using to build the tree, has to iterate over every child of S to add one more child. Since a function that also accepts a hint (a pointer to the last added function) doesn't seem to exist, is there a better way to build the tree (maybe a batch build method) ? In case such numbers are insignificant and I should search for deficiencies elsewhere, please let me know.
Yeah, rather than keeping the entire XML in memory with xmlTree, you may want to use a combination of libxml's xmlReader and xmlWriter APIs. They're both streaming, so it won't have to keep the entire document in memory and won't have any scaling problems based on the number of elements.
Examples of both xmlReader and xmlWriter can be found here:
http://www.xmlsoft.org/examples/index.html

libxml2: missing children when dumping a node with xmlNodeDump()

I'm facing an issue with libxml2 (version 2.7.8.13).
I'm attempting to dump a node while parsing an in-memory document with a xmlTextReaderPtr reader.
So upon parsing the given node, I use xmlNodeDump() to get its whole content, and then switch to the next node.
Here is how I proceed:
[...]
// get the xmlNodePtr from the text reader
node = xmlTextReaderCurrentNode(reader);
// allocate a buffer to dump into
buf = xmlBufferCreate();
// dump the node
xmlNodeDump(buf, node->doc, node, 0 /* level of indentation */, 0 /* disable formatting */);
result = strdup((char*)xmlBufferContent(buf));
This works in most cases, but sometimes the result is missing some children from the parsed node. For instance, the whole in-memory xml document contains
[...]
<aList>
<a>
<b>42</b>
<c>aaa</c>
<d/>
</a>
<a>
<b>43</b>
...
</aList>
and I get something like:
<aList>
<a>
<b>42</b>
</c>
</a>
</aList>
The result is well formed but it lacks some data ! A whole bunch of children has "disappeared". xmlNodeDump() should recursively dumps all children of .
It looks like some kind of size limitation.
I guess I do something wrong, but I can't figure out what.
Thank you for your answers.
I succeeded in implementing this correctly another way, still I do not understand what happened there. Thank you for having read my question.
FYI, instead of trying to tinker an existing parsing code based on xmlTextReader, I have just rewritten a small parsing module for my case (dump all the 1st level siblings into separate memory chunks).
I did so by using the parsing and tree modules of libxml2, so:
get the tree from the in-memory xml document with xmlReadMemory()
get the first node with xmlDocGetRootElement()
for each sibling (with xmlNextElementSibling() ), dump its content (all children recursively) with xmlNodeDump()
Et voilĂ , kinda straightforward actually. Sometimes it's easier to start from scratch...
I guess there was some side effect.

Read & Process in memory XML data in a streaming manner in C

Original question below, update regarding solution, if someone has a similar problem:
For a fast regex I found http://re2c.org/ ; for xml parsing http://expat.sourceforge.net/
Is there an xml library I can use to parse xml from memory (and not from file) in a streaming manner in c?
Currently I have:
libxml2 ; XMLReader seems to only be possible to use with a filehandle and not in-memory
rapidxml is c++ and does not seem to expose a c interface
Requirements:
I need to process the individual xml nodes without having the whole xml (400GB uncompressed, and "only" 29GB as original .bz2 file) in memory ( bzip'd file gets read in and decompressed piecewise, and I would pass those uncompressed pieces to be consumed by the xml parser )
It does not need to very fast, but I would prefer an efficient solution
I (most probably) don't need the path of an extracted node, so it would be fine to just discard them as soon as they have been processed by my callback (if I would need the path contrary to what I think right now, I could then still track it myself)
This is part of me trying to solve my own problem posted here (and no, it's not the same question): How to efficiently parse large bz2 xml file in C
Ideally I'd like to be able to feed the library a certain amount of bytes at a time and have a function called whenever a node is completed.
Thank you very much
Here's some pseudo c code (way shorter than actual c code) for a better understanding
// extracted data gets put here
strm.next_out = buffer_ptr;
while( bytes_processed_total < filesize ) {
// extracts up to amount of data set in strm.avail_in
BZ2_bzDecompress( strm );
bytes_processed = strm.next_out - buffer_ptr;
bytes_processed_total += bytes_processed;
// here I would like to pass bytes_processed of buffer_ptr to xmlreader
}
About the data I want to parse: http://wiki.openstreetmap.org/wiki/OSM_XML
At the moment I only need certain <node ...> nodes from this, which have subnode <tag k="place" v="country|county|city|town|village"> (the '|' means at least one of those in this context, in the file it's of course only "country" etc without the '|')
xmlReaderForMemory from libxml2 seems a good one to me (but haven't used it so, I may be wrong)
the char * buffer needs to point to a valid XML document (that can be a part of your entire XML file). This can be extracted reading in chuncks your file but obtaining a valid XML fragment.
What's the structure of your XML file ? A root containing subsequent similar nodes or a fully fledged tree ?
If I had an XML like this:
<root>
<node>...</node>
<node>...</node>
<node>...</node>
</root>
I'd read starting from the opening <node> till the closing </node> and then parse it with the xmlReaderForMemory function, do what I need to do, then go on with the next <node> node.
Ofc if your <node> content is too complex/long, you may have to go deep some levels:
<node>
<subnode>....</subnode>
<subnode>....</subnode>
<subnode>....</subnode>
<subnode>....</subnode>
</node>
And read from the file until you have the entire <subnode> node (but keeping track that you're in a <node>.
I know it's ugly, but is a viable way. Or you can try to use a sax parser (dunno if some C implementation exists).
Sax parsing fires events on each node start and node end, so you can do nothing untill you find your nodes and process just them.
Another viable way can be using some external tools to filter the whole XML (XQuery or XPath processors) in order to extract just your interesting nodes from the whole file, obtain a smaller doc and then work on it.
EDIT: Zorba was a good XQuery framework, with command line preprocessor, may be a good place to look at
EDIT2: well since you have this dimensions, one alternative solution can be manage the file as a text file, so read and uncompress in chunks and then matching something like:
<yourNode>.*</yourNode>
with regexp.
If you're on a Linux/Unix you should have POSIX regexp library. Check
this question on S.O. for further insights.

JCR create single file, link from different nodes

I am trying to create a single file node for an image with name (say A.gif). Now, I want to re-use the file across multiple nodes. Is there a way to do this?
As a workaround, I am re-creating file nodes for different paths in my repository, but this results in duplication of files.
If you're using jackrabbit, copying a file node (or rather copying a binary property) is cheap if the DataStore is active.
That component makes sure "large" binary properties (with a configurable size threshold IIRC) are stored once only, based on a digest of their content.
So you can in this case copy the same file node many times without having to worry about disk space.
I'm not sure I understand your problem. However, what I would do is store the file in a single location and then reference it using a path property from multiple locations.
Assume that you have an the following node structure
-content
- articles
- article1
- article2
- images
- image1
You can set on each of the articles a property named imagePath which points to the path of the image to display, in this case /content/images/image1.
The nt:linkedFile type was made for just this kind of use.
And just for completeness, don't forget references.
Node imageNode = rootNode.addNode("imageNode");
imageNode.addMixin(JcrConstants.MIX_REFERENCEABLE);
Node node1 = rootNode.addNode("1");
node1.setProperty("image", imageNode);
Node node2 = rootNode.addNode("2");
node2.setProperty("image", imageNode);
session.save();
PropertyIterator references = imageNode.getReferences();
while (references.hasNext()) {
Property reference = references.nextProperty();
System.out.println(reference.getPath());
}

Efficient way to get nodes X level deep

I have a 4 levels deep node structure, where the top most level is made of 1 root node.
What I want to do is get all nodes in the 4th level for which a certain property(ies) is true, for example:
get all 4th level nodes where nodePropertyX == true.
Now, I could do this with a for-each loop, and iterate all the items in the levels above, but I have the feeling it would be inefficient.
How can I do it in a better more efficient way ? Is there a way to maybe cache my dataset? (I'm returning results as a datatable) ?
What is the preferrable method: using C# control (.ascx) or razor script (.cshtml)?
Depending on what you want to do with those nodes, you may use razor, macros and the built-in caching abilities for macros to cache the output of the macro:
Here's how to get all nodes at the 4th level from the root node having a property nodePropertyX which equals value :
#foreach (var item in #Model.AncestorOrSelf().Descendants()
.Where("Visible")
.Where("level=4")
.Where("nodePropertyX == \"value\""))
{
#item.Name
}
Place this code in a Scripting file (Section Developer, node Scripting Files), create a macro using this scripting file, and insert the macro wherever (on any template) you want to display the list of those nodes.
In order to cache the output of the macro, select the macro and set the appropriate properties (Cache Period, Cache By Page and Cache Personalized).

Resources