libxml xmlNodePtr to raw xml string? - c

Given a valid, arbitrary xmlNodePtr, I would like the string representation of that node, including the tag, attributes, and children in the same form (recursive).
FWIW, my scenario is I am using PerformXPathQuery to get a block of data from within an existing document. I need to get the results of the query, which has nested XML elements in it, as the raw string, so I can insert it into a text field.
These seems like a simple task, however I cannot find an easy way. Writing an xmlDocPtr to file must do this, however, I cannot see a handy method that will do the same thing to an arbitrary node in the tree, and return it in memory.
I hope I am just going blind from the brown-on-brown documentation color scheme at xmlsoft.org

Is xmlNodeDump (or xmlNodeDumpOutput) what you are looking for?

My code I used to dump a node to a string. It's objectiv-c so just change your output as needed.
xmlBufferPtr buffer = xmlBufferCreate();
int size = xmlNodeDump(buffer, myXMLDoc, myXMLNode, 0, 1);
NSLog(#"%d", size);
NSLog(#"%s", buffer->content);
Don't forget to free your buffer again.

One way you could do it definitely is to create a new document, then use xmlDocCopyNode to copy the node into it and serialize it.

Related

How to encode JSON buffer in C?

I have need of some advice.
I gather data from sensors on the analogue ports and I maintain data on the readings.
I then format this data into a json style format which I then use to send it to cloud.
Now the specific code I have for formatting the various values to json are held, not in a string of course, but in a character array using the int sprintf ( char * str, const char * format, ... ); method.
Here is my routines that uses this code:
void StackData() {
char buff[256];
sprintf(buff, "{\"id\":\"stat\",\"minHour\":%1i,\"maxHour\":%2i,\"minDay\":%3i,\"maxDay\":%4i,\"inHour\":%5lu,\"iinDay\":%6lu,\"inWeek\":%7lu}",
minHour, maxHour, minDay, maxDay, AmpsHour, AmpsDay, AmpsWeek);
}
I would like to see how others might do this differently, or is this another way by using a specific library to do this?
PS: I have successfully used coreJSON library to parse JSON input
What you have is reasonable, although an alternative might be some sort of result builder:
char buff[256] = { 0 }
jsonObjectOpen(buff);
jsonObjectInteger(buff,"minHour", minHour);
jsonObjectInteger(buff,"maxHour", maxHour);
jsonObjectClose(buff);
Basically each function is appending the necessary json elements to the buffer, and you'd need to implement functions for each data type (string, int, float), and of course, make sure you use the in the correct order.
I don't think this is more succinct, but if you are doing it more than a few times, especially for more complex structures, you might find it more readible and maintainable.
It's entirely possible there is an existing library that will help with this type of approach, also being mindful of ensuring that the buffer space isn't exceeded during the building process.
In other languages that have type detection, this is a lot easier, and I supposed you could always have a single function that takes a void pointer and a 'type' enum, but that could be more error prone for the sake of a marginally simpler API.
I might be good idea to separate JSON object building from the encoding.
One of the existing JSON C-library do it by the following way:
json_t *item = json_object();
json_object_set_new(item, "id", json_string("stat"));
json_object_set_new(item, "minHour", json_integer(minHour));
json_object_set_new(item, "maxHour", json_integer(maxHour));
...
// Dump to console
json_dumpf(item, stdout, JSON_INDENT(4) | JSON_SORT_KEYS);
// Dump to file
json_dumpf(item, file, JSON_COMPACT);
// Free allocated resources
json_decref(item);
The separation give some benefits.
For example, encode formatting can be selected in one place.
And the same object can be easily encoded several ways (as in the example).

Nest xmlDoc into existing xmlTextWriter

I think I'm missing something trivial but I'm losing a bunch of time on this, so its solution may be useful to others too:
I'm working with libxml2 2.9.8 (pure C, not C++ bindings) under linux.
I have an external (non-libxml) tree structure representing an XML file and I'm trying to write into a string representation using libxml2. All is trivial and working nice traversing it and writing using xmlTextWriter API (it is a struct with simple attributes, like
typedef struct _simplifiedNode {
char *tag,
char *content,
struct _simplifiedNode *parent,
struct _simplifiedNodeList *children,
} simplifiedNode;
), except at a certain point I encounter a string node that may contain the string representation of an xml document. I can parse it using the xmlReadMemory API, but then I need to nest it (and not its escaped string representation) into the on-going writer, including namespaces and attributes.
Is there a trivial way I am missing to do this recursively having the parsed doc/root element, without introspecting every sub-element?
e.g.
I'm producing the following document using xmlTextWriter API
<Title>
TitleValue
</Title>
<Date>
2018-11-26
</Date>
<Content>
The Content node in the non-libxml tree is a leaf node with tag Content containing a string like
char *content = "<SomeXmlComplexDocument ss:someattr=\"attrval\">Somecontent</SomeXmlComplexDocument>"
What I Want to achieve is, instead of having something like
<Content><SomeXmlComplexDocument> ... </Content>
after having parsed and validated the content with xmlReadMemory to re-inject the document obtaining
<Content>
<SomeXmlComplexDocument ss:someattr="attrval">Somecontent</SomeXmlComplexDocument>
</Content>
namespaces and attributes should be preserved.
To serialize the inner XML fragments unescaped, you can simply use xmlTextWriterWriteRaw. This won't check whether the XML is well-formed, though. If you need validation, you'll have to parse the XML fragments at some point. Depending on the content model, you might have to use xmlParseBalancedChunkMemory instead of xmlReadMemory. It should also be possible to parse the result document in one go after it was written, but you'll lose information like original line numbers.

libxml2: missing children when dumping a node with xmlNodeDump()

I'm facing an issue with libxml2 (version 2.7.8.13).
I'm attempting to dump a node while parsing an in-memory document with a xmlTextReaderPtr reader.
So upon parsing the given node, I use xmlNodeDump() to get its whole content, and then switch to the next node.
Here is how I proceed:
[...]
// get the xmlNodePtr from the text reader
node = xmlTextReaderCurrentNode(reader);
// allocate a buffer to dump into
buf = xmlBufferCreate();
// dump the node
xmlNodeDump(buf, node->doc, node, 0 /* level of indentation */, 0 /* disable formatting */);
result = strdup((char*)xmlBufferContent(buf));
This works in most cases, but sometimes the result is missing some children from the parsed node. For instance, the whole in-memory xml document contains
[...]
<aList>
<a>
<b>42</b>
<c>aaa</c>
<d/>
</a>
<a>
<b>43</b>
...
</aList>
and I get something like:
<aList>
<a>
<b>42</b>
</c>
</a>
</aList>
The result is well formed but it lacks some data ! A whole bunch of children has "disappeared". xmlNodeDump() should recursively dumps all children of .
It looks like some kind of size limitation.
I guess I do something wrong, but I can't figure out what.
Thank you for your answers.
I succeeded in implementing this correctly another way, still I do not understand what happened there. Thank you for having read my question.
FYI, instead of trying to tinker an existing parsing code based on xmlTextReader, I have just rewritten a small parsing module for my case (dump all the 1st level siblings into separate memory chunks).
I did so by using the parsing and tree modules of libxml2, so:
get the tree from the in-memory xml document with xmlReadMemory()
get the first node with xmlDocGetRootElement()
for each sibling (with xmlNextElementSibling() ), dump its content (all children recursively) with xmlNodeDump()
Et voilĂ , kinda straightforward actually. Sometimes it's easier to start from scratch...
I guess there was some side effect.

Read & Process in memory XML data in a streaming manner in C

Original question below, update regarding solution, if someone has a similar problem:
For a fast regex I found http://re2c.org/ ; for xml parsing http://expat.sourceforge.net/
Is there an xml library I can use to parse xml from memory (and not from file) in a streaming manner in c?
Currently I have:
libxml2 ; XMLReader seems to only be possible to use with a filehandle and not in-memory
rapidxml is c++ and does not seem to expose a c interface
Requirements:
I need to process the individual xml nodes without having the whole xml (400GB uncompressed, and "only" 29GB as original .bz2 file) in memory ( bzip'd file gets read in and decompressed piecewise, and I would pass those uncompressed pieces to be consumed by the xml parser )
It does not need to very fast, but I would prefer an efficient solution
I (most probably) don't need the path of an extracted node, so it would be fine to just discard them as soon as they have been processed by my callback (if I would need the path contrary to what I think right now, I could then still track it myself)
This is part of me trying to solve my own problem posted here (and no, it's not the same question): How to efficiently parse large bz2 xml file in C
Ideally I'd like to be able to feed the library a certain amount of bytes at a time and have a function called whenever a node is completed.
Thank you very much
Here's some pseudo c code (way shorter than actual c code) for a better understanding
// extracted data gets put here
strm.next_out = buffer_ptr;
while( bytes_processed_total < filesize ) {
// extracts up to amount of data set in strm.avail_in
BZ2_bzDecompress( strm );
bytes_processed = strm.next_out - buffer_ptr;
bytes_processed_total += bytes_processed;
// here I would like to pass bytes_processed of buffer_ptr to xmlreader
}
About the data I want to parse: http://wiki.openstreetmap.org/wiki/OSM_XML
At the moment I only need certain <node ...> nodes from this, which have subnode <tag k="place" v="country|county|city|town|village"> (the '|' means at least one of those in this context, in the file it's of course only "country" etc without the '|')
xmlReaderForMemory from libxml2 seems a good one to me (but haven't used it so, I may be wrong)
the char * buffer needs to point to a valid XML document (that can be a part of your entire XML file). This can be extracted reading in chuncks your file but obtaining a valid XML fragment.
What's the structure of your XML file ? A root containing subsequent similar nodes or a fully fledged tree ?
If I had an XML like this:
<root>
<node>...</node>
<node>...</node>
<node>...</node>
</root>
I'd read starting from the opening <node> till the closing </node> and then parse it with the xmlReaderForMemory function, do what I need to do, then go on with the next <node> node.
Ofc if your <node> content is too complex/long, you may have to go deep some levels:
<node>
<subnode>....</subnode>
<subnode>....</subnode>
<subnode>....</subnode>
<subnode>....</subnode>
</node>
And read from the file until you have the entire <subnode> node (but keeping track that you're in a <node>.
I know it's ugly, but is a viable way. Or you can try to use a sax parser (dunno if some C implementation exists).
Sax parsing fires events on each node start and node end, so you can do nothing untill you find your nodes and process just them.
Another viable way can be using some external tools to filter the whole XML (XQuery or XPath processors) in order to extract just your interesting nodes from the whole file, obtain a smaller doc and then work on it.
EDIT: Zorba was a good XQuery framework, with command line preprocessor, may be a good place to look at
EDIT2: well since you have this dimensions, one alternative solution can be manage the file as a text file, so read and uncompress in chunks and then matching something like:
<yourNode>.*</yourNode>
with regexp.
If you're on a Linux/Unix you should have POSIX regexp library. Check
this question on S.O. for further insights.

Serialize Data Structures in C

I'd like a C library that can serialize my data structures to disk, and then load them again later. It should accept arbitrarily nested structures, possibly with circular references.
I presume that this tool would need a configuration file describing my data structures. The library is allowed to use code generation, although I'm fairly sure it's possible to do this without it.
Note I'm not interested in data portability. I'd like to use it as a cache, so I can rely on the environment not changing.
Thanks.
Results
Someone suggested Tpl which is an awesome library, but I believe that it does not do arbitrary object graphs, such as a tree of Nodes that each contain two other Nodes.
Another candidate is Eet, which is a project of the Enlightenment window manager. Looks interesting but, again, seems not to have the ability to serialize nested structures.
Check out tpl. From the overview:
Tpl is a library for serializing C
data. The data is stored in its
natural binary form. The API is small
and tries to stay "out of the way".
Compared to using XML, tpl is faster
and easier to use in C programs. Tpl
can serialize many C data types,
including structures.
I know you're asking for a library. If you can't find one (::boggle::, you'd think this was a solved problem!), here is an outline for a solution:
You should be able to write a code generator[1] to serialize trees/graphs without (run-time) pre-processing fairly simply.
You'll need to parse the node structure (typedef handling?), and write the included data values in a straight ahead fashion, but treat the pointers with some care.
For pointer to other objects (i.e. char *name;) which you know are singly referenced, you can serialize the target data directly.
For objects that might be multiply refernced and for other nodes of your tree you'll have to represent the pointer structure. Each object gets assigned a serialization number, which is what is written out in-place of the pointer. Maintain a translation structure between current memory position and serialization number. On encountering a pointer, see if it is already assigned a number, if not, give it one and queue that object up for serialization.
Reading back also requires a node-#/memory-location translation step, and might be easier to do in two passes: regenerate the nodes with the node numbers in the pointer slots (bad pointer, be warned) to find out where each node gets put, then walk the structure again fixing the pointers.
I don't know anything about tpl, but you might be able to piggy-back on it.
The on-disk/network format should probably be framed with some type information. You'll need a name-mangling scheme.
[1] ROOT uses this mechanism to provide very flexible serialization support in C++.
Late addition: It occurs to me that this is not always as easy as I implied above. Consider the following (contrived and badly designed) declaration:
enum {
mask_none = 0x00,
mask_something = 0x01,
mask_another = 0x02,
/* ... */
mask_all = 0xff
};
typedef struct mask_map {
int mask_val;
char *mask_name;
} mask_map_t;
mask_map_t mask_list[] = {
{mask_something, "mask_something"},
{mask_another, "mask_another"},
/* ... */
};
struct saved_setup {
char* name;
/* various configuration data */
char* mask_name;
/* ... */
};
and assume that we initalize out struct saved_setup items so that mask_name points at mask_list[foo].mask_name.
When we go to serialize the data, what do we do with struct saved_setup.mask_name?
You will need to take care in designing your data structures and/or bring some case-specific intelligence to the serialization process.
This is my solution. It uses my own implementation of malloc, free and mmap, munmap system calls. Follow the given example codes. Ref: http://amscata.blogspot.com/2013/02/serialize-your-memory.html
In my approach I create a char array as my own RAM space. Then there are functions for allocate the memory and free them. After creating the data structure, by using mmap, I write the char array to a file.
Whenever you want to load it back to the memory there is a function which used munmap to put the data structure again to the char array. Since it has virtual addresses for your pointers, you can re use your data structure. That means, you can create data structure, save it, load it, again edit it, and save it again.
You can take a look on eet. A library of the enlightenment project to store C data types (including nested structures). Although nearly all libs of the enlightenment project are in pre-alpha state, eet is already released. I'm not sure, however, if it can handle circular references. Probably not.
http://s11n.net/c11n/
HTH
you should checkout gwlib. the serializer/deserializer is extensive. and there are extensive tests available to look at. http://gwlib.com/
I'm assuming you are talking about storing a graph structure, if not then disregard...
If your storing a graph, I personally think the best idea would be implementing a function that converts your graph into an adjacency matrix. You can then make a function that converts an adjacency matrix to your graph data structure.
This has three benefits (that may or may not matter in your application):
adjacency matrix are a very natural way to create and store a graph
You can create an adjacency matrix and import them into your applications
You can store and read your data in a meaningful way.
I used this method during a CS project and is definitely how I would do it again.
You can read more about adjacency matrix here: http://en.wikipedia.org/wiki/Modified_adjacency_matrix
Another option is Avro C, an implementation of Apache Avro in C.
Here is an example using the Binn library (my creation):
binn *obj;
// create a new object
obj = binn_object();
// add values to it
binn_object_set_int32(obj, "id", 123);
binn_object_set_str(obj, "name", "Samsung Galaxy Charger");
binn_object_set_double(obj, "price", 12.50);
binn_object_set_blob(obj, "picture", picptr, piclen);
// send over the network
send(sock, binn_ptr(obj), binn_size(obj));
// release the buffer
binn_free(obj);
If you don't want to use strings as keys you can use a binn_map which uses integers as keys.
There is also support for lists, and all these structures can be nested:
binn *list;
// create a new list
list = binn_list();
// add values to it
binn_list_add_int32(list, 123);
binn_list_add_double(list, 2.50);
// add the list to the object
binn_object_set_list(obj, "items", list);
// or add the object to the list
binn_list_add_object(list, obj);
In theory YAML should do what you want http://code.google.com/p/yaml-cpp/
Please let me know if it works for you.

Resources