How to remove '&'-words encoding from libxml2?

How to remove '&'-words encoding from libxml2? - c

I have an XML file which should be parsed and processed. For that reason I'm using libxml2.
The xml file I have looks something like this:
test.xml
<root>
<tag attr1="VALUE_1 "" attr2="VALUE_2
VALUE_3" />
</root>
And I want to get the attribute contents. BUT the libxml2 seems to encode the '&'-words (don't know how to call them).
The code I use is the following one:
LIBXML_TEST_VERSION
xmlDoc *doc;
doc = xmlReadFile("test.xml", NULL, XML_PARSE_IGNORE_ENC);
xmlNode *root;
root = xmlDocGetRootElement(doc);
xmlNode *node;
node = root->children;
while (node != NULL) {
if (node->type == XML_ELEMENT_NODE) {
xmlAttr *attr;
attr = node->properties;
while (attr != NULL) {
xmlNode *child;
child = attr->children;
while (child != NULL) {
if (child->type == XML_TEXT_NODE ||
child->type == XML_CDATA_SECTION_NODE)
printf("%s\n", child->content);
child = child->next;
}
attr = attr->next;
}
}
node = node->next;
}
So basically I want to print the attribute values, BUT they are being parsed with a formatting (I guess). When I run this code than I see following output:
VALUE_1 "
VALUE_2
VALUE_3
As you can see it translated the '&'-words. How can I hint the libxml2 to not do that and give me the literal text values.

You simply can't. libxml2 will always decode numeric character references like
and predefined entities like ". But A and A, for example, are semantically equivalent. If you really need to tell them apart, you're probably doing something wrong elsewhere in your XML pipeline. If you want a literal
in an attribute value, you have to encode it as &#xA;.
Note that the expansion can be controlled for other, user-defined entities via the XML_PARSE_NOENT parser flag, but this won't affect numeric character references.

Related

Appending data in existing XML using libxml2

Well, I am trying to append data using C programming and libxml2 modulel but am facing a lot of problems as I am fairly new to this.
My code is designed to first fetch me an Element Node from the XML file based on the user input and then grab the parent of that child node and append another child in it.
XML FILE:
<policyList>
<policySecurity>
<policyName>AutoAdd</policyName>
<deviceName>PA-722</deviceName>
<status>ACTIVE</status>
<srcZone>any</srcZone>
<dstZone>any</dstZone>
<srcAddr>5.5.5.5</srcAddr>
<dstAddr>5.5.5.4</dstAddr>
<srcUser>any</srcUser>
<application>any</application>
<service>htds</service>
<urlCategory>any</urlCategory>
<action>deny</action>
</policySecurity>
<policySecurity>
<policyName>Test-1</policyName>
<deviceName>PA-710</deviceName>
<status>ACTIVE</status>
<srcZone>any</srcZone>
<dstZone>any</dstZone>
<srcAddr>192.168.1.23</srcAddr>
<dstAddr>8.8.8.8</dstAddr>
<srcUser>vivek</srcUser>
<application>any</application>
<service>http</service>
<urlCategory>any</urlCategory>
<action>deny</action>
</policySecurity>
</policyList>
C CODE:
int main(){
xmlDocPtr pDoc = xmlReadFile("/var/www/db/db_policy.xml", NULL, XML_PARSE_NOBLANKS | XML_PARSE_NOERROR | XML_PARSE_NOWARNING | XML_PARSE_NONET);
if (pDoc == NULL)
{
fprintf(stderr, "Document not parsed successfully.\n");
return 0;
}
root_element = xmlDocGetRootElement(pDoc);
if (root_element == NULL)
{
fprintf(stderr, "empty document\n");
xmlFreeDoc(pDoc);
return 0;
}
printf("Root Node is %s\n", root_element->name);
xmlChar* srcaddr = "5.5.5.5";
xmlChar *xpath = (xmlChar*) "//srcAddr";
xmlNodeSetPtr nodeset;
xmlXPathObjectPtr result;
int i;
xmlChar *keyword;
xmlXPathContextPtr context;
xmlNodePtr resdev;
xmlChar* resd;
context = xmlXPathNewContext(pDoc);
if (context == NULL) {
printf("Error in xmlXPathNewContext\n");
}
result = xmlXPathEvalExpression(xpath, context);
xmlXPathFreeContext(context);
if (result == NULL) {
printf("Error in xmlXPathEvalExpression\n");
}
if(xmlXPathNodeSetIsEmpty(result->nodesetval)){
xmlXPathFreeObject(result);
printf("No result\n");
};
if (result) {
nodeset = result->nodesetval;
for (i=0; i < nodeset->nodeNr; i++) {
keyword = xmlNodeListGetString(pDoc, nodeset->nodeTab[i]->xmlChildrenNode, 1);
printf("keyword: %s\n", keyword);
if(strcmp(keyword, srcaddr) == 0){
xmlNodePtr pNode = xmlNewNode(0, (xmlChar*)"service");
xmlNodeSetContent(pNode, (xmlChar*)"nonser");
xmlAddSibling(result, pNode);
printf("added");
}
xmlFree(keyword);
}
xmlXPathFreeObject (result);
}
xmlFreeDoc(pDoc);
xmlCleanupParser();
return (1);
}
On running this code, it gets compiled and executed(with a few warnings, but nothing that hinders execution), but it does not add anything to my XML File.

I think this topic is old but I just had a similar problem. So, I am just sharing for those who still have similar problems.
On running this code, it gets compiled and executed(with a few warnings, but nothing that hinders execution), but it does not add anything to my XML File.
First of all: In my opinion warnings in C are so much worse than errors because it lets you run the wrong code. So, my very first advice is not to ignore the warnings (although I am not in a position to advise anyone but anyway).
Second: When I was running this code, I saw a warning which makes sense:
> warning: passing argument 1 of ‘xmlAddSibling’ from incompatible
> pointer type [-Wincompatible-pointer-types]
>
> note: expected ‘xmlNodePtr {aka struct _xmlNode *}’ but argument is of
> type ‘xmlXPathObjectPtr {aka struct _xmlXPathObject *}’
As you check the xmlAddSibling from http://www.xmlsoft.org/html/libxml-tree.html you can see:
xmlNodePtr xmlAddSibling (xmlNodePtr cur, xmlNodePtr elem)
Which means the type of both of the arguments should be of xmlNodePtr. However, "result" has the type of xmlXPathObjectPtr which means the pointer types are completely different. What you really want to do is to add a child to a parent that you have found based on the string that you compared: (if(strcmp(keyword, srcaddr) == 0)).
So your way to find the parent is completely correct. But two problems are: first you never updated the "result" (if we assume you imagined the "result" is the parent which is not correct) because "nodeset->nodeTab[i]" is in a for loop that never puts anything in "result". The second problem is even if you updated the "result" based on "nodeset->nodeTab[i]", still they have different types of the pointers (as we discussed previously). So, you have to use xmlAddSibling for the correct parent and with the correct pointer type. As you can see hereunder, the "nodeTab" has the type of "xmlNodePtr" which we were looking for, and "nodeset->nodeTab[i]" is the parent.
Structure xmlNodeSet
struct _xmlNodeSet {
int nodeNr : number of nodes in the set
int nodeMax : size of the array as allocated
> `xmlNodePtr * nodeTab : array of nodes in no particular order`
}
So you should change the:
xmlAddSibling(result, pNode);
to:
xmlAddSibling(nodeset->nodeTab[i], pNode);
Finally: you didn't save the changes. So, save it by adding
xmlSaveFileEnc("note.xml", pDoc, "UTF-8");
before
xmlFreeDoc(pDoc);
With these changes, I was able to run your code with your XML file and with no warnings.

Your commands modify the DOM representation of the XML in memory, but you missed writing it back to the file. So adding the following line should solve your problem:
...
}
// write back to file:
xmlSaveFileEnc("/var/www/db/db_policy.xml", pDoc, "UTF-8");
xmlFreeDoc(pDoc);
xmlCleanupParser();
return (1);

C standard binary trees

I'm pretty much of a noob in regards to C programming.
Been trying for a few days to create a binary tree from expressions of the form:
A(B,C(D,$))
Where each letters are nodes.
'(' goes down a level in my tree (to the right).
',' goes to the left-side branch of my tree
'$' inserts a NULL node.
')' means going up a level.
This is what I came up with after 2-3 days of coding:
#define SUCCESS 0
typedef struct BinaryTree
{
char info;
BinaryTree *left,*right,*father;
}BinaryTree;
int create(BinaryTree*nodeBT, const char *expression)
{
nodeBT *aux;
nodeBT *root;
nodeBT *parent;
nodeBT=(BinaryTree*) malloc (sizeof(BinaryTree));
nodeBT->info=*expression;
nodeBT->right=nodeBT->left=NULL;
nodeBT->father = NULL;
++expression;
parent=nodeBT;
root=nodeBT;
while (*expression)
{if (isalpha (*expression))
{aux=(BinaryTree*) malloc (sizeof(BinaryTree));
aux->info=*expression;
aux->dr=nodeBT->st=NULL;
aux->father= parent;
nodeBT=aux;}
if (*expression== '(')
{parent=nodeBT;
nodeBT=nodeBT->dr;}
if (*expression== ',')
{nodeBT=nodeBT->father;
nodeBT=nodeBT->dr;}
if (*expression== ')')
{nodeBT=nodeBT->father;
parent= nodeBT->nodeBT;}
if (*expression== '$')
++expression;
++expression;
}
nodeBT=root;
return SUCCESS;
}
At the end, while trying to access the newly created tree, I keep getting "memory unreadable 0xCCCCCC". And I haven't got the slightest hint where I'm getting it wrong.
Any idea ?

Several problems:
You haven't shown us the definition of type nodeBT, but you've declared aux, root, and parent to be pointers to that type.
You then assign aux to point to a BinaryTree even though it's declared to point to a nodeBT.
You assign to aux->dr, which isn't part of BinaryTree, so I can't just assume you typed nodeBT where you meant BinaryTree. You assign to nodeBT->st, that is not a part of BinaryTree either.
You try to return the parsed tree by assigning nodeBT=root. The problem is that C is a “call-by-value” language. This implies that when your create function assigns to nodeBT, it is only changing its local variable's value. The caller of create doesn't see that change. So the caller doesn't receive the root node. That's probably why you're getting your “memory unreadable” error; the caller is accessing some random memory, not the memory containing the root node.
Your code will actually be much easier to understand if you write your parser using a standard technique called “recursive descent”. Here's how.
Let's write a function that parses one node from the expression string. Naively, it should have a signature like this:
BinaryTree *nodeFromExpression(char const *expression) {
To parse a node, we first need to get the node's info:
char info = expression[0];
Next, we need to see if the node should have children.
BinaryTree *leftChild = NULL;
BinaryTree *rightChild = NULL;
if (expression[1] == '(') {
If it should have children, we need to parse them. This is where we put the “recursive” in “recursive descent”: we just call nodeFromExpression again to parse each child. To parse the left child, we need to skip the first two characters in expression, since those were the info and the ( of the current node:
leftChild = nodeFromExpression(expression + 2);
But how much do we skip to parse the right child? We need to skip all the characters that we used while parsing the left child…
rightChild = nodeFromExpression(expression + ???
We don't know how many characters that was! It turns out we need to make nodeFromExpression return not just the node it parsed, but also some indication of how many characters it consumed. So we need to change the signature of nodeFromExpression to allow that. And what if we run into an error while parsing? Let's define a structure that nodeFromExpression can use to return the node it parsed, the number of characters it consumed, and the error it encountered (if there was one):
typedef struct {
BinaryTree *node;
char const *error;
int offset;
} ParseResult;
We'll say that if error is non-null, then node is null and offset is the offset in the string where we found the error. Otherwise, offset is just past the last character consumed to parse node.
So, starting over, we'll make nodeFromExpression return a ParseResult. It will take the entire expression string as input, and it will take the offset in that string at which to start parsing:
ParseResult nodeFromExpression(char const *expression, int offset) {
Now that we have a way to report errors, let's do some error checking:
if (!expression[offset]) {
return (ParseResult){
.error = "end of string where info expected",
.offset = offset
};
}
char info = expression[offset++];
I didn't mention this the first time through, but we should handle your $ token for NULL here:
if (info == '$') {
return (ParseResult){
.node = NULL,
.offset = offset
};
}
Now we can get back to parsing the children.
BinaryTree *leftChild = NULL;
BinaryTree *rightChild = NULL;
if (expression[offset] == '(') {
So, to parse the left child, we just call ourselves recursively again. If the recursive call gets an error, we return the same result:
ParseResult leftResult = nodeFromExpression(expression, offset);
if (leftResult->error)
return leftResult;
OK, we parsed the left child successfully. Now we need to check for, and consume, the comma between the children:
offset = leftResult.offset;
if (expression[offset] != ',') {
return (ParseResult){
.error = "comma expected",
.offset = offset
};
}
++offset;
Now we can recursively call nodeFromExpression to parse the right child:
ParseResult rightResult = nodeFromExpression(expression, offset);
The error case now is a bit more complex if we don't want to leak memory. We need to free the left child before returning the error:
if (rightResult.error) {
free(leftResult.node);
return rightResult;
}
Note that free does nothing if you pass it NULL, so we don't need to check for that explicitly.
Now we need to check for, and consume, the ) after the children:
offset = rightResult.offset;
if (expression[offset] != ')') {
free(leftResult.node);
free(rightResult.node);
return (ParseResult){
.error = "right parenthesis expected",
.offset = offset
};
}
++offset;
We need to set our local leftChild and rightChild variables while the leftResult and rightResult variables are still in scope:
leftChild = leftResult.node;
rightChild = rightResult.node;
}
We've parsed both children, if we needed to, so now we're ready to construct the node we need to return:
BinaryTree *node = (BinaryTree *)calloc(1, sizeof *node);
node->info = info;
node->left = leftChild;
node->right = rightChild;
We have one last thing to do: we need to set the father pointers of the children:
if (leftChild) {
leftChild->father = node;
}
if (rightChild) {
rightChild->father = node;
}
Finally, we can return a successful ParseResult:
return (ParseResult){
.node = node,
.offset = offset
};
}
I've put all the code in this gist for easy copy'n'paste.
UPDATE
If your compiler doesn't like the (ParseResult){ ... } syntax, you should look for a better compiler. That syntax has been standard since 1999 (§6.5.2.5 Compound Literals). While you're looking for a better compiler, you can work around it like this.
First, add two static functions:
static ParseResult ParseResultMakeWithNode(BinaryTree *node, int offset) {
ParseResult result;
memset(&result, 0, sizeof result);
result.node = node;
result.offset = offset;
return result;
}
static ParseResult ParseResultMakeWithError(char const *error, int offset) {
ParseResult result;
memset(&result, 0, sizeof result);
result.error = error;
result.offset = offset;
return result;
}
Then, replace the problematic syntax with calls to these functions. Examples:
if (!expression[offset]) {
return ParseResultMakeWithError("end of string where info expected",
offset);
}
if (info == '$') {
return ParseResultMakeWithNode(NULL, offset);
}

Compare two XMLNodes in C (libxml library)

I'm parsing some xml files in C using libxml library. I want to compare two xmlnodes to see whether they contain the same data or not. Is there any function available to do so?

The libxml API docs seem reasonable and suggest that xmlBufGetNodeContent and xmlBufContent might do what you want.
xmlNode node1, node2;
......
xmlBuf buf;
xmlChar* content1 = NULL;
xmlChar* content2 = NULL;
if (xmlBufGetNodeContent(&buf, &node1) == 0) {
content1 = xmlBufContent(&buf);
}
if (xmlBufGetNodeContent(&buf, &node2) == 0) {
content2 = xmlBufContent(&buf);
}
if (strcmp(content1, content2) == 0) {
/* nodes match */
}

I don't think the api calls xmlBufGetNodeContent and xmlBufContent are any more valid.
As the datatype involved in those calls - xmlBufPtr are no more available , atleast not on
libxml2 2.7.6
I used a different api call xmlNodeDump or xmlNodeGetContent. hope it helps others with similar question.

parsing for xml values

I have a simple xml string defined in the following way in a c code:
char xmlstr[] = "<root><str1>Welcome</str1><str2>to</str2><str3>wonderland</str3></root>";
I want to parse the xmlstr to fetch all the values assigned to str1,str2,str3 tags.
I am using libxml2 library. As I am less experienced in xml handling, I unable get the values of the required tags. I tried some sources from net, but I am ending wrong outputs.

Using the libxml2 library parsing your string would look something like this:
char xmlstr[] = ...;
char *str1, *str2, *str3;
xmlDocPtr doc = xmlReadDoc(BAD_CAST xmlstr, "http://someurl", NULL, 0);
xmlNodePtr root, child;
if(!doc)
{ /* error */ }
root = xmlDocGetRootElement(doc);
now that we have parsed a DOM structure out of your xml string, we can extract the values by iterating over all child values of your root tag:
for(child = root->children; child != NULL; child = child->next)
{
if(xmlStrcmp(child->name, BAD_CAST "str1") == 0)
{
str1 = (char *)xmlNodeGetContent(child);
}
/* repeat for str2 and str3 */
...
}

I usual do xml parsing using minixml library
u hope this will help you
http://www.minixml.org/documentation.php/basics.html

How can libxml2 be used to parse data from XML?

I have looked around at the libxml2 code samples and I am confused on how to piece them all together.
What are the steps needed when using libxml2 to just parse or extract data from an XML file?
I would like to get hold of, and possibly store information for, certain attributes. How is this done?

I believe you first need to create a Parse tree. Maybe this article can help, look through the section which says How to Parse a Tree with Libxml2.

libxml2 provides various examples showing basic usage.
http://xmlsoft.org/examples/index.html
For your stated goals, tree1.c would probably be most relevant.
tree1.c: Navigates a tree to print
element names
Parse a file to a tree, use
xmlDocGetRootElement() to get the root
element, then walk the document and
print all the element name in document
order.
http://xmlsoft.org/examples/tree1.c
Once you have an xmlNode struct for an element, the "properties" member is a linked list of attributes. Each xmlAttr object has a "name" and "children" object (which are the name/value for that attribute, respectively), and a "next" member which points to the next attribute (or null for the last one).
http://xmlsoft.org/html/libxml-tree.html#xmlNode
http://xmlsoft.org/html/libxml-tree.html#xmlAttr

I found these two resources helpful when I was learning to use libxml2 to build a rss feed parser.
Tutorial with SAX interface
Tutorial using the DOM Tree (code example for getting an attribute value included)

Here, I mentioned complete process to extract XML/HTML data from file on windows platform.
First download pre-compiled .dll form http://xmlsoft.org/sources/win32/
Also download its dependency iconv.dll and zlib1.dll from the same page
Extract all .zip files into the same directory. For Ex: D:\demo\
Copy iconv.dll, zlib1.dll and libxml2.dll into c:\windows\system32 deirectory
Make libxml_test.cpp file and copy following code into that file.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <libxml/HTMLparser.h>
void traverse_dom_trees(xmlNode * a_node)
{
xmlNode *cur_node = NULL;
if(NULL == a_node)
{
//printf("Invalid argument a_node %p\n", a_node);
return;
}
for (cur_node = a_node; cur_node; cur_node = cur_node->next)
{
if (cur_node->type == XML_ELEMENT_NODE)
{
/* Check for if current node should be exclude or not */
printf("Node type: Text, name: %s\n", cur_node->name);
}
else if(cur_node->type == XML_TEXT_NODE)
{
/* Process here text node, It is available in cpStr :TODO: */
printf("node type: Text, node content: %s, content length %d\n", (char *)cur_node->content, strlen((char *)cur_node->content));
}
traverse_dom_trees(cur_node->children);
}
}
int main(int argc, char **argv)
{
htmlDocPtr doc;
xmlNode *roo_element = NULL;
if (argc != 2)
{
printf("\nInvalid argument\n");
return(1);
}
/* Macro to check API for match with the DLL we are using */
LIBXML_TEST_VERSION
doc = htmlReadFile(argv[1], NULL, HTML_PARSE_NOBLANKS | HTML_PARSE_NOERROR | HTML_PARSE_NOWARNING | HTML_PARSE_NONET);
if (doc == NULL)
{
fprintf(stderr, "Document not parsed successfully.\n");
return 0;
}
roo_element = xmlDocGetRootElement(doc);
if (roo_element == NULL)
{
fprintf(stderr, "empty document\n");
xmlFreeDoc(doc);
return 0;
}
printf("Root Node is %s\n", roo_element->name);
traverse_dom_trees(roo_element);
xmlFreeDoc(doc); // free document
xmlCleanupParser(); // Free globals
return 0;
}
Open Visual Studio Command Promt
Go To D:\demo directory
execute cl libxml_test.cpp /I".\libxml2-2.7.8.win32\include" /I".\iconv-1.9.2.win32\include" /link libxml2-2.7.8.win32\lib\libxml2.lib command
Run binary using libxml_test.exe test.html command(Here test.html may be any valid HTML file)

You can refere this answer.
here they store data into structure format and use further by passing structure address to a function.
You can find detail code in c for use.
code ->> this