LibXML internal and output encodings - c

I'm trying to write XML files with libxml2 in ISO-8859-1.
But from the documentation it seems that for each text node that I create I'll have to convert to UTF-8 which is libxml's internal encoding. Then when calling xmlSaveFormatFileEnc() libxml converts to the target encoding and adds the encoding attribute to the document.
Is this assumption correct?
For now my code goes roughly like this:
xmlNode *root_element = NULL, *node4 = NULL;
xmlDoc *doc = NULL;
doc = xmlNewDoc(BAD_CAST XML_DEFAULT_VERSION);
root_element = xmlNewDocNode(doc, NULL, BAD_CAST("root"),
NULL);
char * input_str = getLatin1Data();
isolat1ToUTF8(utf8_str, &file_size, input_str, &inlen);
node4 = xmlNewCDataBlock(doc, BAD_CAST list_content, xmlStrlen(BAD_CAST utf8_str));
xmlAddChild(root_element, node4);
xmlSaveFormatFileEnc("test_file.xml", doc, "UTF-8", 1);
xmlFreeDoc(doc);

Your assumption is right. When xmlChar is expected, like in xmlNewCDataBlock, xmlNewText, it is always UTF-8:
From include/libxml/xmlstring.h (libxml 2.8.0):
/**
* xmlChar:
*
* This is a basic byte in an UTF-8 encoded string.
* It's unsigned allowing to pinpoint case where char * are assigned
* to xmlChar * (possibly making serialization back impossible).
*/

Related

How to stream Apache Arrow RecordBatches in C?

I read some data from a PostgreSQL database, convert it into RecordBatches and try to send the data to a client. But I fail to properly understand the usage of Apache Arrow C/GLib.
My information sources are the C++ docs, the Apache Arrow C/GLib reference manual and the C/GLib Github files.
By following the usage description of Apache Arrow C++ and experimenting with the wrapper classes in C, I build this minimal example of writing out a RecordBatch into a buffer and (after theoretically sending and receiving the buffer) trying to read that buffer back into a RecordBatch. But it fails and i would be glad, if you could point out my mistakes!
I omitted the error catching for readability. The code errors out at creation of the GArrowRecordBatchStreamReader. If i use the arrowbuffer or the buffer from the top in creating the InputStream, the error reads [record-batch-stream-reader][open]: IOError: Expected IPC message of type schema but got record batch. If i use the testBuffer the error complains about an invalid IPC stream, so the data is just corrupt.
void testRecordbatchStream(GArrowRecordBatch *rb){
GError *error = NULL;
// Write Recordbatch
GArrowResizableBuffer *buffer = garrow_resizable_buffer_new(300, &error);
GArrowBufferOutputStream *bufferStream = garrow_buffer_output_stream_new(buffer);
long written = garrow_output_stream_write_record_batch(GARROW_OUTPUT_STREAM(bufferStream), rb, NULL, &error);
// Use buffer as plain bytes
void *data = garrow_buffer_get_data(GARROW_BUFFER(buffer));
size_t length = garrow_buffer_get_size(GARROW_BUFFER(buffer));
// Read plain bytes and test serialize function
GArrowBuffer *testBuffer = garrow_buffer_new(data, length);
GArrowBuffer *arrowbuffer = garrow_record_batch_serialize(rb, NULL, &error);
// Read RecordBatch from buffer
GArrowBufferInputStream *inputStream = garrow_buffer_input_stream_new(arrowbuffer);
GArrowRecordBatchStreamReader *sr = garrow_record_batch_stream_reader_new(GARROW_INPUT_STREAM(inputStream), &error);
GArrowRecordBatch *rb2 = garrow_record_batch_reader_read_next(sr, &error);
printf("Received RB: \n%s\n", garrow_record_batch_to_string(rb2, &error));
}
So my solution was to use the class GArrowRecordBatchStreamWriter and Reader, instead of the function garrow_output_stream_write_record_batch(), because the latter only writes a record batch without a stream header and schema. Furthermore one has to properly access the data of the GArrowBuffer after writing. (Again, error handling omitted)
GError *error = NULL;
GArrowResizableBuffer *buffer = garrow_resizable_buffer_new(4096, &error);
GArrowBufferOutputStream *bufferStream = garrow_buffer_output_stream_new(buffer);
GArrowSchema *schema = garrow_record_batch_get_schema(recordbatch);
GArrowRecordBatchStreamWriter *sw = garrow_record_batch_stream_writer_new(GARROW_OUTPUT_STREAM(bufferStream), schema, &error);
g_object_unref(bufferStream);
g_object_unref(schema);
gboolean test = garrow_record_batch_writer_write_record_batch(GARROW_RECORD_BATCH_WRITER(sw), recordbatch, &error);
GBytes *data = garrow_buffer_get_data(GARROW_BUFFER(buffer));
gint64 length = garrow_buffer_get_size(GARROW_BUFFER(buffer));
gsize datasize;
gconstpointer datap = g_bytes_get_data(data, &datasize);
GArrowBuffer *receivingBuffer = garrow_buffer_new(datap, datasize);
GArrowBufferInputStream *inputStream = garrow_buffer_input_stream_new(GARROW_BUFFER(receivingBuffer));
GArrowRecordBatchStreamReader *sr = garrow_record_batch_stream_reader_new(GARROW_INPUT_STREAM(inputStream), &error);
printf("Reading RecordBatch:\n");
GArrowRecordBatch *recordbatch2 = garrow_record_batch_reader_read_next(GARROW_RECORD_BATCH_READER(sr), &error);
printf("%s\n", garrow_record_batch_to_string(recordbatch2, &error));

How to remove '&'-words encoding from libxml2?

I have an XML file which should be parsed and processed. For that reason I'm using libxml2.
The xml file I have looks something like this:
test.xml
<root>
<tag attr1="VALUE_1 "" attr2="VALUE_2
VALUE_3" />
</root>
And I want to get the attribute contents. BUT the libxml2 seems to encode the '&'-words (don't know how to call them).
The code I use is the following one:
LIBXML_TEST_VERSION
xmlDoc *doc;
doc = xmlReadFile("test.xml", NULL, XML_PARSE_IGNORE_ENC);
xmlNode *root;
root = xmlDocGetRootElement(doc);
xmlNode *node;
node = root->children;
while (node != NULL) {
if (node->type == XML_ELEMENT_NODE) {
xmlAttr *attr;
attr = node->properties;
while (attr != NULL) {
xmlNode *child;
child = attr->children;
while (child != NULL) {
if (child->type == XML_TEXT_NODE ||
child->type == XML_CDATA_SECTION_NODE)
printf("%s\n", child->content);
child = child->next;
}
attr = attr->next;
}
}
node = node->next;
}
So basically I want to print the attribute values, BUT they are being parsed with a formatting (I guess). When I run this code than I see following output:
VALUE_1 "
VALUE_2
VALUE_3
As you can see it translated the '&'-words. How can I hint the libxml2 to not do that and give me the literal text values.
You simply can't. libxml2 will always decode numeric character references like
and predefined entities like ". But A and A, for example, are semantically equivalent. If you really need to tell them apart, you're probably doing something wrong elsewhere in your XML pipeline. If you want a literal
in an attribute value, you have to encode it as &#xA;.
Note that the expansion can be controlled for other, user-defined entities via the XML_PARSE_NOENT parser flag, but this won't affect numeric character references.

GCC C and passing integer to PostgreSQL

I know that there probably was plenty on that but after several days of searching I am unable to find how to do one simple passing of integer and char in one go to PostgreSQL from C under Linux.
In PHP it is easy, like 123, and in C using libpq it seem to be like something out of ordinary.
I had a look at PQexecParams but is seem to be not helping. Examples on the net are not helping as well and it seems to be an impossible mission.
Would someone be kind enough to translate this simple PHP statement to C and show me how to pass multiple vars of different types in one INSERT query.
col1 is INT
col2 is CHAR
$int1 = 1;
$char1 = 'text';
$query = "INSERT INTO table (col1, col2) values ('$int1',$char1)";
$result = ibase_query($query);
This would show what I am trying to do (please mind the code is very wrong):
void insert_CommsDb(PGconn *conn, PGresult *pgres, int csrv0) { const char * params[1];
params[0] = csrv0;
pgres = PQexecParams(conn, "INSERT INTO comms_db (srv0::int) values ($1)",
1,
NULL,
params,
1,
NULL,
0);
if (PQresultStatus(pgres) != PGRES_COMMAND_OK)
{
fprintf(stderr, "INSERT failed: %s", PQerrorMessage(conn));
exit_nicely(conn,pgres);
}
PQclear(pgres);
}
https://www.postgresql.org/docs/current/static/libpq-exec.html
As #joop commented above:
If the paramTypes argument is NULL, all the params are assumed to be strings.
So, you should transform your int argument to a string.
void insert_CommsDb(PGconn *conn, int csrv0)
{
PGresult *pgres;
char * params[1];
char buff[12];
sprintf(buff, "%d", csrv0);
params[0] = buff;
pgres = PQexecParams(conn
, "INSERT INTO comms_db (srv0::int) values ($1)" // The query (we dont need the cast here)
, 1 // number of params
, NULL // array with types, or NULL
, params // array with parameter values
, NULL // ARRAY with parameter lenghts
, NULL // array with per-param flags indicating binary/non binary
, 0 // set to 1 if we want BINARY results, 0 for txt
);
if (PQrresultStatus(pgres) != PGRES_COMMAND_OK)
{
fprintf(stderr, "INSERT failed: %s", PQerrorMessage(conn));
exit_nicely(conn,pgres);
}
PQclear(pgres);
}
wildplasser's answer shows the way in general.
Since you explicitly asked about several parameters, I'll add an example for that.
If you are not happy to convert integers to strings, the alternative would be to use the external binary format of the data type in question. That requires inside knowledge and probably reading the PostgreSQL source. For some data types, it can also depend on the hardware.
PGresult *res;
PGconn *conn;
Oid types[2];
char * values[2];
int lengths[2], formats[2];
int arg0;
/* connect to the database */
/*
* The first argument is in binary format.
* Apart from having to use the "external binary
* format" for the data, we have to specify
* type and length.
*/
arg0 = htonl(42); /* external binary format: network byte order */
types[0] = 23; /* OID of "int4" */
values[0] = (char *) &arg0;
lengths[0] = sizeof(int);
formats[0] = 1;
/* second argument is in text format */
types[1] = 0;
values[1] = "something";
lengths[1] = 0;
formats[1] = 0;
res = PQexecParams(
conn,
"INSERT INTO mytab (col1, col2) values ($1, $2)",
2,
types,
(const char * const *)values,
lengths,
formats,
0 /* results in text format */
);
I'd recommend that you use the text format for most data types.
The notable exception is bytea, where it usually is an advantage to use the binary format, as it saves space and CPU power. In this case, the external binary format is simply the bytes.
VS C++ not liking htonl(42):
arg0 = htonl(42); /* external binary format: network byte order */

libxml2 fails to parse from buffer but parses successfully from file

I have a function that writes an XML document to a buffer using the libxml2 writer, but when I try to parse the document from memory using xmlParseMemory, it only returns parser errors. I have also tried writing the document to a file and parsing it using xmlParseFile and it parses successfully.
This is how I initialize the writer and buffer for the xml document.
int rc, i = 0;
xmlTextWriterPtr writer;
xmlBufferPtr buf;
// Create a new XML buffer, to which the XML document will be written
buf = xmlBufferCreate();
if (buf == NULL)
{
printf("testXmlwriterMemory: Error creating the xml buffer\n");
return;
}
// Create a new XmlWriter for memory, with no compression.
// Remark: there is no compression for this kind of xmlTextWriter
writer = xmlNewTextWriterMemory(buf, 0);
if (writer == NULL)
{
printf("testXmlwriterMemory: Error creating the xml writer\n");
return;
}
// Start the document with the xml default for the version,
// encoding UTF-8 and the default for the standalone
// declaration.
rc = xmlTextWriterStartDocument(writer, NULL, ENCODING, NULL);
if (rc < 0)
{
printf
("testXmlwriterMemory: Error at xmlTextWriterStartDocument\n");
return;
}
I pass the xml document to another function to be validated using
int ret = validateXML(buf->content);
Here is the first part of validateXML
int validateXML(char *buffer)
{
xmlDocPtr doc;
xmlSchemaPtr schema = NULL;
xmlSchemaParserCtxtPtr ctxt;
char *XSDFileName = XSDFILE;
char *XMLFile = buffer;
int ret = 1;
doc = xmlReadMemory(XMLFile, sizeof(XMLFile), "noname.xml", NULL, 0);
doc is always NULL after calling this function, which means that it failed to parse the document.
Here are the errors that running the program returns
Entity: line 1: parser error : ParsePI: PI xm space expected
<?xm
^
Entity: line 1: parser error : ParsePI: PI xm never end ...
<?xm
^
Entity: line 1: parser error : Start tag expected, '<' not found
<?xm
^
I have been unable to figure this out for quite a while now and I am out of ideas. If anyone has any, I would be grateful if you would share it.
You are using sizeof to determine the size of the xml data. For a char pointer that is always going to return 4. What you probably need is strlen.
doc = xmlReadMemory(XMLFile, strlen(XMLFile), "noname.xml", NULL, 0);

parsing for xml values

I have a simple xml string defined in the following way in a c code:
char xmlstr[] = "<root><str1>Welcome</str1><str2>to</str2><str3>wonderland</str3></root>";
I want to parse the xmlstr to fetch all the values assigned to str1,str2,str3 tags.
I am using libxml2 library. As I am less experienced in xml handling, I unable get the values of the required tags. I tried some sources from net, but I am ending wrong outputs.
Using the libxml2 library parsing your string would look something like this:
char xmlstr[] = ...;
char *str1, *str2, *str3;
xmlDocPtr doc = xmlReadDoc(BAD_CAST xmlstr, "http://someurl", NULL, 0);
xmlNodePtr root, child;
if(!doc)
{ /* error */ }
root = xmlDocGetRootElement(doc);
now that we have parsed a DOM structure out of your xml string, we can extract the values by iterating over all child values of your root tag:
for(child = root->children; child != NULL; child = child->next)
{
if(xmlStrcmp(child->name, BAD_CAST "str1") == 0)
{
str1 = (char *)xmlNodeGetContent(child);
}
/* repeat for str2 and str3 */
...
}
I usual do xml parsing using minixml library
u hope this will help you
http://www.minixml.org/documentation.php/basics.html

Resources