libxml2: Not report characters like ' or " individually - c

I'm new to libxml and so far everything is good, but I noticed one thing that annoys me:
When libxml reports characters, i.e. the handler's characters function is being called, "special" characters like ' or " or reported individually.
example:
"It's a nice day today. Don't you agree?"
report:"
report: It
report: '
report: s a nice day today. Don
report: '
report: you aggree?
report: "
Is there any way to change that behavior, so it would be reported as a complete string?
Don't get me wrong, it's not a problem to use strcat to put the original string together, but that's additional work ;)
I searched the headers and the net and found no solution. Thank you in advance.
Edit: Because the handler description above needs some more explaining.
By reporting characters I mean when the handler's (htmlSAXHandler) handler.characters callback function is called, which I assigned:
void _characters(void *context, const xmlChar *ch, int len) {
printf("report: %s\n", chars);
}

You might want to look at DOM parsing instead of registering SAX callbacks, if your document isn't going to be so large that you can't hold it all in memory.
#include <stdio.h>
#include <libxml/HTMLparser.h>
#include <libxml/tree.h>
int main()
{
htmlDocPtr doc;
xmlNodePtr root, node;
char *output;
char *rawhtml = "<html><body>\"It's a nice day today. Don't you agree?\"</body></html>";
doc = htmlReadDoc(rawhtml, NULL, NULL, XML_PARSE_NOBLANKS);
root = xmlDocGetRootElement(doc);
node = root->children;
output = xmlNodeGetContent(node);
printf("output=[%s]\n", output);
if(output)
xmlFree(output);
if(doc)
xmlFreeDoc(doc);
}
produces
output=["It's a nice day today. Don't you agree?"]

I'm afraid you should live with that. If you encounter an HTML document with 100K chars do you also expect it to deliver all chars in one go? I think you should just be ready for splitting the characters at any moment. Then splitting them at special characters makes no difference.
This answer is not adequate if your software aims to read only small HTML documents, but I bet that the libxml authors were not thinking of special handling for such cases.

Related

VS2010, scanf, strange behaviour

I'm converting some source from VC6 to VS2010. The code is written in C++/CLI and it is an MFC application. It includes a line:
BYTE mybyte;
sscanf(source, "%x", &mybyte);
Which is fine for VC6 (for more than 15 years) but causing problems in VS2010 so I created some test code.
void test_WORD_scanf()
{
char *source = "0xaa";
char *format = "%x";
int result = 0;
try
{
WORD pre = -1;
WORD target = -1;
WORD post = -1;
printf("Test (pre scan): stack: pre=%04x, target=%04x, post=%04x, sourse='%s', format='%s'\n", pre, target, post, source, format);
result = sscanf(source, format, &target);
printf("Test (post scan): stack: pre=%04x, target=%04x, post=%04x, sourse='%s', format='%s'\n", pre, target, post, source, format);
printf("result=%x", result);
// modification suggested by Werner Henze.
printf("&pre=%x sizeof(pre)=%x, &target=%x, sizeof(target)=%x, &post=%x, sizeof(post)=%d\n", &pre, sizeof(pre), &target, sizeof(target), &post, sizeof(post));
}
catch (...)
{
printf("Exception: Bad luck!\n");
}
}
Building this (in DEBUG mode) is no problem. Running it gives strange results that I cannot explain. First, I get the output from the two printf statemens as expected. Then a get a run time waring, which is the unexpected bit for me.
Test (pre scan): stack: pre=ffff, target=ffff, post=ffff, source='0xaa', format='%x'
Test (post scan): stack: pre=ffff, target=00aa, post=ffff, source='0xaa', format='%x'
result=1
Run-Time Check Failure #2 - Stack around the variable 'target' was corrupted.
Using the debugger I found out that the run time check failure is triggered on returning from the function. Does anybody know where the run time check failure comes from? I used Google but can't find any suggestion for this.
In the actual code it is not a WORD that is used in sscanf but a BYTE (and I have a BYTE version of the test function). This caused actual stack corruptions with the "%x" format (overwriting variable pre with 0) while using "%hx" (what I expect to be the correct format) is still causing some problems in overwriting the lower byte of variable prev.
Any suggestion is welcome.
Note: I edited the example code to include the return result from sscanf()
Kind regards,
Andre Steenveld.
sscanf with %x writes an int. If you provide the address of a BYTE or a WORD then you get a buffer overflow/stack overwrite. %hx will write a short int.
The solution is to have an int variable, let sscanf write to that and then set your WORD or BYTE variable to the read value.
int x;
sscanf("%x", "0xaa", x);
BYTE b = (BYTE)x;
BTW, for your test and the message
Run-Time Check Failure #2 - Stack around the variable 'target' was corrupted.
you should also print out the addresses of the variables and you'll probably see that the compiler added some padding/security check space between the variables pre/target/post.

Bus Error on void function return

I'm learning to use libcurl in C. To start, I'm using a randomized list of accession names to search for protein sequence files that may be found hosted here. These follow a set format where the first line is a variable length (but which contains no information I'm trying to query) then a series of capitalized letters with a new line every sixty (60) characters (what I want to pull down, but reformat to eighty (80) characters per line).
I have the call itself in a single function:
//finds and saves the fastas for each protein (assuming on exists)
void pullFasta (proteinEntry *entry, char matchType, FILE *outFile) {
//Local variables
URL_FILE *handle;
char buffer[2] = "", url[32] = "http://www.uniprot.org/uniprot/", sequence[2] = "";
//Build full URL
/*printf ("u:%s\nt:%s\n", url, entry->title); /*This line was used for debugging.*/
strcat (url, entry->title);
strcat (url, ".fasta");
//Open URL
/*printf ("u:%s\n", url); /*This line was used for debugging.*/
handle = url_fopen (url, "r");
//If there is data there
if (handle != NULL) {
//Skip the first line as it's got useless info
do {
url_fread(buffer, 1, 1, handle);
} while (buffer[0] != '\n');
//Grab the fasta data, skipping newline characters
while (!url_feof (handle)) {
url_fread(buffer, 1, 1, handle);
if (buffer[0] != '\n') {
strcat (sequence, buffer);
}
}
//Print it
printFastaEntry (entry->title, sequence, matchType, outFile);
}
url_fclose (handle);
return;
}
With proteinEntry being defined as:
//Entry for fasta formatable data
typedef struct proteinEntry {
char title[7];
struct proteinEntry *next;
} proteinEntry;
And the url_fopen, url_fclose, url_feof, url_read, and URL_FILE code found here, they mimic the file functions for which they are named.
As you can see I've been doing some debugging with the URL generator (uniprot URLs follow the same format for different proteins), I got it working properly and can pull down the data from the site and save it to file in the proper format that I want. I set the read buffer to 1 because I wanted to get a program that was very simplistic but functional (if inelegant) before I start playing with things, so I would have a base to return to as I learned.
I've tested the url_<function> calls and they are giving no errors. So I added incremental printf calls after each line to identify exactly where the bus error is occurring and it is happening at return;.
My understanding of bus errors is that it's a memory access issue wherein I'm trying to get at memory that my program doesn't have control over. My confusion comes from the fact that this is happening at the return of a void function. There's nothing being read, written, or passed to trigger the memory error (as far as I understand it, at least).
Can anyone point me in the right direction to fix my mistake please?
EDIT: As #BLUEPIXY pointed out I had a potential url_fclose (NULL). As #deltheil pointed out I had sequence as a static array. This also made me notice I'm repeating my bad memory allocation for url, so I updated it and it now works. Thanks for your help!
If we look at e.g http://www.uniprot.org/uniprot/Q6GZX1.fasta and skip the first line (as you do) we have:
MNAKYDTDQGVGRMLFLGTIGLAVVVGGLMAYGYYYDGKTPSSGTSFHTASPSFSSRYRY
Which is a 60 characters string.
When you try to read this sequence with:
//Grab the fasta data, skipping newline characters
while (!url_feof (handle)) {
url_fread(buffer, 1, 1, handle);
if (buffer[0] != '\n') {
strcat (sequence, buffer);
}
}
The problem is sequence is not expandable and not large enough (it is a fixed length array of size 2).
So make sure to choose a large enough size to hold any sequence, or implement the ability to expand it on-the-fly.

Why's _itoa causing my program to crash?

The following code just keeps on crashing when it reaches the part with _itoa, I've tried to implement that function instead and then it got even weirder, it just kept on crashing when I ran it without the debugger but worked fine while working with the debugger.
# include "HNum.h"
# include <stdio.h>
# include <stdlib.h>
# include <string.h>
# include <assert.h>
# define START_value 30
typedef enum {
HNUM_OUT_OF_MEMORY = -1,
HNUM_SUCCESS = 0,
} HNumRetVal;
typedef struct _HNum{
size_t Size_Memory;
char* String;
}HNum;
HNum *HNum_alloc(){
HNum* first = (HNum*)malloc(sizeof(HNum));
if(first==NULL){
return NULL;
}
first->String =(char*)malloc(sizeof(START_value));
if(first->String==NULL){
return NULL;
}
first->Size_Memory = START_value; // slash zero && and starting from zero index;
return first;
}
HNumRetVal HNum_setFromInt(HNum *hnum, int nn){
itoa(nn,hnum->String,10);
}
void main(){
HNum * nadav ;
int h = 13428637;
nadav = HNum_alloc();
nadav->String="1237823423423434";
HNum_setFromInt(nadav,h);
printf("nadav string : %s \n ",nadav->String);
//printf("w string %s\n",w->String);
//printf("nadav string %s\n",nadav->String);
HNum_free(nadav);
}
I've been trying to figure this out for hours and couldn't come up with anything...
The IDE I'm using is Visual Studio 2012 express, the crash shows the following:
"PROJECT C.exe has stopped working
windows can check online for a solution to the program."
first->String =(char*)malloc(sizeof(START_value));
should be
first->String = malloc(START_value);
The current version allocates space for sizeof(int)-1 characters (-1 to leave space for the nul terminator). This is too small to hold your target value so _itoa writes beyond memory allocated for first->String. This results in undefined behaviour; it is quite possible for different runs to fail in different places or debug/release builds to behave differently.
You also need to remove the line
nadav->String="1237823423423434";
which leaks the memory allocated for String in HNum_alloc, replacing it with a pointer to a string literal. This new pointer should be considered to be read-only; you cannot write it it inside _itoa
Since I'm not allowed to comment:
simonc's answer is correct. If you find the following answer useful, you should mark his answer as the right one:P
I tried that code myself and the only thing missing is lets say:
strcpy(nadav->String, "1237823423423434"); INSTEAD OF nadav->String="1237823423423434";
and
first->String = malloc(START_value); INSTEAD OF first->String =(char*)malloc(sizeof(START_value));
Also, maybe you'd have to use _itoa instead of itoa, that's one of the things I had to change in my case anyhow.
If that doesn't work, you should probably consider using a different version of VS.

Checking for a blank line in C - Regex

Goal:
Find if a string contains a blank line. Whether it be '\n\n',
'\r\n\r\n', '\r\n\n', '\n\r\n'
Issues:
I don't think my current regex for finding '\n\n' is right. This is my first time really using regex outside of simple use of * when removing files in command line.
Is it possible to check for all of these cases (listed above) in one regex? or do I have to do 4 seperate calls to compile_regex?
Code:
int checkForBlankLine(char *reader) {
regex_t r;
compile_regex(&r, "*\n\n");
match_regex(&r, reader);
return 0;
}
void compile_regex(regex_t *r, char *matchText) {
int status;
regcomp(r, matchText, 0);
}
int match_regex(regex_t *r, char *reader) {
regmatch_t match[1];
int nomatch = regexec(r, reader, 1, match, 0);
if (nomatch) {
printf("No matches.\n");
} else {
printf("MATCH!\n");
}
return 0;
}
Notes:
I only need to worry about finding one blank line, that's why my regmatch_t match[1] is only one item long
reader is the char array containing the text I am checking for a blank line.
I have seen other examples and tried to base the code off of those examples, but I still seem to be missing something.
Thank you kindly for the help/advice.
If anything needs to be clarified please let me know.
It seems that you have to compile the regex as extended:
regcomp(&re, "\r?\n\r?\n", REG_EXTENDED);
The first atom, \r? is probably unnecessary, because it doesn't add to the blank-line condition if you don't capture the result.
In the above, blank line really means empty line. If you want blank line to mean a line that has no characters except for white space, you can use:
regcomp(&re, "\r?\n[ \t]*\r?\n", REG_EXTENDED);
(I don't think you can use the space character pattern, \s here instead of [ \t], because that would include carriage return and new-line.)
As others have already hinted at, the "simple use of * in the command line` is not a regular expression. This wildcard-matching is called file globbing and has different semantics.
Check what the * in a regex means. It's not like the wildcard "anything" in the command line. The * means that the previous component can appear any amount of times. The wildcard in regex is the .. So if you want to say match anything you can do .*, which would be anything, any amount of times.
So in your case you can do .*\n\n.* which would match anything that has \n\n.
Finally, you can use or in a regex and ( ) to group stuff. So you can do something like .*(\n\n|\r\n\r\n).* And that would match anything that has a \n\n or a \r\n\r\n.
Hope that helps.
Rather than looking for only \r or \n, look for not \r or \n?
Your regex would simply be
'[^\r\n]'
and a match result of false indicates a blank line to your specification.

Updated: When to "mortalize" a variable in Perl Inline::C

I am trying to wrap a C library into Perl. I have tinkered with XS but being unsuccessful I thought I should start simply with Inline::C. My question is on Mortalization. I have been reading perlguts as best as I am able, but am still confused. Do I need to call sv_2mortal on an SV* that is to be returned if I am not pushing it onto the stack?
(PS I really am working on a less than functional knowledge of C which is hurting me. I have a friend who knows C helping me, but he doesn't know any Perl).
I am providing a sample below. The function FLIGetLibVersion simply puts len characters of the library version onto char* ver. My question is will the version_return form of my C code leak memory?
N.B. any other comments on this code is welcomed.
#!/usr/bin/perl
use strict;
use warnings;
use 5.10.1;
use Inline (
C => 'DATA',
LIBS => '-lm -lfli',
FORCE_BUILD => 1,
);
say version_stack();
say version_return();
__DATA__
__C__
#include <stdio.h>
#include "libfli.h"
void version_stack() {
Inline_Stack_Vars;
Inline_Stack_Reset;
size_t len = 50;
char ver[len];
FLIGetLibVersion(ver, len);
Inline_Stack_Push(sv_2mortal(newSVpv(ver,strlen(ver))));
Inline_Stack_Done;
}
SV* version_return() {
size_t len = 50;
char ver[len];
FLIGetLibVersion(ver, len);
SV* ret = newSVpv(ver, strlen(ver));
return ret;
}
Edit:
In an attempt to answer this myself, I tried changing the line to
SV* ret = sv_2mortal(newSVpv(ver, strlen(ver)));
and now when I run the script I get the same output that I did previously plus an extra warning. Here is the output:
Software Development Library for Linux 1.99
Software Development Library for Linux 1.99
Attempt to free unreferenced scalar: SV 0x2308aa8, Perl interpreter: 0x22cb010.
I imagine that this means that I don't need to mortalize in this case? I suspect that the error is saying that I marked for collection something that was already in line for collection. Can someone confirm for me that that is what that warning means?
I've been maintaining Set::Object for many years and had this question, too - perhaps best to look at the source of that code to see when stuff should be mortalised (github.com/samv/Set-Object). I know Set::Object has it right after many changes. I think though, it's whenever you're pushing the SV onto the return stack. Not sure how Inline changes all that.

Resources