libxml2 SAX query - c

I am trying to parse an XML file using the SAX interface of libxml2 in C.
My problem is that whitespace characters between end of a tag and start of a new tag are causing the callback
"Characters" to be executed...Hi All,
i.e.
<?xml version="1.0"?>
<doc>
<para>Hello, world!</para>
</doc>
produces these events:
start document
start element: doc
start element: para
characters: Hello, world!
end element: para
characters:
end element: doc
characters:
end document
It would be really nice if somehow these whitespaces don't get recognized as "characters".
Anybody got any idea why this is happening or how this can be prevented from happening???

This is, of course, happening since whitespace between elements is significant in XML. So it's just operating according to specification.
See, for instance, this discussion.

Related

Solr Data Config Error, Open quote is expected for attribute "driver"

I have a postgres data-config file.
<dataConfig>
<dataSource driver=”org.postgresql.Driver” url=”jdbc:postgresql://127.0.0.1:5432/mydb” user=”user” password=”pw” />
...
</dataConfig>
But when I run it, it shows error
Data Config problem: Open quote is expected for attribute "driver" associated with an element type "dataSource".
What's the problem here. is driver information that I put wrong?
Your quotes are wrong.
” and " are not the same kind of quotes (see the different presentation). Only " is a valid double quote in an XML file (and in most/all programming contexts).
The examples in your config file seems to have been mangled by a blog or a text editor on the way.

Indexing Zip Files with Apache Solr

I am trying to index zip files via Apache Solr.
My Zip files only contain one CSV file.
My CSV-Files look like this:
"N_NATIONKEY","N_NAME","N_REGIONKEY","N_COMMENT"
0,"ALGERIA ",0,"04.07.11"
1,"ARGENTINA ",1,"04.07.11"
2,"BRAZIL ",1,"04.07.11"
…
I was already able to index the zip-file with following result:
post http://localhost:8983/solr/first/update/extract?literal.id=zip2&commit=true&captureAttr=true&uprefix=attr_&fmap.content=attr_content
"ignored_":["stream_size",
"461",
"X-Parsed-By",
"org.apache.tika.parser.DefaultParser",
"X-Parsed-By",
"org.apache.tika.parser.pkg.PackageParser",
"stream_content_type",
"text/plain",
"Content-Type",
"application/zip"],
"div":["embedded",
"NATION.csv",
"package-entry"],
"id":"zip2",
"stream_size":[461],
"x_parsed_by":["org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.pkg.PackageParser"],
"stream_content_type":["text/plain"],
"content_type":["application/zip"],
"attr_content":[" \n \n \n \n \n \n \n \n \n \n NATION.csv \n \"N_NATIONKEY\",\"N_NAME\",\"N_REGIONKEY\",\"N_COMMENT\"\r\n0,\"ALGERIA \",0,\"04.07.11\"\r\n1,\"ARGENTINA \",1,\"04.07.11\"\r\n2,\"BRAZIL \",1,\"04.07.11\"\r\n3,\"CANADA \",1,\"04.07.11\"\r\n4,\"EGYPT \",4,\"04.07.11\"\r\n5,\"ETHIOPIA \",0,\"04.07.11\"\r\n6,\"FRANCE \",3,\"04.07.11\"\r\n7,\"GERMANY \",3,\"04.07.11\"\r\n8,\"INDIA \",2,\"04.07.11\"\r\n9,\"INDONESIA \",2,\"1\"\r\n10,\"IRAN \",4,\"04.07.11\"\r\n11,\"IRAQ \",4,\"04.07.11\"\r\n12,\"JAPAN \",2,\"04.07.11\"\r\n13,\"JORDAN \",4,\"04.07.11\"\r\n14,\"KENYA \",0,\"04.07.11\"\r\n15,\"MOROCCO \",0,\"04.07.11\"\r\n16,\"MOZAMBIQUE \",0,\"1\"\r\n17,\"PERU \",1,\"04.07.11\"\r\n18,\"CHINA \",2,\"04.07.11\"\r\n19,\"ROMANIA \",3,\"1\"\r\n20,\"SAUDI ARABIA \",4,\"04.07.11\"\r\n21,\"VIETNAM \",2,\"1\"\r\n22,\"RUSSIA \",3,\"04.07.11\"\r\n23,\"UNITED KINGDOM \",3,\"04.07.11\"\r\n24,\"UNITED STATES \",1,\"04.07.11\"\r\n \n\n \n "],
"_version_":1615098997961129984}]
What I want is this:
"N_NATIONKEY":0,
"N_NAME":"ALGERIA ",
"N_REGIONKEY":0,
"N_COMMENT":"04.07.11",
"id":"84f3e0f3-8b13-47d8-818f-52504f79d91a",
"_version_":1615098850670804992
Here I am able to search after specific columns.
How can I index zipped files like this?
The documentation says that it should be able with Tika, but i don't realy get it.
Something like this is being done with .gz files in upcoming (7.6) Solr, see SOLR-10981. This does not cover zip though.
In general, you probably just want to unzip file and stream it directly to Solr. bin/post command does allow to take file content from the standard input, you just need to make sure the content type is correct. Check bin/post -h for details.

Solr 5: only first word gets highlighted

Search for: "test string"
Highlighting I get: test string
This has been reported as a bug and is allegedly fixed:
Solr: Multi Word Synonyms : Only first word is highlighting
However, here's my version of Lucene:
<luceneMatchVersion>5.0.0</luceneMatchVersion>
How is it possible that I'm still getting this behaviour?
EDIT:
There are no special settings related to highlighting in my solrconfig.xml
Here is the query I use:
hl=true
&hl.simple.pre=<em>
&hl.simple.post=</em>
&hl.fl=Comments,Summary
My problem was that the application was incorrectly parsing tags returned by the highlighter - not Solr's fault at all!

Camel pollEnrich and xml 'prettyPrint'

I am attempting to use Camel's pollEnrich feature, but it is not behaving as I would like... I'm not saying it's broken, but wondering if there is a way to get the behavior I desire. That is, I have an XML (blueprint) defined route that goes something like this:
<route>
<from uri="direct:a" />
<pollEnrich uri="http:www.somewebsite.com?format=application/xml" />
<to uri="log:com.acme?level=WARN&showStreams=true" />
</route>
Now, the response normally comes back just fine (e.g., in a web browser). The problem seems to be that it is not just on one line, and for some reason Camel reads each line, sequentially into the same buffer, starting at character zero... so what we end up with is one messy line in the output from the pollEnrich. That is, the to uri="log... line prints messages like:
2015-05-26 13:55:26,379 | WARN | a.distr.topic.B] | contentEnrich |
? ? | 142 - org.apache.camel.camel-core - 2.12.0.redhat-610379 |
Exchange[ExchangePattern: InOnly, BodyType:
org.apache.camel.converter.stream.InputStreamCache, Body:
<?xml versi</ElementStatus> ]pe></Status>nd>gin>ys for this element.</Reason>>ame>
(last line vertically offset for emphasis)
I cannot seem to find a way to tell Camel that the result will be in 'prettPrint' format... Anyone know how? The documentation seems to suggest that this option does not exist--in which case, I'd consider this to be a bug... though I suppose a person could argue that a custom aggregation strategy should be used (and I'd disagree with that person, citing the simplicity of this case) :)
UPDATE#1: even using org.apache.camel.processor.aggregate.UseLatestAggregationStrategy produces the same effect. (i.e., usage as below)
<bean id="latestStrat"
class="org.apache.camel.processor.aggregate.UseLatestAggregationStrategy" />
<route>
<from uri="direct:a" />
<pollEnrich uri="http:www.somewebsite.com?format=application/xml" strategyRef="latestStrat" />
<to uri="log:com.acme?level=WARN&showStreams=true" />
</route>
...going to cross fingers and try org.apache.camel.processor.aggregate.GroupedExchangeAggregationStrategy, but am guessing there is a configuration limitation with Camel always treating EOL characters as message delimiters.
UPDATE #2 - additional information:
The REST(GET) response received (tested with wget) has blank lines and null fields--but no carriage returns (^M). I've tried both the http and http4 components--same result. There is a leading <?xml version="1.0" encoding="UTF-8"?>, but no namespace/style info. I also just noticed that tab characters have been used for the pretty-ish indents. In sum, the response looks like:
<?xml version="1.0" encoding="UTF-8"?><ElementStatus>
<Flag>false</Flag>
<CODE>XYZ</CODE>
<Locale>Western</Locale>
...
(again, where the whitespace indenting has been done with tabs--AND the blank lines have a few tabs too)
...so the "answer" is that this is an apparent limitation of (or bug within) the log component's "showStreams" logic. I implemented Processor in a <bean>, routed the Exchange output from my pollEnrich to that <bean>, and logged the contents instead, and that matches exactly the output from wget.
FYI: this is camel-paxlogging (2.12.0.redhat-610379) - not sure what underlying version of camel that corresponds to, as I don't seem to have a jboss-parent-2.12.0 pom in my maven repo--which is strange, since I have other jboss-parent poms--and the red hat documentation doesn't seem to get into version composition.
FYI#2: and on a related note, when I use GroupedExchangeAggregationStrategy it does produce a List<Exchange>, BUT it behaves effectively the same as UseLatestAggregationStrategy -- i.e., 'grouped' produces a one-item List<Exchange> that only has the pollEnrich result, where 'latest' produces a standalone Exchange object that has only the pollEnrich result. Seems like an error in either GroupedExchangeAggregationStrategy or pollEnrich ... but this will likely be the topic of my next Stack-post.

C Regex capturing group

I'm having troubles understanding how the regex in C work.
Basically I have an XML file (I can't use an XML parser) containing lines like this:
<Node Bla="blabla" Name="this is my name" .... />
<Node Name="this is my name" Bla="blabla" .... />
What I would like to do is extract the name part of each line. So far I have been using the following regex:
char *regex_str = "Name=\"([^\"]*)\"";
But this gives me Name="this is my name", I'm only looking for the this is my name part.
What am I doing wrong?
You may not need a capturing group.
Assuming your library has lookbehind (which it definitely does if it's PCRE), you can use this regex to match the name:
(?<=[Nn]ame=")[^"]+
See regex demo.
Explanation
the lookbehind (?<=[Nn]ame=") asserts that what precedes is Name=" or name="
[^"]+ matches one or more chars that are not a "
Reference
Lookahead and Lookbehind Zero-Length Assertions
Mastering Lookahead and Lookbehind
Just use a lookbehind to capture the characters which are just after to the string Name upto the first " symbol,
(?<=Name=\")([^\"]*)
Explanation:
(?<=Name=\") Sets the matching marker just after to the string Name"
([^\"]*) Captures all the characters not of " zero or more times.

Resources