Custom format XML from Solr

Custom format XML from Solr - solr

I have a product data in Solr. Solr already provides them into XML via query.
However, I need the data into different XML format (just name of xml nodes are different) for supplying them as a feeds to some other application.
Any idea, how can I do this quickly from Solr?

I finally managed to do this using one of existing response writer. This does not require writing a new response writer. Here is how I did it.
I have used XSLTResponseWriter to generate custom format xml. You can find more details here: http://wiki.apache.org/solr/XsltResponseWriter
You can find more information on how to use response writer here: https://wiki.apache.org/solr/QueryResponseWriter
Okay, now before you use it, it needs to be configured.
Step 1: Define QueryResponseWriter for XSLT in your solrconfig.xml
Add following code into your solrconfig.xml after ending your Query component.
<!--
Changes to XSLT transforms are taken into account
every xsltCacheLifetimeSeconds at most.
-->
<queryResponseWriter name="xslt" class="org.apache.solr.response.XSLTResponseWriter">
<int name="xsltCacheLifetimeSeconds">5</int>
</queryResponseWriter>
You can find its documentation at http://wiki.apache.org/solr/XsltResponseWriter
Step 2. Use proper xslt format or customize your own
You can either use existing xslt formats provided in default Solr download or use it to modify it the way you want it to work. There are 5 example formats provided already. Suppose you use example.xsl which generates a document in html with all fields, then you need to query it like this.
I customized it to use custom.xsl file format to implement my own format. I'll come to it later.
Step 3. Query your Solr using XSLT
http://localhost:8983/solr/mysolrcore/select?q=*:*&wt=xslt&tr=default.xsl&&rows=10
This will query solr and will present data in a format defined in default.xsl. Note the wt and tr parameter. You can pass how many records you want in result in rows.
Custom XML format using XSLT
Here's how I formatted my custom xml format using xslt. Hope this might be helpful to someone.
I have used example_rss.xsl as a base to start with and modified it as following.
<?xml version='1.0' encoding='UTF-8'?>
<!--
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
-->
<!--
Sample transform of Solr query results to custom XML format
-->
<xsl:stylesheet version='1.0'
xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>
<xsl:output
method="xml"
encoding="utf-8"
media-type="application/xml"
/>
<xsl:template match='/'>
<xml version="1.0">
<xsl:apply-templates select="response/result/doc"/>
</xml>
</xsl:template>
<!-- search results xslt -->
<xsl:template match="doc">
<xsl:variable name="id" select="str[#name='Id']"/>
<xsl:variable name="timestamp" select="date[#name='timestamp']"/>
<item>
<id><xsl:value-of select="int[#name='Id']"/></id>
<title><xsl:value-of select="str[#name='Name']"/></title>
<link>
http://localhost:8983/solr/mysolrcore/<xsl:value-of select="string[#name='url']"/>p-<xsl:value-of select="int[#name='Id']"/>
</link>
<image>
<xsl:value-of select="str[#name='ImageURL']"/>
</image>
<category><xsl:value-of select="arr[#name='Category']"/></category>
<availability><xsl:value-of select="bool[#name='StockAvailability']"/></availability>
<description>
<xsl:value-of select="str[#name='ShortDescription']"/>
</description>
</item>
</xsl:template>
</xsl:stylesheet>
This generates a valid xml document without the need of writing your own custom response writer.

You will have to write your own ResponseWriter
Best way is to start looking at an existing implementation. For instance CSVResponceWriter
A quick look at the code tells that you get a SolrQueryResponse object in write method. From the response you can get the searched SolrDocuments and other required information

Related

Creating a.bat file to delete a particular row on any application

I want to create a .bat script file to SELECT and DELETE following row-
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
The above row can appear in any application like Notepad, MS Word, Any other application's text area etc.
So, I will create a keyboard shortcut of the .bat file and whenever I run that keyboard shortcut, it will search for the above row in the particular opened application and then delete the entire row.
It is same like we select any string using mouse and then press Delete button on keyboard to delete it.
As a solution-
I can write Macro, but it will not work for all applications, for example a web application having text area.

There's no practical solution that meets your criteria. An alternative would be to remove the lines directly from the files prior to running NotePad or other apps. Take a look at spew from help findstr. For each file that has that line at the top, rename the file, read it, discard the first line and then write the remaining contents back to a new file with the original file name.

The line is the XML declaration, and should appear in all XML files.
The simplest way to remove it is to create an XSL transformation that removes the declaration:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes">
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
You can create an applet that allows you to run this stylesheet via drag&drop.
Removing the XML declaration is not recommended, because it contains information that is important to read the XML file correctly. It's not recommended to open XML files in Word, because that runs the risk of modifying the XML via Word's various auto-layout features that will make the XML invalid (=not readable in XML applications).

how to index all metatags in nutch

I have installed Nutch 1.9 and configured it to successfully crawl with Solr 4.10.1. I am trying to set Nutch to index metadata as outlined here https://wiki.apache.org/nutch/IndexMetatags
How do I set it to index ALL of the metadata on a site? I set the value for metatags.names to * like this
<property>
<name>metatags.names</name>
<value>*</value>
<description>Names of the metatags to extract, separated by ','. Use '*' to extract all metatags. Prefixes the names with 'metatag.' in the parse-metadata. For instance to index description and keywords, you need to activate the plugin index-metadata and set the
value of the parameter 'index.parse.md' to 'metatag.description,metatag.keywords'.
</description>
</property>
but I am unsure of how to set the value for index.parse.md without listing individual metatag names. I tried this
<property>
<name>index.parse.md</name>
<value>meta*</value>
<description>Comma-separated list of keys to be taken from the parse metadata to generate fields. Can be used e.g. for 'description' or 'keywords' provided that these values are generated by a parser (see parse-metatags plugin)
</description>
</property>
but that doesn't display any metadata when running
bin/nutch indexchecker http://nutch.apache.org/
and I am sure there is metadata on that site because it returns Parse Metadata when running
bin/nutch parsechecker http://nutch.apache.org/
Any help would be greatly appreciated! Thanks

Plugin index-metadata doesn't work that way. You have to specify complete name there, e.g. "metatag.keywords".
Also "metatags.names" value "" is not really wildcard. You can't put something like "meta" there as well.

How to update Solr documents on the Solr server side with custom handler / plugin

I have a core with millions of records.
I want to add a custom handler which scan the existing documents and update one of the field based on a condition (age>12 for example).
I prefer doing it on the Solr server side for avoiding sending millions of documents to the client and back.
I was thinking of writing a solr plugin which will receive a query and update some fields on the query documents (like the delete by query handler).
I was wondering whether there are existing solutions or better alternatives.
I was searching the web for a while and couldn't find examples of Solr plugins which update documents (I don't need to extend the update handler).
I've written a plug-in which use the following code which works fine but isn't as fast as I need.
Currently I do:
AddUpdateCommand addUpdateCommand = new AddUpdateCommand(solrQueryRequest);
DocIterator iterator = docList.iterator();
SolrIndexSearcher indexReader = solrQueryRequest.getSearcher();
while (iterator.hasNext()) {
Document document = indexReader.doc(iterator.nextDoc());
SolrInputDocument solrInputDocument = new SolrInputDocument();
addUpdateCommand.clear();
addUpdateCommand.solrDoc = solrInputDocument;
addUpdateCommand.solrDoc.setField("id", document.get("id"));
addUpdateCommand.solrDoc.setField("my_updated_field", new_value);
updateRequestProcessor.processAdd(addUpdateCommand);
}
But this is very expensive since the update handler will fetch again the document which I already hold at hand.
Is there a safe way to update the lucene document and write it back while taking into account all the Solr related code such as caches, extra solr logic, etc?
I was thinking of converting it to a SolrInputDocument and then just add the document through Solr but I need first to convert all fields.
Thanks in advance,
Avner

I'm not sure whether the following is going to improve the performance, but thought it might help you.
Look at SolrEntityProcessor
Its description sounds very relevant to what you are searching for.
This EntityProcessor imports data from different Solr instances and cores.
The data is retrieved based on a specified (filter) query.
This EntityProcessor is useful in cases you want to copy your Solr index
and slightly want to modify the data in the target index.
In some cases Solr might be the only place were all data is available.
However, I couldn't find an out-of-the-box feature to embed your logic. So, you may have to extend the following class.
SolrEntityProcessor and the link to sourcecode
You may probably know, but a couple of other points.
1) Make the entire process exploit all the cpu cores available. Make it multi-threaded.
2) Use the latest version of Solr.
3) Experiment with two Solr apps on different machines with minimal network delay. This would be a tough call :
same machine, two processes VS two machines, more cores, but network overhead.
4) Tweak Solr cache in a way that applies to your use-case and particular implementation.
5) A couple of more resources: Solr Performance Problems and SolrPerformanceFactors
Hope it helps. Let me know the stats despite this answer. I'm curious and your info might help somebody later.

To point out where to put custom logic, I would suggest to have a look at the SolrEntityProcessor in conjunction with Solr's ScriptTransformer.
The ScriptTransformer allows to compute each entity after it is extracted from the source of a dataimport, manipulate it and add custom field values before the new entity is written to solr.
A sample data-config.xml could look like this
<?xml version="1.0" encoding="UTF-8" ?>
<dataConfig>
<script>
<![CDATA[
function calculateValue(row) {
row.put("CALCULATED_FIELD", "The age is: " + row.get("age"));
return row;
}
]]>
</script>
<document>
<entity name="sep" processor="SolrEntityProcessor"
url="http://localhost:8080/solr/your-core-name"
query="*:*"
wt="javabin"
transformer="script:calculateValue">
<field column="ID" name="id" />
<field column="AGE" name="age" />
<field column="CALCULATED_FIELD" name="update_field" />
</entity>
</document>
</dataConfig>
As you can see, you may perform any data transformation you like and is expressible in javascript. So this would be a good point to express your logic and transformations.
You say one constraint maybe age > 12. I would handle this via the query attribute of the SolrEntityProcessor. You could write query=age:[* TO 12] so that only records with an age up to 12 would be read for the update.

How to differentiate structured and unstructured CDA?

I am working in CDA documents. I am able to validate the XML documents against CDA schema and find out the the xml is CDA or not. But if it is CDA then there are two categories of CDA documents.
Structured CDA(Human readable text)
Unstructured CDA(embedded blob or referenced documents)
What is the key XML element that differentiates CDA as structured or unstructured document?

Structured document look for:
ClinicalDocument/component/structuredBody
Blob - unstructured look for:
ClinicalDocument/component/nonXmlBody
Use nonXmlBody/text to include blob or reference using the ED datatype

You can represent an unstructured document in CDA as either C-CDA (Consolidated CDA) or HITSP C62. C62 is much more commonly supported today; a quick GitHub search does not show any unstructured C-CDA implementations.
Note:the references and examples below are from non-normative specifications. You will probably need an HL7 membership to view the normative standards.
C-CDA:
From MDHT Models documentation (account required):
SHALL contain exactly one [1..1] templateId ( CONF:7710, CONF:10054 ) such that it
a. SHALL contain exactly one [1..1] #root="2.16.840.1.113883.10.20.21.1.10"
Example
<?xml version="1.0" encoding="UTF-8"?>
<ClinicalDocument xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="urn:hl7-org:v3" xsi:schemaLocation="urn:hl7-org:v3 CDA.xsd">
<realmCode code="US"/>
<typeId root="2.16.840.1.113883.1.3"/>
<templateId root="2.16.840.1.113883.10.20.21.1.10"/>
<templateId root="2.16.840.1.113883.10.20.22.1.1"/>
<code code="18842-5" codeSystem="2.16.840.1.113883.6.1" displayName="Discharge summarization note"/>
<confidentialityCode codeSystem="2.16.840.1.113883.5.25" codeSystemName="ConfidentialityCode"/>
<custodian>
<assignedCustodian>
<representedCustodianOrganization/>
</assignedCustodian>
</custodian>
</ClinicalDocument>
HITSP C62:
From MDHT Models documentation (account required):
SHALL contain exactly one [1..1] templateId ( ) such that it
a. SHALL contain exactly one [1..1] #root="2.16.840.1.113883.3.88.11.62.1"
Example
<?xml version="1.0" encoding="UTF-8"?>
<ClinicalDocument xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="urn:hl7-org:v3" xsi:schemaLocation="urn:hl7-org:v3 CDA.xsd">
<realmCode code="US"/>
<typeId root="2.16.840.1.113883.1.3"/>
<!-- HITSP C62 template -->
<templateId root="2.16.840.1.113883.3.88.11.62.1"/>
<!-- HL7 General Header Constraints-->
<templateId root="2.16.840.1.113883.10.20.3"/>
<!-- IHE Medical Documents -->
<templateId root="1.3.6.1.4.1.19376.1.5.3.1.1.1"/>
<!-- IHE Scanned Documents (XDS-SD) -->
<templateId root="1.3.6.1.4.1.19376.1.2.20"/>
<code code="18842-5" codeSystem="2.16.840.1.113883.6.1" displayName="Discharge summarization note"/>
<recordTarget>
<patientRole>
<patient/>
</patientRole>
</recordTarget>
</ClinicalDocument>
You can view some additional XML examples in the MDHT automated test results.
For a receiving organization to differentiate the content of an unstructured document, you should store the content type in the <code> element as shown in the examples. The content type can also be stored in the <classCode> or <typeCode> elements in an XDS submission set.

Unix XML file convert into Flat file

We are having multiple xml files on unix. We need to convert them into flat files. And we did that parsing for one level of xml file using C (C was used as C can communicate with Teradata fastload which is our target box using inmod and it will complete within one parse other wise in other languages we need to do two times parsing one for converting into flat file and one for loading ito teradata). i.e. the below file
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
</book>
Is converted into
bk101~Gambardella, Matthew~XML Developer's Guide~Computer~44.95~
This we achieved by parsing the file in C. But after seeing the original format of xml file which is below. (Please do not consider it as the required file. I am just giving an idea)
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<modified>2010-01-02</modified>
<modified>2010-01-03</modified>
<price>44.95</price>
</book>
This should be converted to two records it seems.
bk101~Gambardella, Matthew~XML Developer's Guide~Computer~2010-01-02~44.95~
bk101~Gambardella, Matthew~XML Developer's Guide~Computer~2010-01-03~44.95~
But now we are feeling that our C code is going to be complex for this req. So we are looking at other options which can be easily used on unix. Can any one please give us any working example codes in different languages/options for unix?

You can use XSLT. I use Saxon (Java) which can be run on Unix.
This stylesheet handles both of your XML samples:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/book">
<xsl:choose>
<xsl:when test="modified">
<xsl:for-each select="modified">
<xsl:call-template name="dump-line">
<xsl:with-param name="pos" select="position()"/>
</xsl:call-template>
</xsl:for-each>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="#id"/><xsl:text>~</xsl:text>
<xsl:value-of select="author"/><xsl:text>~</xsl:text>
<xsl:value-of select="title"/><xsl:text>~</xsl:text>
<xsl:value-of select="genre"/><xsl:text>~</xsl:text>
<xsl:value-of select="price"/>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
<xsl:template name="dump-line">
<xsl:param name="pos"/>
<xsl:value-of select="/book/#id"/><xsl:text>~</xsl:text>
<xsl:value-of select="/book/author"/><xsl:text>~</xsl:text>
<xsl:value-of select="/book/title"/><xsl:text>~</xsl:text>
<xsl:value-of select="/book/genre"/><xsl:text>~</xsl:text>
<xsl:value-of select="/book/modified[$pos]"/><xsl:text>~</xsl:text>
<xsl:value-of select="/book/price"/>
<xsl:text>
</xsl:text>
</xsl:template>
</xsl:stylesheet>
If there are no modified elements, one record is output. If there are modified elements, it outputs as many records as there are modified elements.
Sample output w/modified elements:
bk101~Gambardella, Matthew~XML Developer's Guide~Computer~2010-01-02~44.95
bk101~Gambardella, Matthew~XML Developer's Guide~Computer~2010-01-03~44.95

If you're loading the data into a database, and you have fields that share a many to one relationship with other fields, then you need to make sure your database structure is up to scratch. I.e. one table for the book, and one table for the modification date. Otherwise it will look like there are two books when in fact there is one with two modification dates.
However, if you are loading the data into a database, why are you first converting it to a flat file? You said you wanted to avoid having two passes one the parsing. Well it looks like you'll have one pass to parse the XML and output as a flat file, and another to parse the flat file and enter it into the database. Why not simply parse the XMl and put the data directly into the database?
There are reasons why formats like XML were invented and one is to encapsulate complicated data relationships in text based documents. By converting to a "flat file" you will lose that complexity. If you are then going to import the data into an environment that can handle that complexity and store those relationships...why not keep it?
Does your database have an API, or can it only import flat files?
---EDIT---
It's easier to reply as part of an answer than as a series of comments.
First, thanks for the clarification.
Second, no I cannot provide example code. Mostly since what you want sounds very specific.
Thirdly, I think you have two options:
1) You have a load of C code already written to parse the XML. You have to consider the cost of throwing it all away and writing it again in Perl and supporting that, against the cost of improving it to import data directly into your Teradata database and the cost of maintaining it thereafter.
2) For Perl, there are many XML parsers and in my experience they make traversing an XML tree/data structure much much easier than in C. I'm not a fan of Perl, but I have written code to deal with ready parsed XML trees in C and I have never failed to hate it. By contrast, doing it in Perl is simpler and probably even quicker.
There are a huge number of Perl modules out there to parse XML. I suggest you search the internet for some reviews on them to decide which is easiest or most appropriate for you to use.
There is a Perl module called Teradata::SQL that should allow you to import the data into your Teradata databse. There may be other modules that are easier/simpler/better to use. I have no experience in any of them so cannot make a recommendation. Search http://www.cpan.org for any modules that may be useful.
Lastly, I STRONGLY recommend ensuring that you take some time to ensure that the design of your Teradata database matches the data going into it. As I stated above, you clearly have a many to one relationship between modification dates and books, so that means you need a table for modification dates and a table for books and correct many to one relationships in your table design. To put one entry per line, resulting in multiple lines for the same book with only modification date varying is very wrong. There may be other many to one relationships such as author. Imagine book B written by authors A1 and A2 with modification dates of M1 and M2. If you use the approach you discussed above of having one line for each combination, you end up having 4 entries for the same book, and it looks like you have 2 books with the same title but written by different authors.
Spend some time to ensure you understand the structure of the data in the XML files. This should be clearly defined by the DTD.

XSLT is an option; check out the xsltproc tool.
Or, you can also the much easier XQuery, though you might need to coerce it into producing text. The following XQuery script does almost what you want (only a few fields listed):
for $book in doc("book.xml")/book
for $mod in $book/modified
return concat($book/#id, "~", $book/title, "~", $mod, "
")
You can run this through Saxon with
java net.sf.saxon.Query '!method=text' script.xq
Another popular XQuery processor for Unix is XQilla, though I'm not sure it can produce non-XML output.
(There may be a smart alternative to my awkward way of generating a newline.)

How about formating the line as bk101~Gambardella, Matthew~XML Developer's Guide~Computer~2010-01-02,2010-01-03~44.95~. Of course, special consideration must be taken to the fact that the modified field can contain a list of values. That's about as flat as you can make it.