We are having multiple xml files on unix. We need to convert them into flat files. And we did that parsing for one level of xml file using C (C was used as C can communicate with Teradata fastload which is our target box using inmod and it will complete within one parse other wise in other languages we need to do two times parsing one for converting into flat file and one for loading ito teradata). i.e. the below file
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
</book>
Is converted into
bk101~Gambardella, Matthew~XML Developer's Guide~Computer~44.95~
This we achieved by parsing the file in C. But after seeing the original format of xml file which is below. (Please do not consider it as the required file. I am just giving an idea)
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<modified>2010-01-02</modified>
<modified>2010-01-03</modified>
<price>44.95</price>
</book>
This should be converted to two records it seems.
bk101~Gambardella, Matthew~XML Developer's Guide~Computer~2010-01-02~44.95~
bk101~Gambardella, Matthew~XML Developer's Guide~Computer~2010-01-03~44.95~
But now we are feeling that our C code is going to be complex for this req. So we are looking at other options which can be easily used on unix. Can any one please give us any working example codes in different languages/options for unix?
You can use XSLT. I use Saxon (Java) which can be run on Unix.
This stylesheet handles both of your XML samples:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/book">
<xsl:choose>
<xsl:when test="modified">
<xsl:for-each select="modified">
<xsl:call-template name="dump-line">
<xsl:with-param name="pos" select="position()"/>
</xsl:call-template>
</xsl:for-each>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="#id"/><xsl:text>~</xsl:text>
<xsl:value-of select="author"/><xsl:text>~</xsl:text>
<xsl:value-of select="title"/><xsl:text>~</xsl:text>
<xsl:value-of select="genre"/><xsl:text>~</xsl:text>
<xsl:value-of select="price"/>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
<xsl:template name="dump-line">
<xsl:param name="pos"/>
<xsl:value-of select="/book/#id"/><xsl:text>~</xsl:text>
<xsl:value-of select="/book/author"/><xsl:text>~</xsl:text>
<xsl:value-of select="/book/title"/><xsl:text>~</xsl:text>
<xsl:value-of select="/book/genre"/><xsl:text>~</xsl:text>
<xsl:value-of select="/book/modified[$pos]"/><xsl:text>~</xsl:text>
<xsl:value-of select="/book/price"/>
<xsl:text>
</xsl:text>
</xsl:template>
</xsl:stylesheet>
If there are no modified elements, one record is output. If there are modified elements, it outputs as many records as there are modified elements.
Sample output w/modified elements:
bk101~Gambardella, Matthew~XML Developer's Guide~Computer~2010-01-02~44.95
bk101~Gambardella, Matthew~XML Developer's Guide~Computer~2010-01-03~44.95
If you're loading the data into a database, and you have fields that share a many to one relationship with other fields, then you need to make sure your database structure is up to scratch. I.e. one table for the book, and one table for the modification date. Otherwise it will look like there are two books when in fact there is one with two modification dates.
However, if you are loading the data into a database, why are you first converting it to a flat file? You said you wanted to avoid having two passes one the parsing. Well it looks like you'll have one pass to parse the XML and output as a flat file, and another to parse the flat file and enter it into the database. Why not simply parse the XMl and put the data directly into the database?
There are reasons why formats like XML were invented and one is to encapsulate complicated data relationships in text based documents. By converting to a "flat file" you will lose that complexity. If you are then going to import the data into an environment that can handle that complexity and store those relationships...why not keep it?
Does your database have an API, or can it only import flat files?
---EDIT---
It's easier to reply as part of an answer than as a series of comments.
First, thanks for the clarification.
Second, no I cannot provide example code. Mostly since what you want sounds very specific.
Thirdly, I think you have two options:
1) You have a load of C code already written to parse the XML. You have to consider the cost of throwing it all away and writing it again in Perl and supporting that, against the cost of improving it to import data directly into your Teradata database and the cost of maintaining it thereafter.
2) For Perl, there are many XML parsers and in my experience they make traversing an XML tree/data structure much much easier than in C. I'm not a fan of Perl, but I have written code to deal with ready parsed XML trees in C and I have never failed to hate it. By contrast, doing it in Perl is simpler and probably even quicker.
There are a huge number of Perl modules out there to parse XML. I suggest you search the internet for some reviews on them to decide which is easiest or most appropriate for you to use.
There is a Perl module called Teradata::SQL that should allow you to import the data into your Teradata databse. There may be other modules that are easier/simpler/better to use. I have no experience in any of them so cannot make a recommendation. Search http://www.cpan.org for any modules that may be useful.
Lastly, I STRONGLY recommend ensuring that you take some time to ensure that the design of your Teradata database matches the data going into it. As I stated above, you clearly have a many to one relationship between modification dates and books, so that means you need a table for modification dates and a table for books and correct many to one relationships in your table design. To put one entry per line, resulting in multiple lines for the same book with only modification date varying is very wrong. There may be other many to one relationships such as author. Imagine book B written by authors A1 and A2 with modification dates of M1 and M2. If you use the approach you discussed above of having one line for each combination, you end up having 4 entries for the same book, and it looks like you have 2 books with the same title but written by different authors.
Spend some time to ensure you understand the structure of the data in the XML files. This should be clearly defined by the DTD.
XSLT is an option; check out the xsltproc tool.
Or, you can also the much easier XQuery, though you might need to coerce it into producing text. The following XQuery script does almost what you want (only a few fields listed):
for $book in doc("book.xml")/book
for $mod in $book/modified
return concat($book/#id, "~", $book/title, "~", $mod, "
")
You can run this through Saxon with
java net.sf.saxon.Query '!method=text' script.xq
Another popular XQuery processor for Unix is XQilla, though I'm not sure it can produce non-XML output.
(There may be a smart alternative to my awkward way of generating a newline.)
How about formating the line as bk101~Gambardella, Matthew~XML Developer's Guide~Computer~2010-01-02,2010-01-03~44.95~. Of course, special consideration must be taken to the fact that the modified field can contain a list of values. That's about as flat as you can make it.
Related
I have strings that are attributes of a property in my ontology like: "Foo1 hasBar Bar1, Foo2 hasBaz Baz1,..."
What I want to do is to loop through the string turning each triple separated by a comma into an actual triple. BTW, I know the first thought may be "why didn't you just process the data that way with an upload tool like Cellfie" or "call the SPARQL query from a programming language" but for my particular client they would rather just use SPARQL and the ontology is already a given.
I have written a query that does what I want for the first triple and changes the string to remove that triple. E.g., it finds the first triple, turns that into rdf and inserts it into the graph and then changes the original string property to: "Foo2 hasBaz Baz1,..."
So I can just run the query until there are no more strings to process but that's kind of a pain. I've looked through the SPARQL documentation and the examples regarding SPARQL and iteration on this site and I just don't think it is possible given the declarative nature of SPARQL but I wanted to double check. Perhaps if I did something like embed the current query in another query?
I want to create a .bat script file to SELECT and DELETE following row-
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
The above row can appear in any application like Notepad, MS Word, Any other application's text area etc.
So, I will create a keyboard shortcut of the .bat file and whenever I run that keyboard shortcut, it will search for the above row in the particular opened application and then delete the entire row.
It is same like we select any string using mouse and then press Delete button on keyboard to delete it.
As a solution-
I can write Macro, but it will not work for all applications, for example a web application having text area.
There's no practical solution that meets your criteria. An alternative would be to remove the lines directly from the files prior to running NotePad or other apps. Take a look at spew from help findstr. For each file that has that line at the top, rename the file, read it, discard the first line and then write the remaining contents back to a new file with the original file name.
The line is the XML declaration, and should appear in all XML files.
The simplest way to remove it is to create an XSL transformation that removes the declaration:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes">
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
You can create an applet that allows you to run this stylesheet via drag&drop.
Removing the XML declaration is not recommended, because it contains information that is important to read the XML file correctly. It's not recommended to open XML files in Word, because that runs the risk of modifying the XML via Word's various auto-layout features that will make the XML invalid (=not readable in XML applications).
An xml has 2 sets of similar tags with different data.
<address>
<door_num>100</door_num>
<street>hundred street</street>
<city>XYZ</city>
</address>
<address>
<door_num>200</door_num>
<street>two hundred street</street>
<city>ABC</city>
<active>1</active>
</address>
What is the best way to index this? Search by door_num 100 and city XYZ must return the document; whereas search by door_num 100 and city ABC must not return any document. Storing as multivalues does not help here. Also note that, the second set of address with door_num 200 may or may not be present in the xml. Please suggest
Model this data as nested documents, the Address info would be stored in nested docs, and then you can query them so that both door_num and city need to match on the same nested doc.
Regarding how to actually get them into the index, you have several options:
write some java (or groovy or any other jvm lang) code with SolrJ, build your docs on the client side, and index them.
if you don't like java, you can still write any other lang code on the client side, and build your docs as xml/json that Solr can ingest, index them.
if you don't want to write any code at all, try with DIH and XPathEntityProcessor, you might achieve all you need.
I have a product data in Solr. Solr already provides them into XML via query.
However, I need the data into different XML format (just name of xml nodes are different) for supplying them as a feeds to some other application.
Any idea, how can I do this quickly from Solr?
I finally managed to do this using one of existing response writer. This does not require writing a new response writer. Here is how I did it.
I have used XSLTResponseWriter to generate custom format xml. You can find more details here: http://wiki.apache.org/solr/XsltResponseWriter
You can find more information on how to use response writer here: https://wiki.apache.org/solr/QueryResponseWriter
Okay, now before you use it, it needs to be configured.
Step 1: Define QueryResponseWriter for XSLT in your solrconfig.xml
Add following code into your solrconfig.xml after ending your Query component.
<!--
Changes to XSLT transforms are taken into account
every xsltCacheLifetimeSeconds at most.
-->
<queryResponseWriter name="xslt" class="org.apache.solr.response.XSLTResponseWriter">
<int name="xsltCacheLifetimeSeconds">5</int>
</queryResponseWriter>
You can find its documentation at http://wiki.apache.org/solr/XsltResponseWriter
Step 2. Use proper xslt format or customize your own
You can either use existing xslt formats provided in default Solr download or use it to modify it the way you want it to work. There are 5 example formats provided already. Suppose you use example.xsl which generates a document in html with all fields, then you need to query it like this.
I customized it to use custom.xsl file format to implement my own format. I'll come to it later.
Step 3. Query your Solr using XSLT
http://localhost:8983/solr/mysolrcore/select?q=*:*&wt=xslt&tr=default.xsl&&rows=10
This will query solr and will present data in a format defined in default.xsl. Note the wt and tr parameter. You can pass how many records you want in result in rows.
Custom XML format using XSLT
Here's how I formatted my custom xml format using xslt. Hope this might be helpful to someone.
I have used example_rss.xsl as a base to start with and modified it as following.
<?xml version='1.0' encoding='UTF-8'?>
<!--
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
-->
<!--
Sample transform of Solr query results to custom XML format
-->
<xsl:stylesheet version='1.0'
xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>
<xsl:output
method="xml"
encoding="utf-8"
media-type="application/xml"
/>
<xsl:template match='/'>
<xml version="1.0">
<xsl:apply-templates select="response/result/doc"/>
</xml>
</xsl:template>
<!-- search results xslt -->
<xsl:template match="doc">
<xsl:variable name="id" select="str[#name='Id']"/>
<xsl:variable name="timestamp" select="date[#name='timestamp']"/>
<item>
<id><xsl:value-of select="int[#name='Id']"/></id>
<title><xsl:value-of select="str[#name='Name']"/></title>
<link>
http://localhost:8983/solr/mysolrcore/<xsl:value-of select="string[#name='url']"/>p-<xsl:value-of select="int[#name='Id']"/>
</link>
<image>
<xsl:value-of select="str[#name='ImageURL']"/>
</image>
<category><xsl:value-of select="arr[#name='Category']"/></category>
<availability><xsl:value-of select="bool[#name='StockAvailability']"/></availability>
<description>
<xsl:value-of select="str[#name='ShortDescription']"/>
</description>
</item>
</xsl:template>
</xsl:stylesheet>
This generates a valid xml document without the need of writing your own custom response writer.
You will have to write your own ResponseWriter
Best way is to start looking at an existing implementation. For instance CSVResponceWriter
A quick look at the code tells that you get a SolrQueryResponse object in write method. From the response you can get the searched SolrDocuments and other required information
SGML is the superset of HTML and XML. There are rich HTML and XML parsers available. Could you please explain me the
Usage of SGML (Sample business scneario) in current bussiness domains?
is it when dealing with legecy systems ?
There are HTML and XML parsers to HTML,xml documents. Why SGML parsers ?
My thinking might be wrong please give me some feedback?
Usage of SGML (Sample business scenario) in current business domains?
is it when dealing with legacy systems?
Yes, I think it is mainly for legacy systems, although you can use it for:
1.Weird syntaxes that (ab)use SGML minimization in order to provide less verbose files (when SGML was invented, people used to write SGML files by typing them, hence there are several features in SGML that are oriented to reduce the number of characters that had to be typed)
{config:
{attribute name="network":127.0.0.0/8 192.168.123.0/30;}
{attribute name="action":allow;}
;}
Instead of:
<config>
<attribute name="network">
127.0.0.0/8 192.168.123.0/30
</attribute>
<attribute name="action">
allow
</attribute>
</config>
(Of course, this use case has several disadvantages, and I'm not sure if it outweighs its drawbacks, but it is worth mentioning though)
2.Conversion from semi-structured human formats, where part of the text are actually tags.
For instance, I had an actual work some years ago that involved converting from this:
From:
To:
This is the subject
(there is a blank line before the subject,
the subject ends with a blank line,
and everything between parentheses is a comment)
This is the message body
To this
<from>sender</from>
<to>addressee</to>
<subject>This is the subject</subject>
<!-- there is a blank line before the subject,
the subject ends with a blank line,
and everything between parentheses is a comment -->
<body>This is the message body</body>
The actual example was far more complex, with many variations and, optional elements, then I found easier to convert it through SGML than writing a parser for it.
There are HTML and XML parsers to HTML,xml documents. Why SGML parsers ?
HTML is a markup language for describing the structure of a webpage (BODY, DIV, TABLE, etC), then it is not suitable for describing more general information such as a configuration file, a list of suppliers, bibliography, etc. (i.e. you can display it in a web page written in HTML, but such information will be hard to extract by automated systems)
XML, on the other hand, is oriented for describing arbitrary data structures, decoupled from layout issues.
It is easy to parse an XML document, because XML is based on simple rules (the document must be well-formed). It is because of this rules that you cannot parse an SGML file with an XML parser (unless the SGML file is itself a well formed XML document).
3.Playing with ignore/include marked sections
<!ENTITY % withAnswers "IGNORE">
What is the answer to life the universe and everything?
<![%withAnswers;[ 42 ]]>
If you want to include the answers in the produced document, just replace the first line with:
<!ENTITY % withAnswers "INCLUDE">
(But you could also use XML and a parameterized XSLT to achieve the same result)
SGML is not just legacy, there are large amount of organisations who continue to use SGML for document publication in the aeronautical industry (think Boeing /Airbus / Embraer), i.e. their most recent revisions of data are published directly in SGML.
Industries that follow data standards, e.g. Air Transportation Association (ATA), are locked in to using the format used by the standards authority, so SGML is still arond in a big way.
At some point in the technical publications chain, this usually gets converted to XML and/or HTML but as an original data source, SGML is around for some tie to come.