I am working in CDA documents. I am able to validate the XML documents against CDA schema and find out the the xml is CDA or not. But if it is CDA then there are two categories of CDA documents.
Structured CDA(Human readable text)
Unstructured CDA(embedded blob or referenced documents)
What is the key XML element that differentiates CDA as structured or unstructured document?
Structured document look for:
ClinicalDocument/component/structuredBody
Blob - unstructured look for:
ClinicalDocument/component/nonXmlBody
Use nonXmlBody/text to include blob or reference using the ED datatype
You can represent an unstructured document in CDA as either C-CDA (Consolidated CDA) or HITSP C62. C62 is much more commonly supported today; a quick GitHub search does not show any unstructured C-CDA implementations.
Note:the references and examples below are from non-normative specifications. You will probably need an HL7 membership to view the normative standards.
C-CDA:
From MDHT Models documentation (account required):
SHALL contain exactly one [1..1] templateId ( CONF:7710, CONF:10054 ) such that it
a. SHALL contain exactly one [1..1] #root="2.16.840.1.113883.10.20.21.1.10"
Example
<?xml version="1.0" encoding="UTF-8"?>
<ClinicalDocument xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="urn:hl7-org:v3" xsi:schemaLocation="urn:hl7-org:v3 CDA.xsd">
<realmCode code="US"/>
<typeId root="2.16.840.1.113883.1.3"/>
<templateId root="2.16.840.1.113883.10.20.21.1.10"/>
<templateId root="2.16.840.1.113883.10.20.22.1.1"/>
<code code="18842-5" codeSystem="2.16.840.1.113883.6.1" displayName="Discharge summarization note"/>
<confidentialityCode codeSystem="2.16.840.1.113883.5.25" codeSystemName="ConfidentialityCode"/>
<custodian>
<assignedCustodian>
<representedCustodianOrganization/>
</assignedCustodian>
</custodian>
</ClinicalDocument>
HITSP C62:
From MDHT Models documentation (account required):
SHALL contain exactly one [1..1] templateId ( ) such that it
a. SHALL contain exactly one [1..1] #root="2.16.840.1.113883.3.88.11.62.1"
Example
<?xml version="1.0" encoding="UTF-8"?>
<ClinicalDocument xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="urn:hl7-org:v3" xsi:schemaLocation="urn:hl7-org:v3 CDA.xsd">
<realmCode code="US"/>
<typeId root="2.16.840.1.113883.1.3"/>
<!-- HITSP C62 template -->
<templateId root="2.16.840.1.113883.3.88.11.62.1"/>
<!-- HL7 General Header Constraints-->
<templateId root="2.16.840.1.113883.10.20.3"/>
<!-- IHE Medical Documents -->
<templateId root="1.3.6.1.4.1.19376.1.5.3.1.1.1"/>
<!-- IHE Scanned Documents (XDS-SD) -->
<templateId root="1.3.6.1.4.1.19376.1.2.20"/>
<code code="18842-5" codeSystem="2.16.840.1.113883.6.1" displayName="Discharge summarization note"/>
<recordTarget>
<patientRole>
<patient/>
</patientRole>
</recordTarget>
</ClinicalDocument>
You can view some additional XML examples in the MDHT automated test results.
For a receiving organization to differentiate the content of an unstructured document, you should store the content type in the <code> element as shown in the examples. The content type can also be stored in the <classCode> or <typeCode> elements in an XDS submission set.
Related
All, I had succeeded in indexing the PDF file into SOLR with Post.jar.
I can see the file indexed when I tried to query the query result .
But I was wondering where do thes fields like id, stream_content_type,pdf_pdfversion etc comes from . I tried to search them in the schema.xml. But not found them yet. Where are they defined ? Did I missed something . Thanks.
This is the metatdata stored by Apache Tika
In addition to Tika's metadata, Solr adds the following metadata
(defined in ExtractingMetadataConstants):
https://wiki.apache.org/solr/ExtractingRequestHandler#Metadata
Documentation
Metadata
As has been implied up to now, Tika produces Metadata about the
document. Metadata often contains things like the author of the file
or the number of pages, etc. The Metadata produced depends on the type
of document submitted. For instance, PDFs have different metadata from
Word docs.
In addition to Tika's metadata, Solr adds the following metadata
(defined in ExtractingMetadataConstants):
"stream_name" - The name of the ContentStream as uploaded to Solr.
Depending on how the file is uploaded, this may or may not be set.
"stream_source_info" - Any source info about the stream. See
ContentStream. "stream_size" - The size of the stream in bytes(?)
"stream_content_type" - The content type of the stream, if available.
It is highly recommend that you try using the extract only option to
see what values actually get set for these.
Can anybody explain the difference between TEI and SGML format and/or how they are related?
In short TEI is XML, XML is SGML.
The "G" in SGML (Standard Generalized Markup Language) means (among several other things) that a markup language may customize it syntax. For instance, you can define an SGML syntax where the tags (or elements) are like [v id:id1] instead of <v id="id1"></v>.
XML is a concrete syntax of SGML, plus several other requirements that subset SGML. In XML (and HTML too) the elements are delimited by angular brackets: <body>. Each tag in XML must be paired with an explicit end tag: </body>.
So far, we haven't talk about how the document is structured (the document type or schema). XML by itself does not impose restrictions on the document structure. The following is valid (i.e. well-formed) XML:
<item>
<body>
<head>I don't know what I'm doing</head>
</body>
</item>
TEI defines a common structure that all TEI documents must comply with, and assign a meaning to each tag. For instance:
The actual text (<text>) contains a single text of any kind. This
commonly contains the actual text and other encodings. A text <text>
minimally contains a text body (<body>). The body contains lower-level
text structures like paragraphs (<p>), or different structures for
text genres other than prose [source]
<text>
<body>
<p>For the first time in twenty-five years...</p>
</body>
</text>
I have a product data in Solr. Solr already provides them into XML via query.
However, I need the data into different XML format (just name of xml nodes are different) for supplying them as a feeds to some other application.
Any idea, how can I do this quickly from Solr?
I finally managed to do this using one of existing response writer. This does not require writing a new response writer. Here is how I did it.
I have used XSLTResponseWriter to generate custom format xml. You can find more details here: http://wiki.apache.org/solr/XsltResponseWriter
You can find more information on how to use response writer here: https://wiki.apache.org/solr/QueryResponseWriter
Okay, now before you use it, it needs to be configured.
Step 1: Define QueryResponseWriter for XSLT in your solrconfig.xml
Add following code into your solrconfig.xml after ending your Query component.
<!--
Changes to XSLT transforms are taken into account
every xsltCacheLifetimeSeconds at most.
-->
<queryResponseWriter name="xslt" class="org.apache.solr.response.XSLTResponseWriter">
<int name="xsltCacheLifetimeSeconds">5</int>
</queryResponseWriter>
You can find its documentation at http://wiki.apache.org/solr/XsltResponseWriter
Step 2. Use proper xslt format or customize your own
You can either use existing xslt formats provided in default Solr download or use it to modify it the way you want it to work. There are 5 example formats provided already. Suppose you use example.xsl which generates a document in html with all fields, then you need to query it like this.
I customized it to use custom.xsl file format to implement my own format. I'll come to it later.
Step 3. Query your Solr using XSLT
http://localhost:8983/solr/mysolrcore/select?q=*:*&wt=xslt&tr=default.xsl&&rows=10
This will query solr and will present data in a format defined in default.xsl. Note the wt and tr parameter. You can pass how many records you want in result in rows.
Custom XML format using XSLT
Here's how I formatted my custom xml format using xslt. Hope this might be helpful to someone.
I have used example_rss.xsl as a base to start with and modified it as following.
<?xml version='1.0' encoding='UTF-8'?>
<!--
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
-->
<!--
Sample transform of Solr query results to custom XML format
-->
<xsl:stylesheet version='1.0'
xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>
<xsl:output
method="xml"
encoding="utf-8"
media-type="application/xml"
/>
<xsl:template match='/'>
<xml version="1.0">
<xsl:apply-templates select="response/result/doc"/>
</xml>
</xsl:template>
<!-- search results xslt -->
<xsl:template match="doc">
<xsl:variable name="id" select="str[#name='Id']"/>
<xsl:variable name="timestamp" select="date[#name='timestamp']"/>
<item>
<id><xsl:value-of select="int[#name='Id']"/></id>
<title><xsl:value-of select="str[#name='Name']"/></title>
<link>
http://localhost:8983/solr/mysolrcore/<xsl:value-of select="string[#name='url']"/>p-<xsl:value-of select="int[#name='Id']"/>
</link>
<image>
<xsl:value-of select="str[#name='ImageURL']"/>
</image>
<category><xsl:value-of select="arr[#name='Category']"/></category>
<availability><xsl:value-of select="bool[#name='StockAvailability']"/></availability>
<description>
<xsl:value-of select="str[#name='ShortDescription']"/>
</description>
</item>
</xsl:template>
</xsl:stylesheet>
This generates a valid xml document without the need of writing your own custom response writer.
You will have to write your own ResponseWriter
Best way is to start looking at an existing implementation. For instance CSVResponceWriter
A quick look at the code tells that you get a SolrQueryResponse object in write method. From the response you can get the searched SolrDocuments and other required information
SGML is the superset of HTML and XML. There are rich HTML and XML parsers available. Could you please explain me the
Usage of SGML (Sample business scneario) in current bussiness domains?
is it when dealing with legecy systems ?
There are HTML and XML parsers to HTML,xml documents. Why SGML parsers ?
My thinking might be wrong please give me some feedback?
Usage of SGML (Sample business scenario) in current business domains?
is it when dealing with legacy systems?
Yes, I think it is mainly for legacy systems, although you can use it for:
1.Weird syntaxes that (ab)use SGML minimization in order to provide less verbose files (when SGML was invented, people used to write SGML files by typing them, hence there are several features in SGML that are oriented to reduce the number of characters that had to be typed)
{config:
{attribute name="network":127.0.0.0/8 192.168.123.0/30;}
{attribute name="action":allow;}
;}
Instead of:
<config>
<attribute name="network">
127.0.0.0/8 192.168.123.0/30
</attribute>
<attribute name="action">
allow
</attribute>
</config>
(Of course, this use case has several disadvantages, and I'm not sure if it outweighs its drawbacks, but it is worth mentioning though)
2.Conversion from semi-structured human formats, where part of the text are actually tags.
For instance, I had an actual work some years ago that involved converting from this:
From:
To:
This is the subject
(there is a blank line before the subject,
the subject ends with a blank line,
and everything between parentheses is a comment)
This is the message body
To this
<from>sender</from>
<to>addressee</to>
<subject>This is the subject</subject>
<!-- there is a blank line before the subject,
the subject ends with a blank line,
and everything between parentheses is a comment -->
<body>This is the message body</body>
The actual example was far more complex, with many variations and, optional elements, then I found easier to convert it through SGML than writing a parser for it.
There are HTML and XML parsers to HTML,xml documents. Why SGML parsers ?
HTML is a markup language for describing the structure of a webpage (BODY, DIV, TABLE, etC), then it is not suitable for describing more general information such as a configuration file, a list of suppliers, bibliography, etc. (i.e. you can display it in a web page written in HTML, but such information will be hard to extract by automated systems)
XML, on the other hand, is oriented for describing arbitrary data structures, decoupled from layout issues.
It is easy to parse an XML document, because XML is based on simple rules (the document must be well-formed). It is because of this rules that you cannot parse an SGML file with an XML parser (unless the SGML file is itself a well formed XML document).
3.Playing with ignore/include marked sections
<!ENTITY % withAnswers "IGNORE">
What is the answer to life the universe and everything?
<![%withAnswers;[ 42 ]]>
If you want to include the answers in the produced document, just replace the first line with:
<!ENTITY % withAnswers "INCLUDE">
(But you could also use XML and a parameterized XSLT to achieve the same result)
SGML is not just legacy, there are large amount of organisations who continue to use SGML for document publication in the aeronautical industry (think Boeing /Airbus / Embraer), i.e. their most recent revisions of data are published directly in SGML.
Industries that follow data standards, e.g. Air Transportation Association (ATA), are locked in to using the format used by the standards authority, so SGML is still arond in a big way.
At some point in the technical publications chain, this usually gets converted to XML and/or HTML but as an original data source, SGML is around for some tie to come.
I have a small set of descriptive metadata (~50) and for each of them a corresponding full text file (.txt). My understanding is that the Apache Tika framework is used for detecting and extracting metadata and structured text from various types of documents. However, I would also need to implement a linkage mechanism whereby a given metadata is matched to its full-text. Can this be done in Solr?
Thanks,
Ilaria
If you have metadata and the document content, you can index the metadata and store the content. Your field definition would look something like this
<field name="filename" type="text" indexed="true" stored="true"/>
... <!-- other metadata /-->
<field name="content" type="text" indexed="false" stored="true"/>
This will allow you to search by any metadata, and give you back the content. You can add as much meta information as required to search the text. I wouldn't index the full text as there is already some structured metadata available.
Apache TIKA extracts meta information from HTML pages etc. Since you already have the metadata available, you need not use TIKA. Besides, AFAIK, Tika does not work with plain text files.
Edit 1:
Ok, so the link between the metadata and content will be maintained in Solr. For ex, if you have
File1.txt <-> Metadata1.txt
You could have one record (document) in Solr that has (no. of metadatafields + 1 plaintextcontent field). This gives you the flexibility to look up the document by any metadata. For example,
q=filename:File1.txt
or
q=filesize:[1 to 100]
where filename and filesize are example metadata fields. plaintextcontent would be your text file content, so thus in your Solr schema, you have your link.
Now the trick is to setup the indexing. Here's one way to do it -
Indexing the text file is very simple. You could use the DataImportHandler's PlainTextEntityProcessor.
Indexing the metadata along with it could be slightly tricky (need to understand the structure of metadata). You could use LineEntityProcessor or any one of the Transformers of DataImportHandler, depending on what suits you best.