TEI format vs. SGML format - sgml

Can anybody explain the difference between TEI and SGML format and/or how they are related?

In short TEI is XML, XML is SGML.
The "G" in SGML (Standard Generalized Markup Language) means (among several other things) that a markup language may customize it syntax. For instance, you can define an SGML syntax where the tags (or elements) are like [v id:id1] instead of <v id="id1"></v>.
XML is a concrete syntax of SGML, plus several other requirements that subset SGML. In XML (and HTML too) the elements are delimited by angular brackets: <body>. Each tag in XML must be paired with an explicit end tag: </body>.
So far, we haven't talk about how the document is structured (the document type or schema). XML by itself does not impose restrictions on the document structure. The following is valid (i.e. well-formed) XML:
<item>
<body>
<head>I don't know what I'm doing</head>
</body>
</item>
TEI defines a common structure that all TEI documents must comply with, and assign a meaning to each tag. For instance:
The actual text (<text>) contains a single text of any kind. This
commonly contains the actual text and other encodings. A text <text>
minimally contains a text body (<body>). The body contains lower-level
text structures like paragraphs (<p>), or different structures for
text genres other than prose [source]
<text>
<body>
<p>For the first time in twenty-five years...</p>
</body>
</text>

Related

Usage of SGML(Standard Generalized Markup Language)

SGML is the superset of HTML and XML. There are rich HTML and XML parsers available. Could you please explain me the
Usage of SGML (Sample business scneario) in current bussiness domains?
is it when dealing with legecy systems ?
There are HTML and XML parsers to HTML,xml documents. Why SGML parsers ?
My thinking might be wrong please give me some feedback?
Usage of SGML (Sample business scenario) in current business domains?
is it when dealing with legacy systems?
Yes, I think it is mainly for legacy systems, although you can use it for:
1.Weird syntaxes that (ab)use SGML minimization in order to provide less verbose files (when SGML was invented, people used to write SGML files by typing them, hence there are several features in SGML that are oriented to reduce the number of characters that had to be typed)
{config:
{attribute name="network":127.0.0.0/8 192.168.123.0/30;}
{attribute name="action":allow;}
;}
Instead of:
<config>
<attribute name="network">
127.0.0.0/8 192.168.123.0/30
</attribute>
<attribute name="action">
allow
</attribute>
</config>
(Of course, this use case has several disadvantages, and I'm not sure if it outweighs its drawbacks, but it is worth mentioning though)
2.Conversion from semi-structured human formats, where part of the text are actually tags.
For instance, I had an actual work some years ago that involved converting from this:
From:
To:
This is the subject
(there is a blank line before the subject,
the subject ends with a blank line,
and everything between parentheses is a comment)
This is the message body
To this
<from>sender</from>
<to>addressee</to>
<subject>This is the subject</subject>
<!-- there is a blank line before the subject,
the subject ends with a blank line,
and everything between parentheses is a comment -->
<body>This is the message body</body>
The actual example was far more complex, with many variations and, optional elements, then I found easier to convert it through SGML than writing a parser for it.
There are HTML and XML parsers to HTML,xml documents. Why SGML parsers ?
HTML is a markup language for describing the structure of a webpage (BODY, DIV, TABLE, etC), then it is not suitable for describing more general information such as a configuration file, a list of suppliers, bibliography, etc. (i.e. you can display it in a web page written in HTML, but such information will be hard to extract by automated systems)
XML, on the other hand, is oriented for describing arbitrary data structures, decoupled from layout issues.
It is easy to parse an XML document, because XML is based on simple rules (the document must be well-formed). It is because of this rules that you cannot parse an SGML file with an XML parser (unless the SGML file is itself a well formed XML document).
3.Playing with ignore/include marked sections
<!ENTITY % withAnswers "IGNORE">
What is the answer to life the universe and everything?
<![%withAnswers;[ 42 ]]>
If you want to include the answers in the produced document, just replace the first line with:
<!ENTITY % withAnswers "INCLUDE">
(But you could also use XML and a parameterized XSLT to achieve the same result)
SGML is not just legacy, there are large amount of organisations who continue to use SGML for document publication in the aeronautical industry (think Boeing /Airbus / Embraer), i.e. their most recent revisions of data are published directly in SGML.
Industries that follow data standards, e.g. Air Transportation Association (ATA), are locked in to using the format used by the standards authority, so SGML is still arond in a big way.
At some point in the technical publications chain, this usually gets converted to XML and/or HTML but as an original data source, SGML is around for some tie to come.

Specify language in queries to Search API

When creating a document to add to a search index, you can specify the document language. I've done this, but would now like to query only those docs in a specific language. Is this possible? I assumed it would be trivial (and documented), but I can't find how to do it.
Thanks!
I don't think you can currently, but I haven't seen anything explicitly saying that. I'm implying from these sentences that the language field is for their use and not for querying.
The language parameter for search.TextField:
Two-letter ISO 693-1 language code for the field's content, to assist in tokenization. If None, the language code of the document will be used.
And Building Queries:
Search supports all space-delimited languages as well as some languages not segmented by spaces (specifically, Chinese, Japanese, Korean, and Thai). For these languages, Search segments the text automatically.
They need to know the language so they know how to parse it into words.
My plan is to just add an additional field to my search documents that has the same value as the language field. It's slightly redundant, but simple to do.
search.Document(
fields = [
...,
search.TextField(name='language', value=lang),
],
language = lang,
)

How to differentiate structured and unstructured CDA?

I am working in CDA documents. I am able to validate the XML documents against CDA schema and find out the the xml is CDA or not. But if it is CDA then there are two categories of CDA documents.
Structured CDA(Human readable text)
Unstructured CDA(embedded blob or referenced documents)
What is the key XML element that differentiates CDA as structured or unstructured document?
Structured document look for:
ClinicalDocument/component/structuredBody
Blob - unstructured look for:
ClinicalDocument/component/nonXmlBody
Use nonXmlBody/text to include blob or reference using the ED datatype
You can represent an unstructured document in CDA as either C-CDA (Consolidated CDA) or HITSP C62. C62 is much more commonly supported today; a quick GitHub search does not show any unstructured C-CDA implementations.
Note:the references and examples below are from non-normative specifications. You will probably need an HL7 membership to view the normative standards.
C-CDA:
From MDHT Models documentation (account required):
SHALL contain exactly one [1..1] templateId ( CONF:7710, CONF:10054 ) such that it
a. SHALL contain exactly one [1..1] #root="2.16.840.1.113883.10.20.21.1.10"
Example
<?xml version="1.0" encoding="UTF-8"?>
<ClinicalDocument xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="urn:hl7-org:v3" xsi:schemaLocation="urn:hl7-org:v3 CDA.xsd">
<realmCode code="US"/>
<typeId root="2.16.840.1.113883.1.3"/>
<templateId root="2.16.840.1.113883.10.20.21.1.10"/>
<templateId root="2.16.840.1.113883.10.20.22.1.1"/>
<code code="18842-5" codeSystem="2.16.840.1.113883.6.1" displayName="Discharge summarization note"/>
<confidentialityCode codeSystem="2.16.840.1.113883.5.25" codeSystemName="ConfidentialityCode"/>
<custodian>
<assignedCustodian>
<representedCustodianOrganization/>
</assignedCustodian>
</custodian>
</ClinicalDocument>
HITSP C62:
From MDHT Models documentation (account required):
SHALL contain exactly one [1..1] templateId ( ) such that it
a. SHALL contain exactly one [1..1] #root="2.16.840.1.113883.3.88.11.62.1"
Example
<?xml version="1.0" encoding="UTF-8"?>
<ClinicalDocument xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="urn:hl7-org:v3" xsi:schemaLocation="urn:hl7-org:v3 CDA.xsd">
<realmCode code="US"/>
<typeId root="2.16.840.1.113883.1.3"/>
<!-- HITSP C62 template -->
<templateId root="2.16.840.1.113883.3.88.11.62.1"/>
<!-- HL7 General Header Constraints-->
<templateId root="2.16.840.1.113883.10.20.3"/>
<!-- IHE Medical Documents -->
<templateId root="1.3.6.1.4.1.19376.1.5.3.1.1.1"/>
<!-- IHE Scanned Documents (XDS-SD) -->
<templateId root="1.3.6.1.4.1.19376.1.2.20"/>
<code code="18842-5" codeSystem="2.16.840.1.113883.6.1" displayName="Discharge summarization note"/>
<recordTarget>
<patientRole>
<patient/>
</patientRole>
</recordTarget>
</ClinicalDocument>
You can view some additional XML examples in the MDHT automated test results.
For a receiving organization to differentiate the content of an unstructured document, you should store the content type in the <code> element as shown in the examples. The content type can also be stored in the <classCode> or <typeCode> elements in an XDS submission set.

Index every word of a text file which are delimited by whitespace in solr?

I am implementing solr 3.6 in my application.as i have the below data in my text file..
**
date=2011-07-08 time=10:55:06 timezone="IST" device_name="CR1000i"
device_id=C010600504-TYGJD3 deployment_mode="Route"
log_id=031006209001 log_type="Anti Virus" log_component="FTP"
log_subtype="Clean" status="Denied" priority=Critical fw_rule_id=""
user_name="hemant" virus="codevirus" FTP_URL="ftp.myftp.com"
FTP_direction="download" filename="hemantresume.doc" file_size="550k"
file_path="deepti/Shortcut to virus.lnk" ftpcommand="RETR"
src_ip=10.103.6.100 dst_ip=10.103.6.66 protocol="TCP" src_port=2458
dst_port=21 dstdomain="myftp.cpm" sent_bytes=162 recv_bytes=45
message="An FTP download of File resume.doc of size 550k from server
ftp.myftp.com could not be completed as file was infected with virus
codevirus"
**
now i want to split above data based on key-value pairs..and want the each value to be indexed based on the key..
i want the changes should be in the configuraion files..i have gone through tokenizer in which whitespaceokenizer may work.but want the whole structure to be indexed..so can anyone please help me on this???
thanks..
There is no tokenizer that I know of does this.
Using static fields:
You have to define all your "keys" as fields in schema.xml . They should have the relevant types (dates, string etc).
Create a POJO with these fields and parse this key/value pairs and populate the POJO. Add this pojo to solr using solrj.
Using dynamic fields:
In this case you dont need to define the keys in schema but use dynamic fields (based on the type of data). You still need to parse the key/value pairs and add to solr document. These fields need to be added using solrInputdoc.addField method.
As you define add new key/value pairs, the client would still need to know of the existence of this new key. But your indexer does not need to.
This cannot be done with a tokenizer. Tokenizers are called for each field, but you need processing before handing the data to a field.
A Transformer could probably do this, or you might do some straightforward conversion before submitting it as XML. It should not be hard to write something that reads that format and generates the proper XML format for Solr submissions. It sure wouldn't be hard in Python.
For this input:
date=2011-07-08 time=10:55:06 timezone="IST" device_name="CR1000i"
You would need to create the matching fields in a schema, and generate:
<doc>
<field name="date">2011-07-08</field>
<field name="time">2011-07-08</field>
<field name="timezone">IST</field>
<field name="device_name">CR1000i</field>
...
Also in this pre-processing, you almost certainly want to convert the first three fields into a single datetime in UTC.
For details about the Solr XML update format, see: http://wiki.apache.org/solr/UpdateXmlMessages
The Apache wiki is down at this exact moment, so try again if there is an error page.

Apache solr coverting to xml

I am learning how to use solr but I am struck at a point how to upload a .txt format book to solr I know I need to convert in to XML format but I dont know How to or how format looks like can some one explain me in step by step process
in order to avoid creating the input doc in xml format yourself, you could use tika request handler (it extracts text form various formats including plain text), see here

Resources