SGML is the superset of HTML and XML. There are rich HTML and XML parsers available. Could you please explain me the
Usage of SGML (Sample business scneario) in current bussiness domains?
is it when dealing with legecy systems ?
There are HTML and XML parsers to HTML,xml documents. Why SGML parsers ?
My thinking might be wrong please give me some feedback?
Usage of SGML (Sample business scenario) in current business domains?
is it when dealing with legacy systems?
Yes, I think it is mainly for legacy systems, although you can use it for:
1.Weird syntaxes that (ab)use SGML minimization in order to provide less verbose files (when SGML was invented, people used to write SGML files by typing them, hence there are several features in SGML that are oriented to reduce the number of characters that had to be typed)
{config:
{attribute name="network":127.0.0.0/8 192.168.123.0/30;}
{attribute name="action":allow;}
;}
Instead of:
<config>
<attribute name="network">
127.0.0.0/8 192.168.123.0/30
</attribute>
<attribute name="action">
allow
</attribute>
</config>
(Of course, this use case has several disadvantages, and I'm not sure if it outweighs its drawbacks, but it is worth mentioning though)
2.Conversion from semi-structured human formats, where part of the text are actually tags.
For instance, I had an actual work some years ago that involved converting from this:
From:
To:
This is the subject
(there is a blank line before the subject,
the subject ends with a blank line,
and everything between parentheses is a comment)
This is the message body
To this
<from>sender</from>
<to>addressee</to>
<subject>This is the subject</subject>
<!-- there is a blank line before the subject,
the subject ends with a blank line,
and everything between parentheses is a comment -->
<body>This is the message body</body>
The actual example was far more complex, with many variations and, optional elements, then I found easier to convert it through SGML than writing a parser for it.
There are HTML and XML parsers to HTML,xml documents. Why SGML parsers ?
HTML is a markup language for describing the structure of a webpage (BODY, DIV, TABLE, etC), then it is not suitable for describing more general information such as a configuration file, a list of suppliers, bibliography, etc. (i.e. you can display it in a web page written in HTML, but such information will be hard to extract by automated systems)
XML, on the other hand, is oriented for describing arbitrary data structures, decoupled from layout issues.
It is easy to parse an XML document, because XML is based on simple rules (the document must be well-formed). It is because of this rules that you cannot parse an SGML file with an XML parser (unless the SGML file is itself a well formed XML document).
3.Playing with ignore/include marked sections
<!ENTITY % withAnswers "IGNORE">
What is the answer to life the universe and everything?
<![%withAnswers;[ 42 ]]>
If you want to include the answers in the produced document, just replace the first line with:
<!ENTITY % withAnswers "INCLUDE">
(But you could also use XML and a parameterized XSLT to achieve the same result)
SGML is not just legacy, there are large amount of organisations who continue to use SGML for document publication in the aeronautical industry (think Boeing /Airbus / Embraer), i.e. their most recent revisions of data are published directly in SGML.
Industries that follow data standards, e.g. Air Transportation Association (ATA), are locked in to using the format used by the standards authority, so SGML is still arond in a big way.
At some point in the technical publications chain, this usually gets converted to XML and/or HTML but as an original data source, SGML is around for some tie to come.
Related
I have strings that are attributes of a property in my ontology like: "Foo1 hasBar Bar1, Foo2 hasBaz Baz1,..."
What I want to do is to loop through the string turning each triple separated by a comma into an actual triple. BTW, I know the first thought may be "why didn't you just process the data that way with an upload tool like Cellfie" or "call the SPARQL query from a programming language" but for my particular client they would rather just use SPARQL and the ontology is already a given.
I have written a query that does what I want for the first triple and changes the string to remove that triple. E.g., it finds the first triple, turns that into rdf and inserts it into the graph and then changes the original string property to: "Foo2 hasBaz Baz1,..."
So I can just run the query until there are no more strings to process but that's kind of a pain. I've looked through the SPARQL documentation and the examples regarding SPARQL and iteration on this site and I just don't think it is possible given the declarative nature of SPARQL but I wanted to double check. Perhaps if I did something like embed the current query in another query?
How do I provide an OpenNLP model for tokenization in vespa? This mentions that "The default linguistics module is OpenNlp". Is this what you are referring to? If yes, can I simply set the set_language index expression by referring to the doc? I did not find any relevant information on how to implement this feature in https://docs.vespa.ai/en/linguistics.html, could you please help me out with this?
Required for CJK support.
Yes, the default tokenizer is OpenNLP and it works with no configuration needed. It will guess the language if you don't set it, but if you know the document language it is better to use set_language (and language=...) in queries, since language detection is unreliable on short text.
However, OpenNLP tokenization (not detecting) only supports Danish, Dutch, Finnish, French, German, Hungarian, Irish, Italian, Norwegian, Portugese, Romanian, Russian, Spanish, Swedish, Turkish and English (where we use kstem instead). So, no CJK.
To support CJK you need to plug in your own tokenizer as described in the linguistics doc, or else use ngram instead of tokenization, see https://docs.vespa.ai/documentation/reference/schema-reference.html#gram
n-gram is often a good choice with Vespa because it doesn't suffer from the recall problems of CJK tokenization, and by using a ranking model which incorporates proximity (such as e.g nativeRank) you'l still get good relevancy.
Is there a way to concatenate a prefix onto all the results of a FORMSOF() lookup when doing a CONTAINSTABLE() query? I work in the nordic ski industry, and we sell "rollerskis" for summer training. As this is a pretty obscure word, the parser doesn't quite give me the right inflectional forms I'd like. Specifically, if I try to run a FORMSOF(INFLECTIONAL,"rollerski"), the parsing function sys.dm_fts_parser returns the following terms (no thesaurus, English language):
{"rollerski", "rollerskiing", "rollerskies", "rollerskied"}
That's close to what I need, but it's notably missing the pluralized rollerskis, which is used throughout our website, most notably in the name of several products and product categories. What I would like to do to get a more accurate list is return all the inflectional forms of "ski" and prefix each of them with "roller". That would give me the following list of terms:
{"rollerski", "rollerskis'", "rollerskis","rollerskiing","rollerskies","rollerskied","rollerski's"}
Is there a way I can achieve this within the CONTAINSTABLE() query?
Can anybody explain the difference between TEI and SGML format and/or how they are related?
In short TEI is XML, XML is SGML.
The "G" in SGML (Standard Generalized Markup Language) means (among several other things) that a markup language may customize it syntax. For instance, you can define an SGML syntax where the tags (or elements) are like [v id:id1] instead of <v id="id1"></v>.
XML is a concrete syntax of SGML, plus several other requirements that subset SGML. In XML (and HTML too) the elements are delimited by angular brackets: <body>. Each tag in XML must be paired with an explicit end tag: </body>.
So far, we haven't talk about how the document is structured (the document type or schema). XML by itself does not impose restrictions on the document structure. The following is valid (i.e. well-formed) XML:
<item>
<body>
<head>I don't know what I'm doing</head>
</body>
</item>
TEI defines a common structure that all TEI documents must comply with, and assign a meaning to each tag. For instance:
The actual text (<text>) contains a single text of any kind. This
commonly contains the actual text and other encodings. A text <text>
minimally contains a text body (<body>). The body contains lower-level
text structures like paragraphs (<p>), or different structures for
text genres other than prose [source]
<text>
<body>
<p>For the first time in twenty-five years...</p>
</body>
</text>
When creating a document to add to a search index, you can specify the document language. I've done this, but would now like to query only those docs in a specific language. Is this possible? I assumed it would be trivial (and documented), but I can't find how to do it.
Thanks!
I don't think you can currently, but I haven't seen anything explicitly saying that. I'm implying from these sentences that the language field is for their use and not for querying.
The language parameter for search.TextField:
Two-letter ISO 693-1 language code for the field's content, to assist in tokenization. If None, the language code of the document will be used.
And Building Queries:
Search supports all space-delimited languages as well as some languages not segmented by spaces (specifically, Chinese, Japanese, Korean, and Thai). For these languages, Search segments the text automatically.
They need to know the language so they know how to parse it into words.
My plan is to just add an additional field to my search documents that has the same value as the language field. It's slightly redundant, but simple to do.
search.Document(
fields = [
...,
search.TextField(name='language', value=lang),
],
language = lang,
)