How to extract certain keywords/string on SQL Server? - sql-server

I'm using Microsoft SQL Server and I have a description column that mentions various IDs that I want to extract. Here's what my input column looks like:
Finding
Lorem ipsum dolor sit amet, consectetur adipiscing elit, APPID-12345 sed do eiusmod tempor incididunt
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat APPID-3782
Duis aute irure dolor in reprehenderit APPID-374 in voluptate velit esse cillum dolore eu fugiat nulla pariatur APPID-3458
Those APPIDs in italic are what I'm trying to extract. It's quite tricky because: every ID has inconsistent length, has different placements in the column, and some IDs may occur more than once in a row. Here's what I've done so far:
SELECT
Finding,
SUBSTRING(Finding,CHARINDEX('APPID',Finding,1),12) AS pos
FROM STG.Issues_Inventory
WHERE Finding LIKE '%APPID%'
But this ignores the second APPID. I'm pretty sure I'm looking at this the wrong way, so I would definitely appreciate your input/help.
Thank you!

Related

Solr highlighting - terms with umlaut not found/not highlighted

I am playing with 7.2 version of solr. I've uploaded a nice collection of texts in German language and trying to query and highlight a few queries.
If I fire this query with hightlight:
http://localhost:8983/solr/trans/select?q=trans:Zeit&hl=true&hl.fl=trans&hl.q=Kundigung&hl.snippets=3&wt=xml&rows=1
I get a nice text back:
<response>
<lst name="responseHeader">
<bool name="zkConnected">true</bool>
<int name="status">0</int>
<int name="QTime">10</int>
<lst name="params">
<str name="hl.snippets">3</str>
<str name="q">trans:Zeit</str>
<str name="hl">true</str>
<str name="hl.q">Kundigung</str>
<str name="hl.fl">trans</str>
<str name="rows">1</str>
<str name="wt">xml</str>
</lst>
</lst>
<result name="response" numFound="418" start="0" maxScore="1.6969817">
<doc>
<str name="id">x</str>
<str name="trans">... Zeit ...</str>
<date name="t">2018-03-01T14:32:29.400Z</date>
<int name="l">2305</int>
<long name="_version_">1594374122229465088</long>
</doc>
</result>
<lst name="highlighting">
<lst name="x">
<arr name="trans">
<str> ... <em>Kündigung</em> ... </str>
<str> ... <em>Kündigung</em> ... </str>
</arr>
</lst>
</lst>
</response>
However, if I supply the Kündigung as highlight text, I get no answers, as the text/query parser replaced all the ü characters with u.
I have a feeling that I need to supply the correct qparser. How should I specify it? It seems to me that the collection was build with and queried with the default LuceneQParser parser. How can I supply this parser in the url above?
UPDATE:
http://localhost:8983/solr/trans/schema/fields/trans returns
{
"responseHeader":{
"status":0,
"QTime":0},
"field":{
"name":"trans",
"type":"text_de",
"indexed":true,
"stored":true}}
Update 2: So I've looked at the managed-schema of my solr installation/collection schema configuration and found the following:
<fieldType name="text_de" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_de.txt" ignoreCase="true"/>
<filter class="solr.GermanNormalizationFilterFactory"/>
<filter class="solr.GermanLightStemFilterFactory"/>
</analyzer>
</fieldType>
the way I interpret the information is that since query and index parts are omited, the above code is meant to be the same for both query and index. Which... does not show any misconfiguration issues similar to the answer 2 below...
I rememberred though, adding the field trans with type text_de:
curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-field":{
"name":"trans",
"type":"text_de",
"stored":true,
"indexed":true}
}' http://localhost:8983/solr/trans/schema
I've deleted all the documents using
curl http://localhost:8983/solr/trans/update?commit=true -d "<delete><query>*:*<
/query></delete>"
and then reinserting them again:
curl -X POST http://localhost:8983/solr/trans/update?commit=true -H "Content-Type: application/json" -d #all.json
Is it the correct way to "rebuild" the indexes in solr?
UPDATE 3: The Charset settings of the standart JAVA installation were not set to UTF-8:
C:\tmp>java -classpath . Hello
Cp1252
Cp1252
windows-1252
C:\tmp>cat Hello.java
public class Hello {
public static void main(String args[]) throws Exception{
// not crossplateform safe
System.out.println(System.getProperty("file.encoding"));
// jdk1.4
System.out.println(
new java.io.OutputStreamWriter(
new java.io.ByteArrayOutputStream()).getEncoding()
);
// jdk1.5
System.out.println(java.nio.charset.Charset.defaultCharset().name());
}
}
UPDATE 4: Restarted the solr with UTF8 settings:
bin\solr.cmd start -Dfile.encoding=UTF8 -c -p 8983 -s example/cloud/node1/solr
bin\solr.cmd start -Dfile.encoding=UTF8 -c -p 7574 -s example/cloud/node2/solr -z localhost:9983
Checked the JVM settings:
http://localhost:8983/solr/#/~java-properties
file.​encoding UTF8
file.​encoding.​pkg sun.io
reinserted the docs. No change: http://localhost:8983/solr/trans/select?q=trans:Zeit&hl=true&hl.fl=trans&hl.q=Kundigung&hl.qparser=lucene&hl.snippets=3&rows=1&wt=xml gives:
<lst name="highlighting">
<lst name="32e42caa-313d-45ed-8095-52f2dd6861a1">
<arr name="trans">
<str> ... <em>Kündigung</em> ...</str>
<str> ... <em>Kündigung</em> ...</str>
</arr>
</lst>
</lst>
http://localhost:8983/solr/trans/select?q=trans:Zeit&hl=true&hl.fl=trans&hl.q=K%C3%BCndigung&hl.qparser=lucene&hl.snippets=3&rows=1&wt=xml gives:
<lst name="highlighting">
<lst name="32e42caa-313d-45ed-8095-52f2dd6861a1"/>
</lst>
uchardet all.json (file -bi all.json) reports UTF-8
Running from the ubuntu subsystem under windows:
$ export LC_ALL='en_US.UTF-8'
$ export LC_CTYPE='en_US.UTF-8'
$ curl -H "Content-Type: application/json" http://localhost:8983/solr/trans/query?hl=true\&hl.fl=trans\&fl=id -d '
{
"query" : "trans:Kündigung",
"limit" : "1", params: {"hl.q":"Kündigung"}
}'
{
"responseHeader":{
"zkConnected":true,
"status":0,
"QTime":21,
"params":{
"hl":"true",
"fl":"id",
"json":"\n{\n \"query\" : \"trans:Kündigung\",\n \"limit\" : \"1\", params: {\"hl.q\":\"Kündigung\"}\n}",
"hl.fl":"trans"}},
"response":{"numFound":124,"start":0,"maxScore":4.3724422,"docs":[
{
"id":"b952b811-3711-4bb1-ae3d-e8c8725dcfe7"}]
},
"highlighting":{
"b952b811-3711-4bb1-ae3d-e8c8725dcfe7":{}}}
$ curl -H "Content-Type: application/json" http://localhost:8983/solr/trans/query?hl=true\&hl.fl=trans\&fl=id -d '
{
"query" : "trans:Kündigung",
"limit" : "1", params: {"hl.q":"Kundigung"}
}'
{
"responseHeader":{
"zkConnected":true,
"status":0,
"QTime":18,
"params":{
"hl":"true",
"fl":"id",
"json":"\n{\n \"query\" : \"trans:Kündigung\",\n \"limit\" : \"1\", params: {\"hl.q\":\"Kundigung\"}\n}",
"hl.fl":"trans"}},
"response":{"numFound":124,"start":0,"maxScore":4.3724422,"docs":[
{
"id":"b952b811-3711-4bb1-ae3d-e8c8725dcfe7"}]
},
"highlighting":{
"b952b811-3711-4bb1-ae3d-e8c8725dcfe7":{
"trans":[" ... <em>Kündigung</em> ..."]}}}
UPDATE 5 Without supplying hl.q (http://localhost:8983/solr/trans/select?q=trans:Kundigung&hl=true&hl.fl=trans&hl.qparser=lucene&hl.snippets=3&rows=1&wt=xml or http://localhost:8983/solr/trans/select?q=trans:K%C3%BCndigung&hl=true&hl.fl=trans&hl.qparser=lucene&hl.snippets=3&rows=1&wt=xml):
<lst name="highlighting">
<lst name="b952b811-3711-4bb1-ae3d-e8c8725dcfe7">
<arr name="trans">
<str> ... <em>Kündigung</em> ... </str>
<str> ... <em>Kündigung</em> ... </str>
<str> ... <em>Kündigung</em> ... </str>
</arr>
</lst>
</lst>
in this case, the hl.q took the highlighting terms from the query itself, and did a superb job..
Could be a problem with your JVM's encoding. What about -Dfile.encoding=UTF8? Check LC_ALL and LC_CTYPE too. Should be UTF-8.
What field type is the trans field? I even indexed german text with text_en and do not have any problems with Umlauts in highlighting or search and I use the LuceneQParser too.
How looks the response when you query via Solr Admin UI (http://localhost:8983/solr/#/trans/query) and hl checkbox activated?
Check your analyzer chain too. I get the same behaviour as you described, when I misconfigure the chain this way:
<fieldType name="text_de" class="solr.TextField" positionIncrementGap="100">
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_de.txt" ignoreCase="true"/>
</analyzer>
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_de.txt" ignoreCase="true"/>
<filter class="solr.GermanNormalizationFilterFactory"/>
<filter class="solr.GermanLightStemFilterFactory"/>
</analyzer>
</fieldType>
The GermanNormalizationFilterFactory and GermanLightStemFilterFactory both replaces umlauts.
What you need to specify is the attribute, for which the highlighting is done. Similar to q=trans:Zeit, where you specified trans as an attribute, you need to specify hl.q to be hl.q=trans:Kündigung. Your request then becomes:
http://localhost:8983/solr/trans/select?q=trans:Zeit&hl=true&hl.fl=trans&hl.q=trans:Kündigung&hl.snippets=3&wt=xml&rows=1
This answer was humbly presented by David Smiley, Stefan Matheis, and Erick Erickson, solr community and support. This is the post on their behalf.

Apache Solr search by "AND"

I am working on Apache Solr.
Currently, it is working fine. When I typed in pork AND belly it will return all queries with pork and belly in it.
But I need to search pork and belly and get the same result.
But it does not as it will return all results with pork or and or belly.
The easiest way is to change it in JavaScript before sending the query.
But is there a way to do it from Apache Solr by updating the config?
Thanks.
What I did: I tried to switch it in schema.xml by adding the PatternReplaceCharFilterFactory at the dynamic field, but obviously it failed.
Any suggestions?
The eDisMax query parser accepts lower case operators by default. In your solrconfig.xml, specify that parser and you can also explicitly tell it to accept lower case operators:
<requestHandler name="search" class="solr.SearchHandler" default="true">
<lst name="defaults">
<str name="defType">edismax</str>
<bool name="lowercaseOperators">true</bool>
</lst>
...
</requestHandler>
If you're using the (e)dismax query handler, just searching pork belly with q.op=AND should work fine. As long as you have a StopWordFilter configured for your set (and a proper dictionary), and will automagically removed. The default stopwords_en.txt file bundled with Solr has that in its list.

Highlight Matched Text for query term in solr

I have solr jetty 5.1.3 installed and indexed more than 15000 documents using tika. I have indexed and stored doc published date and content in SOLR. I have enable highlighted in solrConfig.xml, Here is the xml of request handler for highlighted terms
<requestHandler name="/select" class="solr.SearchHandler">
<!-- default values for query parameters can be specified, these
will be overridden by parameters in the request
-->
<lst name="defaults">
<str name="echoParams">explicit</str>
<int name="rows">10</int>
<str name="hl">on</str>
<str name="hl.fl">content</str>
<str name="hl.simple.pre"><b></str>
<str name="hl.simple.post"></b></str>
<str name="f.content.hl.snippets">3</str>
<str name="f.content.hl.fragsize">200</str>
<str name="f.content.hl.maxAnalyzedChars">200000</str>
<str name="f.content.hl.alternateField">content</str>
<str name="f.content.hl.maxAlternateFieldLength">750</str>
</lst>
</requestHandler>
<!-- A request handler that returns indented JSON by default -->
<requestHandler name="/query" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<str name="wt">json</str>
<str name="indent">true</str>
<str name="df">content</str>
<str name="hl">on</str>
<str name="hl.fl">content</str>
<str name="hl.simple.pre"><b></str>
<str name="hl.simple.post"></b></str>
<str name="f.content.hl.snippets">3</str>
<str name="f.content.hl.fragsize">200</str>
<str name="f.content.hl.maxAnalyzedChars">200000</str>
<str name="f.content.hl.alternateField">content</str>
<str name="f.content.hl.maxAlternateFieldLength">750</str>
</lst>
</requestHandler>
It is returning me up to three highlights and search text is bold. like if i search "Lorem" in query term, then it is returning a highlight to me something like that
Lorem ipsum dolor sit amet 2016, consectetur adipiscing elit. Sed volutpat metus lorem, a placerat nibh sodales in. Cras in mauris tempus, vulputate felis eu, tincidunt erat.
But when i search the doc which have publish date between last 1 year and now, it is highlighting two terms. For example, if i search " "Lorem" and docPublishDate:[2015-01-20 TO 2016-01-20] " Then it is returning a highlights to me something like that:
Lorem ipsum dolor sit amet 2016, consectetur adipiscing elit. Sed volutpat metus lorem, a placerat nibh sodales in. Cras in mauris tempus, vulputate felis eu, tincidunt erat.
I don't want that solr highlight 2016 text also. I want that it only bold the Lorem. What should i do to achieve it?
Use a filter query to limit the set of documents to be returned instead - filters given as fq parameters are not used for highlighting.
You can also use the hl.q parameter to use a specific query for highlighting, so you could also submit the query to the highlighter without the date part - but this case seems to be better suited to using a filter query.

How to extract metatags from HTML files and index them in SOLR and TIKA

I am trying to extract the metatags of HTML files and indexing them into solr with tika integration. I am not able to extract those metatags with Tika and not able to display in solr.
My HTML file is look like this.
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="product_id" content="11"/>
<meta name="assetid" content="10001"/>
<meta name="title" content="title of the article"/>
<meta name="type" content="0xyzb"/>
<meta name="category" content="article category"/>
<meta name="first" content="details of the article"/>
<h4>title of the article</h4>
<p class="link">How cite the Article</p>
<p class="list">
<span class="listterm">Length: </span>13 to 15 feet<br>
<span class="listterm">Height to Top of Head: </span>up to 18 feet<br>
<span class="listterm">Weight: </span>1,200 to 4,300 pounds<br>
<span class="listterm">Diet: </span>leaves and branches of trees<br>
<span class="listterm">Number of Young: </span>1<br>
<span class="listterm">Home: </span>Sahara<br>
</p>
</p>
My data-config.xml file look like this
<dataConfig>
<dataSource name="bin" type="BinFileDataSource" />
<document>
<entity name="f" dataSource="null" rootEntity="false"
processor="FileListEntityProcessor"
baseDir="/path/to/html/files/"
fileName=".*html|xml" onError="skip"
recursive="false">
<field column="fileAbsolutePath" name="path" />
<field column="fileSize" name="size"/>
<field column="file" name="filename"/>
<entity name="tika-test" dataSource="bin" processor="TikaEntityProcessor"
url="${f.fileAbsolutePath}" format="text" onError="skip">
<field column="product_id" name="product_id" meta="true"/>
<field column="assetid" name="assetid" meta="true"/>
<field column="title" name="title" meta="true"/>
<field column="type" name="type" meta="true"/>
<field column="first" name="first" meta="true"/>
<field column="category" name="category" meta="true"/>
</entity>
</entity>
</document>
</dataConfig>
In my schema.xml file I have added the following fields.
<field name="product_id" type="string" indexed="true" stored="true"/>
<field name="assetid" type="string" indexed="true" stored="true" />
<field name="title" type="string" indexed="true" stored="true"/>
<field name="type" type="string" indexed="true" stored="true"/>
<field name="category" type="string" indexed="true" stored="true"/>
<field name="first" type="text_general" indexed="true" stored="true"/>
In my solrconfing.xml file I have added the following code.
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler" />
<lst name="defaults">
<str name="config">/path/to/data-config.xml</str>
</lst>
can anyone know how to extract those metatags from the HTML files and index them in solr and Tika? your help will be appreciated.
I don't think meta="true" means what you think it means. It usually refers to things that are about the file rather than the content. So, content-type, etc. Possibly http-equiv will get mapped as well.
Other than that, you need to extract actual content. You can do it by using format="xml" and then putting an inner entity with XPathEntityProcessor and mapping the path then. Except, even then, you are limited because stuck because AFAIK, DIH uses DefaultHtmlMapper which is extremely restrictive in what it let's through and skips most of the 'class' and 'id' attributes and even things like 'div'. You can read the list of allowed elements and attributes by yourself in the source code.
Frankly, your easier path is to have a SolrJ client and manage Tika yourself. Then you can set it to use IdentityHtmlMapper which does not muck about with HTML.
Which version of Solr you are using? If you are using Solr 4.0 or above then tika is embedded into it. Tika communicates with the the solr using the 'Solr-Cells' 'ExtractingRequestHandler' class that is configured in the solrconfig.xml as follows:
<!-- Solr Cell Update Request Handler
http://wiki.apache.org/solr/ExtractingRequestHandler
-->
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<!-- capture link hrefs but ignore div attributes -->
<str name="captureAttr">true</str>
<str name="fmap.a">links</str>
<str name="fmap.div">ignored_</str>
</lst>
</requestHandler>
Now in solr by default as you can see in above configuration, any fields extracted from HTML document that is not declared in schema.xml is prefixed with 'ignored_' i.e. they are mapped to 'ignored_*' dynamic field inside schema.xml. The default schema.xml that reads as follows:
<!-- some trie-coded dynamic fields for faster range queries -->
<dynamicField name="*_ti" type="tint" indexed="true" stored="true"/>
<dynamicField name="*_tl" type="tlong" indexed="true" stored="true"/>
<dynamicField name="*_tf" type="tfloat" indexed="true" stored="true"/>
<dynamicField name="*_td" type="tdouble" indexed="true" stored="true"/>
<dynamicField name="*_tdt" type="tdate" indexed="true" stored="true"/>
<dynamicField name="*_pi" type="pint" indexed="true" stored="true"/>
<dynamicField name="*_c" type="currency" indexed="true" stored="true"/>
<dynamicField name="ignored_*" type="ignored" multiValued="true"/>
<dynamicField name="attr_*" type="text_general" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="random_*" type="random" />
<!-- uncomment the following to ignore any fields that don't already match an existing
field name or dynamic field, rather than reporting them as an error.
alternately, change the type="ignored" to some other type e.g. "text" if you want
unknown fields indexed and/or stored by default -->
<!--dynamicField name="*" type="ignored" multiValued="true" /-->
</fields>
And following is how 'ignored' types are treated:
<!-- since fields of this type are by default not stored or indexed,
any data added to them will be ignored outright. -->
<fieldtype name="ignored" stored="false" indexed="false" multiValued="true" class="solr.StrField" />
So, metadata extracted by tika is by default put in 'ignored' field by the Solr-Cell and thats why they are ignored for indexing and storing.
Therefore, to index and store the metadatas you either change the "uprefix=attr_" or 'create specific fields or dynamic field' for your known metadatas and treat them as you want.
So, here is the corrected solrconfig.xml:
<!-- Solr Cell Update Request Handler
http://wiki.apache.org/solr/ExtractingRequestHandler
-->
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="uprefix">attr_</str>
<!-- capture link hrefs but ignore div attributes -->
<str name="captureAttr">true</str>
<str name="fmap.a">links</str>
<str name="fmap.div">ignored_</str>
</lst>
</requestHandler>
Although an older question, I am replying as
I recently asked a similar question (no replies or comments after several days), that I sorted out and which is relevant to this question.
Solr has changed much over the years, and the existing documentation (where it exists) on this topic is both confusing and sometimes erroneous.
While lengthy, this reply provides a solution to the question with an example and documentation.
Briefly, my now-deleted StackOverflow question was "Extracting custom (e.g. <my_id></my_id) tagged text from HTML using Apache Solr." Ancillary to that task was how to index HTML pages, including custom HTML elements:attributes.
The short answer is that while it is relatively easy to index "standard" HTML elements (a; div; h1; h2; li; meta; p; title; ... https://www.w3.org/TR/2005/WD-xhtml2-20050527/elements.html), it is challenging to include custom tagsets without the rigid use of properly formatted XML files and update functions in Solr (see, e.g.: https://lucene.apache.org/solr/guide/6_6/uploading-data-with-index-handlers.html#uploading-data-with-index-handlers), or the use of the captureAttr parameter with Apache Tika, native to Solr via the ExtractingRequestHandler (described below) or other tools such as Apache Nutch.
Standard HTML elements such as <title>Solr HTML Indexing Tests</title> are easily indexed; however, non-standard elements like <my_id>bt-ic8eew2u</my_id> are ignored.
While you could apply XML-based solutions such as <field name="my_id">bt-ic8eew2u</field>, I prefer a facile HTML-based solution -- hence, the HTML metadata approach.
Environment: Arch Linux (x86_64) command-line; Apache Solr 8.7.0; Solr Admin UI (http://localhost:8983/solr/#/gettingstarted/query) in FireFox 83.0
Test file (solr_test9.html):
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-us">
<head>
<meta charset="UTF-8" />
<title>Solr HTML Indexing Tests</title>
<meta name="date_created" content="2019-11-01" />
<meta name="source_url" content="/mnt/Vancouver/programming/datasci/solr/test/solr_test9.html" />
<!-- <my_id>bt-ic8eew2u</my_id> -->
<meta name="doc_id" content="bt-ic8eeW2U" />
<meta name="date_pub" content="2020-11-16" />
</head>
<body>
<h1>Apples</h1>
<p>I like apples.</p>
<h2>Bananas</h2>
<p>I also like bananas.</p>
<p><div id="div1">This text is located in div element 1.</div></p>
<p><div id="div2">This text is located in div element 2.</div></p>
<br/>
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
<br/>
<p>Suspendisse efficitur pulvinar elementum.</p>
<p>My website is BuriedTruth.com.</p>
<h1>Nova Scotia</h1>
<p>Nova Scotia is a province on the east coast of Canada.</p>
<h2>Capital of Nova Scotia</h2>
<p>Halifax is the capital of N.S.</p>
<p>Halifax is also N.S.'s largest city.</p>
<h1>British Columbia</h1>
<h2>Capital of British Columbia</h2>
<p>Victoria is the capital of B.C.</p>
<p>Vancouver is the largest city in B.C., however.</p>
<p>Non-terminated sentence (missing period)</p>
<meta name="date_current" content="2020-11-17" />
<!-- Comments like these are not indexed. -->
<p>Current date: 2020-11-17</p>
</body>
</html>
solrconfig.xml
Here are the relevant additions to my solrconfig.xml file.
<!-- SOLR CELL PLUGINS: -->
<lib dir="${solr.install.dir:../../..}/contrib/extraction/lib" regex=".*\.jar" />
<lib dir="${solr.install.dir:../../..}/dist/" regex="solr-cell-\d.*\.jar" />
<!-- https://lucene.472066.n3.nabble.com/Prons-an-Cons-of-Startup-Lazy-a-Handler-td4059111.html -->
<requestHandler name="/update/extract"
class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
<lst name="defaults">
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<str name="capture">div</str>
<str name="fmap.div">div</str>
<str name="capture">h1</str>
<str name="fmap.h1">h1</str>
<str name="capture">h2</str>
<str name="fmap.h2">h2_t</str>
<str name="capture">p</str>
<!-- <str name="fmap.p">p_t</str> -->
<str name="fmap.p">p</str>
<!-- COMMENT: note that the entries above refer to standard -->
<!-- HTML elements. As long as you have <meta/> (metadata) -->
<!-- entries ("doc-id", "date_pub" ...) in your schema then -->
<!-- Solr will automatically pick them up when indexing ... -->
<!-- (hence no need to include those, here!). -->
</lst>
</requestHandler>
<!-- https://doc.lucidworks.com/fusion-server/5.2/reference/solr-reference-guide/7.7.2/update-request-processors.html -->
<!-- The update.autoCreateFields property can be turned to false to disable schemaless mode -->
<updateRequestProcessorChain name="add-unknown-fields-to-the-schema" default="${update.autoCreateFields:true}"
processor="uuid,remove-blank,field-name-mutating,parse-boolean,parse-long,parse-double,parse-date,add-schema-fields">
<processor class="solr.LogUpdateProcessorFactory"/>
<processor class="solr.DistributedUpdateProcessorFactory"/>
<!-- ======================================== -->
<!-- https://lucene.apache.org/solr/7_4_0/solr-core/org/apache/solr/update/processor/RegexReplaceProcessorFactory.html -->
<processor class="solr.RegexReplaceProcessorFactory">
<str name="fieldName">content</str>
<str name="fieldName">title</str>
<str name="fieldName">p</str>
<!-- Case-sensitive (and only one pattern:replacement allowed, so use as many copies): -->
<!-- of this processor as needed: -->
<str name="pattern">\s+</str>
<str name="replacement"> </str>
<bool name="literalReplacement">true</bool>
</processor>
<!-- Solr bug? URLs parse as "rect https..." Managed-schema (Admin UI): defined p as text_general -->
<!-- but did not parse. Looking at content | title: text_general copied to string, so added -->
<!-- copyfield of p (text_general) as p_str ... regex below now works! -->
<!-- https://stackoverflow.com/questions/22178700/solr-extractingrequesthandler-extracting-rect-in-links/64882751#64882751 -->
<processor class="solr.RegexReplaceProcessorFactory">
<str name="fieldName">content</str>
<str name="fieldName">title</str>
<str name="fieldName">p</str>
<!-- Case-sensitive (and only one pattern:replacement allowed, so use as many copies): -->
<!-- of this processor as needed: -->
<str name="pattern">rect http</str>
<str name="replacement">http</str>
<bool name="literalReplacement">true</bool>
</processor>
<!-- ======================================== -->
<!-- This needs to be last (may need to clear documents and re-index to see changes, e.g. Solr Admin UI): -->
<processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>
managed-schema (schema.xml):
I edited the Solr schema via the Admin UI. Basically, for whatever HTML metadata you want to index, add a similarly-named field (of the appropriate type: e.g., text_general | string | pdate | ...).
For example, to capture the "doc-id" and "date_pub" metadata I created the following (respective) schema entries:
<field name="doc_id" type="string" uninvertible="true" indexed="true" stored="true"/>
<field name="date_pub" type="pdate" uninvertible="true" indexed="true" stored="true"/>
indexing
Here's how I indexed that HTML test file,
[victoria#victoria solr-8.7.0]$ date; pwd; ls -l; echo; ls -l server/solr/gettingstarted/conf/
Tue Nov 17 02:18:12 PM PST 2020
/mnt/Vancouver/apps/solr/solr-8.7.0
total 1792
drwxr-xr-x 3 victoria victoria 4096 Nov 17 13:26 bin
-rw-r--r-- 1 victoria victoria 946955 Oct 28 02:40 CHANGES.txt
drwxr-xr-x 12 victoria victoria 4096 Oct 29 07:09 contrib
drwxr-xr-x 4 victoria victoria 4096 Nov 15 12:33 dist
drwxr-xr-x 3 victoria victoria 4096 Nov 15 12:33 docs
drwxr-xr-x 6 victoria victoria 4096 Oct 28 02:40 example
drwxr-xr-x 2 victoria victoria 36864 Oct 28 02:40 licenses
-rw-r--r-- 1 victoria victoria 12646 Oct 28 02:21 LICENSE.txt
-rw-r--r-- 1 victoria victoria 766662 Oct 28 02:40 LUCENE_CHANGES.txt
-rw-r--r-- 1 victoria victoria 27540 Oct 28 02:21 NOTICE.txt
-rw-r--r-- 1 victoria victoria 7490 Oct 28 02:40 README.txt
drwxr-xr-x 11 victoria victoria 4096 Nov 15 12:40 server
total 208
drwxr-xr-x 2 victoria victoria 4096 Oct 28 02:21 lang
-rw-r--r-- 1 victoria victoria 33888 Nov 17 13:20 managed-schema
-rw-r--r-- 1 victoria victoria 873 Oct 28 02:21 protwords.txt
-rw-r--r-- 1 victoria victoria 33788 Nov 17 11:36 schema.xml.2020-11-17.13:01
-rw-r--r-- 1 victoria victoria 59248 Nov 17 13:16 solrconfig.xml
-rw-r--r-- 1 victoria victoria 59151 Nov 17 12:59 solrconfig.xml.2020-11-17.13:01
-rw-r--r-- 1 victoria victoria 781 Oct 28 02:21 stopwords.txt
-rw-r--r-- 1 victoria victoria 1124 Oct 28 02:21 synonyms.txt
[victoria#victoria solr-8.7.0]$ solr restart; sleep 1; post -c gettingstarted /mnt/Vancouver/programming/datasci/solr/test/solr_test9.html
Sending stop command to Solr running on port 8983 ... waiting up to 180 seconds to allow Jetty process 3511453 to stop gracefully.
Waiting up to 180 seconds to see Solr running on port 8983 [|]
Started Solr server on port 8983 (pid=3572520). Happy searching!
/usr/lib/jvm/java-8-openjdk/jre//bin/java -classpath /mnt/Vancouver/apps/solr/solr-8.7.0/dist/solr-core-8.7.0.jar -Dauto=yes -Dc=gettingstarted -Ddata=files org.apache.solr.util.SimplePostTool /mnt/Vancouver/programming/datasci/solr/test/solr_test9.html
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/gettingstarted/update...
Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file solr_test9.html (text/html) to [base]/extract
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/gettingstarted/update...
Time spent: 0:00:00.755
[victoria#victoria solr-8.7.0]$
... and here is the result (Solr Admin UI: http://localhost:8983/solr/#/gettingstarted/query)
http://localhost:8983/solr/gettingstarted/select?q=*%3A*
{
"responseHeader":{
"status":0,
"QTime":0,
"params":{
"q":"*:*",
"_":"1605651674401"}},
"response":{"numFound":1,"start":0,"numFoundExact":true,"docs":[
{
"id":"/mnt/Vancouver/programming/datasci/solr/test/solr_test9.html",
"stream_size":[1428],
"x_parsed_by":["org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.html.HtmlParser"],
"stream_content_type":["text/html"],
"date_created":"2019-11-01T00:00:00Z",
"date_current":["2020-11-17"],
"resourcename":["/mnt/Vancouver/programming/datasci/solr/test/solr_test9.html"],
"title":["Solr HTML Indexing Tests"],
"date_pub":"2020-11-16T00:00:00Z",
"doc_id":"bt-ic8eeW2U",
"source_url":"/mnt/Vancouver/programming/datasci/solr/test/solr_test9.html",
"dc_title":["Solr HTML Indexing Tests"],
"content_encoding":["UTF-8"],
"content_type":["application/xhtml+xml; charset=UTF-8"],
"content":[" en-us stream_size 1428 X-Parsed-By org.apache.tika.parser.DefaultParser X-Parsed-By org.apache.tika.parser.html.HtmlParser stream_content_type text/html date_created 2019-11-01 resourceName /mnt/Vancouver/programming/datasci/solr/test/solr_test9.html date_pub 2020-11-16 doc_id bt-ic8eeW2U source_url /mnt/Vancouver/programming/datasci/solr/test/solr_test9.html dc:title Solr HTML Indexing Tests Content-Encoding UTF-8 Content-Language en-us Content-Type application/xhtml+xml; charset=UTF-8 Solr HTML Indexing Tests Lorem ipsum dolor sit amet, consectetur adipiscing elit. "],
"div":[" div1 This text is located in div element 1. div2 This text is located in div element 2."],
"p":[" I like apples. I also like bananas. Suspendisse efficitur pulvinar elementum. My website is https://buriedtruth.com/ BuriedTruth.com . Nova Scotia is a province on the east coast of Canada. Halifax is the capital of N.S. Halifax is also N.S.'s largest city. Victoria is the capital of B.C. Vancouver is the largest city in B.C., however. Non-terminated sentence (missing period) Current date: 2020-11-17"],
"h1":[" Apples Nova Scotia British Columbia"],
"h2_t":" Bananas Capital of Nova Scotia Capital of British Columbia",
"_version_":1683647678197530624}]
}}
UPDATE -- managed-schema >> schema.xml pecularities:
While not related to the original question, the following content is related to my answer (above) -- specifically, pecularities associated with switching from Solr's managed-schema to the classic (user-managed) schema.xml. It is included here to provide a complete solution.
First, add
<schemaFactory class="ClassicIndexSchemaFactory"/>
to your solrconfig.xml file.
Then edit this: -->
<updateRequestProcessorChain
name="add-unknown-fields-to-the-schema"
default="${update.autoCreateFields:true}"
processor="uuid,remove-blank,field-name-mutating,parse-boolean,
parse-long,parse-double,parse-date,add-schema-fields">
... to this:
<updateRequestProcessorChain
processor="uuid,remove-blank,field-name-mutating,parse-boolean,
parse-long,parse-double,parse-date">
i.e., delete
name="add-unknown-fields-to-the-schema"
default="${update.autoCreateFields:true}"
add-schema-fields
Rename managed-schema to schema.xml, and restart Solr or reload the core to effect the changes.
To further extend my example (above), here is a sample <updateRequestProcessorChain /> and the output, on the HTML code that I also provided (above).
solrconfig.xml (part):
<updateRequestProcessorChain
processor="uuid,remove-blank,field-name-mutating,parse-boolean,parse-long,parse-double,parse-date">
<processor class="solr.LogUpdateProcessorFactory"/>
<processor class="solr.DistributedUpdateProcessorFactory"/>
<processor class="solr.RegexReplaceProcessorFactory">
<str name="fieldName">content</str>
<str name="fieldName">title</str>
<str name="fieldName">p</str>
<!-- Case-sensitive (and only one pattern:replacement allowed, so use as many copies): -->
<!-- of this processor as needed: -->
<str name="pattern">\s+</str>
<str name="replacement"> </str>
<bool name="literalReplacement">true</bool>
</processor>
<processor class="solr.RegexReplaceProcessorFactory">
<str name="fieldName">content</str>
<str name="fieldName">title</str>
<str name="fieldName">p</str>
<!-- Case-sensitive (and only one pattern:replacement allowed, so use as many copies): -->
<!-- of this processor as needed: -->
<str name="pattern">rect http</str>
<str name="replacement">http</str>
<bool name="literalReplacement">true</bool>
</processor>
<processor class="solr.RegexReplaceProcessorFactory">
<str name="fieldName">content</str>
<str name="fieldName">title</str>
<str name="pattern">[sS]olr</str>
<str name="replacement">APPLE</str>
<bool name="literalReplacement">true</bool>
</processor>
<processor class="solr.RegexReplaceProcessorFactory">
<str name="fieldName">content</str>
<str name="fieldName">title</str>
<str name="pattern">HTML</str>
<str name="replacement">BANANA</str>
<bool name="literalReplacement">true</bool>
</processor>
<processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>
output
Note:
changes to "title" (Solr >> APPLE; HTML >> BANANA)
removal of "rect " from the URL in "p" (discussed here: Solr ExtractingRequestHandler extracting "rect" in links)
{
"responseHeader":{
"status":0,
"QTime":32,
"params":{
"q":"*:*",
"_":"1605767164812"}},
"response":{"numFound":1,"start":0,"numFoundExact":true,"docs":[
{
"id":"/mnt/Vancouver/programming/datasci/solr/test/solr_test9.html",
"stream_size":[1628],
"x_parsed_by":["org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.html.HtmlParser"],
"stream_content_type":["text/html"],
"date_created":"2020-11-11T21:36:38Z",
"date_current":["2020-11-17"],
"resourcename":["/mnt/Vancouver/programming/datasci/solr/test/solr_test9.html"],
"title":["APPLE BANANA Indexing Tests"],
"date_pub":"2020-11-16T21:37:18Z",
"doc_id":"bt-ic8eeW2U",
"source_url":"/mnt/Vancouver/programming/datasci/solr/test/solr_test9.html",
"dc_title":["Solr HTML Indexing Tests"],
"content_encoding":["UTF-8"],
"content_type":["application/xhtml+xml; charset=UTF-8"],
"content":[" en-us stream_size 1628 X-Parsed-By org.apache.tika.parser.DefaultParser X-Parsed-By org.apache.tika.parser.html.HtmlParser stream_content_type text/html date_created 2020-11-11T21:36:38Z resourceName /mnt/Vancouver/programming/datasci/APPLE/test/APPLE_test9.html date_pub 2020-11-16T21:37:18Z doc_id bt-ic8eeW2U source_url /mnt/Vancouver/programming/datasci/APPLE/test/APPLE_test9.html dc:title APPLE BANANA Indexing Tests Content-Encoding UTF-8 Content-Language en-us Content-Type application/xhtml+xml; charset=UTF-8 APPLE BANANA Indexing Tests Lorem ipsum dolor sit amet, consectetur adipiscing elit. "],
"div":[" div1 This text is located in div element 1. div2 This text is located in div element 2. apple This text is located in the \"apple\" (class) div element. banana This text is located in the \"banana\" (class) div element."],
"p":[" I like apples. I also like bananas. Suspendisse efficitur pulvinar elementum. My website is https://buriedtruth.com/ BuriedTruth.com . Nova Scotia is a province on the east coast of Canada. Halifax is the capital of N.S. Halifax is also N.S.'s largest city. Victoria is the capital of B.C. Vancouver is the largest city in B.C., however. Non-terminated sentence (missing period) Current date: 2020-11-17"],
"h1":[" Apples Nova Scotia British Columbia"],
"h2_t":" Bananas Capital of Nova Scotia Capital of British Columbia",
"_version_":1683814668971278336}]
}}

Solr Arabic Search

I want to implement Arabic Search in my solr, I am able to index the document but not able to search them. When i refer to the documents by ID I get the document, but not when I do a search by arabic words,
Search URL
http://122.166.9.144:8080/solr/tw/select/?q=تأجير الاهلي
Search Response
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">18</int>
<lst name="params">
<str name="q">تأجÙر اÙاÙÙÙ</str>
</lst>
</lst>
<result name="response" numFound="0" start="0"/>
</response>
What could be the problem?
Thanks,
Rohit
Edit Request/Response Header
Response Headers view source
Server Apache-Coyote/1.1
Content-Type application/xml;charset=UTF-8
Transfer-Encoding chunked
Date Mon, 15 Aug 2011 15:37:25 GMT
Request Headers view source
Host 122.166.9.144:8080
User-Agent Mozilla/5.0 (Windows NT 6.0; rv:5.0) Gecko/20100101 Firefox/5.0
Accept text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language en-us,en;q=0.5
Accept-Encoding gzip, deflate
Accept-Charset ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection keep-alive
Apparently the server fails to decode the Arabic text in the URL using the right charset. It looks vaguely like it got UTF-8 but thought it was Latin-1. Have you tried wiresharking the conversation to see exactly which URL bytes get sent to the server?

Resources