I have a field in my Solr schema like -
Type - <fieldType name="text" class="solr.TextField" positionIncrementGap="100"/>
Declaration - <field name="description" type="text" indexed="true" stored="true"/>
This description field has values for some documents and no data for the remaining. I am not explicitly setting anything when there is no data for the description field as it is not a required field.
The problem is when i query using - description:*. I get a NullPointerException.
I have not changed the schema. Also when I query description:abc it works however for description:* and description:abc* it gives a NullPointerException.
Is it the correct behavior?
Are the text fields required to be given "" values when we do not have to set it?
Exception -
"error":{
"trace":"java.lang.NullPointerException\n\tat
org.apache.solr.schema.TextField.analyzeMultiTerm(TextField.java:139)\n\tat org.apache.solr.search.SolrQueryParser.analyzeIfMultitermTermText(SolrQueryParser.java:147)\n\tat
org.apache.solr.search.SolrQueryParser.getWildcardQuery(SolrQueryParser.java:203)\n\tat
org.apache.lucene.queryparser.classic.QueryParserBase.handleBareTokenQuery(QueryParserBase.java:1049)\n\tat
org.apache.lucene.queryparser.classic.QueryParser.Term(QueryParser.java:358)\n\tat
org.apache.lucene.queryparser.classic.QueryParser.Clause(QueryParser.java:257)\n\tat
org.apache.lucene.queryparser.classic.QueryParser.Query(QueryParser.java:181)\n\tat
org.apache.lucene.queryparser.classic.QueryParser.TopLevelQuery(QueryParser.java:170)\n\tat
org.apache.lucene.queryparser.classic.QueryParserBase.parse(QueryParserBase.java:120)\n\tat
org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:72)\n\tat
org.apache.solr.search.QParser.getQuery(QParser.java:143)\n\tat
org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:118)\n\tat
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:185)\n\tat
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)\n\tat
org.apache.solr.core.SolrCore.execute(SolrCore.java:1699)\n\tat
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:455)\n\tat
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276)\n\tat
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)\n\tat
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)\n\tat
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)\n\tat
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)\n\tat
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)\n\tat
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)\n\tat
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)\n\tat
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)\n\tat
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)\n\tat
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)\n\tat
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)\n\tat
org.eclipse.jetty.server.Server.handle(Server.java:351)\n\tat
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454)\n\tat
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47)\n\tat
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:890)\n\tat
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:944)\n\tat
org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:634)\n\tat
org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230)\n\tat
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:66)\n\tat
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254)\n\tat
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:599)\n\tat
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:534)\n\tat
java.lang.Thread.run(Thread.java:679)\n",
"code":500}}
A field of type solr.TextField has to have a Tokenizer associated with it otherwise it fails for wild character search.
I created a Jira bug for the same here.
It should work and does normally work.
Did you by any chance change field type definition and then did not do a full-reindex (delete old index and so on)?
I have seen this before but it seemed to have gone away after the index was fully rebuilt. May be something to do with left-over definitions on Lucene level.
Related
I'm trying to perform a search sort using Sitecore 9.3 and SXA. The default search configuration allows the user to sort by 'Title'. The search logs shows queries with &sort=title_t desc which is expected.
If I change the sort criteria from Title to Other Title via /sitecore/content/[site name]/Global/Settings/Facets/Title to a field other than title, I no longer get results from the search results call.
Looking in the search log shows that Other Title is not being resolved to other_title_t with the error ERROR Solr Error : [sort param field can't be found: other_title ]
The Sitecore documentation https://doc.sitecore.com/developers/93/platform-administration-and-architecture/en/using-solr-field-name-resolution.html described the mechanism for resolving fields to the correct type using index config
<fieldType
fieldTypeName="html|rich text|single-line text|multi-line text|text|memo|image|reference"
returnType="text"
/>
...
</fieldTypes>
which is then used with the the type match to append _t to the field name
<typeMatches hint="raw:AddTypeMatch">
<typeMatch
typeName="text"
type="System.String"
fieldNameFormat="{0}_t"
cultureFormat="_{1}"
settingType="Sitecore.ContentSearch.SolrProvider.SolrSearchFieldConfiguration,
Sitecore.ContentSearch.SolrProvider"
/>
</typeMatches>
This does not appear to be working for sort.
I've found that adding the fieldName (rather then rely on the preconfigured type mapping ) works and result in other_title_t being used as the query sort.
<fieldMap type="Sitecore.ContentSearch.SolrProvider.SolrFieldMap, Sitecore.ContentSearch.SolrProvider">
<fieldNames hint="raw:AddFieldByFieldName">
<field fieldName="other title" returnType="string" />
</fieldNames>
</fieldMap>
Should sort field resolution work via type field mapping already? Is this a bug?
I have upgraded to a solr 6.6.5 version and changed therefore the "lucene match version" accordingly. This apparently is bringing some challenges.
Solr has trouble in building the index, complaining:
Error creating document : SolrInputDocument(fields: [sqm=0, partner_id=0, price=7.5,
...
org.apache.solr.common.SolrException: ERROR: [doc=209860] Error adding field 'price'='7.5' msg=For input string: "7.5"
The field type is defined as:
<fieldType name="price" class="solr.IntPointField" sortMissingLast="true" omitNorms="true"/>
Error msg:
Caused by: java.lang.NumberFormatException: For input string: "7.5"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.parseInt(Integer.java:615)
at org.apache.solr.schema.IntPointField.createField(IntPointField.java:181)
at org.apache.solr.schema.PointField.createFields(PointField.java:216)
at org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:72)
at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:179)
What needs to be changed regarding the price type, in order to match the given solr version?
we are trying to index about 10,000 PDFs against a MySQL table. We are using Solr 5.2.1, Tika 1.7 and PDFBox 1.8.8. Data import handler keeps giving various errors, which halts the entire process. Most errors have to do with unreadable content or PDF file not found. I understand this, and would like the process to continue or skip past the problem files. But no matter how I set the onerror directive, it does not seem to work. We have indexed smaller sets of PDFs using the same methods, no problem. But the continuous errors on this larger store are stopping us in our tracks! I would appreciate any advice.
Here is entity from data-config.xml:
<entity name="proceedings" dataSource="proceedings_db" onerror="skip"
query="SELECT productID, title, fileName, buildDate
FROM products
WHERE status = '1'
">
<field column="productID" name="uid" />
<field column="title" name="paper_title" />
<field column="fileName" name="filename" />
<field column="buildDate" name="builddate" />
<entity name="file" dataSource="proceedings_files" processor="TikaEntityProcessor" url="${proceedings.filename}" format="text" onerror="skip">
</entity>
</entity>
I have tried setting onerror for the outer entity, for the inner, and for both (as above). I have tried skip and continue for all of those combinations. It seems to have no impact.
Here is an example of an error I get:
ERROR FlateFilter FlateFilter: stop reading corrupt stream due to a DataFormatException
and
java.io.EOFException
at org.apache.fontbox.ttf.MemoryTTFDataStream.readSignedShort(MemoryTTFDataStream.java:139)
at org.apache.fontbox.ttf.HorizontalMetricsTable.initData(HorizontalMetricsTable.java:62)
at org.apache.fontbox.ttf.TrueTypeFont.initializeTable(TrueTypeFont.java:280)
at org.apache.fontbox.ttf.AbstractTTFParser.parseTables(AbstractTTFParser.java:128)
at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:80)
at org.apache.fontbox.ttf.AbstractTTFParser.parseTTF(AbstractTTFParser.java:109)
at org.apache.fontbox.ttf.AbstractTTFParser.parseTTF(AbstractTTFParser.java:84)
at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getTTFFont(PDTrueTypeFont.java:632)
at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getFontWidth(PDTrueTypeFont.java:673)
at org.apache.pdfbox.pdmodel.font.PDSimpleFont.getFontWidth(PDSimpleFont.java:231)
at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:411)
at org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:62)
at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:557)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:460)
at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:385)
at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:344)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:134)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:146)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:162)
at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:475)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:514)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:414)
at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:329)
at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:232)
at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:416)
at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:480)
at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:461)
These are not the only errors, mind you, but they are representative. I'd prefer not to hunt down and zap every offending PDF, but if that is what it takes, then fine. But as with the error above, I don't even get a filename, even if I broaden the log output settings. Sometimes I do, and I zap that file, but it does not happen each time.
I guess I do not understand what purpose the onerror settings serve, when it seems like the only working option is to stop! Again, I'd be most appreciative of any advice or suggestions as to what I am doing wrong.
Thanks in advance!
using the DIH to index large ammount of PDF files in a real prod environment is NOT a good idea:
some pdfs will fail to be extracted, no matter what. If you need to extract content by any means, you should catch the errored ones and use a second extraction library (different from PDFBox), as no two libraries fail with the same set of files.
some will OOM, same as above, try other library if needed when your process continues.
Worse, some pdfs will hang while extracting. You need to setup some infra to find those cases, and kill the extracting process so it goes on to the next ones. Ditto on the second library if you feel like.
Bottomline, you should use some other setup by using some ETL tools, or just write some custom code to do the extraction and then the indexing into Solr, taking the above recommendations into account.
When I try to re-index my Sitecore 7 site using Solr. I get the following errors in the Solr log.
10232 09:10:03 WARN Crawler : AddRecursive DoItemAdd failed - {550B5CEF-242C-463F-8ED5-983922A39863}
Exception: System.IO.IOException
Message: Unable to write data to the transport connection: An existing connection was forcibly closed by the remote host.
Source: System
at System.Net.Sockets.NetworkStream.Write(Byte[] buffer, Int32 offset, Int32 size)
at System.Net.ConnectStream.InternalWrite(Boolean async, Byte[] buffer, Int32 offset, Int32 size, AsyncCallback callback, Object state)
at System.Net.ConnectStream.Write(Byte[] buffer, Int32 offset, Int32 size)
at SolrNet.Impl.SolrConnection.CopyTo(Stream input, Stream output)
at SolrNet.Impl.SolrConnection.PostStream(String relativeUrl, String contentType, Stream content, IEnumerable`1 parameters)
at SolrNet.Impl.SolrConnection.Post(String relativeUrl, String s)
at SolrNet.Impl.SolrBasicServer`1.SendAndParseHeader(ISolrCommand cmd)
at Sitecore.ContentSearch.SolrProvider.SolrBatchUpdateContext.AddRange(IEnumerable`1 group, Int32 groupSize)
at Sitecore.ContentSearch.SolrProvider.SolrBatchUpdateContext.AddDocument(Object itemToAdd, IExecutionContext[] executionContexts)
at Sitecore.ContentSearch.SitecoreItemCrawler.DoAdd(IProviderUpdateContext context, SitecoreIndexableItem indexable)
at Sitecore.ContentSearch.HierarchicalDataCrawler`1.CrawlItem(Tuple`3 tuple)
Nested Exception
Exception: System.Net.Sockets.SocketException
Message: An existing connection was forcibly closed by the remote host
Source: System
at System.Net.Sockets.NetworkStream.Write(Byte[] buffer, Int32 offset, Int32 size)
Any ideas why this would be happening?
Try loooking at your solr logs for any Errors coming from the SolrCore by browsing - http://yoursolrinstance/solr/#/~logging
I found that Solr was having an issue with fields generated from the dynamicField type.
Solr in my instance was looking for 'myfieldname_t_cs' and throwing and 'unknown field' exception for items in the cs-CZ language.
The dynamic field defination in the solr schema.xml has a field defined <dynamicField name="*_t_cz" type="text_cz" indexed="true" stored="true" />but not one with mapping the _cs suffix so I added that <dynamicField name="*_t_cs" type="text_cz" indexed="true" stored="true" /> restarted Tomcat and rebuilt my indexes and that error cleared.
I also have Solr errors about Polish content as this is language region information is not defined at all and Norway where the field is defined in the schema.xml as <dynamicField name="*_t_no" type="text_no" indexed="true" stored="true" />but solr is searching for the suffix '*_nb' eg unknown field 'id_t_nb' and throwing a unknown field exception.
There seems to be a problem with the way Sitecore and Solr are mapping languages for using the Region Info I will raise a ticket with Sitecore support and update the answer when I get a response.
You mention Chinese in your errors could be that Chinese is not defined in Solr but you have some content in Sitecore in that language.
Update
Sitecore have confirmed this behaviour as a bug
I have found that when generating search schema and updating indexes
different attributes are used (the name of language and culture). I
will register this behavior as a bug and register a wish to implement
a full support of the languages that are supported in SOLR by default.
I have a Solr multicore instance and I'm trying to get advantage from the External File Field feature, but with no success.
Here's the fieldType definition
<fieldType name="ext_tags" class="solr.ExternalFileField" keyField="doc_id" />
Notice the reference guide reports an example with a few more attributes
defVal="" stored="false" indexed="false" valType="string"
but I guess they're from Solr version 3 because If I add them in my schema.xml, I get an error starting the instance.
Here's the field definition
<field name="deal_tags" type="ext_tags" indexed="false" stored="true" required="false" />
Here's the external data file name ( in $SOLR_HOME/data )
external_deal_tags.txt
Here's the instance dir ($SOLR_HOME)
/opt/solr-4.3.0/deal-ws/cores/it_IT/
Here's an excerpt from the data file (UTF-8 encoded, sorted by doc_id that is a MD5 hash)
003c9256f23da49233fc0b253f7a93cb=8;12
0050188629a8c0e3f89bcd6a7cb77b3a=6;7;13;33;35;38
009c3932933b173072054e3d81527b05=6
Here's the URL I call
http://localhost:8080/solr/it_IT/select?q=*:*&wt=json&fl=deal_tags&rows=3&indent=yes
Here's the response I get
{
"responseHeader":{
"status":0,
"QTime":116},
"response":{"numFound":3678,"start":0,"docs":[
{},
{},
{}]
}}
Even if I change the rows param to 4000 (i have 3678 documents in the index) I get no ext_tags
After modifying the schema.xml file, I restarted Tomcat many times and I also restarted the hosting machine.
What have I missed?
* UPDATE *
During my quest for answers, I found out the problem was possibly in the way I queried Solr. I tried modifying the query using field()
http://localhost:8080/solr/it_IT/select?q=*:*&wt=json&fl=field(deal_tags)&rows=3&indent=yes
and this is what I get now
{
"responseHeader":{
"status":0,
"QTime":2},
"response":{"numFound":3678,"start":0,"docs":[
{ "field(deal_tags)":0.0},
{ "field(deal_tags)":0.0},
{ "field(deal_tags)":8.0}]
}}
I expected to get strings instead I get something formatted as a decimal number:
- first result: expected 56;57 -> 0.0
- second result: expected blank -> 0.0 (its doc_id is not in the external file)
- third result: expected 8 -> 8.0
So, it seems I need to inform Solr I expect this value to be treated as a string but I don't get where to set this configuration.
Any idea?
Ok, I found out the what was the problem. I took a look in the log file ant here's what I got
Caused by: org.apache.solr.common.SolrException: Only float and pfloat (Trie|Float)Field are currently supported as external field type.
So, I cannot use strings as external values.