Solr 4.7.1 with Tomcat 6 doesn't store romanian characters - tomcat6

When trying to store romanian special characters (diacritics) into a solr schema field, like:
<field name="description" type="text_general" indexed="true" stored="true" required="false"/>
The romanian characters are: (ă,î,â,ș,ț) and they are replaced in SOLR by ?.
To mention I've done everything a basic setup requires, I run it with Tomcat6.
My Solr version is 4.7.1

Make sure you submit data to Solr in proper encoding.
Also consider specifies charset for content type. E.g. Content-Type:text/plain; charset=UTF-8
Also try to check how data is parsed at Solr Side. Just debug this method:
org.apache.solr.servlet.SolrRequestParsers.parseParamsAndFillStreams(HttpServletRequest, ArrayList<ContentStream>)
See thise lines:
final String cs = ContentStreamBase.getCharsetFromContentType(req.getContentType());
final Charset charset = (cs == null) ? IOUtils.CHARSET_UTF_8 : Charset.forName(cs);
Solr should come up with UTF-8 here.

Related

NumberFormatException SOLR for String field

New to SOLR, trying to stream a large document onto my SolrCloud instance. I am not picky about types at the moment, so everything is a string.
Field in question: (taken from schema.xml)
<field name="student_count" type="string" indexed="true" stored="true"/>
Trying to insert:
2.5
Error:
"java.lang.NumberFormatException":
Error adding field 'student_count'='2.5' msg=For input string: \"2.5\"
I'm wondering why the updates are not working, is it not picking up my schema? Why is SOLR trying to convert a string to a number when the predefined schema is already a string?

Is there a way to view search document fields that are only indexed but not stored via the solr admin panel using the query tool?

I want to view the indexed but not stored fields of a solr search document in the solr admin query tool, is there any provision for this?
Example Field Configuration:
<field name="product_data" type="string" indexed="true" stored="false" multiValued="false" docValues="true" />
If you're using schema version 1.6, Solr will automagically fetch the values from the stored docValues, even if the field itself is set as stored="false". Include the field name in fl to get the values.
However, even if you're looking for the actual tokens indexed for a document / field / value, using the Analysis page is usually the preferred way as it allows you to tweak the value and see the response quickly. The Luke Request Handler / Tool is useful if you want to explore the actual indexed tokens.

Search substring in multi value field

I have field like below (result of query)
"bestseller_archive_position": [
"2015-11-13_1",
"2015-11-12_2"
],
Now I need to find documents that contain string "2015-11-13" in this field. But when I do
q=2015-11-13
or
q=2015-11-13*
I receive 0 dcuments. I've tested different field types. How can I perform such search ?
Very hard to guess with such few information. However, shooting in the dark, I bet you're using a request handler with a "lucene" query parser, that means the - has a precise meaning (prohibited clause, NOT). So you should escape those special chars with a leading \
On top of that, enable debug (add a debug=true) and you will see how Solr "sees" the query that is executed.
First you should proper configure a field type for search. For your purposes the String type will be good, it will store and index data as is. You should put following config in your schema.xml
<field name="bestseller_archive_position" type="string" indexed="true" stored="false" multiValued="true"/>
<fieldType name="string" class="solr.StrField" sortMissingLast="true"/>
When you search you should also define a field in which you do a search. In other case search will be against default field. E.g.
q=bestseller_archive_position:2015-11-13*

Solr highlighting for external fields

I would like to use Solr highlighting, but our documents are only indexed and not stored. The field values are found in a separate database. Is there a way to pass in the text to be highlighted without Solr needing to pull that text from its own stored fields? Or is there an interface that would allow me to pass in a query, a field name, a field value and get back snippets?
I'm on Solr 5.1.
Lucene supports highlighting (returns offsets) also for non-stored content by using docValues.
Enabling a field for docValues only requires adding docValues="true" to the field (or field type) definition, e.g.:
<field name="manu_exact" type="string" indexed="true" stored="false" docValues="true" />
(introduced in Lucene 8.5, SOLR-14194)
You could reindex the resultset (read from database) in an embedded solr instance and run the query with same set of keywords with highlighting turned on and get the highlighted text back.
You could read the schema and solrconfig as resources from local jar and extract to temporary solr core directory to get this setup working.

How do I index rich-format documents contained as database BLOBs with Solr 4.0+?

I've found a few related solutions to this problem. The related solutions will not work for me as I'll explain. (I'm using Solr 4.0 and indexing data stored in an Oracle 11g database.)
Jonck van der Kogel's related solution (from 2009) is explained here. He describes creating a custom Transformer, sort of like the ClobTransformer that ships with Solr. This is going down the elegant path but is not using Tika which is now integrated with Solr. (He uses external PDFBox and FontBox.) This creates multiple maintenance / upgrade dependencies. Also, I need to be able to index Word documents in addition to PDF.
Since Kogel's solutions seems to be on the right path, is there a way to use the Tika classes included with Solr in a custom Transformer? That would allow all the Tika functionality with Kogel's elegant database solution.
Another related solution is the ExtractingRequestHandler (ERH) that ships with Solr. However, as the name suggests, this is a request handler, such as to handle HTTP posts of rich-text documents. To extract documents from the database this way has performance and security problems. I would have to make the database BLOBs accessible via HTTP. I've found no discussion of using ERH for direct ingest from a database BLOB. Is it possible to directly ingest from database BLOBs with Solr Cell?
Another related solution is to write a Transformer (like Kogel's above) to convert a byte[] to a string (from DataImportHandler FAQ). With true binary documents this is going to feed junk into the index and not properly extract the text elements like Tika does. Won't work.
A final related solution is UpdateRichDocuments offered by the RichDocumentHandler. This is deprecated and no longer available in Solr. The page refers you to the ExtractingRequestHandler (discussed above).
It seems like the right solution is to use DataImportHandler and a customer Transformer using the Tika class. How does this work?
Many hours later... First, there is a lot of misleading, wrong and useless information on this problem. No page seemed to provide everything in one place. All of the information is well intentioned but between differing versions and some going over my head, it didn't solve the problem. Here is my collection of what I learned and the solution. To reiterate, I'm using Solr 4.0 (on Tomcat) + Oracle 11g.
Solution overview: DataImportHandler + TikaEntityProcessor + FieldStreamDataSource
Step 1, make sure you update your solrconfig.xml so that solr can find the TikaEntityProcessor + DataImportHandler + Solr Cell stuff.
<lib dir="../contrib/dataimporthandler/lib" regex=".*\.jar" />
<!-- will include extras (where TikaEntPro is) and regular DIH -->
<lib dir="../dist/" regex="apache-solr-dataimporthandler-.*\.jar" />
<lib dir="../contrib/extraction/lib" regex=".*\.jar" />
<lib dir="../dist/" regex="apache-solr-cell-\d.*\.jar" />
Step 2, modify your data-config.xml to include your BLOB table. This is where I had the most trouble since the solutions to this problems have changed a lot as versions have changed. Plus, using multiple data sources and plugging them together correctly was not intuitive to me. Very sleek once it's done though. Make sure to replace your IP, SID name, username, password, table names, etc.
<dataConfig>
<dataSource name="dastream" type="FieldStreamDataSource" />
<dataSource name="db" type="JdbcDataSource"
driver="oracle.jdbc.OracleDriver"
url="jdbc:oracle:thin:#192.1.1.1:1521:sid"
user="username"
password="password"/>
<document>
<entity
name="attachments"
query="select * from schema.attachment_table"
dataSource="db">
<entity
name="attachment"
dataSource="dastream"
processor="TikaEntityProcessor"
url="blob_column"
dataField="attachments.BLOB_COLUMN"
format="text">
<field column="text" name="body" />
</entity>
</entity>
<entity name="unrelated" query="select * from another_table" dataSource="db">
</entity>
</document>
</dataConfig>
Important note here. If you're getting "No field available for name : whatever" errors when you attempt to import, the FieldStreamDataSource is not able to resolve the data field name you gave. For me, I had to have the url attribute with the lower-case column name, and then the dataField attribute with outside_entity_name.UPPERCASE_BLOB_COLUMN. Also, once I had the column name wrong and that will cause the problem as well.
Step 3, you need to modify your schema.xml to add the BLOB-column field (and any other column you need to index/store). Modify according to your needs.
<field name="body" type="text_en" indexed="false" stored="false" />
<field name="attach_desc" type="text_general" indexed="true" stored="true" />
<field name="text" type="text_en" indexed="true" stored="false" multiValued="true" />
<field name="content" type="text_general" indexed="false" stored="true" multiValued="true" />
<copyField source="body" dest="text" />
<copyField source="body" dest="content" />
With that you should be well on your way to saving many hours getting your binary, rich-text documents (aka rich documents) that are stored as BLOBs in a database column indexed with Solr.
The Integration of Tika and DIH is already provided with Solr via TikaEntityProcessor
Integration - SOLR-1358
Blob Handling - SOLR-1737
You need to just find the right combination.

Resources