SOLR how to limit the search content in solr query - solr

i want to search the words upto particular line and not beyond that using solr query. i have tried proximity match but it didnt worked. my data is like
Blockquote"Date: Thu, 24 Jul 2014 09:36:44 GMT\nCache-Control: private\nContent-Type: application/json; charset=utf-8\nContent-Encoding: gzip\nVary: Accept-Encoding\nP3P: CP=%20CURo TAIo IVAo IVDo ONL UNI COM NAV INT DEM STA OUR%20\nX-Powered-By: ASP.NET\nContent-Length: 570 \nKeep-Alive: timeout=120\nConnection: Keep-Alive\n\n[{%20rows%20:[],%20index%20:[],%20folders%20:[[%20Inbox%20,%20Inbox%20,%20%20,1,1,0,0,0,%20Inbox%20,0,0,%20none%20,0],[%20Drafts%20,%20Drafts%20,%20%20,1,1,0,0,0,%20Drafts%20,0,0,%20none%20,0],[%20Sent%20,%20Sent%20,%20%20,1,1,0,0,11,%20Sent%20,1,0,%20none%20,0],[%20Spam%20,%20Spam%20,%20%20,1,1,0,0,0,%20Spam%20,1,0,%20none%20,0],[%20Deleted%20,%20Trash%20,%20%20,1,1,0,7,9,%20Deleted%20,1,0,%20none%20,0],[%20Saved%20,%20Saved Mail%20,%20%20,1,1,0,0,0,%20Saved%20,1,0,%20none%20,0],[%20SavedIMs%20,%20Saved Chats%20,%20Saved%20,2,1,0,0,0,%20SavedIMs%20,1,0,%20none%20,0]],%20fcsupport%20:true,%20hasNewMsg%20:false,%20totalItems%20:0,%20isSuccess%20:true,%20foldersCanMoveTo%20:[%20Sent%20,%20Spam%20,%20Deleted%20,%20Saved%20,%20SavedIMs%20],%20indexStart%20:0}]POST /38664-816/aol-6/en-us/common/rpc/RPC.aspx?user=hl1lkgReIh&transport=xmlhttp&r=0.019667088333411797&a=GetMessageList&l=31211 HTTP/1.1\nHost: mail.aol.com\nUser-Agent: Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8\nAccept-Language: en-US,en;q=0.5\nAccept-Encoding: gzip, deflate\nContent-Type: application/x-www-form-urlencoded; charset=UTF-8\nX-Requested-With: XMLHttpRequest\nReferer: http://mail.aol.com/38664-816/aol-6/en-us/Suite.aspx\nContent-Length: 452\nCookie: mbox=PC#1405514778803-136292.22_06#1407395182|session#1406185366924-436868#1406187442|check#true#1406185642; s_pers=%20s_fid%3D55C638B5F089E6FB-19ACDEED1644FD86%7C1469344726539%3B%20s_getnr%3D1406186326569-Repeat%7C1469258326569%3B%20s_nrgvo%3DRepeat%7C1469258326571%3B; s_vi=[CS]v1|29E33A0D051D366F-60000105200097FF[CE]; UNAUTHID=1.5efb4a11934a40b8b5272557263dadfe.88c5; RSP_COOKIE=type=30&name=YWxzaGFraWIyMDE0&sn=MzRb%2FjjHIe8odpr%2FfxZR2g%3D%3D&stype=0&agrp=M; LTState=ver:5&lav:22&un:*UQo5AwAnAytffwJSYg%3d%3d&sn:*UQo5AwAnAytffwJSYg%3d%3d&uv:AOL&lc:en-us&ud:aol.com&ea:*UQo5AwAnAytffwJSCAsnWWoJASZL&prmc:825345&mt:6&ams:1&cmai:365&snt:0&vnop:False&mh:core-mia002b.r1000.mail.aol.com&br:100&wm:mail.aol.com&ckd:.mail.aol.com&ckp:%2f&ha:1NGRuUTRRxGFF2s5A4JwkuCT43Q%3d&; aolweatherlocation=10003; DataLayer=cons%3D6.107%26coms%3D629; grvinsights=69f3a2bb86ed3cd31aa1d14a1ce9e845; CUNAUTHID=1.5efb4a11934a40b8b5272557263dadfe.88c5; s_sess=%20s_cc%3Dtrue%3B%20s_sq%3Daolcmp%253D%252526pid%25253Dcmp%2525253A%25252520Help%25252520%2525257C%25252520View%25252520Article%2525253A%25252520Clear%25252520cookies%2525252C%25252520cache%2525252C%25252520history%25252520and%25252520footprints%252526pidt%25253D1%252526oid%25253Dhttp%2525253A%2525252F%2525252Fwebmail.aol.com%2525252F%2525253F_AOLLOCAL%2525253Dmail%252526ot%25253DA%2526aolsnssignin%253D%252526pid%25253Dsso%25252520%2525253A%25252520login%252526pidt%25253D1%252526oid%25253DSign%25252520In%252526oidt%25253D3%252526ot%25253DSUBMIT%3B; L7Id=31211; Context=ver:3&sid:923f783b-bc6e-4edf-87c9-e52f19b3ce67&rt:STANDARD&i:f&ckd:.mail.aol.com&ckp:%2f&ha:X80Ku4ffRKsOVSwgmEVPCfpfxeU%3d&; IDP_A=s-1-V0c3QiuO6BzQ5S6_u3s0brfUqMCktezAz7sWlVfHD90omIijDXRrMJkSM-9-xcnUcSTnXbcZ1aUCgvfuToVeJihcftKY5KtsC_nB7Y9qf6P0xUnNfCIAmWVtRf4ctSQ9JwRIzHa40dhFuULwYLu3NUPTxckeFUFAzcSS4hrmb4grhEtyOGp0qV5rIKtjs4u8; MC_CMP_ESK=NonSense; SNS_AA=asrc=2&sst=1406185424&type=0; _utd=gd#MzRb%2FjjHIe8odpr%2FfxZR2g%3D%3D|pr#a|st#sns.webmail.aol.com|uid#; Auth=ver:22&uas:*UQo5AwAnAytffwJSZAskRiwLBSIDWVpVXxVTVwJCLFxdSnpHUWBbeV1jcikERgl6CEYLJUweGUhdFQQLW1h%2bBAZRcllWfVl8VH4DUmRaZARoPhw%2bBFBA&idl:0&un:*UQo5AwAnAytffwJSYg%3d%3d&at:SNS&sn:*UQo5AwAnAytffwJSYg%3d%3d&wim:%252FwQCAAAAAAAEk2ihy%252BE4MMebm4R1jvxY07zNZhFOHSz2EFBnsNdOAUsl8QyZceo54kWYZ4vwVayLFF7w&sty:0&ud:aol.com&uid:hl1lkgReIh&ss:635417678271359104&svs:SNS_AA%7c1406185424&la:635417687268954835&aat:A&act:M&br:100&cbr:AOL&mt:&pay:0&mbt:G&uv:AOL&lc:en-us&bid:1&acd:1403348988&pix:3829&prmc:825345&relm:aol&mah:%2\nConnection: keep-alive\n"
and want to search Content-Type: application/json from the data and not beyond this line. I have tried
http://192.168.0.164:8983/solr/collection_with_all_details/select?q=Content%3AContent-Typejson*&wt=json&indent=true
but it searches in entire content. i need to limit the search content

I don't think it is possible in this case. You can check highlighter to return that first 200 characters in highlighting response.
May be you need think of writting a custom response writer which can help on this.
One more option cab be creating additional field with indexed="false" stored="true" will be more efficient.
Create your original field indexed="true" stored="false", your index size will be diminished. New copy field will be indexed="false" stored="true".
<copyField source="text" dest="textShort" maxChars="200"/>
Check if this works out for you.

You should really, really pre-process your data to just index the part that you're going to use. Doing it after the fact will not be a good solution, as you'll have most of the content in the index already, and you're looking for a separator that's not positioned in one specific byte location (which is what maxChars would be able to do).
Depending on how you're indexing, you can either do it in the indexing step (regextransformer, in your own code using SolrJ, etc), or do it in the analysis step of the code, by using something like a patternreplacefilter. That would allow you to remove anything after the header you're looking for.
That way you should be able to index the content into one header field and one body field for example, depending on your need.

Related

Onerror directives not working with Solr data import handler / PDFBox

we are trying to index about 10,000 PDFs against a MySQL table. We are using Solr 5.2.1, Tika 1.7 and PDFBox 1.8.8. Data import handler keeps giving various errors, which halts the entire process. Most errors have to do with unreadable content or PDF file not found. I understand this, and would like the process to continue or skip past the problem files. But no matter how I set the onerror directive, it does not seem to work. We have indexed smaller sets of PDFs using the same methods, no problem. But the continuous errors on this larger store are stopping us in our tracks! I would appreciate any advice.
Here is entity from data-config.xml:
<entity name="proceedings" dataSource="proceedings_db" onerror="skip"
query="SELECT productID, title, fileName, buildDate
FROM products
WHERE status = '1'
">
<field column="productID" name="uid" />
<field column="title" name="paper_title" />
<field column="fileName" name="filename" />
<field column="buildDate" name="builddate" />
<entity name="file" dataSource="proceedings_files" processor="TikaEntityProcessor" url="${proceedings.filename}" format="text" onerror="skip">
</entity>
</entity>
I have tried setting onerror for the outer entity, for the inner, and for both (as above). I have tried skip and continue for all of those combinations. It seems to have no impact.
Here is an example of an error I get:
ERROR FlateFilter FlateFilter: stop reading corrupt stream due to a DataFormatException
and
java.io.EOFException
at org.apache.fontbox.ttf.MemoryTTFDataStream.readSignedShort(MemoryTTFDataStream.java:139)
at org.apache.fontbox.ttf.HorizontalMetricsTable.initData(HorizontalMetricsTable.java:62)
at org.apache.fontbox.ttf.TrueTypeFont.initializeTable(TrueTypeFont.java:280)
at org.apache.fontbox.ttf.AbstractTTFParser.parseTables(AbstractTTFParser.java:128)
at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:80)
at org.apache.fontbox.ttf.AbstractTTFParser.parseTTF(AbstractTTFParser.java:109)
at org.apache.fontbox.ttf.AbstractTTFParser.parseTTF(AbstractTTFParser.java:84)
at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getTTFFont(PDTrueTypeFont.java:632)
at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getFontWidth(PDTrueTypeFont.java:673)
at org.apache.pdfbox.pdmodel.font.PDSimpleFont.getFontWidth(PDSimpleFont.java:231)
at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:411)
at org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:62)
at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:557)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:460)
at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:385)
at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:344)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:134)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:146)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:162)
at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:475)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:514)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:414)
at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:329)
at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:232)
at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:416)
at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:480)
at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:461)
These are not the only errors, mind you, but they are representative. I'd prefer not to hunt down and zap every offending PDF, but if that is what it takes, then fine. But as with the error above, I don't even get a filename, even if I broaden the log output settings. Sometimes I do, and I zap that file, but it does not happen each time.
I guess I do not understand what purpose the onerror settings serve, when it seems like the only working option is to stop! Again, I'd be most appreciative of any advice or suggestions as to what I am doing wrong.
Thanks in advance!
using the DIH to index large ammount of PDF files in a real prod environment is NOT a good idea:
some pdfs will fail to be extracted, no matter what. If you need to extract content by any means, you should catch the errored ones and use a second extraction library (different from PDFBox), as no two libraries fail with the same set of files.
some will OOM, same as above, try other library if needed when your process continues.
Worse, some pdfs will hang while extracting. You need to setup some infra to find those cases, and kill the extracting process so it goes on to the next ones. Ditto on the second library if you feel like.
Bottomline, you should use some other setup by using some ETL tools, or just write some custom code to do the extraction and then the indexing into Solr, taking the above recommendations into account.

Solr: where to find the Luke request handler

I'm trying to get a list of all the fields, both static and dynamic, in my Solr index. Another SO answer suggested using the Luke Request Handler for this.
It suggests finding the handler at this url:
http://solr:8983/solr/admin/luke?numTerms=0
When I try this url on my server, however, I get a 404 error.
The admin page for my core is here http://solr:8983/solr/#/mycore, so I also tried http://solr:8983/solr/#/mycore/admin/luke. This also gave me another 404.
Does anyone know what I'm doing wrong? Which url should I be using?
First of all you have to enable the Luke Request Handler. Note that if you started from the example solrconfig.xml you probably don't need to enable it explicitly because
<requestHandler name="/admin/" class="solr.admin.AdminHandlers" />
does it for you.
Then if you need to access the data programmatically you have to make an HTTP GET request to http://solr:8983/solr/mycore/admin/luke (no hash mark!). The response is in XML but specifying wt parameter you can obtain different formats (e.g. http://solr:8983/solr/mycore/admin/luke?wt=json)
If you only want to see fields in SOLR web interface select your core from the drop down menu and then click on "Schema Browser"
In Solr 6, the solr.admin.AdminHandlers has been removed. If your solrconfig.xml has the line <requestHandler name="/admin/" class="solr.admin.AdminHandlers" />, it will fail to load. You will see errors in the log telling you it failed to load the class org.apache.solr.handler.admin.AdminHandlers.
You must include in your solrconfig.xml the line,
<requestHandler name="/admin/luke" class="org.apache.solr.handler.admin.LukeRequestHandler" />
but the URL is core-specific, i.e. http://your_server.com:8983/solr/your_core_name/admin/luke
And you can specify the parameters fl,numTerms,id,docId as follows:
/admin/luke
/admin/luke?fl=cat
/admin/luke?fl=id&numTerms=50
/admin/luke?id=SOLR1000
/admin/luke?docId=2
You can use this Luke tool which allows you to explore Lucene index.
You can also use the solr admin page :
http://localhost:8983/solr/#/core/schema-browser

Solr 4.3.0 External File Field: despite I followed the official reference, I cannot get it to work

I have a Solr multicore instance and I'm trying to get advantage from the External File Field feature, but with no success.
Here's the fieldType definition
<fieldType name="ext_tags" class="solr.ExternalFileField" keyField="doc_id" />
Notice the reference guide reports an example with a few more attributes
defVal="" stored="false" indexed="false" valType="string"
but I guess they're from Solr version 3 because If I add them in my schema.xml, I get an error starting the instance.
Here's the field definition
<field name="deal_tags" type="ext_tags" indexed="false" stored="true" required="false" />
Here's the external data file name ( in $SOLR_HOME/data )
external_deal_tags.txt
Here's the instance dir ($SOLR_HOME)
/opt/solr-4.3.0/deal-ws/cores/it_IT/
Here's an excerpt from the data file (UTF-8 encoded, sorted by doc_id that is a MD5 hash)
003c9256f23da49233fc0b253f7a93cb=8;12
0050188629a8c0e3f89bcd6a7cb77b3a=6;7;13;33;35;38
009c3932933b173072054e3d81527b05=6
Here's the URL I call
http://localhost:8080/solr/it_IT/select?q=*:*&wt=json&fl=deal_tags&rows=3&indent=yes
Here's the response I get
{
"responseHeader":{
"status":0,
"QTime":116},
"response":{"numFound":3678,"start":0,"docs":[
{},
{},
{}]
}}
Even if I change the rows param to 4000 (i have 3678 documents in the index) I get no ext_tags
After modifying the schema.xml file, I restarted Tomcat many times and I also restarted the hosting machine.
What have I missed?
* UPDATE *
During my quest for answers, I found out the problem was possibly in the way I queried Solr. I tried modifying the query using field()
http://localhost:8080/solr/it_IT/select?q=*:*&wt=json&fl=field(deal_tags)&rows=3&indent=yes
and this is what I get now
{
"responseHeader":{
"status":0,
"QTime":2},
"response":{"numFound":3678,"start":0,"docs":[
{ "field(deal_tags)":0.0},
{ "field(deal_tags)":0.0},
{ "field(deal_tags)":8.0}]
}}
I expected to get strings instead I get something formatted as a decimal number:
- first result: expected 56;57 -> 0.0
- second result: expected blank -> 0.0 (its doc_id is not in the external file)
- third result: expected 8 -> 8.0
So, it seems I need to inform Solr I expect this value to be treated as a string but I don't get where to set this configuration.
Any idea?
Ok, I found out the what was the problem. I took a look in the log file ant here's what I got
Caused by: org.apache.solr.common.SolrException: Only float and pfloat (Trie|Float)Field are currently supported as external field type.
So, I cannot use strings as external values.

Solr 4: disable compression on stored fields: how to actually configure custom codec?

The short question is :
I want to disable stored field compression on Solr 4.3.0 index. After reading :
http://blog.jpountz.net/post/35667727458/stored-fields-compression-in-lucene-4-1
http://wiki.apache.org/solr/SimpleTextCodecExample
http://www.opensourceconnections.com/2013/06/05/build-your-own-lucene-codec/
I've decided to follow the path described there, and make my own codec. I'm pretty sure I've followed all the steps, however, when I actually try to use my codec (affectionatelly named "UncompressedStorageCodec"), I get the following error in Solr log:
java.lang.IllegalArgumentException: A SPI class of type org.apache.lucene.codecs.PostingsFormat with name 'UncompressedStorageCodec' does not exist. You need to add the corresponding JAR file supporting this SPI to your classpath.
The current classpath supports the following names: [Pulsing41, SimpleText, Memory, BloomFilter, Direct, Lucene40, Lucene41]
at org.apache.lucene.util.NamedSPILoader.lookup(NamedSPILoader.java:109)
From the output I get that Solr is not picking up the jar with my custom codec, and I don't get why?
Here's all the horriffic details:
I've created a class like this:
public class UncompressedStorageCodec extends FilterCodec {
private final StoredFieldsFormat fieldsFormat = new Lucene40StoredFieldsFormat();
protected UncompressedStorageCodec() {
super("UncompressedStorageCodec", new Lucene42Codec());
}
#Override
public StoredFieldsFormat storedFieldsFormat() {
return fieldsFormat;
}
}
in package: "fr.company.project.solr.transformers.utils"
The FQDN of "FilterCodec" is: "org.apache.lucene.codecs.FilterCodec"
I've created a basic jar file out of this (exported it as jar from Eclipse).
The Solr installation I'm using to test this is the basic Solr 4.3.0 unzipped, and started via it's embedded Jetty server and using the example core.
I've placed my jar with the codec in [solrDir]\dist
In:
[solrDir]\example\solr\myCore\conf\solrconfig.xml
I've added the line:
<lib dir="../../../dist/" regex="myJarWithCodec-1.10.1.jar" />
Then in the schema.xml file, I've declared some fieldTypes that should use this codec like so:
<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true" postingsFormat="UncompressedStorageCodec"/>
<fieldType name="string_lowercase" class="solr.TextField" positionIncrementGap="100" omitNorms="true" postingsFormat="UncompressedStorageCodec">
<!--...-->
</fieldType>
Now, if I use the DataImportHandler component to import some data into Solr, at commit time it tells me:
java.lang.IllegalArgumentException: A SPI class of type org.apache.lucene.codecs.PostingsFormat with name 'UncompressedStorageCodec' does not exist. You need to add the corresponding JAR file supporting this SPI to your classpath.
The current classpath supports the following names: [Pulsing41, SimpleText, Memory, BloomFilter, Direct, Lucene40, Lucene41]
at org.apache.lucene.util.NamedSPILoader.lookup(NamedSPILoader.java:109)
What I find strange is that the above mentioned codec jar also contains some Transformers for the DataImportHandler component. And those are picked up fine. Also, other jars placed in the dist folder (and declared in the same way in solrconfig.xml), like the jdbc driver are picked up fine. I'm guessing that for the codec there's this SPI thingy which loads things differentlly, and there's somethign he's missing...
I've also tried placing the codec jar in:
[solrDir]\example\solr-webapp\webapp\WEB-INF\lib\
as well as inside the WEB-INF\lib folder of the solr.war file, which is found in:
[solrDir]\example\webapps\
but I'm still getting the same error.
So basically, my question is, what's missing so that my codec jar is picked up by Solr?
Thanks
I'm going to answer this question myself since it sort of become moot due to some benchmarks I've made: long story short, I had arrived at the (wrong) conclusion that for really large stored fields, Solr 3.x and 4.0 (without field compression) is faster than Solr 4.1 and above (with field compression). However that was mostly due to some errors in my benchmarks. After repeating them I've obtained results where when you go from non-compressed to compressed fields even for very large stored fields, the index time is between 0% and 15% slower, which is really not bad at all, considering that afterwards queries on the compressed fields indexes are 10-20% times faster (the document fetching part).
Also, here's some remarks on how to speed up indexing:
Use the DataImportHandler plugin. It bypasses the Solr Rest (HTTP based) API and writes directly to the Lucene index.
Check out said plugins sources to see how it accomplishes this, and do your own plugin if the DataImportHandler doesn't meet your needs
If for whatever reason you want to stick to the Solr Rest API, use ConcurrentUpdateSolrServer and play around with the queue size and number of threads parameters. It will normally be a lot faster (up to 200% in my case) than the basic HttpSolrServer.
Don't forget to enable the javabin data serialization like this:
ConcurrentUpdateSolrServer solrServer = new ConcurrentUpdateSolrServer("http://some.solr.host:8983/solr", 100, 4);
solrServer.setRequestWriter(new BinaryRequestWriter());
I'm explicitly showing the code because I believe there migth be a small bug here:
If you look at the ConcurrentUpdateSolrServer constructor, you'll see that by default it already sets the request writer to binary:
//the ConcurrentUpdateSolrServer initializes HttpSolrServer objects using this constructor:
public HttpSolrServer(String baseURL, HttpClient client) {
this(baseURL, client, new BinaryResponseParser());
}
However after debugging I've noticed that if you don't explicitly call the setWriter method with the Binary writer argument, it will still use the XmlSerializer.
Going from XML to Binary serialization reduces the size of my documents about 3 times as they are being sent to the server. This makes my index times for this case about 150-200% faster.
I have recently tried and succeeded to get something very similar to work. The only difference is that I want to enable the best compression instead of no compression, and Solr defaults to the fastest compression. I also got the "SPI class [...] does not exist" error at some point, and here is what I have found out from various articles, including the ones you have linked to.
Lucene uses SPI to find the codec classes to load. Lucene requires the list of codec classes be declared in the file "org.apache.lucene.codecs.Codec", and the file must be on the class path. To get Solr to load the file: When you create your JAR file "myJarWithCodec-1.10.1.jar", make sure that it contains a file at "META-INF/services/org.apache.lucene.codecs.Codec". The file should have one full class name per line, like this:
org.apache.lucene.codecs.lucene3x.Lucene3xCodec
org.apache.lucene.codecs.lucene40.Lucene40Codec
org.apache.lucene.codecs.lucene41.Lucene41Codec
org.apache.lucene.codecs.lucene42.Lucene42Codec
fr.company.project.solr.transformers.utils.UncompressedStorageCodec
And in solrconfig.xml, replace:
<codecFactory class="solr.SchemaCodecFactory" />
with:
<codecFactory class="fr.company.project.solr.transformers.utils.UncompressedStorageCodec" />
You might also need to remove postingsFormat="UncompressedStorageCodec" from schema.xml if Solr complains. I think this particular parameter is for specifying the postings format, not the codec. Hope it helps.

Solr 4.0, TextField query gives NullPointerException

I have a field in my Solr schema like -
Type - <fieldType name="text" class="solr.TextField" positionIncrementGap="100"/>
Declaration - <field name="description" type="text" indexed="true" stored="true"/>
This description field has values for some documents and no data for the remaining. I am not explicitly setting anything when there is no data for the description field as it is not a required field.
The problem is when i query using - description:*. I get a NullPointerException.
I have not changed the schema. Also when I query description:abc it works however for description:* and description:abc* it gives a NullPointerException.
Is it the correct behavior?
Are the text fields required to be given "" values when we do not have to set it?
Exception -
"error":{
"trace":"java.lang.NullPointerException\n\tat
org.apache.solr.schema.TextField.analyzeMultiTerm(TextField.java:139)\n\tat org.apache.solr.search.SolrQueryParser.analyzeIfMultitermTermText(SolrQueryParser.java:147)\n\tat
org.apache.solr.search.SolrQueryParser.getWildcardQuery(SolrQueryParser.java:203)\n\tat
org.apache.lucene.queryparser.classic.QueryParserBase.handleBareTokenQuery(QueryParserBase.java:1049)\n\tat
org.apache.lucene.queryparser.classic.QueryParser.Term(QueryParser.java:358)\n\tat
org.apache.lucene.queryparser.classic.QueryParser.Clause(QueryParser.java:257)\n\tat
org.apache.lucene.queryparser.classic.QueryParser.Query(QueryParser.java:181)\n\tat
org.apache.lucene.queryparser.classic.QueryParser.TopLevelQuery(QueryParser.java:170)\n\tat
org.apache.lucene.queryparser.classic.QueryParserBase.parse(QueryParserBase.java:120)\n\tat
org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:72)\n\tat
org.apache.solr.search.QParser.getQuery(QParser.java:143)\n\tat
org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:118)\n\tat
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:185)\n\tat
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)\n\tat
org.apache.solr.core.SolrCore.execute(SolrCore.java:1699)\n\tat
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:455)\n\tat
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276)\n\tat
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)\n\tat
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)\n\tat
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)\n\tat
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)\n\tat
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)\n\tat
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)\n\tat
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)\n\tat
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)\n\tat
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)\n\tat
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)\n\tat
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)\n\tat
org.eclipse.jetty.server.Server.handle(Server.java:351)\n\tat
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454)\n\tat
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47)\n\tat
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:890)\n\tat
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:944)\n\tat
org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:634)\n\tat
org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230)\n\tat
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:66)\n\tat
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254)\n\tat
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:599)\n\tat
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:534)\n\tat
java.lang.Thread.run(Thread.java:679)\n",
"code":500}}
A field of type solr.TextField has to have a Tokenizer associated with it otherwise it fails for wild character search.
I created a Jira bug for the same here.
It should work and does normally work.
Did you by any chance change field type definition and then did not do a full-reindex (delete old index and so on)?
I have seen this before but it seemed to have gone away after the index was fully rebuilt. May be something to do with left-over definitions on Lucene level.

Resources