SOLR 3.1 Indexing Issue

SOLR 3.1 Indexing Issue - solr

We are facing some issues with SOLR search.
We are using SOLR 3.1 with Jetty. We have set schema according to our requirement. We have set data-config.xml to import records into the Collection (Core) from our database (Sql Server 2005).
There are 320, 000 records in the database which we need to import.
After finished import, when i try to search all the records by SOLR admin
http://localhost:8983/solr/Collection_201/admin/
It shows me total number found 290, 000. So, 30, 000 records are missing.
Now following questions are in my mind
How could i know which record is not properly indexed? OR which record is missing? To know that, i tried a trick, i thought i should have put a field in the database to know that which record is imported into the SOLR collection and which is not. But the big question is how would i update this database field while import from data-config.xml. Because tag allows you only search queries OR in other words something to return. So, i got another idea to still update that database field. I created a stored procedure in my database, which contains update query that would update the field in the database and after that i have select query which is simply return 1 record to fulfill requirement. But when i tried to run DIH with that it returns "Index Failed. Rollback all the changes" error message and nothing imported. When i commented update query into the stored procedure, then it works. So it was not allowing me to run update query even it from stored procedure. So i tried really hard to find a way to update the database from DIH. But i was really failed to find anything Sad smile i refused this idea to update database.
I cleared the index and started import data again. This time i tried it manually run the solr admin import page for 5, 000 records per turn. At the end, for some how records are still missing.
Is this possible it is not committed properly. I red in the documentation that import page (http://localhost:8983/solr/Collection_201/dataimport?command=full-import&clean=false) automatically committed the imported data. But i personally noticed some time it does or sometime it does not. So it is really driving me crazy Sad smile
Now i am fully frustrated and start thinking the way i am using to use SOLR is right or not. If i am right, then is it reliable???? If am wrong, please guide me what is my mistake??
Please Please Please guide me how easily we can sync. collection with our database and make sure it is 100% synced.

What field are you using for your IDs in Solr and the database? The id field needs to be unique, so if you have 30,000 records that have the same ID as some 30,000 other records then the data will overwrite those records.
Also, when you run data import handler, you can query it for status (?command=status) and that should tell you the total number of records imported on the last run.
The first thing I would do is check for non-unique IDs in your database WRT the solr id field.

Also be aware, that when one record in the batch is wrong, the whole batch gets rolledback. So if it happened 3 times, and you are indexing 10K docs each, that would explain it.
At the time, I solved it: https://github.com/romanchyla/montysolr/blob/master/contrib/invenio/src/java/org/apache/solr/handler/dataimport/NoRollbackDataImporter.java
but there should be a better/more elegant solution than that. I don't know how to get missing records in your case. But if you have indexed the ids, then you can compare the indexed ids with the external source and get the gaps

Related

Solr 4 - Delete fields from FieldsInfo file

I am running solr 4 version. I have created million fields in solr using a script . I saw GC has gone very high after adding these fields as every time searcher is open, these fields were loaded.
Now, I want to go back to the stage where my solr cluster was before adding those fields. Even though, I delete documents which has those fields, the cluster is not coming back to what it was as the fields are not getting deleted from fieldsInfo file.
Is there a way we can explicitly tell solr to delete the fields from the fieldsInfo file???

There is a schema API documented that can delete a field. However I don't know if this is already available for Solr 4. You should try if it works.

Broken indexer on Azure-Search (error: multiple columns with the same name)

We are experiencing a sudden and strange issue with our Azure Search indexer. We had an index (2015-02-28-preview version) with corresponding datasource and indexer based on a table of a SQL Azure v12 database. Change tracking was enabled and changes were properly forwarded in the index. A couple of days ago, our attention was drawn by the fact that last changes in the database were no more properly replicated to the index. Being in a development phase, this index was frequently rebuilt by developers and nobody has noticed when exactly things started to go wrong.
In the Azure portal, the index is displayed in red color with an error message stating we have a duplicate column in the datasource...("Datasource contains multiple columns with the same name 'ProductId'") which is false. We cleaned the database and tried several things but could not find any duplicate column. As for today, the situation is the following :
1/ After deleting and recreating everything (index, indexer and datasource) the index is filled with the 2000 documents present in the SQL table
2/ The index is full and can be queried without any issue, though it still shows up in red with the "duplicate column" error message
3/ Due to this error, we cannot manually force a new indexation from the azure portal
4/ In order to reflect changes of the indexed table, we have to run again the script which deletes index, indexer and datasource and re-creates everything. After running this script .. we're back at step 1 above (index queryable, but in error state and cannot be updated without drop/recreate).
This problem seems to have occurred all of a sudden without any change on our side, as if there had been a server-side version change. Are there any newer release of the Azure Search Rest APIs available ? Has anyone ever encountered the same issue or has any hints on things we could check ?
Thanks for your help shedding some light on what may be broken here,

Problem fixed thanks to Eugene investigations. He discovered a bug in the C# code used to generate the datasource : a casing difference between a “ProductId” column in the database and a “ProductID” field in the index.
We fixed the misspelling and the issue is gone. Microsoft support said that they'll "fix the issue in the coming weeks" : The same code used to work properly (and is still working properly on the first run), so it looks like the indexing process has somehow become more case sensitive than before.

Solr search engine Updating document

I'm using solr search engine.I'm new to this. I want to update data automatically every time when my database getting update or new data created in the tables.I tried delta import and full import.In these method I have to do it manually when ever I need to update.
Which way is best for update solr document.?
How to make it automatically?
Thanks for your help.

There isn't a built in way to do this using Solr. I wouldn't recommend running a full or delta import when just updating one row in a table. What most Solr deployments do with a database is update the corresponding document when updating a row. This will be application specific, but this is the most efficient and standard way of dealing with this issue.
Using full or delta imports would be something that would run nightly or every few hours typically.

So, basically you want to process document before adding in solr.
This can be achieved by adding new update processor in update process chain you can go through : Solr split joined dates to multivalue field.
Here they split data in a field and saved it as multi valued field

Solr is showing data in search for some time only, after some time results get disappear

Hi I am facing a problem with Solr earlier everything was fine but after restart of server Solr is showing only those documents that was indexed before restart. Whenever I create new document then it get indexed and commit also work and solr shows it in result but after some hour result get disappear from the search result. Before server restart it was not happening.I dont know why it is behaving like this now.
I observed index data when I index newly created data then its(index folder) size get increased and result shows document in result and after sometime size back to same size that was earlier and result also get disappear.
I am using tomcat 6.0 plugin tx_Solr 2.8.0 and Solr 4.2.0
Can anyone let me know why its now deleting newly created indexes only? Devlog Shows a delete query gets fire but I dont know why it is happening like this.
Any help would be highly appreciable.

SOLR 3.6.0, After a full re-index of a bunch of entities, some of my items are not making it into the SOLR index, but no logs are being generated

Use a StreamingUpdateSolrServer, I used the following algorithm to re-index my huge dataset into SOLR.
Initialize StreamingUpdateSolrServer server = new StreamingUpdateSolrServer(solrServerUrl, numDocsToAddInBatch, numOfThreads);
For each Item…
-->Create document
-->Server.add(document)
When all finished,
server.commit();
server.optimize();
The problem:
Some of my items are not making it into the SOLR index, but no logs are being generated to tell me what happened.
I was able to find most of the documents, but some were missing. No errors in any logs – and I have substantial try/catch blocks with logs around all SOLRJ exceptions on the clients site.

Verify logging is not being hidden for the SOLR WAR
You will definitely want to verify that the SOLR server log settings are not hiding the fact that documents are failing to be added to the index.
Because SOLR uses the SLF4J API, your SOLR server could be over-riding the log settings allowing you to see an error message when the document failed to be indexed.
If you have a custom {solr-war}/WEB-INF/classes/logging.properties, you will need to make sure that the settings are not such that it is hiding the error messages.
By default, errors in adding an item should be shown automatically. So if you did not change your SOLR log settings at any point... you should be seeing any errors during indexing in your server log file.
Troubleshoot why Documents are failing to be indexed
In order to investigate this, it is helpful to follow verification step any time after the indexing is complete:
Initialize new log log_fromsolr
Initialize new log log_notfound
For each Item…
-->Search SOLR for the item. If SOLR has the object, log each item’s fields into log_fromsolr on a single line into log_fromsolr. This should include the unqiueKey for your document if you have one.
-->If document cannot be found in SOLR for this item, write a line to log_notfound with all the fields from the object from the database, also supplying the uniqueKey as the first line.
Once the verification step has completed, the log log_notfound created a list of all Documents that failed to be added into the Index.
You can use the log created by log_fromsolr to compare the document fields for an item that made it into the index and one that did not.
Verify it is not an intermittent issue
Sometimes it might be the case that it is not the same Items failing to be added to the index each time you try to index.
If you find objects in the log_notfound log, you will want to back up the current notfound log and run the indexing process again from scratch. Use a diff tool to see the differences between the first notfound log and the second notfound log.
An intermittent problem is evident when you see large numbers of differences in these files (Note: some differences are to be expected if new objects are being created in the database in between the first and second re-indexing).
If your problem is intermittent, it most certainly points at the application code with respect to your SOLR transactions not being committed correctly.
The same documents consistently come up missing each time it indexes
At this point we have to compare documents that are being found from the SOLR index, versus documents that are not getting into the Lucene index. Usually a field-by-field comparison of the object will start turning of some suspicious values that may be causing issues when adding the document to the index.
Try eliminating all the suspicious fields and then re-indexing the entire thing again. See if the documents are still failing to be indexed. If this worked, you will want to start re-introducing the fields that you removed and see if you can pinpoint the one that is the issue.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight