CFINDEX throwing attribute validation error exception - solr

I am upgrading to ColdFusion 11 from ColdFusion 8, so I need to rebuild my search indices to work Solr instead of Verity. I cannot find any reliable way to import my old Verity collections, so I'm attempting to build the new indices from scratch. I am using the following code to index some items along with their corresponding documents which are located on the server:
<cfsetting requesttimeout="3600" />
<cfquery name="qDocuments" datasource="#APPLICATION.DataSource#">
SELECT DISTINCT
ID,
Status,
'C:\Documents\'
CONCAT ID
CONCAT '.PDF' AS File
FROM tblDocuments
</cfquery>
<cfindex
query="qDocuments"
collection="solrdocuments"
action="fullimport"
type="file"
key="document_file"
custom1="ID"
custom2="Status" />
A very similar setup was used with Verity for years without a problem.
When I run the above code, I get the following exception:
Attribute validation error for CFINDEX.
The value of the FULLIMPORT attribute is invalid.
Valid values are: UPDATE, DELETE, PURGE, REFRESH, FULL-IMPORT,
DELTA-IMPORT,STATUS, ABORT.
This makes absolutely no sense, since there is no "FULLIMPORT" attribute for CFINDEX.
I am running ColdFusion 11 Update 3 with Java 1.8.0_25 on Windows Server 2008R2/IIS7.5.

You should believe the error message. try this:
<cfindex
query="qDocuments"
collection="solrdocuments"
action="FULL-IMPORT"
type="file"
key="document_file"
custom1="ID"
custom2="Status" />
It's referring to the value of the attribute action.

This is definitely a bug. In the ColdFusion documentation, fullimport is not an attribute of cfindex.

I know this is an old thread, but in case anyone else has the same question, it's just poor descriptions in the documentation. The action "FullImport" is only available when using type="dih" (i.e. Data Import Handler). When using the query attributes, use action="refresh" instead.
Source: CFIndex Documentation:
...
When type="dih", these actions are used:
abort: Aborts an ongoing indexing task.
deltaimport: For partial indexing. For instance, for any updates in the database,
instead of a full import, you can perform delta import to update your
collection.
fullimport: To index full database. For instance,
when you index the database for the first time.
status:
Provides the status of indexing, such as the total number of documents
processed and status such as idle or running.

Related

Solr deletions with custom full import

I'm trying to use the DataImportHandler to keep my index in sync with a SQL database (what I would think is a pretty vanilla thing to do). Since my database will be pretty large, I want to use incremental imports using this method http://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport so the calls are of the form http://localhost:8983/solr/Items/dataimport?command=full-import&clean=false. This works perfectly well for adding items.
I have a separate DeletedItems table in my database which contains the primary keys of the items that have been deleted from the Items table, along with when they were deleted. As part of the DataImport call, I had hoped to be able to delete the relevant items from my index based on a query along the lines of
SELECT Id FROM DeletedItems WHERE WasDeletedOn > '${dataimporter.last_index_time}'
but I can't figure out how to do this. The link above alludes to it with the cryptic
In this case it means obviously that in case you also want to use deletedPkQuery then when running the delta-import command is still necessary.
but setting deletedPkQuery to the above SQL query doesn't seem to work. I then read that deletedPkQuery only works with delta-imports, so I am forced to make two requests to my solr server as part of the sync process? This doesn't seem right as the operations are parameterized by the dataimporter.last_index_time property, which changes. Both steps would need to be done in one "atomic" action, surely? Any ideas?
You must use the import handler special commands
https://wiki.apache.org/solr/DataImportHandler#Special_Commands
With these commands you can alter the boost or delete a document coming from the recordset of the full import query. Be aware that you must use the $skipDoc field to avoid that the document gets indexed again and that you must repeat the id in the $deleteDocById field.
You can use a union query
select
id,
text,
'false' as [$deleteDocById],
'false' as [$skipDoc]
from [rows to update or add]
Union Select
id,
'' as text,
id as [$deleteDocById],
true as [$skipDoc]
or a case when
select
id,
text,
CASE
when deleted = 1 then id
else 'false'
END as [$deleteDocById],
CASE
when deleted = 1 then 'true'
else 'false'
END as [$skipDoc]
Where updated > ${dih.last_index_time}
The deletedPkQuery is run as part of the regular call to delta-import, so you don't have to run anything twice (and when doing a full-import, there's no need to run deletedPkQuery, since the whole connection is cleared before importing anyway).
The deletedPkQuery should be configured on the same element as your main query. Be sure to match the field names exactly as well, and that the id produced by your deletedPkQuery matches the one provided by the main query.
There's a minimal example on solr.pl for importing and deleting fields using the same deleted_entries-table structure as you have here:
<entity
name="album"
query="SELECT * from albums"
deletedPkQuery="SELECT deleted_id as id FROM deletes WHERE deleted_at > '${dataimporter.last_index_time}'"
>
Also make sure that the format of the deleted_at-field is comparable against the value produced by last_index_time. The default is yyyy-MM-dd HH:mm:ss.
.. and lastly, remember that the last_index_time property isn't available before the second time the task is run, since there's no "previous index time" the first time an index is being populated (but the deletedPkQuery shouldn't run before that anyway).

Solr: To get all records

I am trying to upgrade my Solr 4.x version to 5.2.1 Solrcloud implementation. I had written following code to get all the results from Sorl query which works well in Solr single instance mode.
SolrQuery query = new SolrQuery();
query.setQuery("*:*");
query.addSort("agent_status", ORDER.desc);
query.addFilterQuery("account_id:\"" + accountId + "\"");
query.set("rows", Integer.MAX_VALUE);
But code will not work well in SolrCloud implemenation.It throws following exception.
2015-08-14 16:44:45,648 ERROR [solr.core.SolrCore] - [http-8080-8] : java.lang.NegativeArraySizeException
at org.apache.lucene.util.PriorityQueue.<init>(PriorityQueue.java:58)
at org.apache.lucene.util.PriorityQueue.<init>(PriorityQueue.java:39)
at org.apache.solr.handler.component.ShardFieldSortedHitQueue.<init>(ShardDoc.java:113)
at org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:972)
at org.apache.solr.handler.component.QueryComponent.handleRegularResponses(QueryComponent.java:750)
at org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:729)
at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:388)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
I found that it is failing because of query.set("rows", Integer.MAX_VALUE) statement.People suggested me to use pagination.
But, I can not afford doing pagination as there will be too many changes at UI side.
There is one more way where I can first query with some small number & get total number of documents using response.getResults().getNumFound() method & try setting that value to setRows method.But this approach will increase one more call to server.
Is there any other way I can solve this problem?
You can always set your rows to be a large value that would encompass your results. Integer.MAX_VALUE will not work due to the size limits of Java Arrays (see here) and the Lucene Priority Queue (see lines 42 - 58).
Solr-534 requested to have essentially what your asking for; there is some good conversation about why and why-not such a feature would be good.
A better question might be how many documents can the UI hold without becoming unusable? However many documents that is, would be a good value for your query to return.

App Engine no longer updating index.yaml

The index.yaml file of my GAE app is no longer updated by the development server.
I have recently added a new kind to my app and a handler that queries this kind like so:
from google.appengine.ext import ndb
class MyKind(ndb.Model):
thing = ndb.TextProperty()
timestamp = ndb.DateTimeProperty(auto_now_add=True)
and in the handler I have a query
query = MyKind.query()
query.order(-MyKind.timestamp)
logging.info(query.iter().index_list())
entities = query.fetch(100)
for entity in entities:
# do something
AFAIK, the development server should create an index for this query and update index.yaml accordingly. However, it doesn't. It just looks like this:
indexes:
# AUTOGENERATED
The logging.info(query.iter().index_list()) should output the index used for the query, it just says 'None'. Also, the SDK console says 'Datastore contains no indexes.'
Running the query returns the entities unsorted. I have two questions:
is there some syntax error in my code causes the query results be unsorted or is it the missing index?
if it's the missing index, is there a way to manually force the dev server to update index.yaml? Other suggestions?
Thank you
your call to order returns the new query..
query = MyKind.query()
query = query.order(-MyKind.timestamp)
..to clarify..
query.order(-MyKind.timestamp) does not change the query, it returns a new one, so you need to use the query returned by that method. As it is query.order(-MyKind.timestamp) in your code does nothing.

BadArgumentError: _MultiQuery with cursors requires __key__ order in ndb

I can't understand what this error means and apparently, no one ever got the same error on the internet
BadArgumentError: _MultiQuery with cursors requires __key__ order
This happens here:
return SocialNotification.query().order(-SocialNotification.date).filter(SocialNotification.source_key.IN(nodes_list)).fetch_page(10)
The property source_key is obviously a key and nodes_list is a list of entity keys previously retrieved.
What I need is to find all the SocialNotifications that have a field source_key that match one of the keys in the list.
The error message tries to tell you you that queries involving IN and cursors must be ordered by __key__ (which is the internal name for the key of the entity). (This is needed so that the results can be properly merged and made unique.) In this case you have to replace your .order() call with .order(SocialNotification._key).
It seems that this also happens when you filter for an inequality and try to fetch a page.
(e.g. MyModel.query(MyModel.prop != 'value').fetch_page(...) . This basically means (unless i missed something) that you can't fetch_page when using an inequality filter because on one hand you need the sort to be MyModel.prop but on the other hand you need it to be MyModel._key, which is hard :)
I found the answer here: https://developers.google.com/appengine/docs/python/ndb/queries#cursors
You can change your query to:
SocialNotification.query().order(-SocialNotification.date, SocialNotification.key).filter(SocialNotification.source_key.IN(nodes_list)).fetch_page(10)
in order to get this to work. Note that it seems to be slow (18 seconds) when nodes_list is large (1000 entities), at least on the Development server. I don't have a large amount of test
data on a test server.
You need the property you want to order on and key.
.order(-SocialNotification.date, SocialNotification.key)
I had the same error when filtering without a group.
The error occurred every time my filter returned more than one result.
To fix it I actually had to add ordering by key.

How can I modify the Solr Update Handler to not simply overwrite existing documents?

I'm working with Solr indexing data from two sources - real-time "pump" inserting (and updating) documents into Solr and database which holds backups of those documents.
The problem we encountered looks like that - if we make a data import from database while pump is performing inserts, we may index a doc from pump, and later overwrite it with doc extracted from database - which is a backup, so it's probably little outdated.
If we close the pump, import from database and open the pump again, it probably will cause instabilities in our application.
What I'd like to do is tell Solr to not automatically overwrite the document, but do so conditionally (for example by the value of 'last_modified_date' field).
My question is - how can I do it? Do I have to modify Solr source, make a new class overwriting some update processor, or just add some magic lines to solrconfig?
Sorry, but there there is not an option or config to tell Solr to not automatically update documents, but instead use some conditional check. The current model for Solr is that if you insert a document with the same unique id as one already in the index, it will "update" that document by a delete/add operation. Solr also does not currently support the ability to only update specific fields in an existing indexed document. Please see issue SOLR-139 for more details.
Based on the scenario you have described, I would suggest that you create a process outside of Solr that handles the retrieval of items from your data sources and then performs the conditional check to see what is in the index already and determine if an update to the index is necessary.
You can use solr script processors to check if that document exists proceeds in its accordance
Below code only works when solr uses java 8
function processAdd(cmd) {
doc = cmd.solrDoc;
var previousDoc=null;
try {
// create a term type object
var Term = Java.type("org.apache.lucene.index.Term");
var TermObject =new Term("fieldForSearchTryUnique","Value of field");
//retrieve document id from solr return -1 if not present
previousDocId= req.getSearcher().getFirstMatch(TermObject);
if(-1!=perviousDocId) {
// get complete document from solr for that searched field
previousDoc=req.getSearcher().doc(previousDocId);
// do required process here
}
}
catch(err) {
logger.error("error in update processor "+err)
}
}

Resources