Solr's documentation for DataImportHandler gives this table for the entity query attributes.
That's not extremely descriptive. Can someone express here the difference and interaction between these query attributes? I have seen some code use deltaQuery and parentDeltaQuery to support nested entities, and I have seen others use deltaQuery and deltaImportQuery.
What is the purpose of choosing one of those over the other?
I see it now in the Solr Wiki:
* The query gives the data needed to populate fields of the Solr document in full-import
* The deltaImportQuery gives the data needed to populate fields when running a delta-import
* The deltaQuery gives the primary keys of the current entity which have changes since the last index time
* The parentDeltaQuery uses the changed rows of the current table (fetched with deltaQuery) to give the changed rows in the parent table. This is necessary because whenever a row in the child table changes, we need to re-generate the document which has that field.
I missed this explanation on the first pass, and expected that information to show up in the table I posted. Strangely enough, Solr In Action spent less than 1 page of 600 explaining how to use DataImportHandler to read a database.
Related
I am new to Apache Solr and have worked with single table and importing it in Solr to get data using query.
Now I want to do following.
query from multiple tables ..... Like if I find by a word, it should return all occurances in multiple tables.
Search in all fields of table ....like I query by word in all fields in single table too.
Do I need to create single document by importing data from multiple tables using joins in data-config.xml? And then querying over it?
Any leads and guidance is welcome.
TIA.
Do I need to create single document by importing data from multiple tables using joins in data-config.xml? And then querying over it?
Yes. Solr uses a document model (rather than a relational model) and the general approach is to index a single document with the fields that you need for searching.
From the Apache Solr guide:
Solr’s basic unit of information is a document, which is a set of data
that describes something. A recipe document would contain the
ingredients, the instructions, the preparation time, the cooking time,
the tools needed, and so on. A document about a person, for example,
might contain the person’s name, biography, favorite color, and shoe
size. A document about a book could contain the title, author, year of
publication, number of pages, and so on.
We have around 100k products in our website and each product have around 30 attributes which are indexed. Most of the time we only update price of products but we still have to index the whole product. Is it possible in hybris to index only the price attribute(or description attribute) of all 100k products.
It is possible since Solr 4.0. This feature is called partial update, where you can update only the fields changed, in your case, price and description.
The official documentation is here.
Marco is right. You can do a Partial Update.
For Hybris, there is some documentation is in Creating and Configuring Indexed Types. SolrIndexerQuery.type attribute lets you choose partial_update.
You have the following values to choose from:
FULL: recreates the index
UPDATE: updates some documents in the index
PARTIAL_UPDATE: allows you to select the fields for the update
DELETE: deletes documents from the index
I'm building a Java app using a relational database and I wish to map it's primary data to a Solr index/es. However, I'm not sure how to map the components of a database. At the momement I've mapped a single row cell to a Solr/Lucene Document.
A doc would be something like this (each line is a field):
schema: "schemaName"
table: "tableName"
column: "columnName"
row: "rowNumber"
data: "data on schemaName.tableName.columnName.row"
This allows me to have a "fixed" Solr schema.xml(as far as I know it has to be defined "before" creating indexes). Also dynamic fields doesn't seem to serve my purpose.
What I've found while searching is that a single row is usually mapped to a Solr Document and each column is mapped as a Field. But, how can I add the column names as fields into schema.xml (when I don't know the columns a table has)? Also, I would need the info to be queried as if it was SQL. I.e, search for all rows of a column in a table, etc, etc.
With my current "solution" I can do that kind of queries but I'm worried with performance as I'm new to Solr and I don't know the implications it may have.
So, what do you say about my "solution"? Is there another way map a database to a Solr index concerning the schema.xml fields should be set before indexing? I've also "heard" that a table is usually mapped to a index: how could I achieve that?
Maybe I'm just being noob but by the research I did I don't see how I can map a database Row to a Solr Document without messing with schema.xml Fields.
I would appreciate any thoughts :) Regards.
You can specify your table columns in the schema before hand or use dynamic fields and then use the solr DIH to import the data into solr from the database. Select your dynamic fields name in the queries for DIH.
Please go through Solr DIH for database integration
I'm working with solr and indexing data from DB.
When I import the data using SQL query, I got some rows with the same key.
I need a way that solr will generate a new field with unique key.
How can I do that?
Thanks
I am not sure if this is possible or not, but maybe you need to re-consider your logic here...
Indexing operation into Solr should be Re-Runable. So, imagine that you come one day and decide to change the schema of your core.
If you generate a new key everytime you import a document, you will end up creating duplicate items when you re-run your data import.
Maybe you need to revisit your DB design to have a unique key, or maybe in the select query, you can create a derived or calculated column value that is calculated based on multiple columns. But I am sure that pushing this problem to solr is not the solution.
ideally the unique key should come from the db (are you sure you cannot get one, by composing some columns etc?).
But, if you cannot, Solr supports UUID generation for this, look here to see how it works depending on your solr version
What is pk in solr DIH delta import? I am trying to delta index multiple fields in solr?
I believe it is whatever field you specify in your schema.xml file as the id field.
It is a name of Solr field that serves as a unique key for that record. You define your mapping of source to that Solr column and then - after mapping - Solr checks its presence and values based on the pk field you specified.
It is different from primaryKey because you may be generating primaryKey or it may not be suitable somehow. But it could be same. I think the clearest Wiki explanation may be in the example for HttpDataSource.
I believe, you may also be able to define a compound pk for when you are flattening inner source entries into one Solr entry.
I think the problem is in your delta-query for the child entity. You have given,
deltaQuery="select id from cc_gadget_lang where '${cc_gadget.last_modified_date}' > '${dataimporter.last_index_time}'"
I think the where condition in the above query validates to TRUE always and there is no specific purpose of having that.
The Solution I would suggest is to have a separate "last_modified_date" field in the "cc_gadget_lang" table in your database and use that in the delta query of your child entity.
I also believe that there is no need to have the "pk" of the child entity in your schema file because, they are stored and used temporarily during delta-imports and do not require to be stored permanently in Index.