ElasticSearch - workaround of unique constraint - database

I am thinking about some smart workaround of "no unique constraint" problem in ElasticSearch.
I can't use _id to store my unique field, because I am using _id for other purpose.
I crawl Internet pages and store them in ElasticSearch index. My rule is, that url must be unique (only one document with given url in index) so as ElasticSearch doesn't allow to set unique constraint on one field, I must query index before inserting new page to check if there is already site with given url.
So adding new page to document looks like that:
Query(match) index in ES to check if there is document with given url field.
If not, I insert new document.
The solution has two disadvantages:
I must execute extra query to check if there is already document with given url. It slows down inserting process and generates extra load.
If I try to add 2 documents with the same url in short amount of time and the index doesn't refresh before adding second document, the second query returns, that there is no document with given url and finally I have two documents with the same url
So I am looking for something else. Please tell me if you have any idea or please tell me what do you think about such solutions:
Solution 1
To use other database system (or maybe another ES index with url in _id) where I will store only urls and I will query it to check if there is already url
Solution 2
2. To queue documents before inserting and to disable index refreshing when other process will process the queue and add queued documents to index.

You've hit upon one of the things that Elasticsearch does not do well (secondary indexes and constraints) when compared to some other NoSQL solutions. In addition to Solution 1 and Solution 2 I'd suggest you look at Elasticsearch Rivers:
Rivers
A river is a pluggable service running within elasticsearch cluster
pulling data (or being pushed with data) that is then indexed into the
cluster.
For example, you could use the MongoDB river and then insert your data into MongoDB. MongoDB supports secondary unique indexes so you could prevent insertion of duplicate urls. The River will then take care of pushing the data to Elasticsearch in realtime.
https://github.com/richardwilly98/elasticsearch-river-mongodb
ES supports CouchDB officially and there are a number of other databases that have rivers too -

Related

Custom sharding or auto Sharding on SolrCloud?

I want to establish a SolrCloud clsuter for over 10 millions of news articles. After reading this article: Shards and Indexing Data in SolrCloud, I have a plan as follows:
Add prefix ED2001! to document ID where ED means some newspaper source and 2001 is the year part in published date of news article, i.e. I want to put all news articles of specific news paper source published in specific year to a shard.
Create collection with router.name set to compositeID.
Add documents?
Query Collection?
Practically, I got some questions:
How to add doucments based on this plan? Do I have to specify special parameters when updating the collection/core?
Is this called "custom sharding"? If not, what is "custom sharding"?
Is auto sharding a better choice for my case since there's a shard-splitting feature for auto sharding when the shard is too big?
Can I query without _router_ parameter?
EDIT # 2015/9/2:
This is how I think SolrCloud will do: "The amount of news articles of specific newspaper source of specific year tends to be around a fix number, e.g. Every year ED has around 80,000 articles, so each shard's size won't increase dramatically. For the next year's news articles of ED, I only have to add prefix 'ED2016!' to document ID, SolrCloud will create a new shard for me (which contains all ED2016 articles), and later the Leader will spread the replica of this new shard to other nodes (per replica per node other than leader?)". Am I right? If yes, it seems no need for shard-splitting.
Answer-1: If have the schema (structure) of the document then you can provide the same in schema.xml configuration or you can use Solr's schema-less mode for indexing the document. The schema-less mode will automatically identify the fields in your document and index them. The configuration of schema-less mode is little different then schema based configuration mode in solr. Afterwards, you need to send the documents to solr for indexing using curl or solrj java api. Essentially, solr provides rest end points for all the different operations. You can write the client in any language which suits you better.
Answer-2: What you have mentioned in your plan, use of compositeId, is called custom sharding. Because you are deciding to which shard a particular document should go.
Answer-3: I would suggest to go with auto-sharding feature if are not certain how much data you need to index at present and in future. As the index size increases you can split the shards and scale the solr horizontally.
Answer-4: I went through the solr documentation, did not find anywhere mentioning _route_ as mandatory parameter. But in some situations, this may improve query performance because it overcomes network latency when querying all the shards.
Answer-5: The meaning of auto-sharding is routing the document to a shards, based on the hash range assigned while creating the shards. It does not create the new shards automatically, just by specifying a new prefix for compositeId. So once the index grows large enough in size, you might need to split it. Check here for more.
This is actually a guide to answer my own question:
I kinda understand some concepts:
"custom sharding" IS NOT "custom hashing".
Solr averagely splits hash values as default hashing behavior.
compositeId router applies "custom hashing" cause it changes default hashing behavior by prefixing shard_key/num-of-bits.
Implicit router applies "custom sharding" since we need to manually specify which shards our docs will be sent to.
compositeId router still is auto sharding since it's Solr who see the shard_key prefix and route the docs to specific shards.
compositeId router needs to specify numShards parameter (possbily because Solr needs to distribute various hash value space ranges for each of the shard).
So obviously my strategy doesn't work since I need to always add in new year's news articles to Solr and there's no way I can predict how many shards in advance. So to speak, Implicit router seems a possible choice for me (We create shard we need and add docs to shard we intend to).

List all document keys in a Solr index for the purpose of database synchronisation

I need to synchronize a Solr index with a database table. At any given time, the Solr index may need to have documents added or removed. The nature of the database prevents the Data Import Handler's Delta Import functionality from being able to detect changes.
My proposed solution was to retrieve a list of all primary keys of the database table and all unique keys of the Solr index (which contain the same integer value) and compare these lists. I would use SolrJ for this.
However, to get all Solr documents requires the infamous approach of hard-coding the maximum integer value as the result count limit. Using this approach seems to be frowned upon. Does my situation have cause to ignore this advice, or is there another approach?
You can execute two queries to list all keys from solr in one batch: first with rows=0, you will get a number of hits, second with that number as rows parameter. Its not very optmimal solution, but works.
Second possibility is to store update date in solr index, and fetch only changed documents from last synchronisation.

What to do when a cache and a db index get very different?

I use memcache and datastore indexes with the google search api in gae. A practical problem is how to refresh a datastore index after an entity has been deleted since it appears that the entity is still in the index though it has been deleted. And how should I handle a more hypothetical scenario if memcache and the index start contain very different contents for the "same" data set i.e. a list of entities that can display from memcache, from the datastore index or from a datastore roundtrip?
For the first problem, I would recommend using the entity's key as doc_id for the index and since you have a reference to the document, you can delete it in a pre_delete_hook. This way you can also keep the data up to date, needed since adding a new document with an existing doc_id to the index will result in overwriting the existing one. (e.g. having a post_put_hook that creates the corresponding search document)
For the second, it's probably better to make sure that you won't run into this kind of a situation than trying to remedy it by keeping them updated.

Partial Update of documents

We have a requirement that documents that we currently index in SOLR may periodically need to be PARTIALLY UPDATED. The updates can either be
a. add new fields
b. update the content of existing fields.
Some of the fields in our schema are stored, others are not.
SOLR 4 does allow this but all the fields must be stored. See Update a new field to existing document and http://solr.pl/en/2012/07/09/solr-4-0-partial-documents-update/
Questions:
1. Is there a way that SOLR can achieve this. We've tried SOLR JOINs in the past but it wasn't the right fit for all our use cases.
On the other hand, can elastic search , linkedin's senseidb or other text search engines achieve this ?
For now, we manage by re-indexing the affected documents when they need to be indexed
Thanks
Solr has the limitation of stored fields, that's correct. The underlying lucene always requires to delete the old document and index the new one. In fact lucene segments are write-once, it never goes back to modify the existing ones, thus it only markes documents as deleted and deletes them for real when a merge happens.
Search servers on top of lucene try to work around this problem by exposing a single endpoint that's able to delete the old document and reindex the new one automatically, but there must be a way to retrieve the old document somehow. Solr can do that only if you store all the fields.
Elasticsearch works around it storing the source documents by default, in a special field called _source. That's exactly the document that you sent to the search engine in the first place, while indexing. This is by the way one of the features that make elasticsearch similar to NoSQL databases. The elasticsearch Update API allows you to update a document in two ways:
Sending a new partial document that will be merged with the existing one (still deleting the old one and indexing the result of the merge
Executing a script on the existing document and indexing the result after deleting the old one
Both options rely on the presence of the _source field. Storing the source can be disabled, if you disable it you of course lose this great feature.

Index file content and custom metadata separately with Solr3.3

I am doing a POC on content/text search using Solr3.3.
I have requirement where documents along with content and their custom metadata would be indexed initially. After the documents are indexed and made available for searching, user can change the custom metadata of the documents. However once the document is added to index the content of the document cannot be updated. When the user updates the custom metadata, the document index has to be updated to reflect the metadata changes in the search.
But during index update, even though the content of the file is not changed, it is also indexed and which causes delays in the metadata update.
So I wanted to check if there is a way to avoid content indexing and update just the metadata?
Or do I have to store the content and metadata in separate index files. i.e. documentId, content in index1 and documentId, custom metadata in another index. In that case how I can query onto these two different indexes and return the result?
"if there is a way to avoid content indexing and update just the metadata" This has been covered in solr indexing and reindexing and the answer is no.
Do remember that Solr uses a very loose schema. Its like a database where everything is put into a single table. Think sparse matrices, think Amazon SimpleDB. Two solr indexes are considered as two databases, not two tables, if you had DB-like joins in mind. I just answered on it on How to start and Stop SOLR from A user created windows service .
I would enter each file as two documents (a solr document = a DB row). Hence for a file on "watson":
id: docs_contents_watson
type:contents
text: text of the file
and the metadata as
id:docs_metadata_watson
type:metadata
author:A J Crown
year:1984
To search the contents of a document:
http://localhost:8080/app/select?q=type:contents&text:"on a dark lonely night"
To do metadata searches:
http://localhost:8080/app/select?q=type:metadata&year:1984
Note the type:xx.
This may be a kludge (an implementation that can cause headaches in the long run). Fellow SO'ers, please critic this.
We did try this and it should work. Take a snapshot of what you have basically the SOLrInputDocument object before you send it to lucene. Compress it and serialize the object and then assign it to one more field in your schema. Make that field as a binary field.
So when you want to update this information to one of the fields just fetch the binary field unserialize it and append/update the values to fields you are interested and re-feed it to lucene.
Never forget to store the XML as one of the fields inside SolrInputDocument that contains the text extracted by TIKA which is used for search/indexing.
The only negative: Your index size will grow a little bit but you will get what you want without re-feeding the data.

Resources