why search engines need to reindex periodically but databases don't? - database

For example , search engines such as Sphinx , Lucene must merge there indexes periodically , but index of database can be updated dynamically . Why must the index of search engine be merged?

I don't know much about Sphinx but I believe the answer to this question will not be related to it.
First, why databases do not need updates periodically? This is because of database is the major data store for the applications most of the time. By this I mean, if you create, delete or update any data; that data is the means of a database record. You're removing data from there to get rid of it within the application or you first get the data from database to update since old version is kept there. All this indicates that databases are being updated all the time and your data is always up-to-date there.
Why an index of a search engine needs periodic reindexing? Index is the data store for a search engine basically that you're processing your data, putting it into index and then retrieving it by the means of your search system. That index is your secondary data resource. This does not hold for all applications but most of the time, you have database as primary resource that is being synchronized with your application as I explained above and then index where you're not reflecting all changes in real-time. Then you find your data in index a little bit outdated according to the database. That reindexing step is necessary for you to keep your data resources consistent.
As I said this explanation does not hold for all applications but it can give you the basic idea.
ps: You have a "index of database" phrase in your question but it is totally a different topic.

Related

Migrate data between search services

I am trying to move a Azure Search service from standard pricing tier to basic. I can't seem to find a way to do that otherwise than create another and manually move data between. I am about to create a temp console project that selects all data from source service and uploads to the destination service. Is there no data migration tool for this?
Unfortunately, we do not yet have migration support between tiers in Azure Search and it does require re-creating the index in a new service. Please know that we understand the importance of this and have it high on our priority list.
Also, when you do this migration of your index, please keep in mind that there are some things you will need to keep in mind.
First off, when you export the data, you will likely be using our paging (skip and top), but note that this paging is limited to 100K documents. As a result, if you have more than 100K docs, you will need to have some sort of filtering. Perhaps if you have a State or Province field you could search and $filter where State = 'WA'
If you happen to have the original data for the index in a different location (such as SQL), you will find it easier to do this re-loading from there.
Finally, taking into account all of the above, I have been working on a sample here that shows how to do the exporting and reloading of the schema and data which hopefully will help for smaller indexes (less than 100K docs) but ultimately it is really important to make sure that all of the documents are successfully migrated.
Also, it would be great if you could vote for this feature.

What is a good web application SQL Server data mart implementation in ElasticSearch?

Coming from a RDBMS background and trying to wrap my head around ElasticSearch data storage patterns...
Currently in SQL Server, we have a star schema data mart, RecordData. Rows are organized by user ID, geographic location that pertains to the rest of the searchable record, title and description (which are free text search fields).
I would like to move this over to ElasticSearch, and have read about creating a separate index per user. If I understand this correctly, with this suggestion, I would be creating a RecordData type in each user index, correct? What is a recommended naming convention for user indices that will be simple for Kibana analysis?
One issue I have with this recommendation is, how would you organize multiple web applications on the ES server? You wouldn't want to have all those user indices all over the place?
Is it so bad to have one index per application, and type per SQL Server table?
Since in SQL Server, we have other tables for user configuration, based on user ID's, I take it that I could then create new ES types in user indices for configuration. Is this a recommended pattern? I would rather not have two data base systems for this web application.
Suggestions welcome, thank you.
I went through the same thing, and there are a few things to take into account.
Data Modeling
You say you use a star schema today. Elasticsearch is typically appropriate for denormalized data where the totality of the information resides in each document unlike with a star schema. If you can live with denormalized, that is fine but I assume that since you already have star schema, denormalized data is not an option because you don't want to go and update millions of documents each time the location name change for example(if i understand the use case). At least in my use case that wasn't an option.
What are Elasticsearch options for normalized data?
This leads us to think of how to put star schema like data in a system like Elasticsearch. There are a few options in the documentation, the main ones i focused were
Nested Objects - more details at https://www.elastic.co/guide/en/elasticsearch/guide/current/nested-objects.html . In nested objects the entire information is kept in a single document, meaning one location and its related users would be in a single document. That may make it not optimal becasue the document will be huge and again, a change in the location name will require to update the entire document. So this is better but still not optimal.
Parent - Child Relationship - more details at https://www.elastic.co/guide/en/elasticsearch/guide/current/parent-child.html . In this case the location and the User records would be kepts in separate indices similarly to a relational database. This seems to be the right modeling for what we need. The only major issue with this option is the fact that Kibana 4 does not provide ways to manipulate/aggregate documents based on parent/child relationship as of this writing. So if you main driver for using Elasticsearch is Kibana(this was mine), that kind of eliminates the option. If you want to benefit from the elasticsearch speed as an engine this seems to be the desired option for your use case.
In my opinion once you got right the data modeling all of your questions will be easier to answer.
Regarding the organization of the servers themselves, the way we organize that is by having a separate cluster of 3 elasticsearch nodes behind a Load Balancer(all of that is hosted on a cloud) and then have all your Web Applications connect to that cluster using the Elasticsearch API.
Hope that helps.

What NoSQL database (categories) support versioning?

I thought that regardless of whether a NoSQL aggregate store is a key-value, column-family or document database, it would support versioning of values. After a bit of Googling, I'm concluding that this assumption is wrong and that it just depends on the DBMS implementation. Is this true?
I know that Cassandra and BigTable support it (both column-family stores). It SEEMS that Hbase (column family) and Riak (Key-Value) do but Redis and Hadoop (Key-Value) do not. Mongo DB (document) doesCouchbase does but MongoDB does not (document stores). I don't see any pattern here. Is there a rule of thumb? (for example, "key value stores generally do not have versioning, while column-family and document databases do")
What I'm trying to do: I want to create a database of website screenshots from URL to PNG image. I'd rather use a key-value store since, versioning aside, it is the simplest solution that satisfies the problem. But when website changes or is decomissioned and I update my database I don't want to lose old images. Even if I select a key-value database that has versioning, I want to have the luxury to switch to a different key-value database without the constraint that many key-value DBs do not support versioning. So I'm trying to understand at what level of sophistication in the continuum of aggregate NoSQL databases does versioning become a feature implicit to the data model.
You don't really need versioning support from the Key-Value store.
The only thing you really need from the data Store is an efficient scanning/range query feature.
This means the datastore can retrieve entries in lexicographical order.
Most KV-stores do, so this is easy.
This is how you do it:
Create versioned keys.
In case you cant hash the original name to a fixed length, prepend the length of the original key. then put in the hash of the key or the original key itself, and end with a fixed length encoded version number (so it is lexicographically ordered from high version to low by inverting the number against the max version).
Query
Do a range query from the maximum possible version up to version 0, but only retrieving exactly one key.
Done
If you dont need explicit versions, you can also use a timestamp, so you can insert without getting the last version.
A really interesting approach to this is the Datomic database. Rather store versions, in Datomic, there are no updates only inserts. The entire database is immutable meaning you can specify the moment of truth you want to see the database as on connect and the entire history will appear to only contain the changes made up to that point. Or to think of it another any anything inserted into the database can be queried for its history looking backward. You can also branch the database and create data in one branch that isn't in the other (in programming it is like a database based on git, where multiple histories can be created)

Solr denormalization and update of referenced data

Consider the following situation. We have a database which stores writers and books in two separate tables. One book obviously stores the reference to the writer who wrote the book.
For Solr i have to denormalize this structure into one big document where every book contains the details of the writer associated. This index is now used for querying books.
One user of the system now decides to update a writer record in the system. Because many books can be associated with it i have to update every document in Solr which have embedded data from this writer record. This is very painful because i have to delete and re-add every affected document as far as i know.
Is there any better way of doing this? I need near realtime update of the index in the system if one of the referenced data gets modified.
This would be a perfect usecase for nested documents. As far as I know lucene does support nested documents but Solr doesn't, not totally sure about the current state of this feature.
This feature is available in elasticsearch though. You might want to have a look at it, there's an article I just wrote that can be interesting if you want to know what's so cool about elasticsearch in my opinion. Your question just reminded me that I didn't mention the nested documents feature in my article, which is really cool too. You can use the nested type in your mapping. If you want to know more you can have a look at this article. By the way it contains exactly the books/authors example.
Elasticsearch also helps you while updating documents. You don't need to reindex the whole document but send only the changes through a script. Thanks to the fact that it stores the source document that has been indexed it internally retrieves it, updates it running the script and reindexes it. That's how lucene internally works since its index segments are write-once. With Solr 4, which will be soon released, you can update documents providing only the changes, but as far as I know this works only if all your fields are stored. The fields that are not stored cannot be retrieved from the index.
If we are talking about Near Real Time updates, elasticsearch does use the Lucene Near Real Time API and refreshes automatically the index reader every second. Solr 3 doesn't use yet those APIs but Solr 4 does.
For updating nested types in SOLR you can use dataimporters and delta imports. The example on https://wiki.apache.org/solr/DataImportHandler#Delta-Import_Example shows how this would work. Obviously you would then need to have solr access your database.

Configure Lucene.Net with SQL Server

Has anyone used Lucene.NET rather than using the full text search that comes with sql server?
If so I would be interested on how you implemented it.
Did you for example write a windows service that queried the database every hour then saved the results to the lucene.net index?
Yes, I've used it for exactly what you are describing. We had two services - one for read, and one for write, but only because we had multiple readers. I'm sure we could have done it with just one service (the writer) and embedded the reader in the web app and services.
I've used lucene.net as a general database indexer, so what I got back was basically DB id's (to indexed email messages), and I've also use it to get back enough info to populate search results or such without touching the database. It's worked great in both cases, tho the SQL can get a little slow, as you pretty much have to get an ID, select an ID etc. We got around this by making a temp table (with just the ID row in it) and bulk-inserting from a file (which was the output from lucene) then joining to the message table. Was a lot quicker.
Lucene isn't perfect, and you do have to think a little outside the relational database box, because it TOTALLY isn't one, but it's very very good at what it does. Worth a look, and, I'm told, doesn't have the "oops, sorry, you need to rebuild your index again" problems that MS SQL's FTI does.
BTW, we were dealing with 20-50million emails (and around 1 million unique attachments), totaling about 20GB of lucene index I think, and 250+GB of SQL database + attachments.
Performance was fantastic, to say the least - just make sure you think about, and tweak, your merge factors (when it merges index segments). There is no issue in having more than one segment, but there can be a BIG problem if you try to merge two segments which have 1mil items in each, and you have a watcher thread which kills the process if it takes too long..... (yes, that kicked our arse for a while). So keep the max number of documents per thinggie LOW (ie, dont set it to maxint like we did!)
EDIT Corey Trager documented how to use Lucene.NET in BugTracker.NET here.
I have not done it against database yet, your question is kinda open.
If you want to search an db, and can choose to use Lucene, I also guess that you can control when data is inserted to the database.
If so, there is little reason to poll the db to find out if you need to reindex, just index as you insert, or create an queue table which can be used to tell lucene what to index.
I think we don't need another indexer that is ignorant about what it is doing, and reindexing everytime, or uses resources wasteful.
I have used lucene.net also as storage engine, because it's easier to distribute and setup alternate machines with an index than a database, it's just a filesystem copy, you can index on one machine, and just copy the new files to the other machines to distribute the index. All the searches and details are shown from the lucene index, and the database is just used for editing. This setup has been proven as a very scalable solution for our needs.
Regarding the differences between sql server and lucene, the principal problem with sql server 2005 full text search is that the service is decoupled from the relational engine, so joins, orders, aggregates and filter between the full text results and the relational columns are very expensive in performance terms, Microsoft claims that this issues have been addressed in sql server 2008, integrating the full text search inside the relational engine, but I don't have tested it. They also made the whole full text search much more transparent, in previous versions the stemmers, stopwords, and several other parts of the indexing where like a black box and difficult to understand, and in the new version are easier to see how they works.
With my experience, if sql server meet your requirements, it will be the easiest way, if you expect a lot of growth, complex queries or need a big control of the full text search, you might consider working with lucene from the start because it will be easier to scale and personalise.
I used Lucene.NET along with MySQL. My approach was to store primary key of db record in Lucene document along with indexed text. In pseudo code it looks like:
Store record:
insert text, other data to the table
get latest inserted ID
create lucene document
put (ID, text) into lucene document
update lucene index
Querying
search lucene index
for each lucene doc in result set load data from DB by stored record's ID
Just to note, I switched from Lucene to Sphinx due to it superb performance

Resources