Should I Use Apache Solr - solr

I am using Oracle 11g2 and we are having a schema where few tables are having more than 100 million rows (some of them are Varchar2 100 bytes). And we have to frequently do the LIKE based search on those tables. Sometimes we need to join the tables also. Insert / Updates are also happening very frequently for such tables (1000 insert / updates per second) by other applications.
So my question is, for my User Interface, should I use Apache Solr to let user search on these tables instead of SQL queries? I have tried SQL and it is really slow (considering amount of data I am having in my database).
My requirements are,
Result should come faster and it should be accurate.
It should have the latest data.
Can you suggest if I should go with Apache Solr, or another solution for my problem ?
EDIT : Should I consider Cassandra ? Is it having joins between tables ? Is it having LIKE based search or VARCHAR items of DB ?

You should consider using Hibernate Search for your requirement. You can use This link to get started. As you are using tables for your application, you will have lot of facilities like ORM and also Full Text Search through this module. If you want to use Apache Solr, you may have to make necessary code to sync data from your database to Apache Solr Synchronization.
Hope, this may help you
Thanks

I think you can use Solr in your case and you will get the data faster and accurate. About the latest...there would be some lag there...
as data needs to added/updated in solr index.
I have used Solr in my application...(I have used DataImportHandler)
I am updating the Solr index after 20 mins...
that means there is 20 mins lag for the latest update in the database.
But users are fine with it as its faster than Oracle search ...:)
Its been three years... its working fine and there is no issue with it.

Related

Synchronizing data from MSSQL to Elasticsearch using Apache Kafka

I'm currently running a text search in SQL Server, which is becoming a bottleneck and I'd like to move things to Elasticsearch for obvious reasons, however I know that I have to denormalize data for best performance and scalability.
Currently, my text search includes some aggregation and joining multiple tables to get the final output. Tables, that are joined, aren't that big (up to 20GB per table) but are changed (inserted, updated, deleted) irregularly (two of them once in a week, other one on demand x times a day).
My plan would be to use Apache Kafka together with Kafka Connect in order to read CDC from my SQL Server, join this data in Kafka and persist it in Elasticsearch, however I cannot find any material telling me how deletes would be handled when data is being persisted to Elasticsearch.
Is this even supported by the default driver? If not, what are the possibilities? Apache Spark, Logstash?
I am not sure whether this is already possible in Kafka Connect now, but it seems that this can be resolved with Nifi.
Hopefully I understand the need, here is the documentation for deleting Elasticsearch records with one of the standard NiFi processors:
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-elasticsearch-5-nar/1.5.0/org.apache.nifi.processors.elasticsearch.DeleteElasticsearch5/

Need Suggestions: Utilizing columnar database

I am working on a project which is highly performance dashboard where results are mostly aggregated mixed with non-aggregated data. First page is loaded by 8 different complex queries, getting mixed data. Dashboard is served by a centralized database (Oracle 11g) which is receiving data from many systems in realtime ( using replication tool). Data which is shown is realized through very complex queries ( multiple join, count, group by and many where conditions).
The issue is that as data is increasing, DB queries are taking more time than defined/agreed. I am thinking to move aggregated functionality to Columnar database say HBase ( all the counts), and rest linear data will be fetched from Oracle. Both the data will be merged based on a key on App layer. Need experts opinion if this is correct approach.
There are few things which are not clear to me:
1. Will Sqoop be able to load data based on query/view or only tables? on continuous basis or one time?
2. If a record is modified ( e.g. status is changed), how will HBase get to know?
My two cents. HBase is a NoSQL database build for fast lookup queries, not to make aggregated, ad-hoc queries.
If you are planning to use a hadoop cluster, you can try hive with parquet storage formart. If you need near real-time queries, you can go with MPP database. A commercial option is Vertica or maybe Redshift from Amazon. For an open-source solution, you can use InfoBrigth.
These columnar options is going to give you a greate aggregate query performance.

Index/Statistics on volatile tables

One of my application has the following use-case:
user inputs some filters and conditions about orders (delivery date ranges,...) to analyze
the application compute a lot of data and save it on several support tables (potentially thousands of record for each analysis)
the application starts a report engine that use data from these tables
when exiting, the application deletes computed record from support tables
Actually I'm analyzing how to ehnance queries performance adding indexes/stastics to support tables and the SQL Profiler suggests me to create 3-4 indexes and 20-25 statistics.
The record in supports tables are costantly created and removed: it's correct to create all this indexes/statistics or there is the risk that all these data will be easily outdated (with the only result of a costant overhead for maintaining indexes/statistics)?
DB server: SQL Server 2005+
App language: C# .NET
Thanks in advance for any hints/suggestions!
First seems like a good situation for a data cube. Second, yes you should update stats before running your query once the support tables are populated. You should disable your indexes when inserting the data. Then the rebuild command will bring your indexes and stats up to date in one go. Profiler these days is usually quite good at these suggestions, but test the combinations to see what actully gives the best performance gains. To look as os cubes here What are the open source tools and techniques to build a complete data warehouse platform?

Sync solr documents with database records

I wonder there is a proper way to solr documents with sync database records. I usually have problems: there is solr documents while there are no database records referent by solr. It seems some db records has been deleted, but no trigger has been to update solr. I want to write a rake task to remove documents in solr that run periodically.
Any suggestions?
Chamnap
Yes, there is one.
You have to use the DataImportHandler with the delta import feature.
Basically, you specify a query that updates only the rows that have been modified, instead of rebuilding the whole index. Here's an example.
Otherwise you can add a feature in your application that simply trigger the removal of the documents via HTTP in both your DB and in your index.
I'm using Java + Java DB + Lucene (where Solr is based on) for my text search and database records. My solution is to backup then recreate (delete + create) the Lucene database to sync with my records on Java DB. This seems to be the easiest approach, only problem is that this is not advisable to run often. This also means that your records are not updated in real-time. I run my batch job nightly so that all changes reflect the next day. Hope this helps.
Also read an article about syncing Solr and db records here under "No synchronization". It states that it's not easy, but possible in some cases. Would be helpful if you specify your programming language so more people can help you.
In addition to the above, "soft" deletion by setting a deleted or deleted_at column is a great approach. That way you can run a script to periodically clear out deleted records from your Solr index as needed.
You mention using a rake task — is this a Rails app you're working with? Most Solr clients for Rails apps should support deleting records via an after_destroy hook.

Help figuring out approaches to (near) real time multi dimensional data querying

I have a system that involves numerous related tables. Think of a standard category/product/order/customer/orderitem scenario. Some tables are self referencing (like Categories). None of the tables are particularly large (around 100k rows with an estimated scale to around 1 million rows). There are a lot of dimensions to this data I need to consider, but must be queried in a near real time way. I also don't know which dimensions a particular user is interested in- it can be one or many criteria across numerous tables. Things can range from
Give me everything with a category of Jackets
Give me everything with a category of Jackets->Parkas having a red color purchased in the last month in New York
Give me everything which wasn't purchased in New York which costs over $100.
Currently we have a very long SP which uses a "cascading data" approach- we go table by table, filtering everything into a temp table using whatever criteria was specified for that table. For the next table, we join the current temp table to whatever table we're using and apply a new filter set into a new temp table. It works, but manageability and performance is slow. I need something better.
I need a new approach to this problem. It's clearly a need for OLAP, possibly using a star schema. Does this work in real time? Can it be configured to work in real time? Should I use indexed views to create a set of denormalized tables? Should I offload this outside of the database completely?
FYI We're using Sql Server.
As you say, this is perfect for OLAP.
With Sql Server 2005 and 2008 you can set up an almost real time solution. You should:
Create a denormalized star schema
Build an OLAP cube using that schema
Enable proactive caching to update the cube when the underlying data source changes.
It's not a trivial job, and you need the Enterprise version of Sql Server to use proactive caching. You also need some front-end tool (maybe excel would do) to consume the cube.
It would probably be better to build a dynamic query in your code with all the joins you need, customized to each individual request. (properly parameterized for security of course).
You would use much of the same cascading logic you have now but you move it to to the code instead of the database. Then you only submit the exact query you need.
The performance would beat using all of the temp tables and you might get some caching benefit after a few queries were run.
Your dilemma sounds to me like "Is it better to achieve the same result by performing complex processing every time I need it, or should I do it once only for each new piece of data?".

Resources