Synchronizing data from MSSQL to Elasticsearch using Apache Kafka - sql-server

I'm currently running a text search in SQL Server, which is becoming a bottleneck and I'd like to move things to Elasticsearch for obvious reasons, however I know that I have to denormalize data for best performance and scalability.
Currently, my text search includes some aggregation and joining multiple tables to get the final output. Tables, that are joined, aren't that big (up to 20GB per table) but are changed (inserted, updated, deleted) irregularly (two of them once in a week, other one on demand x times a day).
My plan would be to use Apache Kafka together with Kafka Connect in order to read CDC from my SQL Server, join this data in Kafka and persist it in Elasticsearch, however I cannot find any material telling me how deletes would be handled when data is being persisted to Elasticsearch.
Is this even supported by the default driver? If not, what are the possibilities? Apache Spark, Logstash?

I am not sure whether this is already possible in Kafka Connect now, but it seems that this can be resolved with Nifi.
Hopefully I understand the need, here is the documentation for deleting Elasticsearch records with one of the standard NiFi processors:
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-elasticsearch-5-nar/1.5.0/org.apache.nifi.processors.elasticsearch.DeleteElasticsearch5/

Related

Setup Flink application with multiple different SQL query(aggregation)

I need to build a pipeline with different column aggregations from a same source, say, one by userId, another by productId, and etc. I also want to have different granularity aggregations, say, by hourly, daily. Each aggregation will have a different sink, say a different nosql table.
It seems simple to build a SQL query with Table API. But I would like to reduce the operation overhead of managing too many Flink apps. So I am thinking putting all different SQL queries in one pyflink app.
This is first time I build Flink app. So I am not sure how feasible this is. In particular, I'd like to know:
Read the Flink doc, I see there are concepts of application vs job. So I am curious if each SQL aggregation query is a single Flink job?
will the overall performance degraded because of too many queries in one Flink app?
since the queries share a same source(from kinesis), will each query get a copy of the source. Basically, I want to make sure each event will be processed by each sql aggregation query.
Thanks!
You can put multiple queries into one job if you use a statement set:
https://docs.ververica.com/user_guide/sql_development/sql_scripts.html

Loading data from SQL Server to Elasticsearch

Looking for suggesting on loading data from SQL Server into Elasticsearch or any other data store. The goal is to have transactional data available in real time for Reporting.
We currently use a 3rd party tool, in addition to SSRS, for data analytics. The data transfer is done using daily batch jobs and as a result, there is a 24 hour data latency.
We are looking to build something out that would allow for more real time availability of the data, similar to SSRS, for our Clients to report on. We need to ensure that this does not have an impact on our SQL Server database.
My initial thought was to do a full dump of the data, during the weekend, and writes, in real time, during weekdays.
Thanks.
ElasticSearch's main use cases are for providing search type capabilities on top of unstructured large text based data. For example, if you were ingesting large batches of emails into your data store every day, ElasticSearch is a good tool to parse out pieces of those emails based on rules you setup with it to enable searching (and to some degree querying) capability of those email messages.
If your data is already in SQL Server, it sounds like it's structured already and therefore there's not much gained from ElasticSearch in terms of reportability and availability. Rather you'd likely be introducing extra complexity to your data workflow.
If you have structured data in SQL Server already, and you are experiencing issues with reporting directly off of it, you should look to building a data warehouse instead to handle your reporting. SQL Server comes with a number of features out of the box to help you replicate your data for this very purpose. The three main features to accomplish this that you could look into are AlwaysOn Availability Groups, Replication, or SSIS.
Each option above (in addition to other out-of-the-box features of SQL Server) have different pros and drawbacks. For example, AlwaysOn Availability Groups are very easy to setup and offer the ability to automatically failover if your main server had an outage, but they clone the entire database to a replica. Replication let's you more granularly choose to only copy specific Tables and Views, but then you can't as easily failover if your main server has an outage. So you should read up on all three options and understand their differences.
Additionally, if you're having specific performance problems trying to report off of the main database, you may want to dig into the root cause of those problems first before looking into replicating your data as a solution for reporting (although it's a fairly common solution). You may find that a simple architectural change like using a columnstore index on the correct Table will improve your reporting capabilities immensely.
I've been down both pathways of implementing ElasticSearch and a data warehouse using all three of the main data synchronization features above, for structured data and unstructured large text data, and have experienced the proper use cases for both. One data warehouse I've managed in the past had Tables with billions of rows in it (each Table terabytes big), and it was highly performant for reporting off of on fairly modest hardware in AWS (we weren't even using Redshift).

SQL Server vs. No-SQL Database

I have inherited a legacy content delivery system and I need to re-design & re-build it. The content is delivered by content suppliers (e.g. Sony Music) and is ingested by a legacy .NET app into a SQL Server database.
Each content has some common properties (e.g. Title & Artist Name) as well as some content-type specific properties (e.g. Bit Rate for MP3 files and Frame Rate for video files).
This information is stored in a relational database in multiple tables. These tables might have null values in some of their fields because those fields might not belong to a property of the content. The database is constantly under write operations because the content ingestion system is constantly receiving content files from the suppliers and then adds their metadata to the database.
Also, there is a public facing web application which lets end users buy the ingested contents (e.g. musics, videos etc). This web application totally relies on an Elasticsearch index. In fact this application does not see the database at all and uses the Elasticsearch index as the source of data. The reason is that SQL Server does not perform as fast and as efficient as Elasticsearch when it comes to text-search.
To keep the database and Elasticsearch in sync there is a Windows service which reads the updates from SQL Sever and writes them to the Elasticsearch index!
As you can see there are a few problems here:
The data is saved in a relational database which makes the data hard to manage. e.g. there is a table of 3 billion records to store metadata of each contents as a key value pairs! To me using a NoSQL database or index would make a lot more sense as they allow to store documents with different formats in them.
The Elasticsearch index needs to be kept in Sync with the database. If the Windows services does not work for any reason then the index will not get updated. Also when there are too many inserts/updates in the database it takes a while for the index to get updated.
We need to maintain two sources of data which has cost overhead.
Now my question: is there a NoSQL database which has these characteristics?
Allows me to store documents with different structures in it?
Provides good text-search functions and performance? e.g. Fuzzy search etc.
Allows multiple updates to be made to its data concurrently? Based on my experience Elasticsearch has problems with concurrent updates.
It can be installed and used at Amazon AWS infrastructure because our new products will be hosted on AWS. Auto scaling and clustering is important. e.g. DynamoDB.
It would have a kind of GUI so that support staff or developers could modify the data to some extent.
A combination of DynamoDB and ElasticSearch may work for your use case.
DynamoDB certainly supports characteristics 1, 3, 4, and 5.
There is now a Logstash Input Plugin for DynamoDB that can be combined with an ElasticSearch output plugin to keep your table and index in sync in real time. ElasticSearch provides characteristic 2.

Need Suggestions: Utilizing columnar database

I am working on a project which is highly performance dashboard where results are mostly aggregated mixed with non-aggregated data. First page is loaded by 8 different complex queries, getting mixed data. Dashboard is served by a centralized database (Oracle 11g) which is receiving data from many systems in realtime ( using replication tool). Data which is shown is realized through very complex queries ( multiple join, count, group by and many where conditions).
The issue is that as data is increasing, DB queries are taking more time than defined/agreed. I am thinking to move aggregated functionality to Columnar database say HBase ( all the counts), and rest linear data will be fetched from Oracle. Both the data will be merged based on a key on App layer. Need experts opinion if this is correct approach.
There are few things which are not clear to me:
1. Will Sqoop be able to load data based on query/view or only tables? on continuous basis or one time?
2. If a record is modified ( e.g. status is changed), how will HBase get to know?
My two cents. HBase is a NoSQL database build for fast lookup queries, not to make aggregated, ad-hoc queries.
If you are planning to use a hadoop cluster, you can try hive with parquet storage formart. If you need near real-time queries, you can go with MPP database. A commercial option is Vertica or maybe Redshift from Amazon. For an open-source solution, you can use InfoBrigth.
These columnar options is going to give you a greate aggregate query performance.

Is it better to send data to hbase via one stream or via several servers concurrently?

I'm sorry if this question is basic(I'm new to nosql). Basically I have a large mathimatical process that I'm splitting up and having different servers process and send the result to an hbase database. Each server computing the data, is an hbase regional server, and has thrift on it.
I was thinking of each server processing the data and then updating hbase locally(via thrift). I'm not sure if this is the best approach because I don't fully understand how the master(named) node will handle the upload/splitting.
I'm wondering what the best practice is when uploading large amounts of data(in total I suspect it'll be several million rows)? Is it okay to send it to regional servers or should everything go through the master?
From this blog post,
The general flow is that a new client contacts the Zookeeper quorum (a
separate cluster of Zookeeper nodes) first to find a particular row
key. It does so by retrieving the server name (i.e. host name) that
hosts the -ROOT- region from Zookeeper. With that information it can
query that server to get the server that hosts the .META. table. Both
of these two details are cached and only looked up once. Lastly it can
query the .META. server and retrieve the server that has the row the
client is looking for.
Once it has been told where the row resides, i.e. in what region, it
caches this information as well and contacts the HRegionServer hosting
that region directly. So over time the client has a pretty complete
picture of where to get rows from without needing to query the .META.
server again.
I am assuming you directly use the thrift interface. In that case, even if you call any mutation from a particular regionserver, that regionserver only acts as a client. It will contact Zookeeper quorum, then contact Master to get the regions where to write the data and proceed in the same way as if it was written from another regionserver.
Is it okay to send it to regional servers or should everything go through the master?
Both are same. There is no such thing as writing directly to regionserver. Master will have to be contacted to determine which region to write the output to.
If you are using a hadoop map-reduce job, and using the Java API for the mapreduce job, then you can use the TableOutputFormat to write directly to HFiles without going through the HBase API. It is about ~10x faster than using the API.

Resources