We are moving our data from RDS to elastic search and the data volume is around 80GB with around 90 million records.
We have been using bulk api of elastic search for indexing the data but we want to take an entire dump of elastic records and compare the records with our RDS as we want to make sure that the data from RDS is moved correctly to Elastic. In Elastic, we have been combining multiple tables of RDS into a single elastic index, is there any way to take the dump of elastic index into a document or into multiple files?
There's no solution out of the box but there's an old post from the elastic website which explain some solutions:
https://www.elastic.co/blog/elasticsearch-verifying-data-integrity-with-external-data-stores
Hope it help.
Related
We have an on-premises oracle database installed on a server. We have to create some Charts/Dashboards with Tableau CRM on those data on-premises. Note that, tableau CRM is not Tableau Online, it is a Tableau version for the Salesforce ecosystem.
Tableau CRM has APIs, so we can push data to it or can upload CSV programmatically to it.
So, what can be done are,
Run a nodeJS app on the on-premise server, pull data from Oracle DB, and then push to Tableau CRM via the TCRM API.
Run a nodeJS app on the on-premise server, pull data from Oracle DB, create CSV, push the CSV via TCRM API
I have tested with the 2nd option and it is working fine.
But, you all know, it is not efficient. Because I have to run a cronJob and schedule the process multiple times in a day. I have to query the full table all the time.
I am looking for a better approach. Any other tools/technology you know to have a smooth sync process?
Thanks
The second method you described in the questions is a good solution. However, you can optimize it a bit.
I have to query the full table all the time.
This is can be avoided. If you take a look at the documentation of SObject InsightsExternalData you can see that it has a field by name Operation which takes one of these values Append, Delete, Overwrite, Upsert
what you will have to do is when you push data to Tableau CRM you can use the Append operator and push the records that don't exist in TCRM. That way you only query the delta records from your database. This reduces the size of the CSV you will have to push and since the size is less it takes less time to get uploaded into TCRM.
However, to implement this solution you need two things on the database side.
A unique identifier that uniquely identifies every record in the database
A DateTime field
Once you have these two, you have to write a query that sorts all the records in ascending order of the DateTime field and take only the files that fall below the last UniqueId you pushed into TCRM. That way your result set only contains delta records that you don't have on TCRM. After that you can use the same pipeline you built to push data.
Our team is trying to create an ETL into Redshift to be our data warehouse for some reporting. We are using Microsoft SQL Server and have partitioned out our database into 40+ datasources. We are looking for a way to be able to pipe the data from all of these identical data sources into 1 Redshift DB.
Looking at AWS Glue it doesn't seem possible to achieve this. Since they open up the job script to be edited by developers, I was wondering if anyone else has had experience with looping through multiple databases and transfering the same table into a single data warehouse. We are trying to prevent ourselves from having to create a job for each database... Unless we can programmatically loop through and create multiple jobs for each database.
We've taken a look at DMS as well, which is helpful for getting the schema and current data over to redshift, but it doesn't seem like it would work for the multiple partitioned datasource issue as well.
This sounds like an excellent use-case for Matillion ETL for Redshift.
(Full disclosure: I am the product manager for Matillion ETL for Redshift)
Matillion is an ELT tool - it will Extract data from your (numerous) SQL server databases and Load them, via an efficient Redshift COPY, into some staging tables (which can be stored inside Redshift in the usual way, or can be held on S3 and accessed from Redshift via Spectrum). From there you can add Transformation jobs to clean/filter/join (and much more!) into nice queryable star-schemas for your reporting users.
If the table schemas on your 40+ databases are very similar (your question doesn't clarify how you are breaking your data down into those servers - horizontal or vertical) you can parameterise the connection details in your jobs and use iteration to run them over each source database, either serially or with a level of parallelism.
Pushing down transformations to Redshift works nicely because all of those transformation queries can utilize the power of a massively parallel, scalable compute architecture. Workload Management configuration can be used to ensure ETL and User queries can happen concurrently.
Also, you may have other sources of data you want to mash-up inside your Redshift cluster, and Matillion supports many more - see https://www.matillion.com/etl-for-redshift/integrations/.
You can use AWS DMS for this.
Steps:
set up and configure DMS instance
set up target endpoint for redshift
set up source endpoints for each sql server instance see
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.SQLServer.html
set up a task for each sql server source, you can specify the tables
to copy/synchronise and you can use a transformation to specify
which schema name(s) on redshift you want to write to.
You will then have all of the data in identical schemas on redshift.
If you want to query all those together, you can do that by wither running some transformation code inside redsshift to combine and make new tables. Or you may be able to use views.
Azure's documentation suggests that we should leverage blobs to be able to index documents like MS Word, PDF, etc. We have an Azure SQL Server database of thousands of documents stored in a table's nvarchar(MAX) field. The nature of the contents in each database record is in plain English text. In fact the application converted the PDF / MS Word into plain text and stored in database.
My question is that would it be possible to index the stored "documents" in database in the same way as Azure would do against blobs? I know how to create an SQL Azure indexer but I'd like to make sure that the way that the underneath search performs against blobs will be the same for documents stored in database table.
Thanks in advance!
This is not currently possible - document extraction can only be done on blobs stored in Azure storage.
[Background]
Now I am creating WCF for keeping and getting articles of our university.
I need to save files and metadata of these files.
My WCF need to be used by 1000 person a day.
The storage will contains about 60000 aticles.
I have three different ways to do it.
I can save metadata(file name, file type) in sql server to create unique id) and save files into Azure BLOB storage.
I can save metadata and data into sql server.
I can save metadata and data into Azure BLOB storage.
What way do chose and why ?
If you suggest your own solution, it will be wondefull.
P.S. Both of them use Azure.
I would recommend going with option 1 - save metadata in database but save files in blob storage. Here're my reasons:
Blob storage is meant for this purpose only. As of today an account can hold 500TB of data and size of each blob can be of 200 GB. So space is not a limitation.
Compared to SQL Server, it is extremely cheap to store in blob storage.
The reason I am recommending storing metadata in database is because blob storage is a simple object store without any querying capabilities. So if you want to search for files, you can query your database to find the files and then return the file URLs to your users.
However please keep in mind that because these (database server and blob storage) are two distinct data stores, you won't be able to achieve transactional consistency. When creating files, I would recommend uploading files in blob storage first and then create a record in the database. Likewise when deleting files, I would recommend deleting the record from the database first and then removing blob. If you're concerned about having orphaned blobs (i.e. blobs without a matching record in the database), I would recommend running a background task which finds the orphaned blobs and delete them.
I'm new to elastic search and I have a basic question.
I want to load data from database and search them by using elastic search in MVC.NET project, but cause of data I have in my database's table I cant't convert all of them to the json and search in thme by using elastic search. How should I fill data of the elastic search from the database in an mvc.net project. I don't want the whole solution because it is impossible just a general and brief explanation. thank you very much.
First of all you should be able to model your data from SQL to ElasticSearch.
As ElasticSearch is a NoSQL and document oriented database/search engine.
You need an indexer to index SQL data to ElasticSearch.
Get all the columns associated with one record that you want to search in ElasticSearch from your SQL database (use joins if data is in multiple tables).
Use a dedicated Stored Procedure to get only needed data and construct a document class, serialize to JSON and index in your ElasticSearch cluster.
Use ElasticSearch.net client as they very neatly expose bulk index APIs.
Hope this will get you started. Have fun