We are working on requirement where we want to fetch incremental data from one redshift cluster "row wise", process it based on requirement and insert it in another redshift cluster. We want to do it "row wise" not "batch operation." For that we are writing one generic service which will do row processing from Redshift -> Redshift. So, it is like Redshift -> Service -> Redshift.
For inserting data, we will use insert queries to insert. We will commit after particular batch not row wise for performance.
But I am bit worried about performance of multiple insert queries. Or is there any other tool available which does it. There are many ETL tools available but all do batch processing. We want to process row wise. Can someone please suggest on it?
I can guarantee that your approach will not be efficient based on experience. You can refer this link for detailed best practices :
https://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html
But, I would suggest that you do as follows :
Write a python script to unload the data from your source Redshift to S3 based on a query condition that filters data as per your requirement, i.e based on some threshold like time, date etc. This operation should be fast and you can schedule this script to execute every minute or in a couple of minutes, generating multiple files.
Now, you basically have a continuous stream of files in S3, where the size of each file or batch size can be controlled based on your frequency for the previous script.
Now, all you have to do is set up a service that keeps polling S3 for objects/files as and when they are created and then process them as needed and put the processed file in another bucket. Let's call this as B2.
Set up another python script/ETL step that remotely executes a COPY command from bucket B2.
This is just an initial idea though. You have to evolve on this approach and optimize this. Best of luck!
Related
We have ETL process which ingest data every 5 mins once from different source system(as400,orcale,sap etc) into our sqlserver database, and from there we ingest data into elastic index every 5 mins so that both are in sync.
I wanted to tighten the timeframe to seconds rather than 5 mins and i wanted to make sure they both are in sync all time.
I am using control log table to make sure elastic ingestion and SSIS ETL are not running at the same time, so that we might go out of sync. which is very poor solution and not allowing me to achieve near real time data capture
I am looking for better solution to sync sqlserver database and elastic index in near real time rather than doing manually.
Note:I am using python scripts for pumping the data from sql to elastic index currently.
One approach would be to have an event stream coming out of your database or even directly out of the SSIS package run (which might actually be simpler to implement) that feeds directly into your elastic search index. ELK handles streaming log files so should handle an event stream pretty well.
We have a product that uses a MySQL database as the data-store. The data-store holds large amount of data. The problem we are facing is that the response time of the application is very slow. The database queries are very basic with very simple joins, if any. The root cause for the slow response time according to some senior employees is the database operations on the huge data-store.
Another team in our company had worked on a project in the past where they processed large fixed-format files using Hadoop and dumped the contents of these files into database tables. Borrowing from this project, some of the team members feel that we can migrate from using a MySQL database to simple fixed-format files that will hold the data instead. There will be one file corresponding to each table in the database instead. We can then build another data interaction layer that provides interfaces for performing DML operations on the contents in these files. This layer will be developed using Hadoop and the MapReduce programming model.
At this point, several questions come to my mind.
1. Does the problem statement fit into the kind of problems that are solved using Hadoop?
2. How will the application ask the data interaction layer to fetch/update/delete the required data? As far as my understanding goes, the files containing the data will reside on HDFS. We will spawn a Hadoop job that will process the required file (similar to a table in the db) and fetch the required data. This data will be written to an outout file on HDFS. We will have to parse this file to get the required content.
3. Will the approach of using fixed format-files and processing them with Hadoop truly solve the problem?
I have managed to set up a simple node cluster with two Ubuntu machines but after playing around with Hadoop for a while, I feel that the problem statement is not a good fit for Hadoop. I could be completely wrong and therefore want to know whether Hadoop fits into this scenario or is it just a waste of time as the problem statement is not in line with what Hadoop is meant for?
I would suggest go straight to Hive (http://hive.apache.org/). It is SQL engine / datawarehouse build on top of the Hadoop MR.
In a nutshell - it get Hadoop scalability and hadoop high latency.
I would consider storing bulk of data there, do all required transformation and only summarized data move to MySQL to serve queries. Usually it is not good idea to translate user requests to the hive queries - they are too slow, capability to run jobs in parallel is not trivial.
If you are planning to update data more often then storing directly in hadoop may not be a good option for you. To update a file in hadoop you may have to rewrite the file and then delete old file and copy a new file in hdfs.
However if you are just searching and joining the data then its good option. If you use hive then you could make some queries like sql.
In hadoop your work flow could be something described below:
You will run a hadoop job for your queries.
Your hadoop program will parse query and execute some job to join
and read files based on your queries and input parameters.
Your output will be generated in hdfs.
You will copy the output to local file system. Then show the output to your program.
Any tips for speeding up the import processes? Theres alot of Joins in the db.
Also, when a SSIS task is completed, would the best way to handle the next functions by code or using the Emailing notification SSIS has..?
Here is a sample that I have used to illustrate loading 1 million rows in under 3 minutes from text file to SQL Server database. The package in the sample was created using SSIS 208 R2 and was executed on Xeon single core CPU 2.5GHz and 3.00 GB RAM.
Import records on SSIS after lookup
One of the main bottlenecks in importing large number of rows will be the destination component. Faster the destination component can insert the rows, the faster the preceding source or transformation components can process the rows. Again if you happen to have components like Sort transformation that will be different because Sort transformation would hold up all the data to sort before sending it down the pipeline.
Sending email depends on what you would like to do.
If you need simple success or failure, you could simply use Send Mail task. Other option is that you could also enable the Alert notification on SQL jobs from where you might schedule the package to run on regular basis.
If you need more information to be added to the email, then you might need to use a Script Task to formulate the message body. After creating the message body, you can send the mail from within Script Task or use Send Mail task.
Hope that example along with the article #Nikhil S provided should help you fine tune your package.
This SimpleTalk article discusses ways to optimize your data flow task
Horizontally partition your data-to-be transferred into N data flows. Where N is the number of cpu cores available on your server where ssis is installed.
Play with the ssis buffer size property to figure our setting optimal for your kind of data.
There is a project in flight at my organization to move customer data and all the associated records (billing transactions, etc) from one database to another, if the customer has not had account activity within a certain timeframe.
The total number of rows in all the tables is in the millions. Perhaps 100 million rows, with all the various tables combined. The schema is more-or-less normalized. The project's designers have decided on SSIS to execute this and initial analysis is showing 5 months of execution time.
Basically, the process:
Fills an "archive" database that has the same schema as the database of origin
Delete the original rows from the source database
I can provide more detail if necessary. What I'm wondering is, is SSIS the correct approach? Is there some sort of canonical way to move very large quantities of data around? Are there common performance pitfalls to avoid?
I just can't believe that this is going to take months to run and I'd like to know if there's something else that we should be looking into.
SSIS is just a tool. You can write a 100M rows transfer in SSIS to take 24h, you can write it to take 5 mo. The problem is what you write (ie. the workflow in SSIS case), not SSIS.
There isn't anything specific to SSID that would dictate 'the transfer cannot be done faster than 5 mo'.
The guiding principles for such a task (logically partition the data, process each logical partition in parallel, eliminate access and update contention between processing, batch commit changes, don't transfer more data that is necessary on the wire, use set based processing as much as possible, be able to suspend and resume etc etc) can be implemented on SSIS just as well as any other technology (if not better).
For the record, the ETL world speed record stands at about 2TB per hour. Using SSIS. And just as a matter of fact, I just finished a transfer of 130M rows, ~200Gb of data, took some 24h (I'm lazy and not shooting for ETL record).
I would understand 5mo for development, testing and deployment, but not 5mo for actual processing. That is like 7 rows a second, and is realy realy lame.
SSIS is probably not the right choice if you are simply deleting records.
This might be of interest: Performing fast SQL Server delete operations
UPDATE: as Remus correctly points out, SSIS can perform well or badly depending on how the flows are written, and there have been some huge benchmarks (on high end systems). But for just deletes there are simply ways, such as a SQL Agent job running a TSQL delete in batches.
What is the fastest method to fill a database table with 10 Million rows? I'm asking about the technique but also about any specific database engine that would allow for a way to do this as fast as possible. I"m not requiring this data to be indexed during this initial data-table population.
Using SQL to load a lot of data into a database will usually result in poor performance. In order to do things quickly, you need to go around the SQL engine. Most databases (including Firebird I think) have the ability to backup all the data into a text (or maybe XML) file and to restore the entire database from such a dump file. Since the restoration process doesn't need to be transaction aware and the data isn't represented as SQL, it is usually very quick.
I would write a script that generates a dump file by hand, and then use the database's restore utility to load the data.
After a bit of searching I found FBExport, that seems to be able to do exactly that - you'll just need to generate a CSV file and then use the FBExport tool to import that data into your database.
The fastest method is probably running an INSERT sql statement with a SELECT FROM. I've generated test data to populate tables from other databases and even the same database a number of times. But it all depends on the nature and availability of your own data. In my case i had enough rows of collected data where a few select/insert routines with random row selection applied half-cleverly against real data yielded decent test data quickly. In some cases where table data was uniquely identifying i used intermediate tables and frequency distribution sorting to eliminate things like uncommon names (eliminated instances where a count with group by was less than or equal to 2)
Also, Red Gate actually provides a utility to do just what you're asking. It's not free and i think it's Sql Server-specific but their tools are top notch. Well worth the cost. There's also a free trial period.
If you don't want to pay or their utility you could conceivably build your own pretty quickly. What they do is not magic by any means. A decent developer should be able to knock out a similarly-featured though alpha/hardcoded version of the app in a day or two...
You might be interested in the answers to this question. It looks at uploading a massive CSV file to a SQL server (2005) database. For SQL Server, it appears that a SSIS DTS package is the fastest way to bulk import data into a database.
It entirely depends on your DB. For instance, Oracle has something called direct path load (http://download.oracle.com/docs/cd/B10501_01/server.920/a96652/ch09.htm), which effectively disables indexing, and if I understand correctly, builds the binary structures that will be written to disk on the -client- side rather than sending SQL over.
Combined with partitioning and rebuilding indexes per partition, we were able to load a 1 billion row (I kid you not) database in a relatively short order. 10 million rows is nothing.
Use MySQL or MS SQL and embedded functions to generate records inside the database engine. Or generate a text file (in cvs like format) and then use Bulk copy functionality.