Nifi ExecuteSQLRecord: how to capture performance bottleneck - sql-server

Objective
I have Apache Nifi Docker container on Azure VM with attached premium very high-throughput SSD disk. I have MSSQL Server 2012 database on AWS. Nifi to database communication happens through mssql jar v6.2, and through high-throughtput AWS Direct Connect MPLS network.
Within Nifi Flow only one processor is executed - ExecuteSQLRecord. It use only one thread/CPU and has 4 GB JVM Heap Space available. ExecuteSQLRecord execute query that return 1 million of rows, which equals to 60MB Flow File. Query is based on table indexes, so there is nothing to optimize on DB side. Query looks like: SELECT * FROM table WHERE id BETWEEN x AND y.
The issue
ExecuteSQLRecord with 1 thread/CPU, 1 query , retrieves 1M of rows (60MB) in 40 seconds.
In the same time, the same query run from MSSMS and database internal network takes 18 seconds.
In the same time query is already optimized on DB side (with indexes), and throughtput scale linearly with increasing number of threads/CPUs - network is not a bottleneck.
Questions
Is this performance okay for Nifi 1 CPU? Is it okay that Nifi spends 22 seconds (from 40) for retrieval and storing the results to Content Repository?
How does Nifi pull the data from MSSQL Server? Is this a pull approach? If yes, maybe we have to many roundtrips?
How can I check how much time Nifi spending on converting result set to CSV, and how much time for writting into Content Repository?

Are you using the latest Docker image (1.11.4)? If so you should be able to set the fetch size on the ExecuteSQLRecord processor (https://issues.apache.org/jira/browse/NIFI-6865)
I got a couple of different results when I searched for the default fetch size for the MSSQL driver, one site said 1 and another said 32. In your case for that many records I'd imagine you'd want it to be way higher (see https://learn.microsoft.com/en-us/previous-versions/sql/legacy/aa342344(v=sql.90)?redirectedfrom=MSDN#use-the-appropriate-fetch-size for setting the appropriate fetch size).

To add to Matt's answer, you can examine the provenance data for each flowfile and see the lineage duration (amount of time) it spent in each segment of the flow. You can also see the status history for every processor, so you can examine the data in/out by size and number of flowfiles, CPU usage, etc. for each processor.

Related

how to transfer 100 gb data over the network every night at a high speed?

There is an external database which we have access to 7 views. Every morning we want to pull all the records from those views. It appears the record is around 100GB. I have tried to use spring batch but it takes almost 15 hours to pull that size of the record. I am looking for a solution where I can do two things 1. speed up this process which will take 1 or 2 hours max and if there network failure then email to a stakeholders about the failures. We need the data both in Elasticsearch as well as MS SQL server.
Following things I have tried
Apache Kafka with JDBC Source connector: dint work because the view does not have primary key column neither it has a timestamp column
Tried with spring batch JdbcItemReader and RepositoryItemwriter but it's pretty slow. (MS SQL Server to MS SQL Server)
Tried with SpringBatch JdbcItemReader and KafkaItemWriter, Kafka consumer to Elasticsearch bulk index. this is the fastest which takes around 15 hours. Chunk size 10k or chunk size 5k both take around the same amount of time. What are my options?
Tried to use Debezium source connector for Kafka ut dint work since the Source db has CDC disabled.

Generate SQL Server CPU usage for last 7 days

I have a requirement to generate CPU usage reports for my SQL server for previous 7 days. I will use a graph to represent it.
Also, I have to keep track of top 10 queries which consumed maximum CPU each day.
I got one post below but I have few doubts.
CPU utilization by database?
Doubt:
How I will know that, how was the overall CPU usage yesterday? Do I have to add all the AvgCPU time for distinct queries ran yesterday?
There is no reliable way in Getting cpu usage per day/last 5 days..I see SQLServer has below columns..
select
creation_time,
last_worker_time,
total_worker_time,
execution_count,
last_execution_time
from sys.dm_exec_query_stats
And those reported below on my test instance..
As you can see from Screenshot above..
We can't reliably get ,count of instances a particular query got executed on a particular day..And moreover you will see this entire data gets reset if you restart SQLServer
If you really want to show data on daily basis,you could use perfmon..Here are some tutorials which may help you..
1.Collecting Performance Data into a SQL Server Table
2.Using PerfMon for SQL Server Reporting Services Performance Management
You may also take a look on MS SQL Data collection sets. Easy to deploy, easy to keep necessary data (as long it stored on dedicated DB), and at least its really fits your requirements for top 10 CPU-expensive queries.
You can also slightly modify t-sql for collector agent and target tables on collector server in order to obtain some extra CPU info if you need it.

Warehouse PostgreSQL database architecture recommendation

Background:
I am developing an application that allows users to generate lots of different reports. The data is stored in PostgreSQL and has natural unique group key, so that the data with one group key is totally independent from the data with others group key. Reports are built only using 1 group key at a time, so all of the queries uses "WHERE groupKey = X;" clause. The data in PostgreSQL updates intensively via parallel processes which adds data into different groups, but I don't need a realtime report. The one update per 30 minutes is fine.
Problem:
There are about 4 gigs of data already and I found that some reports takes significant time to generate (up to 15 seconds), because they need to query not a single table but 3-4 of them.
What I want to do is to reduce the time it takes to create a report without significantly changing the technologies or schemes of the solution.
Possible solutions
What I was thinking about this is:
Splitting one database into several databases for 1 database per each group key. Then I will get rid of WHERE groupKey = X (though I have index on that column in each table) and the number of rows to process each time would be significantly less.
Creating the slave database for reads only. Then I will have to sync the data with replication mechanism of PostgreSQL for example once per 15 minutes (Can I actually do that? Or I have to write custom code)
I don't want to change the database to NoSQL because I will have to rewrite all sql queries and I don't want to. I might switch to another SQL database with column store support if it is free and runs on Windows (sorry, don't have Linux server but might have one if I have to).
Your ideas
What would you recommend as the first simple steps?
Two thoughts immediately come to mind for reporting:
1). Set up some summary (aka "aggregate") tables that are precomputed results of the queries that your users are likely to run. Eg. A table containing the counts and sums grouped by the various dimensions. This can be an automated process -- a db function (or script) gets run via your job scheduler of choice -- that refreshes the data every N minutes.
2). Regarding replication, if you are using Streaming Replication (PostgreSQL 9+), the changes in the master db are replicated to the slave databases (hot standby = read only) for reporting.
Tune the report query. Use explain. Avoid procedure when you could do it in pure sql.
Tune the server; memory, disk, processor. Take a look at server config.
Upgrade postgres version.
Do vacuum.
Out of 4, only 1 will require significant changes in the application.

Database solution for 200million writes/day, monthly summarization queries

I'm looking for help deciding on which database system to use. (I've been googling and reading for the past few hours; it now seems worthwhile to ask for help from someone with firsthand knowledge.)
I need to log around 200 million rows (or more) per 8 hour workday to a database, then perform weekly/monthly/yearly summary queries on that data. The summary queries would be for collecting data for things like billing statements, eg. "How many transactions of type A did each user run this month?" (could be more complex, but that's the general idea).
I can spread the database amongst several machines, as necessary, but I don't think I can take old data offline. I'll definitely need to be able to query a month's worth of data, maybe a year. These queries would be for my own use, and wouldn't need to be generated in real-time for an end-user (they could run overnight, if needed).
Does anyone have any suggestions as to which databases would be a good fit?
P.S. Cassandra looks like it would have no problem handling the writes, but what about the huge monthly table scans? Is anyone familiar with Cassandra/Hadoop MapReduce performance?
I'm working on a very similar process at the present (a web domain crawlling database) with the same significant transaction rates.
At these ingest rates, it is critical to get the storage layer right first. You're going to be looking at several machines connecting to the storage in a SAN cluster. A singe database server can support millions of writes a day, it's the amount of CPU used per "write" and the speed that the writes can be commited.
(Network performance also often is an early bottleneck)
With clever partitioning, you can reduce the effort required to summarise the data. You don't say how up-to-date the summaries need to be, and this is critical. I would try to push back from "realtime" and suggest overnight (or if you can get away with it monthly) summary calculations.
Finally, we're using a 2 CPU 4GB RAM Windows 2003 virtual SQL Server 2005 and a single CPU 1GB RAM IIS Webserver as our test system and we can ingest 20 million records in a 10 hour period (and the storage is RAID 5 on a shared SAN). We get ingest rates upto 160 records per second batched in blocks of 40 records per network round trip.
Cassandra + Hadoop does sound like a good fit for you. 200M/8h is 7000/s, which a single Cassandra node could handle easily, and it sounds like your aggregation stuff would be simple to do with map/reduce (or higher-level Pig).
Greenplum or Teradata will be a good option. These databases are MPP and can handle peta-scale data. Greenplum is a distributed PostgreSQL db and also has it own mapreduce. While Hadoop may solve your storage problem but it wouldn't be helpful for performing summary queries on your data.

Tracking down data load performance issues in SSIS package

Are there any ways to determine what the differences in databases are that affect a SSIS package load performance ?
I've got a package which loads and does various bits of processing on ~100k records on my laptop database in about 5 minutes
Try the same package and same data on the test server, which is a reasonable box in both CPU and memory, and it's still running ... about 1 hour so far :-(
Checked the package with a small set of data, and it ran through Ok
I've had similar problems over the past few weeks, and here are several things you could consider, listed in decreasing order of importance according to what made the biggest difference for us:
Don't assume anything about the server.
We found that our production server's RAID was miscconfigured (HP sold us disks with firmware mismatches) and the disk write speed was literally a 50th of what it should be. So check out the server metrics with Perfmon.
Check that enough RAM is allocated to SQL Server. Inserts of large datasets often require use of RAM and TempDB for building indices, etc. Ensure that SQL has enough RAM that it doesn't need to swap out to Pagefile.sys.
As per the holy grail of SSIS, avoid manipulating large datasets using T-SQL statements. All T-SQL statements cause changed data to write out to the transaction log even if you use Simple Recovery Model. The only difference between Simple and Full recovery models is that Simple automatically truncates the log file after each transactions. This means that large datasets, when manipulated with T-SQL, thrash the log file, killing performance.
For large datasets, do data sorts at the source if possible. The SSIS Sort component chokes on reasonably large datasets, and the only viable alternative (nSort by Ordinal, Inc.) costs $900 for a non-transferrable per CPU license. So... if you absolutely have to a large dataset then consider loading it into a staging database as an intermediate step.
Use the SQL Server Destination if you know your package is going to run on the destination server, since it offers roughly 15% performance increase over OLE DB because it shares memory with SQL Server.
Increase the network packaet size to 32767 on your database connection managers. This allows large volumes of data to move faster from the source server/s, and can noticably improve reads on large datasets.
If using Lookup transforms, experiment with cache sizes - between using a Cache connection or Full Cache mode for smaller lookup datasets, and Partial / No Cache for larger datasets. This can free up much needed RAM.
If combining multiple large datasets, use either RAW files or a staging database to hold your transformed datasets, then combine and insert all of a table's data in a single data flow operation, and lock the destination table. Using staging tables or RAW files can also help relive table locking contention.
Last but not least, experiment with the DefaultBufferSize and DefaulBufferMaxRows properties. You'll need to monitor your package's "Buffers Spooled" performance counter using Perfmon.exe, and adjust the buffer sizes upwards until you see buffers being spooled (paged to disk), then back off a little.
Point 8 is especially important on very large datasets, since you can only achieve a minimally logged bulk insert operation if:
The destination table is empty, and
The table is locked for the duration of the load operation.
The database is in Simply / Bulk Logged recovery mode.
This means that subesquent bulk loads a table will always be fully logged, so you want to get as much data as possible into the table on the first data load.
Finally, if you can partition you destination table and then load the data into each partition in parallel, you can achieve up to 2.5 times faster load times, though this isn't usually a feasible option out in the wild.
If you've ruled out network latency, your most likely culprit (with real quantities of data) is your pipeline organisation. Specifically, what transformations you're doing along the pipeline.
Data transformations come in four flavours:
streaming (entirely in-process/in-memory)
non-blocking (but still using I/O, e.g. lookup, oledb commands)
semi-blocking (blocks a pipeline partially, but not entirely, e.g. merge join)
blocking (blocks a pipeline until it's entirely received, e.g. sort, aggregate)
If you've a few blocking transforms, that will significantly mash your performance on large datasets. Even semi-blocking, on unbalanced inputs, will block for long periods of time.
In my experience the biggest performance factor in SSIS is Network Latency. A package running locally on the server itself runs much faster than anything else on the network. Beyond that I can't think of any reasons why the speed would be drastically different. Running SQL Profiler for a few minutes may yield some clues there.
CozyRoc over at MSDN forums pointed me in the right direction ...
- used the SSMS / Management / Activity Monitor and spotted lots of TRANSACTION entries
- got me thinking, read up on the Ole Db connector and unchecked the Table Lock
- WHAM ... data loads fine :-)
Still don't understand why it works fine on my laptop d/b, and stalls on the test server ?
- I was the only person using the test d/b, so it's not as if there should have been any contention for the tables ??

Resources