Importing large data regularly in sql server - sql-server

We have an application that uses SQL Server and needs to be refreshed with a source system's data at a regular interval. There are millions of records in the source system and we refresh it every 30 mins. We are currently using openquery and cursors to import the data and keep it fresh. However, this approach seems time consuming and not very reliable.
Does anyone know any other options we can use?
Also, for some source tables, there are hooks like last modified datetime etc...that we can use to get smaller chunks of data since it was last modified. But it isn't very reliable either as that field doesn't seem to get updated every time and also is not the case with all the tables. So, it is really a pain to deal with the ones that don't have hooks like this.
Do you think we could use big data solutions like Hadoop, MapReduce etc... in anyway? My impression about these was that they are useful in storing and fetching legacy data and/or fetching bigger data like when dealing with files. Not sure how they'd come into play in just importing table data.
Any suggestion is greatly appreciated.

Related

Which one is better, iterate and sort data in backend or let the database handle it?

I'm trying to design a database schema for Djabgo rest framework web application.
At some point, I have two choces:
1- Choose a schema in which in one or several apies, I have to get a queryset from database and iterate and order it with python. (For example, I can store some datas in an array-data-typed column, get them from database and sort them with python.)
2- store the data in another table and insert a kind of big number of rows with each insert. This way, I can get the data in my favorite format in much less lines with orm codes.
I tried some basic tests and benchmarking to see which way is faster, and letting database handle more of the job (second way) didn't let me down. But I don't have the means of setting a more real situatuin and here's the question:
Is it still a good idea to let database handle the job when it also has to handle hundreds of requests from other apies and clients each second?
Is database (and orm) usually faster and more reliable than backend?
As a general rule, you want to let the database do work when the work is appropriate for the database. Sorting result sets would be in that category.
Keep in mind:
The database is running on a server, often on a distributed system and so it has access to more resources.
Databases are designed to handle large data, so they are not limited by the memory in a single thread.
When this question comes up, often more data needs to be passed back to the application than is strictly needed. Consider a problem such as getting the top 10 of something.
Mixing processing in the application and the database often requires multiple queries and passing data back and forth, which is expensive.
(And there are no doubt other considerations.)
There are some situations where it might be more efficient or convenient to do work in the application. A common example is formatting result sets for the application -- say turning 1234.56 into $1,234.56. Other examples would be when the application language has capabilities that are not directly in SQL or are hard to implement in SQL.

Using application's internal cache while working with Cassandra

As I've been working with traditional relational database for a long time, moving to nosql, especially Cassandra, is a big change. I ussually design my application so that everything in the database are loaded into application's internal caches on startup and if there is any update to a database's table, its corresponding cache is updated as well. For example, if I have a table Student, on startup, all data in that table is loaded into StudentCache, and when I want to insert/update/delete, I will call a service which updates both of them at the same time. The aim of my design is to prevent selecting directly from the database.
In Cassandra, as the idea is to build table containing all needed data so that join is unnencessary, I wonder if my favorite design is still useful, or is it more effective to query data directly from the database (i.e. from one table) when required.
Based on your described usecase I'd say that querying data as you need it prevents storing of data you dont need, plus what if your dataset is 5Gb? Are you still going to load the entire dataset?
Maybe consider a design where you dont load all the data on startup, but load it as needed and then store it and check this store before querying again, like what a cache does!
Cassandra is built to scale, your design cant handle scaling, you'll reach a point where your dataset is too large. Based on that, you should think about a tradeoff. Lots of on-the-fly querying vs storing everything in the client. I would advise direct queries, but store data when you do carry out a query, dont discard it and then carry out the same query again!
I would suggest to query the data directly as saving all the data to the application makes the applications performance based on the input. Now this might be a good thing if you know that the amount of data will never exceed your target machine's memory.
Should you however decide that this limit should change (higher!) you will be faced with a problem. Taking this approach will be fast when it comes down to searching (assuming you sort the result at start) but will pretty much kill maintainability.
The former favorite 'approach' is however still usefull should you choose for this.

Heavy vs light queries

When you have an application that is constantly querying a database for information, in terms of performance, and database usage, is it better to have one big query that pulls in all the data at once or is it better to have a bunch of smaller query's that pull in the data one at a time. Does it matter?
Im trying to figure out if I should query my entire class once any value in the class is requested. Or to only query individual values as they are needed.
If others are paying for your usage, I would recommend pulling as much data as you can in a single connection and work with it locally. However, I don't know what 'usage' is defined as. It could be connections, bandwidth, operations, etc.

Storing large amounts of data in a database

I'm currently working on a home-automation project which provides the user with the possibility to view their energy usage over a period of time. Currently we request data every 15 minutes and we are expecting around 2000 users for our first big pilot.
My boss is requesting we that we store at least half a year of data. A quick sum leads to estimates of around 35 million records. Though these records are small (around 500bytes each) I'm still wondering whether storing these in our database (Postgres) is a correct decision.
Does anyone have some good reference material and/or advise about how to deal with this amount of information?
For now, 35M records of 0.5K each means 37.5G of data. This fits in a database for your pilot, but you should also think of the next step after the pilot. Your boss will not be happy when the pilot will be a big success and that you will tell him that you cannot add 100.000 users to the system in the next months without redesigning everything. Moreover, what about a new feature for VIP users to request data at each minutes...
This is a complex issue and the choice you make will restrict the evolution of your software.
For the pilot, keep it as simple as possible to get the product out as cheap as possible --> ok for a database. But tell you boss that you cannot open the service like that and that you will have to change things before getting 10.000 new users per week.
One thing for the next release: have many data repositories: one for your user data that is updated frequently, one for you queries/statistics system, ...
You could look at RRD for your next release.
Also keep in mind the update frequency: 2000 users updating data each 15 minutes means 2.2 updates per seconds --> ok; 100.000 users updating data each 5 minutes means 333.3 updates per seconds. I am not sure a simple database can keep up with that, and a single web service server definitely cannot.
We frequently hit tables that look like this. Obviously structure your indexes based on usage (do you read or write a lot, etc), and from the start think about table partitioning based on some high level grouping of the data.
Also, you can implement an archiving idea to keep the live table thin. Historical records are either never touched, or reported on, both of which are no good to live tables in my opinion.
It's worth noting that we have tables around 100m records and we don't perceive there to be a performance problem. A lot of these performance improvements can be made with little pain afterwards, so you could always start with a common-sense solution and tune only when performance is proven to be poor.
With appropriate indexes to avoid slow queries, I wouldn't expect any decent RDBMS to struggle with that kind of dataset. Lots of people are using PostgreSQL to handle far more data than that.
It's what databases are made for :)
First of all, I would suggest that you make a performance test - write a program that generates test entries that corresponds to the number of entries you'll see over half a year, insert them and check results to see if query times are satisfactory. If not, try indexing as suggested by other answers. It is, btw, also worth trying write performance to ensure that you can actually insert the amount of data you're generating in 15 minutes in.. 15 minutes or less.
Making a test will avoid the mother of all problems - assumptions :-)
Also think about production performance - your pilot will have 2000 users - will your production environment have 4000 users or 200000 users in a year or two?
If we're talking a really big environment, you need to think about a solution that allows you to scale out by adding more nodes instead of relying on always being able to add more CPU, disk and memory to a single machine. You can either do this in your application by keeping track on which out of multiple database machines is hosting details for a specific user, or you can use one of the Postgresql clustering methods, or you could go a completely different path - the NoSQL approach, where you walk away completely from RDBMS and use systems which are built to scale horizontally.
There are a number of such systems. I only have personal experience of Cassandra. You have to think completely different compared to what you're used to from the RDBMS world which is something of a challenge - think more about how you want
to access the data rather than how to store it. For your example, I think storing the data with the user-id as key and then add a column with the column name being the timestamp and the column value being your data for that timestamp would make sense. You can then ask for slices of those columns for example for graphing results in a Web UI - Cassandra has good enough response times for UI applications.
The upside of investing time in learning and using a nosql system is that when you need more space - you just add a new node. Same thing if you need more write performance, or more read performance.
Are you not better off not keeping individual samples for the full period? You could possibly implement some sort of consolidation mechanism, which concatenates weekly/monthly samples into one record. And run said consolidation on a schedule.
You decision has to depend on the type of queries you need to be able to run on the database.
There are lots of techniques to handle this problem. you will only get performance if you touch minimum number of records. in your case you can use following techniques.
Try to keep old data in separate table here your can use table partitioning or can use a different kind of approach where you can store your old data in file system and can serve them directly from your application without connecting to database, this way your database will be free. I am doing this for one of my project and it already has more than 50GB of data but it is running very smoothly.
Try to index table columns but be careful as it will affect your insertion speed.
Try batch processing for your insertion or select queries. you can handle this issue very smartly here.
Example: suppose you are getting request to insert record in any table after every 1 second then you make a mechanism where you process this request in batch of 5 record in this way you will hit your database after 5 second which is much better. Yes, you can make users to wait for 5 second to wait for their record inserted like in Gmail where you send email and it ask you to wait/processing. for select you can put your resultset periodically in file system and can serve them directly to user without touching database like most stock market data company do.
You can also use some ORM like Hibernate. They will use some caching techniques to boost speed of your data.
For any further query you can mail me on ranjeet1985#gmail.com

Dataset retrieving data from another dataset

I work with an application that it switching from filebased datastorage to database based. It has a very large amount of code that is written specifically towards the filebased system. To make the switch I am implementing functionality that will work as the old system, the plan is then making more optimal use of the database in new code.
One problem is that the filebased system often was reading single records, and read them repeatedly for reports. This have become alot of queries to the database, which is slow.
The idea I have been trying to flesh out is using two datasets. One dataset to retrieve an entire table, and another dataset to query against the first, thereby decreasing communication overhead with the database server.
I've tried to look at the DataSource property of TADODataSet but the dataset still seems to require a connection, and it asks the database directly if Connection is assigned.
The reason I would prefer to get the result in another dataset, rather than navigating the first one, is that there is already implemented a good amount of logic for emulating the old system. This logic is based on having a dataset containing only the results as queried with the old interface.
The functionality only have to support reading data, not writing it back.
How can I use one dataset to supply values for another dataset to select from?
I am using Delphi 2007 and MSSQL.
You can use a ClientDataSet/DataSetProvider pair to fetch data from an existing DataSet. You can use filters on the source dataset, filters on the ClientDataSet and provider events to trim the dataset only to the interesting records.
I've used this technique with success in a couple of migrating projects and to mitigate similar situation where a old SQL Server 7 database was queried thousands of times to retrieve individual records with painful performance costs. Querying it only one time and then fetching individual records to the client dataset was, at the time, not only an elegant solution but a great performance boost to that particular application: The most great example was an 8 hour process reduced to 15 minutes... poor users loved me that time.
A ClientDataSet is just a TDataSet you can seamlessly integrate into existing code and UI.

Resources