Daily data generation and insertion - database

I'm facing a problem that perhaps someone around here can help me with.
I work in a business intelligence company and I'd like to simulate the whole usage cycle of our product the way our clients use it.
The short version is that our customers are inserting some 20 million records to their database on a daily basis, and our product crunches the new data at the end of the day.
I would like to automatically create around 20 million records and insert them into some database, everyday (MSSQL probably).
I should point out that the number of records should change from day to day between 15 to 25 million. Other than that, the data is supposed to be inserted to 6 tables linked with foreign keys.
I ususally use Redgate's SQL Generator to create data, but as far as I can tell it's good for one time data generation as opposed to the on going data generation I'm looking for.
If anyone knows of methods/tools adequate to this situation, please let me know.
Thanks!

You could also write a small Java (or similar) program to get the starting ID from the database, pick a random number of rows to insert, and then execute the data-generation tool as a child process.
For example, see Runtime.exec():
http://docs.oracle.com/javase/7/docs/api/java/lang/Runtime.html
You can then run your program as a scheduled task or cron job.

Related

Developing an optimal solution/design that sums many database rows for a reporting engine

Problem: I am developing a reporting engine that displays data about how many bees a farm detected (Bees is just an example here)
I have 100 devices that each minute count how many bees were detected on the farm. Here is how the DB looks like:
So there can be hundreds of thousands of rows in a given week.
The farmer wants a report that shows for a given day how many bees came each hour. I developed two ways to do this:
The server takes all 100,000 rows for that day from the DB and filters it down. The server uses a large amount of memory to do this and I feel this is a brute force solution
I have a Stored Procedure that returns a temporarily created table, with every hour the amount of bees collected for each device totaled. The server takes this table and doesn't need to process 100,000 rows.
This return (24 * 100) rows. However it takes much longer than I expected to do this ~
What are some good candidate solutions for developing a solution that can consolidate and sum this data without taking 30 seconds just to sum a day of data (where I may need a months worth divided between days)?
If performance is your primary concern here, there's probably quite a bit you can do directly on the database. I would try indexing the table on time_collected_bees so it can filter down to 100K lines faster. I would guess that that's where your slowdown is happening, if the database is scanning the whole table to find the relevant entries.
If you're using SQL Server, you can try looking at your execution plan to see what's actually slowing things down.
Give database optimization more of a look before you architect something really complex and hard to maintain.

Which data store is best for my scenario

I'm working on an application that involves very high execution of update / select queries in the database.
I have a base table (A) which will have about 500 records for an entity for a day. And for every user in the system, a variation of this entity is created based on some of the preferences of the user and they are stored in another table (B). This is done by a cron job that runs at midnight everyday.
So if there are 10,000 users and 500 records in table A, there will be 5M records in table B for that day. I always keep data for one day in these tables and at midnight I archive historical data to HBase. This setup is working fine and I'm having no performance issues so far.
There has been some change in the business requirements lately and now some attributes in base table A ( for 15 - 20 records) will change every 20 seconds and based on that I have to recalculate some values for all of those variation records in table B for all users. Even though only 20 master records change, I need to do recalculation and update 200,000 user records which takes more than 20 seconds and by then the next update occurs eventually resulting in all Select queries getting queued up. I'm getting about 3 get request / 5 seconds from online users which results in 6-9 Select queries. To respond to an api request, I always use the fields in table B.
I can buy more processing power and solve this situation but I'm interested in having a properly scaled system which can handle even a million users.
Can anybody here suggest a better alternative? Does nosql + relational database help me here ? Are there any platforms / datastores which will let me update data frequently without locking and at the same time give me the flexibility of running select queries on various fields in an entity ?
Cheers
Jugs
I recommend looking at an in memory DBMS that fully implements MVCC, to eliminate blocking issues. If your application is currently using SQL, then there's no reason to move away from that to nosql. The performance requirements you describe can certainly be met by an in memory SQL-capable DBMS.
What I understand from your saying you are updating 200K records for every 20 sec. Then like in 10min you will update almost all of your data. In that case why are you writing those state to database if that is so frequently updated. I don't know anything about your requirements but why don't you just calculate it on demand using data from table A?

Recommended ETL approachs for large to big data

I've been reading up on this (via Google search) for a while now and still not a getting a clear answer so finally decided to post.
Am trying to get a clear idea on what is a good process for setting up automated ETL process. Let's take the following problem:
Data:
Product Code, Month, Year, Sales, Flag
15 million rows for data, where you had 5000 products. Given this data, calculate whether the cumulative sales for a particular product exceeds X. And set flag = 1 if at that point in time the threshold was exceeded.
How would people approach this task? My approach was to attempt it using SQL Server but that was painfully slow at times. In particular there was step in the transformation that would have required me to write a Stored Proc that created an index on a temp table on the fly on order to speed up .. all of which seemed like a lot of bother.
Should I have coded it in Java or Python? Should I have used Alteryx or Lavastorm? Is this something I should ideally be doing using Hadoop?

How to save frequent received data in database?

Me and 10 students are doing a big project where we need to receive temperature data from hardware in form av nodes, that should be uploaded and stored on a server. As we are all engineers in embedded systems and having minor database knowledge, I am turning to you guys.
I want to receive data from the nodes lets say, every 30 seconds. The table that will store that data in the database would quickly become very long if you store: [nodeId, time, temp] in a table. Do you have any suggestions how to store the data in another way?
A solution could be to store it like mentioned for a period of time and then "compromize" it somehow to a matrix of some sort? I still want to be able to reach old data.
One row every 30 seconds is not a lot of data. It's 2880 rows per day per node. I once designed a database which had 32 million rows added per day, every day. I haven't looked at it for a while but I know it's currently got more than 21 billion rows in it.
The only thing to bear in mind is that you need to think about how you're going to query it, and make sure it has appropriate indexes.
Have fun!

Printing the names of all the people greater than age 18?

This was a pretty good question that was posed to me recently. Suppose we have a hypothetical (insert your favorite data storage tool here) database that consists of the names, ages and address of all the people residing on this planet. Your task is to print out the names of all the people whose age is greater than 18 within an HTML table. How would you go about doing that? Lets say that hypothetically the population is growing at the rate of 1200/per second and the database is updated accordingly(don't ask how). What would be your strategy to print the names of all these people and their addresses on an HTML table?
Storing the ages in a DB tables sounds like a recipe for trouble to me - it would be impossible to maintain. You would be better off storing the birth dates, then building an index on that column/attribute.
You have to get an initial dump of the table for display. Just calculate the date 18 years ago (let's say D0) and use a query for any person born earlier than that.
Use DB triggers to receive notifications about deaths, so that you can remove them from the table immediately.
Since people only get older (unfortunately?), you can use ranged queries to get new additions (i.e. people that become 18 years old since yo last queried the table). E.g. if you want to update the display the next day, you issue a query for the people that were born in day D0 + 1 only - no need to request the whole table again.
You could even prefetch the people who reach 18 years of age the next day, keep the entries in memory, and add them to the display at the exact moment they reach that age.
BTW, even with 2KB of data for each person, you get a 18TB database (assuming 50% overhead). Any slightly beefed up server should be able to handle this kind of DB size. On the other hand, the thought of a 12 TB HTML table terrifies me...
Oh, and beware of timezone and DST issues - time is such a relative thing these days...
I don't see what the problem is. You don't have to worry about new records being added at all, since none of them will be included in your query unless that query takes 18 or more years to run. If you have an index on age, and presumably any DB technology sufficient to handle that much data and 1200 inserts a second updates indexes on insert, it should just work.
In the real world, using existing technologies or something like it, I would create a daily snapshot once a day and do queries on that read-only snapshot that would not include records for that day. That table would certainly be good enough for this query, and most others.
Are you forced to aggregate all of the entries into one table?
It would be simpler if you were to create a table for each age group (only around 120 tables would be needed) and just insert the inputs into those, as it's computationally simpler to look over 120 tables when you insert an entry than to look over 6,000,000,000 when looking for entries.

Resources