I need Big Query to efficiently analyse TB large datasets.
Every couple of seconds I need to check the maximum and minimum value from some column in the big query table. Even with an appropriate clustering this has become quite expensive.
Would it make sense to store these values in a separate firestore database? Any thoughts? Thanks
Presumably you know the min and max values from the previous time you checked, so you should only need to check the min and max values from the new data that has arrived since the last check. You don't need to query the whole table every time.
Is there any techniques to calculate actual used data size per every SQL table row? Including enabled Indexes and Log records?
Sum of field sizes would not be correct because some fields can be empty or data is less than field size.
Target is to know, how much exactly data is used per user.
Probably I can do this in handler side.
With the word "exactly", I have to say "no".
Change that to "approximately", and I say
and look at Avg_row_length. This info is also available in information_schema.TABLES.
But, that is just an average. And not a very accurate average at that.
Do you care about a hundred bytes here or there? Do users own rows in a single table? What the heck is going on?
There are some crude formulas for computing the size of Data rows and Index rows, but nothing on Log records. One of the problems is that if there is a "block split" in a BTree because someone else inserted a row, do you divvy up the new block evenly across all users? Or what?
I'm looking for a way to store data with a timestamp.
Each timestamp might have 1 to 10 data fields.
Can I store data as (time, key, value) using a simple data solution or SQL? how would that compare to noSQL solution like mongo, where I can store {time:.., key1:..., key2:...}?
It will store about 10 data points with max around 10 fields per second. And the data might be collected as long as 10years, easily aggregating a billion records. The database should be able to help graphing data with time range queries.
It should be able to handle heavy writing frequency, ~100 per second (ok, this is not that high, but still..), at the same time being able to handle queries that return about a million of records (maybe even more)
Data it self is very simple, they are just electronic measurements. Some need to be measured with a high frequency(~100 milliseconds), and others every 1 min or so.
Can anyone who used something like this comment on the pluses and minuses of the method they used?
(Obviously this is a very specific scenario, so this definitely is not intended to turn in to what's the best database kind of question).
Sample data:
{ _id: Date(2013-05-08 18:48:40.078554),
V_in: 2.44,
I_in: .00988,
I_max: 0.11,
{_id: Date(2013-05-08 18:48:40.078325),
I_max: 0.100,
{ _id: Date(2001-08-09 23:48:43.083454),
V_out: 2.44,
I_in: .00988,
I_max: 0.11,
Thank you.
For simplicity, I would just make a table of timestamps with a column for each measurement point, and an integer primary key would be technically redundant since the timestamp uniquely identifies a measurement point, however it's easier to refer to a particular row by number than by timestamp. You will have nulls for any measured parameter that was not taken during that timestamp, which will take up a few extra bits per row (log base 2 of number of columns, rounded up), but you also won't have to do any joins. It is true if you decide you want to add columns later, but that's really not too difficult, and you could just make another separate table that keys on this one.
Please see here for an example with your data: http://www.sqlfiddle.com/#!2/e967c/4
I would recommend making some dummy databases of large size to make sure whatever structure you use still performs adequately.
The (time,key,value) suggestion smells like EAV, which I would avoid if you're planning on scaling.
I'm designing a project where I'll be storing (potentially hundreds of thousands of) lat/lon pairs in a database. The pairs are associated with other data. The catch is that in addition to users manipulating this data, I also want the locations to change over time. My initial instinct was to set up a cron job that will adjust every lat/lon by a certain amount every day, but I realize that such an operation would be insanely inefficient.
So, any ideas on how to efficiently adjust a bunch of lat/lon pairs over time? My best thought so far is associating a "last changed" timestamp with each pair and have a process running that fires every few seconds, grabs n (maybe order 100? 1000?) pairs with the oldest timestamps, adjusts those pairs and updates the times. This way I'm constantly moving small amounts of data, instead of moving an overwhelming amount once a day. I'm still not convinced this is the best way to go, though.
Thanks in advance!
Store the amount that is added to each pair somewhere else, and rather than using the values in the database directly, add this stored offset amount whenever you retrieve and subtract it whenever you insert.
Your instinct is right, that probably is the best way of adjusting latitude longitude gradually. The update query shouldn't hammer the server too hard though, and your not moving- Just changing?
I need some inspiration for a solution...
We are running an online game with around 80.000 active users - we are hoping to expand this and are therefore setting a target of achieving up to 1-500.000 users.
The game includes a highscore for all the users, which is based on a large set of data. This data needs to be processed in code to calculate the values for each user.
After the values are calculated we need to rank the users, and write the data to a highscore table.
My problem is that in order to generate a highscore for 500.000 users we need to load data from the database in the order of 25-30.000.000 rows totalling around 1.5-2gb of raw data. Also, in order to rank the values we need to have the total set of values.
Also we need to generate the highscore as often as possible - preferably every 30 minutes.
Now we could just use brute force - load the 30 mio records every 30 minutes, calculate the values and rank them, and write them in to the database, but I'm worried about the strain this will cause on the database, the application server and the network - and if it's even possible.
I'm thinking the solution to this might be to break up the problem some how, but I can't see how. So I'm seeking for some inspiration on possible alternative solutions based on this information:
We need a complete highscore of all ~500.000 teams - we can't (won't unless absolutely necessary) shard it.
I'm assuming that there is no way to rank users without having a list of all users values.
Calculating the value for each team has to be done in code - we can't do it in SQL alone.
Our current method loads each user's data individually (3 calls to the database) to calculate the value - it takes around 20 minutes to load data and generate the highscore 25.000 users which is too slow if this should scale to 500.000.
I'm assuming that hardware size will not an issue (within reasonable limits)
We are already using memcached to store and retrieve cached data
Any suggestions, links to good articles about similar issues are welcome.
Interesting problem. In my experience, batch processes should only be used as a last resort. You are usually better off having your software calculate values as it inserts/updates the database with the new data. For your scenario, this would mean that it should run the score calculation code every time it inserts or updates any of the data that goes into calculating the team's score. Store the calculated value in the DB with the team's record. Put an index on the calculated value field. You can then ask the database to sort on that field and it will be relatively fast. Even with millions of records, it should be able to return the top n records in O(n) time or better. I don't think you'll even need a high scores table at all, since the query will be fast enough (unless you have some other need for the high scores table other than as a cache). This solution also gives you real-time results.
Assuming that most of your 2GB of data is not changing that frequently you can calculate and cache (in db or elsewhere) the totals each day and then just add the difference based on new records provided since the last calculation.
In postgresql you could cluster the table on the column that represents when the record was inserted and create an index on that column. You can then make calculations on recent data without having to scan the entire table.
First and formost:
The computation has to take place somewhere.
User experience impact should be as low as possible.
One possible solution is:
Replicate (mirror) the database in real time.
Pull the data from the mirrored DB.
Do the analysis on the mirror or on a third, dedicated, machine.
Push the results to the main database.
Results are still going to take a while, but at least performance won't be impacted as much.
How about saving those scores in a database, and then simply query the database for the top scores (so that the computation is done on the server side, not on the client side.. and thus there is no need to move the millions of records).
It sounds pretty straight forward... unless I'm missing your point... let me know.
Calculate and store the score of each active team on a rolling basis. Once you've stored the score, you should be able to do the sorting/ordering/retrieval in the SQL. Why is this not an option?
It might prove fruitless, but I'd at least take a gander at the way sorting is done on a lower level and see if you can't manage to get some inspiration from it. You might be able to grab more manageable amounts of data for processing at a time.
Have you run tests to see whether or not your concerns with the data size are valid? On a mid-range server throwing around 2GB isn't too difficult if the software is optimized for it.
Seems to me this is clearly a job for chacheing, because you should be able to keep the half-million score records semi-local, if not in RAM. Every time you update data in the big DB, make the corresponding adjustment to the local score record.
Sorting the local score records should be trivial. (They are nearly in order to begin with.)
If you only need to know the top 100-or-so scores, then the sorting is even easier. All you have to do is scan the list and insertion-sort each element into a 100-element list. If the element is lower than the first element, which it is 99.98% of the time, you don't have to do anything.
Then run a big update from the whole DB once every day or so, just to eliminate any creeping inconsistencies.
I have a sproc that puts 750K records into a temp table through a query as one of its first actions. If I create indexes on the temp table before filling it, the item takes about twice as long to run compared to when I index after filling the table. (The index is an integer in a single column, the table being indexed is just two columns each a single integer.)
This seems a little off to me, but then I don't have the firmest understanding of what goes on under the hood. Does anyone have an answer for this?
If you create a clustered index, it affects the way the data is physically ordered on the disk. It's better to add the index after the fact and let the database engine reorder the rows when it knows how the data is distributed.
For example, let's say you needed to build a brick wall with numbered bricks so that those with the highest number are at the bottom of the wall. It would be a difficult task if you were just handed the bricks in random order, one at a time - you wouldn't know which bricks were going to turn out to be the highest numbered, and you'd have to tear the wall down and rebuild it over and over. It would be a lot easier to handle that task if you had all the bricks lined up in front of you, and could organize your work.
That's how it is for the database engine - if you let it know about the whole job, it can be much more efficient than if you just feed it a row at a time.
It's because the database server has to do calculations each and every time you insert a new row. Basically, you end up reindexing the table each time. It doesn't seem like a very expensive operation, and it's not, but when you do that many of them together, you start to see the impact. That's why you usually want to index after you've populated your rows, since it will just be a one-time cost.
Think of it this way.
unorderedList = {5, 1,3}
orderedList = {1,3,5}
add 2 to both lists.
unorderedList = {5, 1,3,2}
orderedList = {1,2,3,5}
What list do you think is easier to add to?
Btw ordering your input before load will give you a boost.
You should NEVER EVER create an index on an empty table if you are going to massively load it right afterwards.
Indexes have to be maintained as the data on the table changes, so imagine as if for every insert on the table the index was being recalculated (which is an expensive operation).
Load the table first and create the index after finishing with the load.
That's were the performance difference is going.
After performing large data manipulation operations, you frequently have to update the underlying indexes. You can do that by using the UPDATE STATISTICS [table] statement.
The other option is to drop and recreate the index which, if you are doing large data insertions, will likely perform the inserts much faster. You can even incorporate that into your stored procedure.
this is because if the data you insert is not in the order of the index, SQL will have to split pages to make room for additional rows to keep them together logically
This due to the fact that when SQL Server indexes table with data it is able to produce exact statistics of values in indexed column. At some moments SQL Server will recalculate statistics, but when you perform massive inserts the distribution of values may change after the statistics was calculated last time.
The fact that statistics is out of date can be discovered on Query Analyzer. When you see that on a certain table scan number of rows expected differs to much from actual numbers of rows processed.
You should use UPDATE STATISTICS to recalculate distribution of values after you insert all the data. After that no performance difference should be observed.
If you have an index on a table, as you add data to the table SQL Server will have to re-order the table to make room in the appropriate place for the new records. If you're adding a lot of data, it will have to reorder it over and over again. By creating an index only after the data is loaded, the re-order only needs to happen once.
Of course, if you are importing the records in index order it shouldn't matter so much.
In addition to the index overhead, running each query as a transaction is a bad idea for the same reason. If you run chunks of inserts (say 100) within 1 explicit transaction, you should also see a performance increase.