I've got approximately 1000 rows (4 fields) in a MySQL database. I know for a fact that the data in the database is not going to change very often (they are GPS coordinates). Is it better for me to call this information from the database every time the appropriate script is loaded, or would it be better for me to "hard code" the data into the script, and when I do make a change to the database, simply update the hard coded data too?
I'm wondering if this improves performance, but part of me thinks this may not be best practice.
Thanks
Hard coding coordinates into a script is not a good idea.
I would read the 1000 coordinates at start into an array, either from SQL DB or from a File.
But do that reading only once at start up, and not at each caluclation step.
Given the fact that changes might occur once or twice per month, and the fact that 0.0063 seconds isn't very much (at least not from my point of view, if it would be a matter of life or death or very important Wall Street stock data that would be another matter), my recommendation is that you use the SQL. Of course, as long as you perform the query only once per script execution.
Indeed, it could improve performance with some milliseconds if you hard-code the data into your script. But ask yourself the question: How much extra work is needed to maintain the hard-corded data? If you really want to be sure, then make a version of the script where you hard-code the data and execute the script 1000 times and measure the time difference. (However, just making this test would probably take more time than it would save...)
If your script is run 5000 times per day and each time the SQL takes an extra 0.01 seconds compared to having hard-coded values, that's a sum of 50 seconds per day in total for your users. However, for each user they will most likely not notice any difference.
Related
TL;DR
I have a table with about 2 million WRITEs over the month and 0 READs. Every 1st day of a month, I need to read all the rows written on the previous month and generate CSVs + statistics.
How to work with DynamoDB in this scenario? How to choose the READ throughput capacity?
Long description
I have an application that logs client requests. It has about 200 clients. The clients need to receive on every 1st day of a month a CSV with all the requests they've made. They also need to be billed, and for that we need to calculate some stats with the requests they've made, grouping by type of request.
So in the end of the month, a client receives a report like:
I've already come to two solutions, but I'm not still convinced on any of them.
1st solution: ok, every last day of the month I increase the READ throughput capacity and then I run a map reduce job. When the job is done, I decrease the capacity back to the original value.
Cons: not fully automated, risk of the DynamoDB capacity not being available when the job starts.
2nd solution: I can break the generation of CSVs + statistics to small jobs in a daily or hourly routine. I could store partial CSVs on S3 and on every 1st day of a month I could join those files and generate a new one. The statistics would be much easier to generate, just some calculations derived from the daily/hourly statistics.
Cons: I feel like I'm turning something simple into something complex.
Do you have a better solution? If not, what solution would you choose? Why?
Having been in a similar place myself before, I used, and now recommend to you, to process the raw data:
as often as you reasonably can (start with daily)
to a format as close as possible to the desired report output
with as much calculation/CPU intensive work done as possible
leaving as little to do at report time as possible.
This approach is entirely scaleable - the incremental frequency can be:
reduced to as small a window as needed
parallelised if required
It also, makes possible re-running past months reports on demand, as the report generation time should be quite small.
In my example, I shipped denormalized, pre-processed (financial calculations) data every hour to a data warehouse, then reporting just involved a very basic (and fast) SQL query.
This had the additional benefit of spreading the load on the production database server to lots of small bites, instead of bringing it to its knees once a week at invoice time (30000 invoiced produced every week).
I would use the service kinesis to produce a daily and almost real time billing.
for this purpose I would create a special DynamoDB table just for the calculated data.
(other option is to run it on flat files)
then I would add a process which will send events to kinesis service just after you update the regular DynamoDB table.
thus when you reach the end of the month you can just execute whatever post billing calculations you have and create your CSV files from the already calculated table.
I hope that helps.
Take a look at Dynamic DynamoDB. It will increase/decrease the throughput when you need it without any manual intervention. The good news is you will not need to change the way the export job is done.
Currently I have a project (written in Java) that reads sensor output from a micro controller and writes it across several Postgres tables every second using Hibernate. In total I write about 130 columns worth of data every second. Once the data is written it will stay static forever.This system seems to perform fine under the current conditions.
My question is regarding the best way to query and average this data in the future. There are several approaches I think would be viable but am looking for input as to which one would scale and perform best.
Being that we gather and write data every second we end up generating more than 2.5 million rows per month. We currently plot this data via a JDBC select statement writing to a JChart2D (i.e. SELECT pressure, temperature, speed FROM data WHERE time_stamp BETWEEN startTime AND endTime). The user must be careful to not specify too long of a time period (startTimem and endTime delta < 1 day) or else they will have to wait several minutes (or longer) for the query to run.
The future goal would be to have a user interface similar to the Google visualization API that powers Google Finance. With regards to time scaling, i.e. the longer the time period the "smoother" (or more averaged) the data becomes.
Options I have considered are as follows:
Option A: Use the SQL avg function to return the averaged data points to the user. I think this option would get expensive if the user asks to see the data for say half a year. I imagine the interface in this scenario would scale the amount of rows to average based on the user request. I.E. if the user asks for a month of data the interface will request an avg of every 86400 rows which would return ~30 data points whereas if the user asks for a day of data the interface will request an avg of every 2880 rows which will also return 30 data points but of more granularity.
Option B: Use SQL to return all of the rows in a time interval and use the Java interface to average out the data. I have briefly tested this for kicks and I know it is expensive because I'm returning 86400 rows/day of interval time requested. I don't think this is a viable option unless there's something I'm not considering when performing the SQL select.
Option C: Since all this data is static once it is written, I have considered using the Java program (with Hibernate) to also write tables of averages along with the data it is currently writing. In this option, I have several java classes that "accumulate" data then average it and write it to a table at a specified interval (5 seconds, 30 seconds, 1 minute, 1 hour, 6 hours and so on). The future user interface plotting program would take the interval of time specified by the user and determine which table of averages to query. This option seems like it would create a lot of redundancy and take a lot more storage space but (in my mind) would yield the best performance?
Option D: Suggestions from the more experienced community?
Option A won't tend to scale very well once you have large quantities of data to pass over; Option B will probably tend to start relatively slow compared to A and scale even more poorly. Option C is a technique generally referred to as "materialized views", and you might want to implement this one way or another for best performance and scalability. While PostgreSQL doesn't yet support declarative materialized views (but I'm working on that this year, personally), there are ways to get there through triggers and/or scheduled jobs.
To keep the inserts fast, you probably don't want to try to maintain any views off of triggers on the primary table. What you might want to do is to periodically summarize detail into summary tables from crontab jobs (or similar). You might also want to create views to show summary data by using the summary tables which have been created, combined with detail table where the summary table doesn't exist.
The materialized view approach would probably work better for you if you partition your raw data by date range. That's probably a really good idea anyway.
http://www.postgresql.org/docs/current/static/ddl-partitioning.html
I will try to describe my challenge and operation:
I need to calculate stocks price indices over historical period. For example, I will take 100 stocks and calc their aggregated avg price each second (or even less) for the last year.
I need to create many different indices like this where the stocks are picked dynamically out of 30,000~ different instruments.
The main consideration is speed. I need to output a few months of this kind of index as fast as i can.
For that reason, i think a traditional RDBMS are too slow, and so i am looking for a sophisticated and original solution.
Here is something i had In mind, using NoSql or column oriented approach:
Distribute all stocks into some kind of a key value pairs of time:price with matching time rows on all of them. Then use some sort of a map reduce pattern to select only the required stocks and aggregate their prices while reading them line by line.
I would like some feedback on my approach, suggestion for tools and use cases, or suggestion of a completely different design pattern. My guidelines for the solution is price (would like to use open source), ability to handle huge amounts of data and again, fast lookup (I don't care about inserts since it is only made one time and never change)
Update: by fast lookup i don't mean real time, but a reasonably quick operation. Currently it takes me a few minutes to process each day of data, which translates to a few hours per yearly calculation. I want to achieve this within minutes or so.
In the past, I've worked on several projects that involved the storage and processing of time series using different storage techniques (files, RDBMS, NoSQL databases). In all these projects, the essential point was to make sure that the time series samples are stored sequentially on the disk. This made sure reading several thousand consecutive samples was quick.
Since you seem to have a moderate number of time series (approx. 30,000) each having a large number of samples (1 price a second), a simple yet effective approach could be to write each time series into a separate file. Within the file, the prices are ordered by time.
You then need an index for each file so that you can quickly find certain points of time within the file and don't need to read the file from the start when you just need a certain period of time.
With this approach you can take full advantage of today's operating systems which have a large file cache and are optimized for sequential reads (usually reading ahead in the file when they detect a sequential pattern).
Aggregating several time series involves reading a certain period from each of these files into memory, computing the aggregated numbers and writing them somewhere. To fully leverage the operating system, read the full required period of each time series one by one and don't try to read them in parallel. If you need to compute a long period, then don’t break it into smaller periods.
You mention that you have 25,000 prices a day when you reduce them to a single one per second. It seems to me that in such a time series, many consecutive prices would be the same as few instruments are traded (or even priced) more than once a second (unless you only process S&P 500 stocks and their derivatives). So an additional optimization could be to further condense your time series by only storing a new sample when the price has indeed changed.
On a lower level, the time series files could be organized as a binary files consisting of sample runs. Each run starts with the time stamp of the first price and the length of the run. After that, the prices for the several consecutive seconds follow. The file offset of each run could be stored in the index, which could be implemented with a relational DBMS (such as MySQL). This database would also contain all the meta data for each time series.
(Do stay away from memory mapped files. They're slower because they aren’t optimized for sequential access.)
If the scenario you described is the ONLY requirement, then there are "low tech" simple solutions which are cheaper and easier to implement. The first that comes to mind is LogParser. In case you haven't heard of it, it is a tool which runs SQL queries on simple CSV files. It is unbelievably fast - typically around 500K rows/sec, depending on row size and the IO throughput of the HDs.
Dump the raw data into CSVs, run a simple aggregate SQL query via the command line, and you are done. Hard to believe it can be that simple, but it is.
More info about logparser:
Wikipedia
Coding Horror
What you really need is a relational database that has built in time series functionality, IBM released one very recently Informix 11.7 ( note it must be 11.7 to get this feature). What is even better news is that for what you are doing the free version, Informix Innovator-C will be more than adequate.
http://www.freeinformix.com/time-series-presentation-technical.html
i work on sql 2005 server
I have almost 350 000 insert scripts.. The insert script has 10 columns to be inserted.
So how many rows should I select to be executed at one click. "Execute" click..
Please tell me an average number according to an average system configuration..
Win XP
Cor 2 Duo
3,66 Gb ram
Ok, lets get some things straight here:
Win XP Cor 2 Duo 3,66 Gb ram
Not average but outdated. On top it totally missed the most important nubmer for a db, which is speed/number of discs.
i work on sql 2005 server I have
almost 350 000 insert scripts..
I seriously doubt you haver 350.000 insert SCRIPTS. THIs would be 350.000 FILES that contain insert commands. This is a lot of files.
The insert script has 10 columns to be
inserted.
I order a pizza. How much fuel does my car require per km? Same relation. 10 columns is nice, but you dont say how many insert commands your scripts contain.
So, at the end the only SENSIBLE interpretation is you have to insert 350.000 rows, and try to do it from a program (i.e. there are no scripts to start with), but this is pretty much absolutely NOT what you say.
So how many rows should I select to be
executed at one click
How many pizzas should I order with one telephone? THe click here is irrelevant. It woud also not get faster when you use a command line program to do the isnerts.
The question is how to get inserts into the db fastest.
For normal SQL:
Batch the inserts. Like 50 or 100 into one statement (yes, you can write more than one insert into one command).
Submit them interleaved async, preparing the next statement while the prevous one executes.
This is very flexible as yuo can do real sql etc.
for real mass inserts:
Forget the idea of writing insertsstatements. Prepare the data properly as per table structure, use SqlBulkCopy to mass insert them.
Less flexible - but a LOT faster.
The later approach on my SMALL (!) database computer would handle this in about 3-5 seconds when the fields are small (a field dan be a 2gb binary data thing, you know). I handle about 80.000 row isnerts per second without a lot of optimization, but i have small and a little less fields. This is 4 processor cores (irrelvant, they never get busy), 8gb RAM (VERY small for a datbase server, irrelevant as well in this context), and 6 vlociraptors for the data in a Raid 10 (again, a small configuration for a database, b ut very relevant). I get a peak insert in the 150mb per second range here in activity monitor. I will do a lot of optimization here, as i open /close a db connection at the moment every 20.000 items... bad batching.
But then, you dont seem to have a database system at all, just a database installed on a low end workstation, an this means your IO is going to be REALLY slow compared to database servers, and insert speed / update speed is IO bound. Desktop discs suck, and you have data AND logs on the same discs.
But.... at the end you dont really say us anything about your problem.
And... the timeout CAN be set programmatically on the connection object.
I'm pretty sure the timeout can be set by the user by going Server Properties -> Connections -> Remote query timeout. If you set this sufficiently high (or to 0 which should mean it never times out) then you can run as many scripts as you like.
Obviously this is only ok if the database is not yet live - and you're simply needing to populate. If the data is coming from another MS SQL Server however you might just want to take a Full backup and restore - this will be both simpler and quicker.
This may be of help.
The general rule of thumb is to not exceed 0.1 seconds per UI operation for excellent performance. You are going to need to benchmark to find out what that is.
I need some inspiration for a solution...
We are running an online game with around 80.000 active users - we are hoping to expand this and are therefore setting a target of achieving up to 1-500.000 users.
The game includes a highscore for all the users, which is based on a large set of data. This data needs to be processed in code to calculate the values for each user.
After the values are calculated we need to rank the users, and write the data to a highscore table.
My problem is that in order to generate a highscore for 500.000 users we need to load data from the database in the order of 25-30.000.000 rows totalling around 1.5-2gb of raw data. Also, in order to rank the values we need to have the total set of values.
Also we need to generate the highscore as often as possible - preferably every 30 minutes.
Now we could just use brute force - load the 30 mio records every 30 minutes, calculate the values and rank them, and write them in to the database, but I'm worried about the strain this will cause on the database, the application server and the network - and if it's even possible.
I'm thinking the solution to this might be to break up the problem some how, but I can't see how. So I'm seeking for some inspiration on possible alternative solutions based on this information:
We need a complete highscore of all ~500.000 teams - we can't (won't unless absolutely necessary) shard it.
I'm assuming that there is no way to rank users without having a list of all users values.
Calculating the value for each team has to be done in code - we can't do it in SQL alone.
Our current method loads each user's data individually (3 calls to the database) to calculate the value - it takes around 20 minutes to load data and generate the highscore 25.000 users which is too slow if this should scale to 500.000.
I'm assuming that hardware size will not an issue (within reasonable limits)
We are already using memcached to store and retrieve cached data
Any suggestions, links to good articles about similar issues are welcome.
Interesting problem. In my experience, batch processes should only be used as a last resort. You are usually better off having your software calculate values as it inserts/updates the database with the new data. For your scenario, this would mean that it should run the score calculation code every time it inserts or updates any of the data that goes into calculating the team's score. Store the calculated value in the DB with the team's record. Put an index on the calculated value field. You can then ask the database to sort on that field and it will be relatively fast. Even with millions of records, it should be able to return the top n records in O(n) time or better. I don't think you'll even need a high scores table at all, since the query will be fast enough (unless you have some other need for the high scores table other than as a cache). This solution also gives you real-time results.
Assuming that most of your 2GB of data is not changing that frequently you can calculate and cache (in db or elsewhere) the totals each day and then just add the difference based on new records provided since the last calculation.
In postgresql you could cluster the table on the column that represents when the record was inserted and create an index on that column. You can then make calculations on recent data without having to scan the entire table.
First and formost:
The computation has to take place somewhere.
User experience impact should be as low as possible.
One possible solution is:
Replicate (mirror) the database in real time.
Pull the data from the mirrored DB.
Do the analysis on the mirror or on a third, dedicated, machine.
Push the results to the main database.
Results are still going to take a while, but at least performance won't be impacted as much.
How about saving those scores in a database, and then simply query the database for the top scores (so that the computation is done on the server side, not on the client side.. and thus there is no need to move the millions of records).
It sounds pretty straight forward... unless I'm missing your point... let me know.
Calculate and store the score of each active team on a rolling basis. Once you've stored the score, you should be able to do the sorting/ordering/retrieval in the SQL. Why is this not an option?
It might prove fruitless, but I'd at least take a gander at the way sorting is done on a lower level and see if you can't manage to get some inspiration from it. You might be able to grab more manageable amounts of data for processing at a time.
Have you run tests to see whether or not your concerns with the data size are valid? On a mid-range server throwing around 2GB isn't too difficult if the software is optimized for it.
Seems to me this is clearly a job for chacheing, because you should be able to keep the half-million score records semi-local, if not in RAM. Every time you update data in the big DB, make the corresponding adjustment to the local score record.
Sorting the local score records should be trivial. (They are nearly in order to begin with.)
If you only need to know the top 100-or-so scores, then the sorting is even easier. All you have to do is scan the list and insertion-sort each element into a 100-element list. If the element is lower than the first element, which it is 99.98% of the time, you don't have to do anything.
Then run a big update from the whole DB once every day or so, just to eliminate any creeping inconsistencies.