I am trying to use datameer tool to build some reports, I have about 300 million records, and I have do lot of complex calculation.
I am wondering is someone know which way will give me better performance
multiple columns with simple formula and each column depend on previous calculation done
single column with complex formula to get everything calculated in one shot.
It will somehow depend on your environment, data sources and data structure. In general you may start with Best Practices - Efficiently Handling Multiple Data Sources.
Related
I want to store static data, but I don't know which database is faster for storing 10-100 millions rows.
This really is not the correct form for this question. I would suggest padding our your requirements and posting this over at https://softwarerecs.stackexchange.com/
Generally speaking all DB platforms will hold a 100 million rows of data no problem at all - given careful consideration to the design on the platform of choice, i.e. the correct indexes etc
What I am saying here is the correct mechanism to store this data will depend on your requirements, for example StackOverflow are known for using a mix of technology from SQl Server to Redis. Pick the correct tool for the job based on requirements, and until we know your requirements a little better it is hard to provide an overview of each.
I have a doubt, is best, in term of performance, retrieving from a large database (50k+ rows growing) on MySQL a "calculated" data or perform the calculus on the fly?
The calculus are simply division and multiplication of some data yet retrieved from DB, but a lot of data (minimum ~300 calculus per load).
Another note, this "page" can be loaded multiple times (it's not an only one-load).
I think that calculate the value (e.g. using a background task) and then only retrieving from a DB is the best solution in term of performance, right?
There is a lot of difference in the two approach?
I'd play to the strengths of each system.
Aggregating, joining and filtering logic obviously belongs on the data layer. It's faster, not only because most DB engines have 10+ years of optimisation for doing just that, but you minimise the data shifted between your DB and web server.
On the other hand, most DB platforms i've used have very poor functionality for working with individual values. Things likes date formatting and string manipulation just suck in SQL, you're better doing that work in PHP.
Basically, use each system for what it's built to do.
In terms of maintainability, as long as the division between what happens where is clear, separating these to types of logic shouldn't cause much problem and certainly not enough to out way the benefits. In my opinion code clarity and maintainability are more about consistency than about putting all the logic in one place.
TO SUM UP:
I'll defenetly use Database for your problem.
I need a general piece of advice, but for the record i use jpa.
I need to generate usage data statistics, eg breakdown of user purchases per product, etc... I see three possible strategies, 1) generate on the fly stats each time the stats are being viewed, 2) create a specific table for stats that i would update each time there is a change 3) do offline processing at regular time intervals
All have issues and advanages, eg cost vs not up to date data, and i was wondering if anyone with experience in this field could provide some advice. I am aware the question s pretty broad, i can refine my use case if needed.
I've done a lot of reporting and the first question I always want to know is if the stakeholder needs the data in real time or not. This definitely shifts how you think and how you'll design a reporting system.
Based on the size of your data, I think it's possible to do real time reporting. If you had data in the millions, then maybe you'd need to do some pre-processing or data warehousing (your options 2/3).
Some general recommendations:
If you want to do real time reporting, think about making a copy of the database so you aren't running reports against production data. Some reports can use queries that are heavy, so it's worth looking into replicating production data to some other server where you can run reports.
Use intermediate structures a lot for reports. Write views, stored procedures, etc. so every report isn't just some huge complex query.
If the reports start to get too complex for doing at the database level, make sure you move the report logic into the application layer. I've been bitten by this many times. I start writing a report with queries purely from the database and eventually it gets too complex and I have to jump through hoops to make it work.
Shoot for real time and then go to stale data if necessary. Databases are capable of doing a lot more than you'd think. Quite often you can make changes to your database structures that will give you a big yield in performance.
I work with databases containing spatial data. Most of these databases are in a proprietary format created by ESRI for use with their ArcGIS software. We store our data in a normalized data model within these geodatabases.
We have found that the performance of this database is quite slow when dealing with relationships (i.e. relating several thousand records to several thousand records can take several minutes).
Is there any way to improve performance without completely flattening/denormalizing the database or is this strictly limited by the database platform we are using?
There is only one way: measure. Try to obtain a query plan, and try to read it. Try to isolate a query from the logfile, edit it to an executable (non-parameterised) form, and submit it manually (in psql). Try to tune it, and see where it hurts.
Geometry joins can be costly in term of CPU, if many (big) polygons have to be joined, and if their bounding boxes have a big chance to overlap. In the extreme case, you'll have to do a preselection on other criteria (eg zipcode, if available) or maintain cache tables of matching records.
EDIT:
BTW: do you have statistics and autovacuum? IIRC, ESRI is still tied to postgres-8.3-something, where these were not run by default.
UPDATE 2014-12-11
ESRI does not interfere with non-gis stuff. It is perfectly Ok to add PK/FK relations or additional indexes to your schema. The DBMS will pick them up if appropiate. And ESRI will ignore them. (ESRI only uses its own meta-catalogs, ignoring the system catalogs)
When I had to deal with spatial data, I tended to precalulate the values and store them. Yes that makes for a big table but it is much faster to query when you only do the complex calculation once on data entry. Data entry does take longer though. I was in a situation where all my spacial data came from a monthly load, so precalculating wasn't too bad.
I have 2 tables (~ 4 million rows) that I have to do insert/update actions on matching and unmatching records. I am pretty confused about the method I have to use for incremental load. Should I use Lookup component or new sql server merge statement? and will there be too much performance differences?
I've run into this exact problem a few times and I've always had to resort to loading the complete dataset into SQLserver via ETL, and then manipulating with stored procs. It always seemed to take way, way too long updating the data on the fly in SSIS transforms.
The SSIS Lookup has three caching modes which are key to getting the best performance from it. If you are looking up against a large table, FULL Cache mode will eat up a lot of your memory and could hinder performance. If your lookup destination is small, keep it in memory. You've also got to decide if the data you are looking up against is changing as you process data. If it is, then you don't want to cache.
Can you give us some more info on what you are oding so I can formulate a more precise answer.
Premature optimization is the root of all evil, I don't know about ssis, but it's always to early to think about this.
4 million rows could be "large" or "small", depending on the type of data, and the hardware configuration you're using.