Alternative to map in marklogic - database

In my application I need to feed billions of entries in map which is taking much time for execution, can there be any other alternative to map which takes less execution time?

With billions of anything, the answer is probably to put them into the database and work with them using range indexes. It might be appropriate to use https://github.com/marklogic/semantic/ or at least borrow some of its concepts.

I concur with Justin that more details would help to give a more accurate answer..
In general the main issue with map:map is that it needs to be initialized at each execution of the module. With many entries it is wise to store the map:map entirely somewhere, to speed this up. You could put it in a server-field, so it would only need to be recalculated after a restart. You could also store it in a database, but that would require a database round-trip to retrieve it.
However, a map:map with a billion entries might not perform well at all. As an alternative you could store each entry as a separate document in the database. MarkLogic can handle that very well. You can use cts functions to retrieve appropriate entries. Indexes are kept in memory to make using them very fast..
HTH!

Related

Couchbase retrieve data from vbucket

I'm new to Couchbase and wondering if there is any manner to implement a parallel read from bucket. Given that, a bucket contains 1024 vbuckets by default. So could it be possible to split a N1QL query select * from b1 into several queries? It means that one of those queries just reads data from vbucket1 to vbucket100. Because the partition key is used to decide which node the value should be persisted. I think it could be possible to read a part of data from bucket according to a range of partition key. Could someone help me out of this?
Thanks
I don't recommend proceeding down this route. If you are just starting out, you should be worrying about how to represent your data in JSON, how to write effective N1QL queries against it, and how to get a useful set of indexes that support those queries and let them run quickly. You should also make sure that your cluster is properly set up, and you have a proper mix of KV, N1QL, and indexing nodes, with none of them as an obvious bottleneck. And of course you should be measuring performance. Exotic strategies like query partitioning should come after that, if you are still unsatisfied with performance.

Should I Replace Multiple Float Columns with a BLOB?

How would a single BLOB column in SQL Server compare (performance wise), to ~20 REAL columns (20 x 32-bit floats)?
I remember Martin Fowler recommending using BLOBs for persisting large object graphs (in Patterns of Enterprise Application Architecture) to remove multiple joins in queries, but does it make sense to do something like this for a table with 20 fixed columns (which are never used in queries)?
This table is updated really often, around 100 times per second, and INSERT statements get rather large with all the columns specified in the query.
I presume the first answer is going to be "profile it yourself", but I'd like to know if someone already has experience with this stuff.
Typically you should not, if you have not found out that this is critical to meet your performance requirements.
If you store it in one blob you need to recalculate your whole database if you make any change to the object structure (like adding or removing a column). If you keep multiple columns your future database refactorings and deployments will be much easier.
I can't fully speak to the performance of the SELECT, you'll need to test that, but I highly doubt it will cause any performance issues there because you wouldn't be reading any more data than before. However, in regards to the INSERT, you should see a performance gain (of what size I'm unsure), because there will likely not be any statistical indexes to update. Of course that depends on a lot of settings but I'm just throwing my opinion out there. This question is pretty subjective and not near enough information is available to truly tell you if you will see performance issues surrounding the change.
Now, in practice I'm going to say, leave it be unless you're seeing real performance issues. Further, if you're seeing real performance issues, analyze those before choosing this type of solution, there are probably other ways to fix them.

Does storing aggregated data go against database normalization?

On sites like SO, I'm sure it's absolutely necessary to store as much aggregated data as possible to avoid performing all those complex queries/calculations on every page load. For instance, storing a running tally of the vote count for each question/answer, or storing the number of answers for each question, or the number of times a question has been viewed so that these queries don't need to be performed as often.
But does doing this go against db normalization, or any other standards/best-practices? And what is the best way to do this, e.g., should every table have another table for aggregated data, should it be stored in the same table it represents, when should the aggregated data be updated?
Thanks
Storing aggregated data is not itself a violation of any Normal Form. Normalization is concerned only with redundancies due to functional dependencies, multi-valued dependencies and join dependencies. It doesn't deal with any other kinds of redundancy.
The phrase to remember is "Normalize till it hurts, Denormalize till it works"
It means: normalise all your domain relationships (to at least Third Normal Form (3NF)). If you measure there is a lack of performance, then investigate (and measure) whether denormalisation will provide performance benefits.
So, Yes. Storing aggregated data 'goes against' normalisation.
There is no 'one best way' to denormalise; it depends what you are doing with the data.
Denormalisation should be treated the same way as premature optimisation: don't do it unless you have measured a performance problem.
Too much normalization will hurt performance so in the real world you have to find your balance.
I've handled a situation like this in two ways.
1) using DB2 I used a MQT (Materialized Query Table) that works like a view only it's driven by a query and you can schedule how often you want it to refresh; e.g. every 5 min. Then that table stored the count values.
2) in the software package itself I set information like that as a system variable. So in Apache you can set a system wide variable and refresh it every 5 minutes. Then it's somewhat accurate but your only running your "count(*)" query once every five minutes. You can have a daemon run it or have it driven by page requests.
I used a wrapper class to do it so it's been while but I think in PHP was was as simple as:
$_SERVER['report_page_count'] = array('timeout'=>1234569783, 'count'=>15);
Nonetheless, however you store that single value it saves you from running it with every request.

Using SQL server partitioning when storing 1000 records a second

I read your article(SQL Server partitioning: not the answer to everything)
and being amazing of use partitioning for my case or not
I must to store about 1000 record per a second this data is about location of mobile nodes, these data make my database too huge
do you think i must partitioning my database or not(I have so much reporting in future).
1000 a second isn't that much.
Is it every second of 24/7?
In a defined window?
Is it a peak of 1000 per second but usually less?
We have a recent system growing at 20 million rows/month (after tidy ups of say another 50-80 million) and we're not thinking of anything like partitioning.
That's a lot of data.
What is the lifecycle of the data i.e. do you only need to store the records for a finite amount of time? For example after a month, perhaps certain data can be archived off or moved to a Data warehouse?
Given the volume of data that you intend to work with you are probably going to want to use an architecture that scales easily? For this reason you may want to look at using Cloud type services such as Amazon Ec2, or SQL Data Services on the Azure Platform.
http://aws.amazon.com/ec2/
http://www.microsoft.com/azure/data.mspx
Perhaps if you provide more specific details about what it is you are actually looking to do i.e. what business process you are looking to support, we may be able to provide more specific assistance.
Without such details it is not possible to ascertain whether or not SQL Server Partitioning would be an appropriate design approach for you.
You might need to look at a different RDMS. I would take a look at Vertica.
Presuming the table in question is indexed, then one of two options is certainly warranted when any of the indexes outgrow the available RAM. Not surprisingly, one of them is, increase RAM. The other of course is vertical partitioning.
gbn's answer provides some good things to consider which you have not mentioned, such as how many records per month (or week, or day) are being added. Richard's comment as to how big the (average) record is is also significant, particularly in terms of how big the average records for the indexes are, presuming the indexes do not include all the fields from the table.
gbn's answer however also seems a bit reckless to me. Growing at 20 million rows per month and not even "thinking of anything like partitioning". Without sufficient metrics as alluded to above, this is a possible recipe for disaster. You should at least be thinking about it, even it just to determine how long you can sustain your current and/or expected rate of growth, before needing to consider more RAM or partitioning.

How to gain performance when maintaining historical and current data?

I want to maintain last ten years of stock market data in a single table. Certain analysis need only data of the last one month data. When I do this short term analysis it takes a long time to complete the operation.
To overcome this I created another table to hold current year data alone. When I perform the analysis from this table it 20 times faster than the previous one.
Now my question is:
Is this the right way to have a separate table for this kind of problem. (Or we use separate database instead of table)
If I have separate table Is there any way to update the secondary table automatically.
Or we can use anything like dematerialized view or something like that to gain performance.
Note: I'm using Postgresql database.
You want table partitioning. This will automatically split the data between multiple tables, and will in general work much better than doing it by hand.
I'm working on near the exact same issue.
Table partitioning is definitely the way to go here. I would segment by more than year though, it would give you a greater degree of control. Just set up your partitions and then constrain them by months (or some other date). In your postgresql.conf you'll need to turn constraint_exclusion=on to really get the benefit. The additional benefit here is that you can only index the exact tables you really want to pull information from. If you're batch importing large amounts of data into this table, you may get slightly better results a Rule vs a Trigger and for partitioning, I find rules easier to maintain. But for smaller transactions, triggers are much faster. The postgresql manual has a great section on partitioning via inheritance.
I'm not sure about PostgreSQL, but I can confirm that you are on the right track. When dealing with large data volumes partitioning data into multiple tables and then using some kind of query generator to build your queries is absolutely the right way to go. This approach is well established in Data Warehousing, and specifically in your case stock market data.
However, I'm curious why do you need to update your historical data? If you're dealing with stock splits, it's common to implement that using a seperate multiplier table that is used in conjunction with the raw historical data to give an accurate price/share.
it is perfectly sensible to use separate table for historical records. It's much more problematic with separate database, as it's not simple to write cross-database queries
automatic updates - it's a tool for cronjob
you can use partial indexes for such things - they do wonderful job
Frankly, you should check your execution plans and try fixing your queries or indexing before taking more radical steps.
Indexing comes at very little cost (unless you do a lot of insertions) and your existing code will be faster (if you index properly) without modifying it.
Other measures such as partioning come after that...

Resources