I am looking for a data structure that can store points containing position data ((x, y), (latitude, longitude), etc.) and time data. I was planning on using a 3 dimensional KD-Tree but I’m running into problems because of the time data. Since points are added as they come in, and the time almost always increases, the points are being added almost linearly (to the right of the previous one).
I want to be able to perform insertions, deletions, and nearest neighbor queries on the data.
The technical term is Spatiotemporal database, that should allow you to look up related research and algorithms.
To avoid the problem with the kd-tree, some kd-trees have a rebalance() function that may be helpful.
Also, how about using a R-tree (self balancing) or a PH-Tree (does not require rebalancing and depths is inherently limited to 64; disclaimer: self advertisement)?
Related
I'm trying to find a nearrest neighbor for a given data point, then add the data point without rebuilding the entire NN database as that would be too costly. Are there any methods that are efficient in which I can add the data point without rebuilding. Are there any optimization workarounds?
Many spatial indexes allow kNN search and modification (add/remove/relocate): kd-tree, R-Tree, quadtree, LSH, ... .
Some spatial indexes allow modifications in principle, but modifications probably are prohibitivley slow, e.g. CoverTree.
All of the above give precise answers to NN queries.
I don't know how well approximate NN (ANN) indexes can be updated.
I'm learning to work with large amounts of data.
I've generated a file of 10,000,000 ints. I want to perform a number of sorts on the data and time the sorts (maybe plot the values for performance analysis?) but I've never worked with big data before and don't know how to sort (say, even bubble sort!) data that isn't in memory! I want to invoke the program like so:
./mySort < myDataFile > myOutFile
How would I go about sorting data that can't fit into a linked list, or array?
There are a number of algorithms for performing this type of operation. They all fall under the general heading of External Sorting.
One of the best references on this, though rather technical and dense is Donald Knuth's treatment of tape sorting algorithms. Back in the day where data was stored on tape and could only be read sequentially and then written out to other tapes this kind of sorting was often done by repeatedly shuffling data back and forth between different tape drives.
Depending upon the size and type of dataset you are working with it may be worthwhile to make use of either a dedicated database to load the data into, or to make use of a cloud based service like Google's BigQuery. BigQuery has no cost to upload and download your dataset, you just pay for the processing. The first TB of processed data each month is free and you have less than even one GB of data.
Edit: Here's a very nice set of undergraduate lecture notes on external sorting algorithms. http://www.math-cs.gordon.edu/courses/cs321/lectures/external_sorting.html
You need to use external sorting
Bring in part of data at a time , sort it in memory and then merge it
More details here
http://en.m.wikipedia.org/wiki/External_sorting
I'm new to databases and have been reading that adding an index to a field you need to search over can dramatically speed up search times. I understand this reality, but am curious as to how it actually works. I've searched a bit on the subject, but haven't found any good, concise, and not over technical answer to how it works.
I've read the analogy of it being like an index at the back of a book, but in the case of a data field of unique elements (such as e-mail addresses in a user database), using the back of the book analogy would provide the same linear look up time as a non-indexed seach.
What is going on here to speed up search times so much? I've read a little bit about searching using B+-Trees, but the descriptions were a bit too indepth. What I'm looking for is a high level overview of what is going on, something to help my conceptual understanding of it, not technical details.
Expanding on the search algorithm efficiencies, a key area in database performance is how fast the data can be accessed.
In general, reading data from a disk is a lot lot slower than reading data from memory.
To illustrate a point, lets assume everything is stored on disk. If you need to search through every row of data in a table looking for certain values in a field, you still need to read the entire row of data from the disk to see if it matches - this is commonly referred to as a 'table scan'.
If your table is 100MB, that's 100MB you need to read from disk.
If you now index the column you want to search on, in simplistic terms the index will store each unique value of the data and a reference to the exact location of the corresponding full row of data. This index may now only be 10MB compared to 100MB for the entire table.
Reading 10MB of data from the disk (and maybe a bit extra to read the full row data for each match) is roughly 10 times faster than reading the 100MB.
Different databases will store indexes or data in memory in different ways to make these things much faster. However, if your data set is large and doesn't fit in memory then the disk speed can have a huge impact and indexing can show huge gains.
In memory there can still be large performance gains (amongst other efficiencies).
In general, that's why you may not notice any tangible difference with indexing a small dataset which easily fits in memory.
The underlying details will vary between systems and actually will be a lot more complicated, but I've always found the disk reads vs. memory reads an easily understandable way of explaining this.
Okay, after a bit of research and discussion, here is what I have learned:
Conceptually an index is a sorted copy of the data field it is indexing, where each index value points to it's original(unsorted) row. Because the database knows how values are sorted, it can apply more sophisticated search algorithms than just looking for the value from start to finish. The binary search algorithm is a simple example of a searching algorithm for sorted lists and reduces the maximum search time from O(n) to O(log n).
As a side note: A decent sorting algorithm generally will take O(n log n) to complete, which means (as we've all probably heard before) you should only put indexes on fields you will search often, as it's a bit more expensive to add the index (which includes a sort) than it is to do a full search a few times. For example, in a large database of >1,000,000 entries it's on the range of 20x more expensive to sort than to search once.
Edit:
See #Jarod Elliott's answer for a more in-depth look at search efficiencies, specifically in regard to read from disk operations.
To continue your back-of-the-book analogy, if the pages were in order by that element it would be the same look-up time as a non-indexed search, yes.
However, what if your book were a list of book reviews ordered by author, but you only knew the ISBN. The ISBN is unique, yes, but you'd still have to scan each review to find the one you are looking for.
Now, add an index at the back of the book, sorted by ISBN. Boom, fast search time. This is analogous to the database index, going from the index key (ISBN) to the actual data row (in this case a page number of your book).
I was recently on the OEIS (Online Encyclopedia of Integer Sequences) recently, trying to look up a particular sequence I had on had.
Now, this database is fairly large. The website states that if the 2006 (! 5 years old) edition were printed, it would occupy 750 volumes of text.
I'm sure this is the same sort of issue Google has to handle as well. But, they also have a distributed system where they take advantage of load balancing.
Neglecting load balancing however, how much time does it take to do a query compared to database size?
Or in other words, what is the time complexity of a query with respect to DB size?
Edit: To make things more specific, assume the input query is simply looking up a string of numbers such as:
1, 4, 9, 16, 25, 36, 49
It strongly depends on the query, structure of the database, contention, and so on. But in general most databases will find a way to use an index, and that index will either be some kind of tree structure (see http://en.wikipedia.org/wiki/B-tree for one option) in which case access time is proportional to log(n), or else a hash in which case access time is proportional to O(1) on average (see http://en.wikipedia.org/wiki/Hash_function#Hash_tables for an explanation of how they work).
So the answer is typically O(1) or O(log(n)) depending on which type of data structure is used.
This may cause you to wonder why we don't always use hash functions. There are multiple reasons. Hash functions make it hard to retrieve ranges of values. If the hash function fails to distribute data well, it is possible for access time to become O(n). Hashes need resizing occasionally, which is potentially very expensive. And log(n) grows slowly enough that you can treat it as being reasonably close to constant across all practical data sets. (From 1000 to 1 petabyte it varies by a factor of 5.) And frequently the actively requested data shows some sort of locality, which trees do a better job of keeping in RAM. As a result trees are somewhat more commonly seen in practice. (Though hashes are by no means rare.)
That depends on a number of factors including the database engine implementation, indexing strategy, specifics of the query, available hardware, database configuration, etc.
There is no way to answer such a general question.
A properly designed and implemented database with terabytes of data may actually outperform a badly designed little database (particulaly one with no indexing and one that uses badly performing non-sargable queries and things such as correlated subqueries). This is why anyone expecting to have large amounts of data needs to hire an expert on databse design for large databases to do the intial design not later when the database is large. You may also need to invest in the type of equipment you need to handle the size as well.
I have to implement an algorithm to decompose 3D volumes in voxels. The algorithm starts by identifying which vertexes is on each side of the cut plan and in a second step which edge traverse the cutting plan.
This process could be optimized by using the benefit of sorted list. Identifying the split point is O log(n). But I have to maintain one such sorted list per axis and this for vertexes and edges. Since this is to be implemented to be used by GPU I also have some constrains on memory management (i.e. CUDA). Intrusive listsM/trees and C are imposed.
With a complete "voxelization" I expect to endup with ~4000 points, and 12000 edges. Fortunately this can be optimized by using a smarter strategy to get rid of processed voxels and order residual volumes cutting to keep their number to a minimum. In this case I would expect to have less then 100 points and 300 edges. This makes the process more complex to manage but could end up beeing more efficient.
The question is thus to help me identify the criteria to determine when the benefit of using a sorted data structure is worth the effort and complexity overhead compared to simple intrusive linked lists.
chmike, this really sounds like the sort of thing you want to do first the simpler way, and see how it behaves. Any sort of GPU voxelization approach is pretty fragile to system details once you get into big volumes at least (which you don't seem to have). In your shoes I'd definitely want the straightforward implementation first, if for no other reason that to check against....
The question will ALWAYS boil down to which operator is most common, accessing, or adding.
If you have an unordered list, adding to it takes no time, and accessing particular items takes extra time.
If you have a sorted list, adding to it takes more time, but accessing it is quicker.
Most applications spending most of their time accessing the data, rather than adding to it, which means that the (running) time overhead in creating a sorted list will usually be balanced or covered by the time saved in accessing the list.
If there is a lot of churn in your data (which it doesn't sound like there is) then maintaining a sorted list isn't necessarily advisable, because you will be constantly resorting the list as considerable CPU cost.
The complexity of the data structures only matters if they cannot be sorted in a useful way. If they can be sorted, then you'll have to go by the heuristic of
number of accesses:number of changes
to determine if sorting is a good idea.
After considering all answers I found out that the later method used to avoid duplicate computation would end up being less efficient because of the effort to maintain and navigate in the data structure. Beside, the initial method is straightforward to parallelize with a few small kernel routines and thus more appropriate for GPU implementation.
Checking back my initial method I also found significant optimization opportunities that leaves the volume cut method well behind.
Since I had to pick one answer I chose devinb because he answer the question, but Simon's comment, backed up by Tobias Warre comment, were as valuable for me.
Thanks to all of you for helping me sorting out this issue.
Stack overflow is an impressive service.