I am new to R-Tree concept. Sorry If I ask a very basic question related to Rtree. I have read a few literature on R-Tree to get the basic concept of R-Tree. However, I could not understand the clustering or grouping steps in MBR. What's bothering me is:
How many points or object could fit in each MBR? i could see that the number of object stored in each MBR is varies. So is there any condition or procedure or formula or anything to determine how many objects will be stored in each MBR?
Thanks for your help! Gracias!
Read the R-tree publication, or a book on index structures.
You fix a page size (because the R-tree is a disk-oriented data structure, this should be something such as e.g. 8kb).
If a page gets too empty, it will be removed. If a page is too full, it will be split.
Just like with pretty much any other page-based tree, actually (e.g. B-tree).
Related
I am writing a Reddit like comment store using Couchbase. For each comment, I am storing its parentId and a list of its childrenIds. Each top-level comment on a web page will have its parentId as null.
I want to retrieve a comment block efficiently. By a comment block, I mean a top-level comment along with all its children comments. So the first step in this can be to write a map function that emits the ids of all top-level comments.
How do I go about fetching the entire tree once I have the root. A very naive approach would be to find the children, and query them recursively. But this defeats the purpose of not using a relational database for this project (since I am dealing with highly nested data and relational databases are terrible at storing them).
Can someone guide me on this?
OK, so each top-level comment has a tree of sub-comments below it. I think you could safely put the entire tree of sub-comments in the document of the top-level comment in most cases. The default limit on document sizes is 20 MB, which is a heck of a lot for text.
The question then is what to do with comments that inspire a LOT of sub-comments. I suggest you start spilling parts of the sub-comment tree to other documents when that happens, so strictly speaking there can be a tree of sub-comment trees, although typically you only need the one document. Design things so these subsidiary documents only get fetched on demand, and you never have to fetch absolutely the entire sub-comment tree.
i have a scenario where i have to set few records with field values to a constant and then later access them one by one sequentially .
The records can be random records.
I dont want to use link list as it will be costly and don't want to traverse the whole buffer.
please give me some idea to do that.
When you say "set few records with field values to a constant" is this like a key to the record? And then "later access them one by one" - is this to recall them with some key? "one-by-one sequentially" and "don't want to traverse the whole buffer" seems to conflict, as sequential access sounds a lot like traversal.
But I digress. If you in fact do have a key (and it's a number), you could use some sort of Hash Table to organize your records. One basic implementation might be an array of linked lists, where you mod the key down into the array's range, then add it to the list there. This might increase performance assuming you have a good distribution of keys (your records spread across the array well).
Another data structure to look into might be a B-Tree or a binary search tree, which can access nodes in logarithmic time.
However, overall I agree with the commenters that over-optimizing is usually not a good idea.
I'm a grad student in astrophysics. I run big simulations using codes mostly developed by others over a decade or so. For examples of these codes, you can check out gadget http://www.mpa-garching.mpg.de/gadget/ and enzo http://code.google.com/p/enzo/. Those are definitely the two most mature codes (they use different methods).
The outputs from these simulations are huge. Depending on your code, your data is a bit different, but it's always big data. You usually take billions of particles and cells to do anything realistic. The biggest runs are terabytes per snapshot and hundreds of snapshots per simulation.
Currently, it seems that the best way to read and write this kind of data is to use HDF5 http://www.hdfgroup.org/HDF5/, which is basically an organized way of using binary files. It's a huge improvement over unformatted binary files with a custom header block (still give me nightmares), but I can't help but think there could be a better way to do this.
I imagine the sheer data size is the issue here, but is there some sort of datastore that can handle terabytes of binary data efficiently, or are binary files the only way at this point?
If it helps, we typically store data columnwise. That is, you have a block of all particle id's, block of all particle positions, block of particle velocites, etc. It's not the prettiest, but it is the fastest for doing something like a particle lookup in some volume.
edit: Sorry for being vague about the issues. Steve is right that this might just be an issue of data structure rather than the data storage method. I have to run now, but I will provide more details late tonight or tomorrow.
edit 2: So the more I look into this, the more I realize that this probably isn't a datastore issue anymore. The main issue with unformatted binary was all the headaches reading the data correctly (getting the block sizes and order right and being sure about it). HDF5 pretty much fixed that and there isn't going to be a faster option until the file system limitations are improved (thanks Matt Turk).
The new issues probably come down to data structure. HDF5 is as performant as we can get, even if it is not the nicest interface to query against. Being used to databases, I thought it would be really interesting/powerful to be able to query something like "give me all particles with velocity over x at any time". You can do something like that now, but you have to work at a lower level. Of course, given how big the data is and depending on what you are doing with it, it might be a good thing to work at a low level for performance sake.
MongoDB: http://www.mongodb.org/
Netezza
Products:
http://www.netezza.com/data-warehouse-appliance-products/skimmer.aspx
Hadoop: http://hadoop.apache.org/
Wikipedia's List of Distributed File
Systems:
http://en.wikipedia.org/wiki/List_of_file_systems#Distributed_file_systems
EDIT
Rationale for my lack of explanation / etc.:
OP says: "[HDF5]'s a huge improvement over unformatted binary files with a custom header block (still give me nightmares), but I can't help but think there could be a better way to do this."
What does "better" mean? Better structured? He seems to allude to the "unformatted binary files" as being an issue - so maybe that's what he means by better. If so, he'll need something with some structure - hence the first couple suggestions.
OP says: "I imagine the sheer data size is the issue here, but is there some sort of datastore that can handle terabytes of binary data efficiently, or are binary files the only way at this point?"
Yes, there are several. Both structured, and "unstructured" - does he want structure, or is he happy to leave them in some sort of "unformatted binary format"? We still don't know - so I suggest checking out some Distributed File Systems.
OP says: "If it helps, we typically store data columnwise. That is, you have a block of all particle id's, block of all particle positions, block of particle velocites, etc. It's not the prettiest, but it is the fastest for doing something like a particle lookup in some volume."
Again, Does the OP want better structure, or doesn't he? Seems like he wants both - better structure AND faster.... maybe scaling OUT will give him this. This further reinforces the first few options I listed.
OP says (in comments): "I don't know if we can take the hit on io though."
Are there IO requirements? Cost restrictions? What are they?
We can't get something for nothing here. There is no "silver-bullet" storage solution. All we have to go on here for requirements is "lots of data" and "I don't know if I like the lack of structure, but I'm not willing to increase my IO to accommodate any additional structure"... so I don't know what kind of answer he's expecting. He hasn't listed a single complaint about the current solution he has other than the lack of structure - and he's already said he's not willing to pay any overhead to do anything about that... so.... ?
This program I'm doing is about a social network, which means there are users and their profiles. The profiles structure is UserProfile.
Now, there are various possible Graph implementations and I don't think I'm using the best one. I have a Graph structure and inside, there's a pointer to a linked list of type Vertex. Each Vertex element has a value, a pointer to the next Vertex and a pointer to a linked list of type Edge. Each Edge element has a value (so I can define weights and whatever it's needed), a pointer to the next Edge and a pointer to the Vertex owner.
I have a 2 sample files with data to process (in CSV style) and insert into the Graph. The first one is the user data (one user per line); the second one is the user relations (for the graph). The first file is quickly inserted into the graph cause I always insert at the head and there's like ~18000 users. The second file takes ages but I still insert the edges at the head. The file has about ~520000 lines of user relations and takes between 13-15mins to insert into the Graph. I made a quick test and reading the data is pretty quickly, instantaneously really. The problem is in the insertion.
This problem exists because I have a Graph implemented with linked lists for the vertices. Every time I need to insert a relation, I need to lookup for 2 vertices, so I can link them together. This is the problem... Doing this for ~520000 relations, takes a while.
How should I solve this?
Solution 1) Some people recommended me to implement the Graph (the vertices part) as an array instead of a linked list. This way I have direct access to every vertex and the insertion is probably going to drop considerably. But, I don't like the idea of allocating an array with [18000] elements. How practically is this? My sample data has ~18000, but what if I need much less or much more? The linked list approach has that flexibility, I can have whatever size I want as long as there's memory for it. But the array doesn't, how am I going to handle such situation? What are your suggestions?
Using linked lists is good for space complexity but bad for time complexity. And using an array is good for time complexity but bad for space complexity.
Any thoughts about this solution?
Solution 2) This project also demands that I have some sort of data structures that allows quick lookup based on a name index and an ID index. For this I decided to use Hash Tables. My tables are implemented with separate chaining as collision resolution and when a load factor of 0.70 is reach, I normally recreate the table. I base the next table size on this Link.
Currently, both Hash Tables hold a pointer to the UserProfile instead of duplication the user profile itself. That would be stupid, changing data would require 3 changes and it's really dumb to do it that way. So I just save the pointer to the UserProfile. The same user profile pointer is also saved as value in each Graph Vertex.
So, I have 3 data structures, one Graph and two Hash Tables and every single one of them point to the same exact UserProfile. The Graph structure will serve the purpose of finding the shortest path and stuff like that while the Hash Tables serve as quick index by name and ID.
What I'm thinking to solve my Graph problem is to, instead of having the Hash Tables value point to the UserProfile, I point it to the corresponding Vertex. It's still a pointer, no more and no less space is used, I just change what I point to.
Like this, I can easily and quickly lookup for each Vertex I need and link them together. This will insert the ~520000 relations pretty quickly.
I thought of this solution because I already have the Hash Tables and I need to have them, then, why not take advantage of them for indexing the Graph vertices instead of the user profile? It's basically the same thing, I can still access the UserProfile pretty quickly, just go to the Vertex and then to the UserProfile.
But, do you see any cons on this second solution against the first one? Or only pros that overpower the pros and cons on the first solution?
Other Solution) If you have any other solution, I'm all ears. But please explain the pros and cons of that solution over the previous 2. I really don't have much time to be wasting with this right now, I need to move on with this project, so, if I'm doing to do such a change, I need to understand exactly what to change and if that's really the way to go.
Hopefully no one fell asleep reading this and closed the browser, sorry for the big testament. But I really need to decide what to do about this and I really need to make a change.
P.S: When answering my proposed solutions, please enumerate them as I did so I know exactly what are you talking about and don't confuse my self more than I already am.
The first approach is the Since the main issue here is speed, I would prefer the array approach.
You should, of course, maintain the hash table for the name-index lookup.
If I understood correctly, you only process the data one time. So there is no dynamic data insertion.
To deal with the space allocation problem, I would recommend:
1 - Read once the file, to get the number of vertex.
2 - allocate that space
If you data is dynamic, you could implement some simple method to increment the array size in steps of 50%.
3 - In the Edges, substitute you linked list for an array. This array should be dynamically incremented with steps of 50%.
Even with the "extra" space allocated, when you increment the size with steps of 50%, the total size used by the array should only be marginally larger than with the size of the linked list.
I hope I could help.
I kinda need to decide on this to see if I can achieve it in a couple of hours before the deadline for my school project is due but I don't understand much about data structures and I need suggestions...
There's 2 things I need to do, they will probably use different data structures.
I need a data structure to hold profile records. The profiles must be search able by name and social security number. The SSN is unique, so I probably can use that for my advantage? I suppose hash maps is the best bet here? But how do I use the SSN in an hash map to use that as an advantage in looking for a specific profile? A basic and easy to understand explanation would be much appreciated.
I need a data structure to hold records about cities. I need to know which are cities with most visitors, cities less visited and the clients (the profile is pulled from the data structure in #1 for data about the clients) that visit a specific city.
This is the third data structure I need for my project and it's the data structure that I have no idea where to begin. Suggestions as for which type of data structure to use are appreciated, if possible, with examples on how to old the data above in bold.
As a note:
The first data structure is already done (I talked about it in a previous question). The second one is posted here on #1 and although the other group members are taking care of that I just need to know if what we are trying to do is the "best" approach. The third one is #2, the one I need most help.
The right answer lies anywhere between a balanced search tree and an array.
The situation you have mentioned here and else-thread misses out on a very important point: The size of the data you are handling. You choose your data structure and algorithm(s) depending on the amount of data you have to handle. It is important that you are able to justify your choice(s). Using a less efficient general algorithm is not always bad. Being able to back up your choices (e.g: choosing bubble-sort since data size < 10 always) shows a) greater command of the field and b) pragmatism -- both of which are in short supply.
For searchability across multiple keys, store the data in any convenient form, and provides fast lookup indexes on the key(s).
This could be as simple as keeping the data in an array (or linked list, or ...) in the order of creation, and keeping a bunch of {hashtables|sorted arrays|btrees} of maps (key, data*) for all the interesting keys (SSN, name, ...).
If you had more time, you could even work out how to not have a different struct for each different map...
I think this solution probably applies to both your problems.
Good luck.
For clarity:
First we have a simple array of student records
typedef
struct student_s {
char ssn[10]; // nul terminated so we can use str* functions
char name[100];
float GPA;
...
} student;
student slist[MAX_STUDENTS];
which is filled in as you go. It has no order, so search on any key is a linear time operation. Not a problem for 1,000 entries, but maybe a problem for 10,000, and certainly a problem for 1 million. See dirkgently's comments.
If we want to be able to search fast we need another layer of structure. I build a map between a key and the main data structure like this:
typedef
struct str_map {
char* key;
student *data;
} smap;
smap skey[MAX_STUDENTS]
and maintain skey sorted on the key, so that I can do fast lookups. (Only an array is a hassle to keep sorted, so we probably prefer a tree, or a hashmap.)
This complexity isn't needed (and should certainly be avoided) if you will only want fast searches on a single field.
Outside of a homework question, you'd use a relational database for
this. But that probably doesn't help you…
The first thing you need to figure out, as others have already pointed
out, is how much data you're handling. An O(n) brute-force search is
plenty fast as long a n is small. Since a trivial amount of data would
make this a trivial problem (put it in an array, and just brute-force
search it), I'm going to assume the amount of data is large.
Storing Cities
First, your search requirements appear to require the data sorted in
multiple ways:
Some city unique identifier (name?)
Number of visitors
This actually isn't too hard to satisfy. (1) is easiest. Store the
cities in some array. The array index becomes the unique identifier
(assumption: we aren't deleting cities, or if we do delete cities we can
just leave that array spot unused, wasting some memory. Adding is OK).
Now, we also need to be able to find most & fewest visits. Assuming
modifications may happen (e.g., adding cities, changing number of
visitors, etc.) and borrowing from relational databases, I'd suggest
creating an index using some form of balanced tree. Databases would
commonly use a B-Tree, but different ones may work for you: check Wikipedia's
article on trees. In each tree node, I'd just keep a pointer (or
array index) of the city data. No reason to make another copy!
I recommend a tree over a hash for one simple reason: you can very
easily do a preorder or reverse order traversal to find the top or
bottom N items. A hash can't do that.
Of course, if modifications may not happen, just use another array (of
pointers to the items, once again, don't duplicate them).
Linking Cities to Profiles
How to do this depends on how you have to query the data, and what form
it can take. The most general is that each profile can be associated
with multiple cities and each city can be associated with multiple
profiles. Further, we want to be able to efficiently query from either
direction — ask both "who visits Phoenix?" and "which cities does Bob
visit?".
Shamelessly lifting from databases again, I'd create another data
structure, a fairly simple one along the lines of:
struct profile_city {
/* btree pointers here */
size_t profile_idx; /* or use a pointer */
size_t city_idx; /* for both indices */
};
So, to say Bob (profile 4) has visited Phoenix (city 2) you'd have
profile_idx = 4 and city_idx = 2. To say Bob has visited Vegas (city
1) as well, you'd add another one, so you'd have two of them for Bob.
Now, you have a choice: you can store these either in a tree or a
hash. Personally, I'd go with the tree, since that code is already
written. But a hash would be O(n) instead of O(logn) for lookups.
Also, just like we did for the city visit count, create an index for
city_idx so the lookup can be done from that side too.
Conclusion
You now have a way to look up the 5 most-visited cities (via an in-order
traversal on the city visit count index), and find out who visits those
cities, by search for each city in the city_idx index to get the
profile_idx. Grab only unique items, and you have your answer.
Oh, and something seems wrong here: This seems like an awful lot of code for your instructor to want written in several hours!