I have GPS points of a million users, and I want to select users who have crossed path; meaning that they have points within a distance threshold (50 meters) and within time threshold (15 minutes). Each user has on average 80 points during multiple weeks, and each point has attributes:
lat, lon, arrival_time, leave_time
This comparison needs to be done for any possible combination two users out of a million. I am building cartesian table of all the points of each pair of users, which makes the processing huge. Can anyone suggest a method to compare only points that are temporally or spatially close?
I am hoping to achieve so, using postgis and python.
Related
I have a constant number of columns - they correspond to real-time coordinates of a few/maybe even a few hundred points in space (constant id and x, y coordinates of a detected pose in an OpenCV image - they are analyzed grid by grid so a lot of data comes in at once).
I read that Redis runs on RAM and you can set a time to delete the data.
Cassandra stores data in columns next to each other so as for fixed coordinates it should be suitable.
It would be nice if you could perform operations on them such as subtraction or multiplication.
I'm looking for a database that will be able to quickly write and read this data and at the same time will not be performance-intensive.
thanks
I'm not too sure what the best place for this is
I'm working on an app that requires me to find points of interest thats within a specific radius of a users location.
For example, I grab the users location as Lat and Long coordinates and want to find all the items within a 20 mile radius.
Right now I have a MySQL database with 450,000 records with each record containing a Lat and Long. I then run a prepared statement to grab X amount of records within a 20 meter radius.
This is quite slow and intensive on the database.
Are there better ways to optimise lookups when using MySQL or is there a purpose built system?
Right now this is a hobby project so affording a service that does this may be out of my $0 budget.
Any and all suggestions are appreciated.
I was asked by the interviewer to design a system to store gigabytes of data and the system also has to support some kind of query.
Description:
There are massive amount of records generated in an IDC, each record is composed of a url, an IP which visits the url, and the time when the visit occurs. The record can probably be stated as a struct like this, but I'm not sure which data type should I pick to represent them:
struct Record {
url; //char *
IP; //int?
visit_time; //time_t or simply a number?
}
Requirements:
Design a system to store 100 billion records, and also the system gotta support 2 kinds of query at least:
First, given a time period (t1, t2) and a IP, query how many urls this IP has visited in the given period.
Second, given a time period (t1, t2) and a url, query how many times this url has been visited.
I was stumbled, and here is my stupid solution:
Analysis:
because every query is performed upon a given period of time, so:
1.Create a set, put all visit time into the set, and keep the set ordered according to the time's value from older to latest.
2.Create a hash table using hash(visit_time) as the key, this hash table is called time-hash-table, then each node in a specific bucket has 2 pointers pointing to another 2 hash-tables respectively.
3.The another 2 hash-tables would be a ip-hash-table and a url-hash-table.
ip-hash-table uses hash(ip) as the key and all the ips in the same ip-hash-table have the same visit-time;
url-hash-table uses hash(url) as the key and all the urls in the same url-hash-table have the same visit-time.
Give a drawing as follows:
time_hastbl
[]
[]
[]-->[visit_time_i]-->[visit_time_j]...[visit_time_p]-->NIL
[] | |
[] ip_hastbl url_hastbl
[] []
: :
[] []
[] []
So, when doing the query upon (t1, t2):
find the closest match from the time set, let's say the match is (t1', t2'), then all the valid visit time will fall into the part of set starting from t1' to t2';
for each visit-time t in the time set[t1':t2'], do hash(t) and find t's ip_hastbl or url_hastbl, then count and log how many times the given ip or url appears.
Questions:
1.My solution is stupid, hope you can give me another solution.
2.with respect to how to store the massive records on disk, any advice? I thought of B-tree, but how to use it or is B-tree applicable in this system?
I believe the interviewer was expecting a distributed computing based solution, esp when "100 billion records" are involved. With the limited knowledge of Distributed Computing I have, I would suggest you to look into Distributed Hash Table and map-reduce (for parallel query processing)
In my opinion, create a B+ tree using time as the key to help you quickly locate the range of records during given time period (t1,t2) in disk. Then using the records during (t1,t2) to build IP and URL hash table respectively.
Old question, but recently bumped so here's a few other things to think about:
What you need to consider is a few very simple boundary limits beyond your listed requirements, assuming you have no further indexes:
First, given a time period (t1, t2) and a IP, query how many urls this IP has visited in the given period.
If you have 10k users then you can expect at worst a scan of all records in a time window would result in only needing to return in 10k records accessed (on average).
Second, given a time period (t1, t2) and a url, query how many times this url has been visited.
Depending on how many urls you have in the system say 1000, then this again means that a simple scan results in 999 of 1000 records scanned not being returned.
Lets say you have only 100,000 unique urls, you could greatly reduce the space consumed by the database (by using a guid / int foreign key instead), this also means the average url is accessed 1M times on your 100Bn records.
Even with all this it tells us nothing completely, because we don't have numbers / statistics on how clusteded by time the records are for the given search times. Are we getting 1000 page requests every second and searching for a 12month time range, or are we getting 100 requests per second and searching for a 1hour time block (360k requests).
Assuming the 100Bn represents 12 months of data that's 3170 requests per second. Does that sound reasonable?
Why is this important? Because it highlights one key thing you overlooked in your answer.
With 100Bn records in the past 12months, that means in 12months time you'll have 200Bn records to deal with. If 100bn records is for 20 years then it's not such an issue, you can expect to grow by only another 25-30bn in the next 5 years... but it's unlikely that your existing data is over such a long time frame.
Your solution only answers one side of the equation (reading data), you don't consider any complications with writing that much data. A vast majority of the time you will be inserting data into whatever data store you create, will it be able to handle a constant 3k insert requests per second?
If you insert 3k records and each record is just 3x 64bit integers representing Time (in ticks), IP Address and a Foreign key to the url. Then that is only ~75kb/s of writing data which will be fine to maintain. If every URL is to be assumed unique, then you could easily run into performance issues due to IO speeds (ignoring the space requirements).
One other thing the interviewer would be interested in seeing is your thoughts on supporting IPv6.
Lastly, if you provided a solution like you have then the interviewer should have asked a followup question. "How would your system perform if I now want to know when a specific ip address last accessed a specific url?"
So yes, if you don't know about MapReduce and other distributed processing query systems then yours should be a reasonable answer.
It will be an interval tree which is also a B-Tree. An interval tree because all the queries have input as time interval only, and B-Tree due to the size of the input(billions).
I've got a rather tricky one, bear with me as I try not to stumble over my words here. I'm doing some research, and my group is transitioning to a cassandra database. Our research used MySQL before, but the data outgrew the database (192 million rows in memory # 16G -- it was the only way to query the data fast enough). The data itself is kinda-sorta static. There's a whole lot of it, but any new data is a somewhat slow trickle at this point.
The data consists of a boatload of classifier-score pairs. We formulate queries for the database which basically say, "give me the top 500 for the following classifiers". Then the database returns that many scores. For example, if we ask for the top 500 scores for 2 classifiers, we get back 1000 rows (each row consisting of a classifier ID and a score -- i.e. [4, 9100]). The scores themselves are non-uniform (the distribution tends to clump toward one end of the values -- which by the way are from -10000 to 10000)
As we transition to cassandra, there are a number of requirements. First of all, we need to be able to query for the top and bottom N scores on a per-classifier basis. Normally I can see that an ordered partitioner would be appropriate for this, however like I said the scores tends to clump at the extremes (which would put too much of a burden on one node). So my first question is, how do I evenly distribute the classifier/score pairs while still being able to query for the top or bottom N.
There is a secondary requirement which pretty much screws up the first one. Sometimes it is necessary to find all scores that are near another score. So if I see classifier 6 with a score of 400, I might ask, show me 500 scores that are the closest to that (all within classifier 6). I'm absolutely stumped about this one. I've read that cassandra supports secondary indices (yay) but only hash type (boo - no ranges). Do we create a seperate ColumnFamily for this use case?
And finally, speed is paramount. The data is being used in an interactive GUI application. Ideally, queries should only take a few seconds. And if data all gets stuck on one particular node, it will slow things down.
We've tried all kinds of clever tricks. Our best idea was to put the data into buckets, so that the top 500 went into bucket 1, the next 500 went into bucket 2, and so on. The advantage is that to get the top 500 we just ask for bucket 1. Also all of the data WOULD be evenly distributed using a random partitioner. However since MOST of our queries are interested only in bucket 1, it would put a lot of burden on just one node (remember, if N classifiers are involved, it's actually 500 * N scores per bucket). The real disadvantage of this scheme is that it falls apart when we need to query based on nearness to a score (we'd have to do some kind of weird binary search over the buckets to find our starting value).
At this point we're running low on ideas. Everything I've seen about cassandra makes me wonder if it's even appropriate for this task. We chose it mainly because of it's horizontal scalability, which is important (much easier to add a node than to shard an RDBM). So I suppose my overall question is: how would you approach this? If cassandra, please address any of the above issues. Otherwise any insight or wisdom would be appreciated. Thanks.
Why not storing the classifier as a column family row key and the score in column name. Since columns are sorted it is really fast to query the top/bottom 500 columns for a given classifier. The second type of query is also possible, when you are looking for the scores near s you can for instance select 500 columns before s and 500 columns after s and then filter the 500 columns near s.
I have a real estate application and a "house" contains the following information:
house:
- house_id
- address
- city
- state
- zip
- price
- sqft
- bedrooms
- bathrooms
- geo_latitude
- geo_longitude
I need to perform an EXTREMELY fast (low latency) retrieval of all homes within a geo-coordinate box.
Something like the SQL below (if I were to use a database):
SELECT * from houses
WHERE latitude IS BETWEEN xxx AND yyy
AND longitude IS BETWEEN www AND zzz
Question: What would be the quickest way for me to store this information so that I can perform the fastest retrieval of data based on latitude & longitude? (e.g. database, NoSQL, memcache, etc)?
This is a typical query for a Geographical Information System (GIS) application. Many of these are solved by using quad-tree, or similar spatial, indices. The tiling mentioned is how these often end up being implemented.
If an index containing the coordinates could fit into memory and the DBMS had a decent optimiser, then a table scan could provide a Cartesian distance from any point of interest with tolerably low overhead. If this is too slow, then the query could be pre-filtered by comparing each coordinate axis separately before doing the full distance calculation.
ThereMongoDB supports geospatial indexes, but there are ways to reduce the computation time for things like this. Depending on how your data is arranged, you can place houses in identifiable 'tiles' and then fetch all houses for a given tile and, from that reduced dataset, sort based on distance from whatever coordinates you have.
Depending on how many tiles there are, you can use bitmasks to find houses that may be near or overlap multiple tiles.
I'm going to assume that you're doing lots more reads than writes, and you don't need to have your database distributed across dozens of machines. If so, you should go for a read-optimized database like sqlite (my personal preference) or mysql, and use exactly the SQL query you suggest.
Most (not all) NoSQL databases end up being overly complicated for queries of this sort, since they're better at looking up exact values in their indexes rather than ranges.
It's nice that you're looking for a bounding box instead of cartesian distance; the latter would be harder for a SQL database to optimize (although you could narrow it to a bounding box, then do the slower cartesian distance calculation).