PostGIS: How to efficiently query a large number of polygons in chunks? - postgis

I am developing a CAD program that allows a user to pan around very large layers of polygons. If the user is zoomed out, for example, then I may need to render millions of polygons. I would like to slowly draw this scene for the user by rendering a few polygons every frame, but that means I will need a query that is able select a rectangular area of polygons from a PostGIS database in chunks. Here is the table and index that I'm using:
create table POLYGON(ID serial8 primary key,
LAYER_ID bigint not null,
GEOM geometry(polygon,0) not null);
create index on POLYGON using gist(LAYER_ID, GEOM);
I want to perform the following query:
select POLYGON.ID, ST_AsBinary(POLYGON.GEOM) from POLYGON
where LAYER_ID = ? and (ST_MakeEnvelope(?, ?, ?, ?, 0) && GEOM);
But that query may take a very long time and return a large number of results. Is there some way I can divide this huge query into smaller queries?
I considered using "offset" and "limit", but I don't think that will work, because my results are not ordered. Also, as far as "offset", I believe that postgres still has to query everything that it is skipping.
I could try dividing the polygons into groups myself, so that I could retrieve one group at a time, but hasn't the "gist" index already internally partitioned my geometries in some way? I don't want to build another system on top of the "gist" index unless I absolutely have to.
Any suggestions would be greatly appreciated.

Related

Azure Database Large Table Group By Performance

I'm looking for design and/or index recommendations for the problem listed below.
I have a couple of denormalized tables in an Azure S1 Standard (20 DTU) database. One of those tables has ~20 columns and a million rows. My application requirements need me to support sub-second (or at least close to it) querying of this table by any combination of columns in my WHERE clause, as well as sub-second (or at least close to it) querying of DISTINCT values in each column.
In order to picture the use case behind this, here is an example. Imagine you were using an HR application that allowed you to search for employees and view employee information. The employee table might have 5 columns and millions of rows. The application allows you to filter by any column, and provides an interface to allow this. Therefore, the underlying SQL queries that must be made are:
A GROUP BY (or DISTINCT) query for each column, which provides the interface with the available filter options
A general employee search query, that filters all rows by any combination of filters
In order to solve performance issues on the first set of queries, I've implemented the following:
Index columns with a large variety of values
Full-Text index columns that require string matching (So CONTAINS querying instead of LIKE)
Do not index columns with a small variety of values
In order to solve the performance issues on the second query, I've implemented the following:
Forcing the front end to use pagination, implemented using SELECT * FROM table OFFSET 0 ROWS FETCH NEXT n ROWS ONLY, and ensuring the order by column is indexed
Locally, this seemed to work fine. Unfortunately, and Azure Standard database doesn't have the same performance as my local machine, and I'm seeing issues. Specifically, the columns I am not indexing (the ones with a very small set of distinct values) are taking 30+ seconds to query for. Additionally, while the paging is initially very quick, the query takes longer and longer the higher and higher I increase the offset.
So I have two targeted questions, but any other advice or design suggestions would be most welcome:
How bad is it to index every column in the table? Know that the table does need to be updated, but the columns that I update won't actually be part of any filters or WHERE clauses. Will the indexes still need to be rebuilt on update? You can also safely assume that the table will not see any inserts/deletes, except for once a month where the entire table is truncated and rebuilt from scratch
In regards to the paging getting slower and slower the deeper I get, I've read this is expected, but the performance becomes unacceptable at a certain point. Outside of making my clustered column the sort by column, are there any other suggestions to get this working?
Thanks,
-Tim

Cluster on a spatial index

I am trying to cluster on spatial locality (Not just create a spatial index), but SQL Server does not allow this. To create a spatial index it first wants me to create a clustered primary key, which nothing makes sense to cluster on. I want to create a spatial index and then cluster on spatial location in some way.
I have an idea to create bins that bin each geometry into a certain bin which then gets some integer. Then set that as the required clustered primary key, that way at least some of my data is clustered close together spatially.
I am kind of baffled SQL server doesnt do this already, so either I am missing out on how to do this or most likely someone has thought of this and someone can proposed a good enough solution.
I want to cluster on spatial location because I am dealing with big data and the first filter I do is by spatial location (creating tiles of maps), without clustering on spatial location my pages are now scattered based on some meaningless auto increment integer.
If a simple implementation of binning by spatial location hasn't been proposed, I figured I could just cut the bounds of my geometry into equal squares and then for each center point run a distance formula that includes all geometries that intersect that bin.
This is not specific to SQL server per say, I am looking for general approaches to solving this index/clustering on spatial location. I assume non-mssql databases may come with this functionality built in.
I don't see how this would be possible, regardless of implementation. Specifically, the idea of a clustering key is so that you (the db engine) can tell the order in which rows should be stored. This is possible with every other datatype (and combination thereof) because ultimately you can say whether a given tuple is bigger, smaller, or equal to another. What metric would you use for generalized spatial data to say that one instance is bigger or smaller than another? Size? Proximity to the origin? Some other measure? There isn't a well-defined sense of that in the general case, and so you can't do it.
But all is not lost. Just assign an arbitrary identifier to your rows (i.e. an identity column or a column populated by a sequence) and cluster on that. Then you can put a spatial index on that and go to town. Looking at your problem, if your bins are pre-defined, you can put those in another table and do a join using STIntersects. But that may be putting the cart before the horse.

Even partitioning of nonuniform ranged data in cassandra

I've got a rather tricky one, bear with me as I try not to stumble over my words here. I'm doing some research, and my group is transitioning to a cassandra database. Our research used MySQL before, but the data outgrew the database (192 million rows in memory # 16G -- it was the only way to query the data fast enough). The data itself is kinda-sorta static. There's a whole lot of it, but any new data is a somewhat slow trickle at this point.
The data consists of a boatload of classifier-score pairs. We formulate queries for the database which basically say, "give me the top 500 for the following classifiers". Then the database returns that many scores. For example, if we ask for the top 500 scores for 2 classifiers, we get back 1000 rows (each row consisting of a classifier ID and a score -- i.e. [4, 9100]). The scores themselves are non-uniform (the distribution tends to clump toward one end of the values -- which by the way are from -10000 to 10000)
As we transition to cassandra, there are a number of requirements. First of all, we need to be able to query for the top and bottom N scores on a per-classifier basis. Normally I can see that an ordered partitioner would be appropriate for this, however like I said the scores tends to clump at the extremes (which would put too much of a burden on one node). So my first question is, how do I evenly distribute the classifier/score pairs while still being able to query for the top or bottom N.
There is a secondary requirement which pretty much screws up the first one. Sometimes it is necessary to find all scores that are near another score. So if I see classifier 6 with a score of 400, I might ask, show me 500 scores that are the closest to that (all within classifier 6). I'm absolutely stumped about this one. I've read that cassandra supports secondary indices (yay) but only hash type (boo - no ranges). Do we create a seperate ColumnFamily for this use case?
And finally, speed is paramount. The data is being used in an interactive GUI application. Ideally, queries should only take a few seconds. And if data all gets stuck on one particular node, it will slow things down.
We've tried all kinds of clever tricks. Our best idea was to put the data into buckets, so that the top 500 went into bucket 1, the next 500 went into bucket 2, and so on. The advantage is that to get the top 500 we just ask for bucket 1. Also all of the data WOULD be evenly distributed using a random partitioner. However since MOST of our queries are interested only in bucket 1, it would put a lot of burden on just one node (remember, if N classifiers are involved, it's actually 500 * N scores per bucket). The real disadvantage of this scheme is that it falls apart when we need to query based on nearness to a score (we'd have to do some kind of weird binary search over the buckets to find our starting value).
At this point we're running low on ideas. Everything I've seen about cassandra makes me wonder if it's even appropriate for this task. We chose it mainly because of it's horizontal scalability, which is important (much easier to add a node than to shard an RDBM). So I suppose my overall question is: how would you approach this? If cassandra, please address any of the above issues. Otherwise any insight or wisdom would be appreciated. Thanks.
Why not storing the classifier as a column family row key and the score in column name. Since columns are sorted it is really fast to query the top/bottom 500 columns for a given classifier. The second type of query is also possible, when you are looking for the scores near s you can for instance select 500 columns before s and 500 columns after s and then filter the 500 columns near s.

Database/NoSQL - Lowest latency way to retrieve the following data

I have a real estate application and a "house" contains the following information:
house:
- house_id
- address
- city
- state
- zip
- price
- sqft
- bedrooms
- bathrooms
- geo_latitude
- geo_longitude
I need to perform an EXTREMELY fast (low latency) retrieval of all homes within a geo-coordinate box.
Something like the SQL below (if I were to use a database):
SELECT * from houses
WHERE latitude IS BETWEEN xxx AND yyy
AND longitude IS BETWEEN www AND zzz
Question: What would be the quickest way for me to store this information so that I can perform the fastest retrieval of data based on latitude & longitude? (e.g. database, NoSQL, memcache, etc)?
This is a typical query for a Geographical Information System (GIS) application. Many of these are solved by using quad-tree, or similar spatial, indices. The tiling mentioned is how these often end up being implemented.
If an index containing the coordinates could fit into memory and the DBMS had a decent optimiser, then a table scan could provide a Cartesian distance from any point of interest with tolerably low overhead. If this is too slow, then the query could be pre-filtered by comparing each coordinate axis separately before doing the full distance calculation.
ThereMongoDB supports geospatial indexes, but there are ways to reduce the computation time for things like this. Depending on how your data is arranged, you can place houses in identifiable 'tiles' and then fetch all houses for a given tile and, from that reduced dataset, sort based on distance from whatever coordinates you have.
Depending on how many tiles there are, you can use bitmasks to find houses that may be near or overlap multiple tiles.
I'm going to assume that you're doing lots more reads than writes, and you don't need to have your database distributed across dozens of machines. If so, you should go for a read-optimized database like sqlite (my personal preference) or mysql, and use exactly the SQL query you suggest.
Most (not all) NoSQL databases end up being overly complicated for queries of this sort, since they're better at looking up exact values in their indexes rather than ranges.
It's nice that you're looking for a bounding box instead of cartesian distance; the latter would be harder for a SQL database to optimize (although you could narrow it to a bounding box, then do the slower cartesian distance calculation).

Why does SQL Server work faster when you index a table after filling it?

I have a sproc that puts 750K records into a temp table through a query as one of its first actions. If I create indexes on the temp table before filling it, the item takes about twice as long to run compared to when I index after filling the table. (The index is an integer in a single column, the table being indexed is just two columns each a single integer.)
This seems a little off to me, but then I don't have the firmest understanding of what goes on under the hood. Does anyone have an answer for this?
If you create a clustered index, it affects the way the data is physically ordered on the disk. It's better to add the index after the fact and let the database engine reorder the rows when it knows how the data is distributed.
For example, let's say you needed to build a brick wall with numbered bricks so that those with the highest number are at the bottom of the wall. It would be a difficult task if you were just handed the bricks in random order, one at a time - you wouldn't know which bricks were going to turn out to be the highest numbered, and you'd have to tear the wall down and rebuild it over and over. It would be a lot easier to handle that task if you had all the bricks lined up in front of you, and could organize your work.
That's how it is for the database engine - if you let it know about the whole job, it can be much more efficient than if you just feed it a row at a time.
It's because the database server has to do calculations each and every time you insert a new row. Basically, you end up reindexing the table each time. It doesn't seem like a very expensive operation, and it's not, but when you do that many of them together, you start to see the impact. That's why you usually want to index after you've populated your rows, since it will just be a one-time cost.
Think of it this way.
Given
unorderedList = {5, 1,3}
orderedList = {1,3,5}
add 2 to both lists.
unorderedList = {5, 1,3,2}
orderedList = {1,2,3,5}
What list do you think is easier to add to?
Btw ordering your input before load will give you a boost.
You should NEVER EVER create an index on an empty table if you are going to massively load it right afterwards.
Indexes have to be maintained as the data on the table changes, so imagine as if for every insert on the table the index was being recalculated (which is an expensive operation).
Load the table first and create the index after finishing with the load.
That's were the performance difference is going.
After performing large data manipulation operations, you frequently have to update the underlying indexes. You can do that by using the UPDATE STATISTICS [table] statement.
The other option is to drop and recreate the index which, if you are doing large data insertions, will likely perform the inserts much faster. You can even incorporate that into your stored procedure.
this is because if the data you insert is not in the order of the index, SQL will have to split pages to make room for additional rows to keep them together logically
This due to the fact that when SQL Server indexes table with data it is able to produce exact statistics of values in indexed column. At some moments SQL Server will recalculate statistics, but when you perform massive inserts the distribution of values may change after the statistics was calculated last time.
The fact that statistics is out of date can be discovered on Query Analyzer. When you see that on a certain table scan number of rows expected differs to much from actual numbers of rows processed.
You should use UPDATE STATISTICS to recalculate distribution of values after you insert all the data. After that no performance difference should be observed.
If you have an index on a table, as you add data to the table SQL Server will have to re-order the table to make room in the appropriate place for the new records. If you're adding a lot of data, it will have to reorder it over and over again. By creating an index only after the data is loaded, the re-order only needs to happen once.
Of course, if you are importing the records in index order it shouldn't matter so much.
In addition to the index overhead, running each query as a transaction is a bad idea for the same reason. If you run chunks of inserts (say 100) within 1 explicit transaction, you should also see a performance increase.

Resources