How to efficiently query large multi-polygons with PostGIS - postgis

I am working with radio maps that seem to be too fragmented to query efficiently. The response time is 20-40 seconds when I ask if a single point is within the multipolygon (I have tested "within"/"contains"/"overlaps"). I use PostGIS, with GeoDjango to abstract the queries.
The multi-polygon column has a GiST index, and I have tried VACUUM ANALYZE. I use PostgreSQL 8.3.7. and Django 1.2.
The maps stretch over large geographical areas. They were originally generated by a topography-aware radio tool, and the radio cells/polygons are therefore fragmented.
My goal is to query for points within the multipolygons (i.e. houses that may or may not be covered by the signals).
All the radio maps are made up of between 100.000 and 300.000 vertices (total), with wildly varying number of polygons. Some of the maps have less than 10 polygons. From there it jumps to between 10.000 and 30.000 polygons. The ratio of polygons to vertices does not seem to effect the time the queries take to complete very much.
I use a projected coordinate system, and use the same system for both houses and radio sectors. Qgis shows that the radio sectors and maps are correctly placed in the terrain.
My test queries are with only one house at a time within a single radio map. I have tested queries like "within"/"contains"/"overlaps", and the results are the same:
Sub-second response if the house is "far from" the radio map (I guess this is because it is outside the bounding box that is automatically used in the query).
20-40 seconds response time if the house/point is close to or within the radio map.
Do I have alternative ways to optimize queries, or must I change/simplify the source material in some way? Any advice is appreciated.

Hallo
The first thing I would do was to split the multipolygons into single polygons and create a new index. Then the index will work a lot more effective. Now the whole multipolygon has one big bounding box and the index can do nothing more than tell if the house is inside the bounding box. So, the smaller polygons in relation to the whole dataset, the more effective index-use. There are even techniques to split single polygons into smaller ones with a grid to get the index-part of the query even more effective. But, the first thing would be to split the multi polygons into single ones with ST_Dump(). If you have a lot of attributes in the same table it would be wise to put that into another table and only keep an ID telling what radiomap it belongs to. Otherwise you will get a lot of duplicated attribute data.
HTH
Nicklas

Related

How do I design a database schema to compute IOU for two regions (coordinates)

Short Question
I want to store coordinates of multiple polygons (regions) in a postgres database. What I would need is to be able to get is all region pairs which have an IOU (Intersection Over Union) > 0.5 (any value). If one region has more than one matching regions, pick the one with the highest IOU.
It would be very helpful if someone can give me the approach on how the schema should be and what kind of SQL queries would be needed to achieve this.
Long Question
Context: Users and AI models add annotations on files (images) in our platform. Let's say AI draws 2 boxes on an image. Box1 with label l1 and Box2 with label l2. User draws 1 box on the same image named Box3 with label l1.
There could be millions of such files and we want to compute various detection and classification metrics from the above information.
Detection metrics would be based on if the box detected by AI matches the user's box or not. We rely on IOU to understand if 2 boxes match or not.
Classification metrics would be on top of those boxes which are determined to be correct based on the IOU by checking if label given by user is there in the labels given by AI.
I want an approach on what kind of DB schema should be used for this kinds of problem statements and how complex the SQL queries would be in terms of performance

Pre-Process polygons and linestrings into grid areas to partition data

Due to the size, number, and performance of my polygon queries (polygon in polygon) I would like to pre-process my data and separate the polygons into grids. My data is pretty uniform in my area of interest so like 12 even grids would work well. I may adjust this number later based on performance. Basically I am going to create 12 tables with associated spatial indexes or possibly I will just create a single table with a partition key of my grid. This will reduce my total index size 12x and hopefully increase performance. From the query side I will direct the query to the appropriate table.
The key is for me to be able to figure out how to group polygons into these grids. If the polygon falls within multiple grids then I would likely create a record in each and de-duplicate upon query. I wouldn't expect this to happen very often.
Essentially I will have a "grid" that I want to intersect my polygon and figure out what grids the polygon falls in.
Thanks
My process would be something like this:
Find the MIN/MAX ordinate values for your whole data set (both axes)
Extend those values by a margin that seems appropriate (in case the ordinates when combined don't form a regular rectangular shape)
Write a small loop that generates polygons at a set interval within those MIN/MAX ordinates - i.e. create one polygon per grid square
Use the SDO_COVERS to see which of the grid squares cover each polygon. If multiple grid squares cover a polygon, you should see multiple matches as you describe.
I also agree with your strategy of partitioning the data within a single table. I have heard positive comments about this, but I have never personally tried it. The overhead of going to multiple tables seems like something you'll want to avoid though.

Databasing to feed ~9k data points to High Charts

I am working on a freelance project that captures an audio file, runs some fourier analysis, and spits out three charts (x-y plots). Each chart has about ~3000 data points, which I plan to display with High Charts in the browser.
What database techniques do you recommend for storing and accessing this much data? Should I be storing the points in an array or in multiple rows? I'm considering Mongo too. Plan is to use Rails, so I was hoping to use a single database for both data and authentication.
I haven't dealt with queries accessing this much data for a single page, and this may very well be a tiny overall amount of data. In addition this is an MVP for demonstration to investors, so making it scalable to huge levels isn't of immediate concern.
My initial thought is that using Postgres and having one large table of data points, stored per-row, will be fine, and that that a bunch of doubles is not going to be too memory-intensive relative to images and such.
Realistically, I may just pull 100 evenly-spaced data points to make the chart, but the original data must still be stored.
I've done a lot of Mongo work and I can tell you what I would do if I were you.
One of the very nice properties about your data is that the x,y coordinates are of a fixed size generally. In other words it's not like you are storing comments from users, which can vary greatly in size.
With Mongo I would first make a sample document with the 3,000 points. Just a simple array of x,y points. I would see how big that document is and how my front end handled it - in other words can High Charts handle that?
I would also try to stick to the easiest conceptual model to manage, which is one document per chart, each chart having 3k points. This is a natural way to think of the data and I would start there and see if there were any performance hits. Mongo can easily store those documents, so I think the biggest pain would be in the UI with rendering the data.
Mongo would handle authentication well. I think it's a good choice for general data storage for an MVP.

Map displaying many measures

I need to create a SSRS report with many measures in a map.
For example, how do I display Total sales in colour and Prices in bullets?
There is a map component to SSRS, but you'll need a spatial data set that defines the points or shapes you want your measures mapping to. There are a bunch of US-centric ones it comes with, and maybe a world one at the country level.
I'm not sure about how much information you could put on the annotations, but you can also combine traditional reporting techniques with the map and have a side bar with your metrics.
One caveat, I think the maximum size of the map component is 36 inches (last time I used it anyways), if you're looking for larger you'll likely need to use some GIS software instead.

How to best do server-side geo clustering?

I want to do pre-clustering for a set of approx. 500,000 points.
I haven't started yet but this is what I had thought I would do:
store all points in a localSOLR index
determine "natural cluster positions" according to some administrative information (big cities for example)
and then calculate a cluster for each city:
for each city
for each zoom level
query the index to get the points contained in a radius around the city (the length of the radius depends on the zoom level)
This should be quite efficient because there are only 100 major cities and SOLR queries are very fast. But a little more thinking revealed it was wrong:
there may be clusters of points that are more "near" one another than near a city: they should get their own cluster
at some zoom levels, some points will not be within the acceptable distance of any city, and so they will not be counted
some cities are near one another and therefore, some points will be counted twice (added to both clusters)
There are other approaches:
examine each point and determine to which cluster it belongs; this eliminates problems 2 and 3 above, but not 1, and is also extremely inefficient
make a (rectangular) grid (for each zoom level); this works but will result in crazy / arbitrary clusters that don't "mean" anything
I guess I'm looking for a general purpose geo-clustering algorithm (or idea) and can't seem to find any.
Edit to answer comment from Geert-Jan
I'd like to build "natural" clusters, yes, and yes I'm afraid that if I use an arbitrary grid, it will not reflect the reality of the data. For example if there are many events that occur around a point that is at or near the intersection of two rectangles, I should get just one cluster but will in fact build two (one in each rectangle).
Originally I wanted to use localSOLR for performance reasons (and because I know it, and have better experience indexing a lot of data into SOLR than loading it in a conventional database); but since we're talking of pre-clustering, maybe performance is not that important (although it should not take days to visualize a result of a new clustering experiment). My first approach of querying lots of points according to a predefined set of "big points" is clearly flawed anyway, the first reason I mentioned being the strongest: clusters should reflect the reality of the data, and not some other bureaucratic definition (they will clearly overlap, sure, but data should come first).
There is a great clusterer for live clustering, that has been added to the core Google Maps API: Marker Clusterer. I wonder if anyone has tried to run it "offline": run it for any amount of time it needs, and then store the results?
Or is there a clusterer that examines each point, point after point, and outputs clusters with their coordinates and number of points included, and which does this in a reasonable amount of time?
You may want to look into advanced clustering algorithms such as OPTICS.
With a good database index, it should be fairly fast.

Resources