Due to the size, number, and performance of my polygon queries (polygon in polygon) I would like to pre-process my data and separate the polygons into grids. My data is pretty uniform in my area of interest so like 12 even grids would work well. I may adjust this number later based on performance. Basically I am going to create 12 tables with associated spatial indexes or possibly I will just create a single table with a partition key of my grid. This will reduce my total index size 12x and hopefully increase performance. From the query side I will direct the query to the appropriate table.
The key is for me to be able to figure out how to group polygons into these grids. If the polygon falls within multiple grids then I would likely create a record in each and de-duplicate upon query. I wouldn't expect this to happen very often.
Essentially I will have a "grid" that I want to intersect my polygon and figure out what grids the polygon falls in.
Thanks
My process would be something like this:
Find the MIN/MAX ordinate values for your whole data set (both axes)
Extend those values by a margin that seems appropriate (in case the ordinates when combined don't form a regular rectangular shape)
Write a small loop that generates polygons at a set interval within those MIN/MAX ordinates - i.e. create one polygon per grid square
Use the SDO_COVERS to see which of the grid squares cover each polygon. If multiple grid squares cover a polygon, you should see multiple matches as you describe.
I also agree with your strategy of partitioning the data within a single table. I have heard positive comments about this, but I have never personally tried it. The overhead of going to multiple tables seems like something you'll want to avoid though.
Related
Description of the data:
I'm creating a simple geospatial database in SQL Server. It will contain regions (polygons) for all the countries in the world plus their administrative regions. The data will come from OpenStreetMap. For each polygon in the database I will only store a unique ID and a name. This database will be very rarely written to, if ever (I'll just load the data once and that's it).
Even though I only get these general boundaries, there's still plenty of data - the full world is around 4GB as GeoJSON. The database will probably be more efficient, yet with all the indexes and everything I still expect it to be overt 1GB in size.
The polygons are organized by their "administrative levels". Level 0 contains all the countries; Level 1 contains largest administrative subdivisions, etc. This does however vary by country - very few countries use ALL levels, mostly they use some 3 to 4 levels, but which ones - depends on the country.
I believe that a polygon in a "lower" level will always belong (fit completely inside) to a polygon in a "higher" level. However it might not be the immediately previous numeric level. For example, if a country uses levels 1, 2 and 5, then the polygons in level 5 will belong to polygons in level 2, which will belong to polygons in level 1. At the very least I expect every poygon in levels greater than 0 to fit inside a polygon in level 0 (countries). I haven't tested this, but if there are any conflicts (disputed territories, maybe?), then I'll figure out some way to resolve them in order to establish the parent-child relationship.
Description of the requirements:
I need to query this data in the following ways:
By specifying a level and a bounding box, all the polygons that are in this administrative level and inside the specific bounding box should be returned (along with ID and name).
By specifying a parent polygon ID, I want to fetch all direct child polygons, no matter to which level they belong (but only direct children, not grandchildren).
Where I'm stuck:
How do I create a data structure that can query this efficiently?
Querying by bounding box can be efficiently done only with a geospatial index, but that then does not filter by administrative levels. I don't want to select all the shapes by the geospatial index and then do a full scan over them - this will fail miserably when I request the whole world in level 0.
I could split each level in their own table, which would work nicely for the level+bbox criteria, but how then do I select all children of a parent, if they are potentially smeared out over several tables? I can't even create an indexed view because that does not support UNIONs.
I could store everything in two copies - split-table for first query; single-table for second. But that would double the amount of data in the database, which is already fairly large.
Any ideas?
I need to create a SSRS report with many measures in a map.
For example, how do I display Total sales in colour and Prices in bullets?
There is a map component to SSRS, but you'll need a spatial data set that defines the points or shapes you want your measures mapping to. There are a bunch of US-centric ones it comes with, and maybe a world one at the country level.
I'm not sure about how much information you could put on the annotations, but you can also combine traditional reporting techniques with the map and have a side bar with your metrics.
One caveat, I think the maximum size of the map component is 36 inches (last time I used it anyways), if you're looking for larger you'll likely need to use some GIS software instead.
I have a large database which contains data about the countries like country name,position, HDI(Human development index) and population. I need to classify this data in to some "K" number of groups based upon the population. One of my friend suggested that in this case K-means clustring will be useful. But I am thinking, this can be done directly by sorting the data according to population and then divide this sorted data in to groups. Are there some difference between these two approaches?
Thanks
Recursively splitting along one dimension results in a decision tree. It is a different data structure. All the cuts between the groups are along axis (horizontally or vertically). K-means can achieve a balance because the cuts are not necessarily horizontally or vertically (most of the time they aren't).
Actually, the decision tree approach is also very useful. You might well try it.
When working with spatial data in a data base like Postgis, is it a good way to calculate on every SELECT the intersections of two polygons or the area of polygons? Or is it better for performance issues to do the calculations on an INSERT, UPDATE or DELETE-statement and save the results in a column of the tables? How is the approach in big spatial data bases?
Thanks for an answer.
The questuion is too abstract.
Of course if you use the intersection area (ST_Intersection) you should store the result of ST_Intersection geometry. But in practice we often have to calculate intersection on-fly, because entrance arguments depend on dynamical parameters (e.g. intersection of an area with temperature <30C with and area of wind > 20 ms). By the way you can use VIEW to simplify a query in that way.
Of course if your table contains both geometrical column-arguments or one of them is constant it's better to store the intersection. Particularly you can build spatial indexes for this column.
There is no any constant rules. You should be guided by practice conditions: size of database, type of using etc. For example I store the generated ellipse (belief zone) for point of lightning stroke, but I don't store facts of intersectioning (boolean) with powerlines because those intersetionings may be parametrized.
I am working with radio maps that seem to be too fragmented to query efficiently. The response time is 20-40 seconds when I ask if a single point is within the multipolygon (I have tested "within"/"contains"/"overlaps"). I use PostGIS, with GeoDjango to abstract the queries.
The multi-polygon column has a GiST index, and I have tried VACUUM ANALYZE. I use PostgreSQL 8.3.7. and Django 1.2.
The maps stretch over large geographical areas. They were originally generated by a topography-aware radio tool, and the radio cells/polygons are therefore fragmented.
My goal is to query for points within the multipolygons (i.e. houses that may or may not be covered by the signals).
All the radio maps are made up of between 100.000 and 300.000 vertices (total), with wildly varying number of polygons. Some of the maps have less than 10 polygons. From there it jumps to between 10.000 and 30.000 polygons. The ratio of polygons to vertices does not seem to effect the time the queries take to complete very much.
I use a projected coordinate system, and use the same system for both houses and radio sectors. Qgis shows that the radio sectors and maps are correctly placed in the terrain.
My test queries are with only one house at a time within a single radio map. I have tested queries like "within"/"contains"/"overlaps", and the results are the same:
Sub-second response if the house is "far from" the radio map (I guess this is because it is outside the bounding box that is automatically used in the query).
20-40 seconds response time if the house/point is close to or within the radio map.
Do I have alternative ways to optimize queries, or must I change/simplify the source material in some way? Any advice is appreciated.
Hallo
The first thing I would do was to split the multipolygons into single polygons and create a new index. Then the index will work a lot more effective. Now the whole multipolygon has one big bounding box and the index can do nothing more than tell if the house is inside the bounding box. So, the smaller polygons in relation to the whole dataset, the more effective index-use. There are even techniques to split single polygons into smaller ones with a grid to get the index-part of the query even more effective. But, the first thing would be to split the multi polygons into single ones with ST_Dump(). If you have a lot of attributes in the same table it would be wise to put that into another table and only keep an ID telling what radiomap it belongs to. Otherwise you will get a lot of duplicated attribute data.
HTH
Nicklas