When working with spatial data in a data base like Postgis, is it a good way to calculate on every SELECT the intersections of two polygons or the area of polygons? Or is it better for performance issues to do the calculations on an INSERT, UPDATE or DELETE-statement and save the results in a column of the tables? How is the approach in big spatial data bases?
Thanks for an answer.
The questuion is too abstract.
Of course if you use the intersection area (ST_Intersection) you should store the result of ST_Intersection geometry. But in practice we often have to calculate intersection on-fly, because entrance arguments depend on dynamical parameters (e.g. intersection of an area with temperature <30C with and area of wind > 20 ms). By the way you can use VIEW to simplify a query in that way.
Of course if your table contains both geometrical column-arguments or one of them is constant it's better to store the intersection. Particularly you can build spatial indexes for this column.
There is no any constant rules. You should be guided by practice conditions: size of database, type of using etc. For example I store the generated ellipse (belief zone) for point of lightning stroke, but I don't store facts of intersectioning (boolean) with powerlines because those intersetionings may be parametrized.
Related
I have the following use case: 5 billion+ geospatial data points, which need to be queried based on 3 predicate ranges: latitude, longitude, and date. A bounding-box geospatial query usually returns 500K-1M rows, of which only about 0.4% are valid, once filtered by the date range.
The ideal structure for this is a 3D index: k-d tree/octree, etc, but PostGreSQL's (and most other databases) geospatial index is a 2D structure. Does anyone have experience representing this type of query in a 3D index, perhaps as a point cloud, using the chronological value as 'Z' component? (Note: even though the current environment is PostgreSQL, suggestions based on other engines are more than welcome)
EDIT: Another possibility I'm considering is reducing the date resolution to a discrete value rather than a range. Then (in theory) I could use a DB product that flattens geospatial data into a standard B-Tree (using a tiling approach), and create a simple compound index, i.e. something along the lines of:
WHERE dateyear = 2015 AND location_tile = xxxxxxxxx
I assume that you don't really query on latitude and longitude, but using a geometrical query, like "overlaps with this bounding box" or "has not more than this distance from a certain point".
The best approach for this might be to use partitioning by date range. Then the date condition will lead to partition pruning, so you only have to perform a GiST index scan on those partitions that match the date condition. In addition, partitioning makes it easy to get rid of old data.
You can build multi-column GiST indexes. With the help of the btree_gist extension and the postgis extension, you could make one GiST index over both the date and the geography, which would be more or less equivalent to an octree. But it is hard to figure out what you actually have, a date, a date range, or some other aspect of the time dimension.
Due to the size, number, and performance of my polygon queries (polygon in polygon) I would like to pre-process my data and separate the polygons into grids. My data is pretty uniform in my area of interest so like 12 even grids would work well. I may adjust this number later based on performance. Basically I am going to create 12 tables with associated spatial indexes or possibly I will just create a single table with a partition key of my grid. This will reduce my total index size 12x and hopefully increase performance. From the query side I will direct the query to the appropriate table.
The key is for me to be able to figure out how to group polygons into these grids. If the polygon falls within multiple grids then I would likely create a record in each and de-duplicate upon query. I wouldn't expect this to happen very often.
Essentially I will have a "grid" that I want to intersect my polygon and figure out what grids the polygon falls in.
Thanks
My process would be something like this:
Find the MIN/MAX ordinate values for your whole data set (both axes)
Extend those values by a margin that seems appropriate (in case the ordinates when combined don't form a regular rectangular shape)
Write a small loop that generates polygons at a set interval within those MIN/MAX ordinates - i.e. create one polygon per grid square
Use the SDO_COVERS to see which of the grid squares cover each polygon. If multiple grid squares cover a polygon, you should see multiple matches as you describe.
I also agree with your strategy of partitioning the data within a single table. I have heard positive comments about this, but I have never personally tried it. The overhead of going to multiple tables seems like something you'll want to avoid though.
We're in a bit of internal conflict on this issue, and can't seem to come to a happy conclusion.
We'll only be storing latitudes and longitudes, and possibly simple polygons. All we need it for is computing distance between two points (and possibly to see if a point is within a polygon), and the entirety of the data is in such close proximity to make planar estimations acceptable.
Since our requirements are so relaxed, half of the dev team suggests using SqlGeometry types, which are apparently simpler. I'm having trouble accepting this, though, since we're storing geographic data, which seems like storing them in SqlGeography is the right thing to do. Also, I'm not finding any substantive evidence that the SqlGeometry data type is that much easier to work with than the SqlGeography type.
Does anyone have advice as to which type would be more appropriate for this relatively simple scenario?
It's not a question of comparing features, or accuracy, or simplicity - the two spatial datatypes are for working with different sorts of data.
As an analogy, suppose you were choosing the best datatype for a column that contained a unique identifier for each row. If that UID only contained integer values, you'd use int, whereas if it was a 6-character alphanumeric value you'd use char(6). And if it had variable-length unicode values, you'd use nvarchar instead, right?
The same logic goes for spatial data - you choose the appropriate datatype based on the values that that column contains; if you're working with geographic (i.e. latitude/longitude) coordinates, use the SqlGeography datatype. It's that simple.
You can use SqlGeometry to store latitude/longitude values, but it would be like using nvarchar(max) to store an integer... and I promise you it will lead to further problems down the line (when all your area calculations come out measured in degrees squared, for example)
The SqlGeography type has less methods available than SqlGeometry (especially in Sql 2008).
SqlGeography reference
SqlGeometry reference
For example, suppose you want to get the centroid of a polygon in Sql2008. You have a native method for that in geometry, but not in geography.
Also, it has the following limitations:
You can't have a geography exceeding one hemisphere
The ring-order matters when creating the polygon
Also, most API and libraries available (that I know of) handle geometries better than geographies.
That said, if the distance calculation has to be precise, you have large distances and have coordinates all over the world, geography would probably be a better fit. Otherwise, and according to your description of the problem, you would be well served with the geometry type.
Regarding your question: "is that much easier to work?". It depends. Anyway, and as a rule of thumb, for simple scenarios I typically opt for SqlGeometry.
Anyway, IMHO you shouldn't worry too much on that decision. It's relatively easy to create a new column with the other type and migrate the data if necessary.
Four years later, it has become apparent that we should have stored the data in SqlGeometry instead of SqlGeography.
Why?
We were importing information from legislative district maps, and their data was stored in SqlGeometry. When determining if a particular lat/long was within a certain legislative district boundary, we'd get inconsistent results when the point was close to two boundaries.
This required us to do additional work to identify locations that were "close" to a boundary, and manually verify that they were assigned to the proper district. Not ideal.
Moral of the story: if you're relying on any data, consider what type it's stored in to help guide your decision.
I have been experimenting with geography datatype lately and just love it. But I can't decide should i convert from my current schema, that stores latitude and longitude in two separate numeric(9,5) fields to geography type. I have calculated the size of both types and Lat/Long way of representing a point is 28 bytes for a single point whereas geography type is 26. Not a big gain in space but huge improvement in performing geospatial operations (intersect, distance measurement etc.) which are currently handled using awkward stored procedures and scalar functions. What I wonder is the indices. Will geography data type require more space for indexing the data? I have a feeling that it will, even though the actual data stored in columns is less, I thing the way geospatial indices work will eventually result in larger space allocation for them.
P.S. as a side note, it seems that SQL Server 2008 (not R2) does not automatically seek through geospatial indices unless explicitly told to using WITH(INDEX()) clause
In my opinion you should definitely use the spatial types only. The spatial type are optimized for spatial queries and if spatial queries are what you need then I think it is an easy choice.
As a sideeffect you can get rid of your geographical functions and procedures since they are (probably) built-in in SQL server 2008. One caveat though, you might have to spend some time optimizing the spatial indexes, but this depends on your specific case.
I understand that you are trying to decide between keep one of the two, but you might want to consider keeping both. If you export your data into shape files, its a common practice to let lat lon field be along with the geom field.
I would keep both. It can be useful to easily query the original coordinates of a particular feature without requiring spatial operations. You have the benefit of knowing the original points as well as the ability to create a new geometry from them in case you need it in a different coordinate system (like if you have your geometry in a particular projection that will lose a lot of precision going to another).
In relation to my previous question where I was asking for some database suggestions; it just occured to me that I don't even know if what I'm trying to store there is appropriate for a database. Or should some other data storage method be used.
I have some physical models testing (let's say wind tunnel data; something similar) where for every model (M-1234) I have:
name (M-1234)
length L
breadth B
height H
L/B ratio
L/H ratio
...
lot of other ratios and dimensions ...
force versus speed curve given in the form of a lot of points for x-y plotting
...
few other similar curves (all of them of type x-y).
Now, what I'm trying to accomplish is store that in some reasonable way, so that the user who will be using the database can come and see what are the closest ten models to L/B=2.5 (or some similar demand). Then for that, somehow get all the data of those models, including the curve data (in a plain text file format).
Is a sql database (or any other, for that matter) an appropriate way of handling something like this ? Or should I take some other approach ?
I have about a month to finish this, and in that time I have to learn enough about databases as well, so ... give your suggestions, please, bearing that in mind. Assume no previous knowledge on the subject, whatsoever.
I think what you're looking for is possible. I'm using Postgresql here, but any database should work. This is my test database
CREATE TABLE test (
id serial primary key,
ratio double precision
);
COPY test (id, ratio) FROM stdin;
1 0.29999999999999999
2 0.40000000000000002
3 0.59999999999999998
4 0.69999999999999996
.
Then, to find the nearest values to a particular ratio
select id,ratio,abs(ratio-0.5) as score from test order by score asc limit 2;
In this case, I'm looking for the 2 nearest to 0.5
I'd probably do a datamodel where you have one table for the main data, the ratios and so on, and then a second table which holds the curve points, as I'm assuming that the curves aren't always the same size.
Yes, a database is probably the best approach for this.
A relational database (which usually uses SQL for data access) is suitable for data that is more or less structured as tables.
To give you an idea:
You could have a main table model with fields name, width etc. . Then subtable(s) for any values which can appear more than once, which refers back to model (look up "foreign key").
Then a subtable for your actual curves, again refering back to model.
How to actually model the curves in the DB I don't know, as I don't know how you model them. But if its lots of numbers, it can go into the DB.
It seems you know little about relational DBMS. Consider reading something on WIkipedia, or doing a few simple DBMS tutorials (PostgreSQL has some: http://www.postgresql.org/docs/8.4/interactive/tutorial.html , but there are many others). Then pick a DBMS for trying out (PostgreSQL is probably not a bad choice, but again there are many others).
Then try implementing a simple table schema, and get back to us with any detail questions (which you'll probably have).
One more thing: Those questions are probably more appropriate to serverfault.com.
This is arguably scientific data: you might find libraries/formats intended for arbitrary scientific data useful: HDF5 http://www.hdfgroup.org/ (note I am not an expert)