I have the following use case: 5 billion+ geospatial data points, which need to be queried based on 3 predicate ranges: latitude, longitude, and date. A bounding-box geospatial query usually returns 500K-1M rows, of which only about 0.4% are valid, once filtered by the date range.
The ideal structure for this is a 3D index: k-d tree/octree, etc, but PostGreSQL's (and most other databases) geospatial index is a 2D structure. Does anyone have experience representing this type of query in a 3D index, perhaps as a point cloud, using the chronological value as 'Z' component? (Note: even though the current environment is PostgreSQL, suggestions based on other engines are more than welcome)
EDIT: Another possibility I'm considering is reducing the date resolution to a discrete value rather than a range. Then (in theory) I could use a DB product that flattens geospatial data into a standard B-Tree (using a tiling approach), and create a simple compound index, i.e. something along the lines of:
WHERE dateyear = 2015 AND location_tile = xxxxxxxxx
I assume that you don't really query on latitude and longitude, but using a geometrical query, like "overlaps with this bounding box" or "has not more than this distance from a certain point".
The best approach for this might be to use partitioning by date range. Then the date condition will lead to partition pruning, so you only have to perform a GiST index scan on those partitions that match the date condition. In addition, partitioning makes it easy to get rid of old data.
You can build multi-column GiST indexes. With the help of the btree_gist extension and the postgis extension, you could make one GiST index over both the date and the geography, which would be more or less equivalent to an octree. But it is hard to figure out what you actually have, a date, a date range, or some other aspect of the time dimension.
Related
I'm trying to use Apache Superset to create a dashboard that will display the average rate of X/Y at different entities such that the time grain can be changed on the fly. However, all I have available as raw data is daily totals of X and Y for the entities in question.
It would be simple to do if I could just get a line chart that displayed sum(X)/sum(Y) as its own metric, where the sum range would change with the time grain, but that doesn't seem to be supported.
Creating a function in SQLAlchemy that calculates the daily rates and then uses that as the raw data is also an insufficient solution, since taking the average of that over different time ranges would not be properly weighted.
Is there a workaround I'm not seeing?
Is there a way to use Druid or some other tool to make displaying a quotient over a variable range possible?
My current best solution is to just set up different charts for each time grain size (day, month, quarter, year), but that's extremely inelegant and I'm hoping to do better.
There are multiple ways to do this, one is using the Metric editor as shown bellow, in this case the metric definition is stored as part of the chart.
Another way is to define a metric in the "datasource editor", where the metric will be stored with the datasource definition, and become reusable for any chart using this datasource, as shown here
Side note: depending on the database you use, you may have to CAST from say an integer to a numeric type as I did in the example, or multiply by 100 in order to get a proper result that's useful.
My goal is to be able to write a query such that I can find all of the rows in a table between a certain radius of a lat and long.
So a query like this:
SELECT * FROM some_table WHERE lat > someVariableMinLat AND
lat < someVariableMaxLat AND
lng > someVariableMinLng AND lng < someVariableMaxLng;
along those lines.
Now, my thought is of course these should be an index, and I just wanted to confirm that, and related reading or info would be great, thank you!
Your query requires ALLOW FILTERING to run, assuming you've set lat and lng as secondary indices.
Since you're interested in related readings and information, I would gladly shere my little knowledge with you. let me start with Allow Filtering. You've created a rather complex query that (1) uses < and > instead of = (2) on more than one non-primary-key column.
What Allow Filtering does is that it will query a database first, and then it applies some part of your conditions on it. Therefore, it's far from efficient if performance is your concern.
Speaking of performance, it's important to note that a column that tends to have more distinct values is not a good candidate to be set as a secondary index. You may find out more about this topic here.
How would I do that?
I'm not sure about your requirements. But you could consider using Geohash. Geohash is the encoded form of both longitude and latitude. It can get pretty precise as well. By using geohash strings, you can play a tradeoff game between the length of your geohash in characters and their precision (the lengthier the string, the more pricise they become). Perhaps you may set the geohash as your index column which implies that the lengthier the geohash, the more distinct values the column would have. You may even consider setting it as the primary key to take the performace to a higher level.
Or maybe, you could set two primary keys. One, to keep short geohash, and another one to keep the longer hash for the same location if you want different level of precision :)
I have a system producing about 5TB of time-tagged numeric data every year. The fields tend to be different for each row, and to avoid having heaps of NULLs I'm thinking of using Postgres as a document store with JSONB.
However, GIN indexes on JSONB fields don't seem to be made for numerical and datetime data. There are no inequality or range operators for numbers and dates.
Here they suggest making special constructs with LATERAL to treat JSON values as normal numeric columns, and here someone proposes using a "sortable" string format for dates and filter string ranges.
These solutions sound a bit hacky and I wonder about their performance. Perhaps this is not a good application for JSONB?
An alternative I can think of using a relational DB is to use the 6th normal form, making one table for each (optional) field, of which however there would be hundreds. It sounds like a big JOIN mess, and new tables would have to be created on the fly any time a new field pops up. But maybe it's still better than a super-slow JSONB implementation.
Any guidance would be much appreciated.
More about the data
The data are mostly sensor readings, physical quantities and boolean flags. Which subset of these is present in each row is unpredictable. The index is an integer, and the only field that always exists is the corresponding date.
There would probably be one write for each value and almost no updates. Reads can be frequent and sliced based on any of the fields (some are more likely to be in a WHERE statement than others).
Suppose the following: as input, one would get a record consisting of N numbers and booleans. This vector has to be compared to a database of vectors, which include M additional "result" elements. That means, the database holds P N+M sized vectors.
Each vector in the database holds as last element a boolean. The aim of the exercise is to find as fast a possible the record(s) which are closest match to the input vector AND have a resulting vector ending with a TRUE boolean.
To make the above a bit more comprehensible, give the following exampe:
A database with personal health information, consisting of records holding:
age
gender
weight
lenght
hearth issues (boolean)
lung issues (boolean)
residence
alternative plan Chosen (if done)
accepted offer
The program would then search get an input like
36 Male 185pound 68in FALSE FALSE NYC
It would then find out which plan would be the best to offer the client, based on what's in the database.
I know of a few methods which would help to do this, eg the levenshtein distance method. However, most methods would involve searching the entire database for the best matches.
Are there any algorithms, methods which would cut back on the processing power/time required? I can't imagine that eg. insurance agencies don't use more efficient methods to search their databases...
Any insights into this area would be greatly appreciated!
Assumption: this is a relational database. If instead it were NOSQL then please provide more info on which db.
Do you have option to create bitmap indexes? They can cut down the # of records returned . That is useful for almost all of the columsn since the cardinalities are low.
After that the only one left is the residence, and you should use a Geo distance for that.
If you are unable to create bitmap indexes then what are your filtering options? If none then you have to do a full table scan.
For each of the components e.g. age, gender, etc. you need to
(a) determine a distance metric
(b) determine how to compute both the metric and the distance between different records.
I'm not sure an Levenshtein would work here - you need to take each field separately to find their contribution to the whole distance measure.
When working with spatial data in a data base like Postgis, is it a good way to calculate on every SELECT the intersections of two polygons or the area of polygons? Or is it better for performance issues to do the calculations on an INSERT, UPDATE or DELETE-statement and save the results in a column of the tables? How is the approach in big spatial data bases?
Thanks for an answer.
The questuion is too abstract.
Of course if you use the intersection area (ST_Intersection) you should store the result of ST_Intersection geometry. But in practice we often have to calculate intersection on-fly, because entrance arguments depend on dynamical parameters (e.g. intersection of an area with temperature <30C with and area of wind > 20 ms). By the way you can use VIEW to simplify a query in that way.
Of course if your table contains both geometrical column-arguments or one of them is constant it's better to store the intersection. Particularly you can build spatial indexes for this column.
There is no any constant rules. You should be guided by practice conditions: size of database, type of using etc. For example I store the generated ellipse (belief zone) for point of lightning stroke, but I don't store facts of intersectioning (boolean) with powerlines because those intersetionings may be parametrized.