Cassandra Geolocation, to index or not to index? - database

My goal is to be able to write a query such that I can find all of the rows in a table between a certain radius of a lat and long.
So a query like this:
SELECT * FROM some_table WHERE lat > someVariableMinLat AND
lat < someVariableMaxLat AND
lng > someVariableMinLng AND lng < someVariableMaxLng;
along those lines.
Now, my thought is of course these should be an index, and I just wanted to confirm that, and related reading or info would be great, thank you!

Your query requires ALLOW FILTERING to run, assuming you've set lat and lng as secondary indices.
Since you're interested in related readings and information, I would gladly shere my little knowledge with you. let me start with Allow Filtering. You've created a rather complex query that (1) uses < and > instead of = (2) on more than one non-primary-key column.
What Allow Filtering does is that it will query a database first, and then it applies some part of your conditions on it. Therefore, it's far from efficient if performance is your concern.
Speaking of performance, it's important to note that a column that tends to have more distinct values is not a good candidate to be set as a secondary index. You may find out more about this topic here.
How would I do that?
I'm not sure about your requirements. But you could consider using Geohash. Geohash is the encoded form of both longitude and latitude. It can get pretty precise as well. By using geohash strings, you can play a tradeoff game between the length of your geohash in characters and their precision (the lengthier the string, the more pricise they become). Perhaps you may set the geohash as your index column which implies that the lengthier the geohash, the more distinct values the column would have. You may even consider setting it as the primary key to take the performace to a higher level.
Or maybe, you could set two primary keys. One, to keep short geohash, and another one to keep the longer hash for the same location if you want different level of precision :)

Related

How to Index time-series based geospatial data

I have the following use case: 5 billion+ geospatial data points, which need to be queried based on 3 predicate ranges: latitude, longitude, and date. A bounding-box geospatial query usually returns 500K-1M rows, of which only about 0.4% are valid, once filtered by the date range.
The ideal structure for this is a 3D index: k-d tree/octree, etc, but PostGreSQL's (and most other databases) geospatial index is a 2D structure. Does anyone have experience representing this type of query in a 3D index, perhaps as a point cloud, using the chronological value as 'Z' component? (Note: even though the current environment is PostgreSQL, suggestions based on other engines are more than welcome)
EDIT: Another possibility I'm considering is reducing the date resolution to a discrete value rather than a range. Then (in theory) I could use a DB product that flattens geospatial data into a standard B-Tree (using a tiling approach), and create a simple compound index, i.e. something along the lines of:
WHERE dateyear = 2015 AND location_tile = xxxxxxxxx
I assume that you don't really query on latitude and longitude, but using a geometrical query, like "overlaps with this bounding box" or "has not more than this distance from a certain point".
The best approach for this might be to use partitioning by date range. Then the date condition will lead to partition pruning, so you only have to perform a GiST index scan on those partitions that match the date condition. In addition, partitioning makes it easy to get rid of old data.
You can build multi-column GiST indexes. With the help of the btree_gist extension and the postgis extension, you could make one GiST index over both the date and the geography, which would be more or less equivalent to an octree. But it is hard to figure out what you actually have, a date, a date range, or some other aspect of the time dimension.

New to database design (Postgres) seeking advice

I am new to designing databases. I come from a front end background. I am looking to design a database that stores performance metrics of wind turbines. I was given an excel file. The list of metrics is nearly 200 and you can see the first few in this image.
I can't think of the best way to represent this data in a database. My first thought was to import this table as is into a database table and add a turbine-id column to it. my second through was to create a table for each metric and add a turbine-id column to each of those tables. What do you guys think. What is the best way for me to store the data that would set me up with a performant database. Thank you for your help and input
One way to do it would be something like this:
TURBINE
ID_TURBINE INTEGER PK
LATITUDE DECIMAL
LONGITUDE DECIMAL
METRIC
ID_METRIC INTEGER PK
METRIC_NAME VARCHAR UNIQUE
VALUE_TYPE VARCHAR
Allowed values = ('BOOLEAN', 'PERCENTAGE', 'INTEGER', 'DOUBLE', 'STRING')
TURBINE_METRIC
ID_TURBINE_METRIC INTEGER PK
ID_TURBINE INTEGER
FOREIGN KEY TO TURBINE
METRIC_NAME VARCHAR
FOREIGN KEY TO METRIC
BOOLEAN_VALUE BOOLEAN
PERCENTAGE_VALUE DOUBLE
INTEGER_VALUE INTEGER
DOUBLE_VALUE DOUBLE
STRING_VALUE VARCHAR
Flesh this out however you need it to be. I have no idea how long your VARCHAR fields should be, etc, but this allows you flexibility in terms of which metrics you store for each turbine. I suppose you could make the LATITUDE and LONGITUDE metrics as well - I just added them to the TURBINE table to show that there may be fixed info which is best stored as part of the TURBINE table.
You want one table to represent turbines (things true of the turbine, like its location) and one or more of turbine metrics that arrive over time. If different groups of metrics arrive at different intervals, put them in different tables.
One goal I would have would be to minimize the number of nullable columns. Ideally, every column is defined NOT NULL, and invalid inputs are set aside for review. What is and is not nullable is controlled by promises made by the system supplying the metrics.
That's how it's done: every table has one or more keys that uniquely identify a row, and all non-key columns are information about the entity defined by the row.
It might seem tempting and "more flexible" to use one table of name-value pairs, so you never have to worry about new properties if the feed changes. That would be a mistake, though (a classic one, which is why I mention it). It's actually not more flexible, because changes upstream will require changes downstream, no matter what. Plus, if the upstream changes in ways that aren't detected by the DBMS, they can subtly corrupt your data and results.
By defining as tight a set of rules about the data as possible in SQL, you guard against missing, malformed, and erroneous inputs. Any verification done by the DBMS is verification that the application can skip, and that no application will be caught by.
For example, you're given min/max values for wind speed and so on. These promises can form constraints in the database. If you get negative wind speed, something is wrong. It might be a sensor problem or (more likely) a data alignment error because a new column was introduced or the input was incorrectly parsed. Rather than put the wind direction mistakenly in the wind speed column, the DBMS rejects the input, and someone can look into what went wrong.
Don't forget to have fun. You have an opportunity to create a new database in a growing industry, and learn about database technology and theory at the same time. Doesn't happen every day!

Is JSONB a good choice for numerical data?

I have a system producing about 5TB of time-tagged numeric data every year. The fields tend to be different for each row, and to avoid having heaps of NULLs I'm thinking of using Postgres as a document store with JSONB.
However, GIN indexes on JSONB fields don't seem to be made for numerical and datetime data. There are no inequality or range operators for numbers and dates.
Here they suggest making special constructs with LATERAL to treat JSON values as normal numeric columns, and here someone proposes using a "sortable" string format for dates and filter string ranges.
These solutions sound a bit hacky and I wonder about their performance. Perhaps this is not a good application for JSONB?
An alternative I can think of using a relational DB is to use the 6th normal form, making one table for each (optional) field, of which however there would be hundreds. It sounds like a big JOIN mess, and new tables would have to be created on the fly any time a new field pops up. But maybe it's still better than a super-slow JSONB implementation.
Any guidance would be much appreciated.
More about the data
The data are mostly sensor readings, physical quantities and boolean flags. Which subset of these is present in each row is unpredictable. The index is an integer, and the only field that always exists is the corresponding date.
There would probably be one write for each value and almost no updates. Reads can be frequent and sliced based on any of the fields (some are more likely to be in a WHERE statement than others).

Efficiently search large DB for similar records

Suppose the following: as input, one would get a record consisting of N numbers and booleans. This vector has to be compared to a database of vectors, which include M additional "result" elements. That means, the database holds P N+M sized vectors.
Each vector in the database holds as last element a boolean. The aim of the exercise is to find as fast a possible the record(s) which are closest match to the input vector AND have a resulting vector ending with a TRUE boolean.
To make the above a bit more comprehensible, give the following exampe:
A database with personal health information, consisting of records holding:
age
gender
weight
lenght
hearth issues (boolean)
lung issues (boolean)
residence
alternative plan Chosen (if done)
accepted offer
The program would then search get an input like
36 Male 185pound 68in FALSE FALSE NYC
It would then find out which plan would be the best to offer the client, based on what's in the database.
I know of a few methods which would help to do this, eg the levenshtein distance method. However, most methods would involve searching the entire database for the best matches.
Are there any algorithms, methods which would cut back on the processing power/time required? I can't imagine that eg. insurance agencies don't use more efficient methods to search their databases...
Any insights into this area would be greatly appreciated!
Assumption: this is a relational database. If instead it were NOSQL then please provide more info on which db.
Do you have option to create bitmap indexes? They can cut down the # of records returned . That is useful for almost all of the columsn since the cardinalities are low.
After that the only one left is the residence, and you should use a Geo distance for that.
If you are unable to create bitmap indexes then what are your filtering options? If none then you have to do a full table scan.
For each of the components e.g. age, gender, etc. you need to
(a) determine a distance metric
(b) determine how to compute both the metric and the distance between different records.
I'm not sure an Levenshtein would work here - you need to take each field separately to find their contribution to the whole distance measure.

Creating an efficient search capability using SQL Server (and/or coldfusion)

I am trying to visualize how to create a search for an application that we are building. I would like a suggestion on how to approach 'searching' through large sets of data.
For instance, this particular search would be on a 750k record minimum table, of product sku's, sizing, material type, create date, etc;
Is anyone aware of a 'plugin' solution for Coldfusion to do this? I envision a google like single entry search where a customer can type in the part number, or the sizing, etc, and get hits on any or all relevant results.
Currently if I run a 'LIKE' comparison query, it seems to take ages (ok a few seconds, but still), and it is too long. At times making a user sit there and wait up to 10 seconds for queries & page loads.
Or are there any SQL formulas to help accomplish this? I want to use a proven method to search the data, not just a simple SQL like or = comparison operation.
So this is a multi-approach question, should I attack this at the SQL level (as it ultimately looks to be) or is there a plug in/module for ColdFusion that I can grab that will give me speedy, advanced search capability.
You could try indexing your db records with a Verity (or Solr, if CF9) search.
I'm not sure it would be faster, and whether even trying it would be worthwhile would depend a lot on how often you update the records you need to search. If you update them rarely, you could do an Verity Index update whenever you update them. If you update the records constantly, that's going to be a drag on the webserver, and certainly mitigate any possible gains in search speed.
I've never indexed a database via Verity, but I've indexed large collections of PDFs, Word Docs, etc, and I recall the search being pretty fast. I don't know if it will help your current situation, but it might be worth further research.
If your slowdown is specifically the search of textual fields (as I surmise from your mentioning of LIKE), the best solution is building an index table (not to be confiused with DB table indexes that are also part of the answer).
Build an index table mapping the unique ID of your records from main table to a set of words (1 word per row) of the textual field. If it matters, add the field of origin as a 3rd column in the index table, and if you want "relevance" features you may want to consider word count.
Populate the index table with either a trigger (using splitting) or from your app - the latter might be better, simply call a stored proc with both the actual data to insert/update and the list of words already split up.
This will immediately drastically speed up textual search as it will no longer do "LIKE", AND will be able to use indexes on index table (no pun intended) without interfering with indexing on SKU and the like on the main table.
Also, ensure that all the relevant fields are indexed fully - not necessarily in the same compund index (SKU, sizing etc...), and any field that is searched as a range field (sizing or date) is a good candidate for a clustered index (as long as the records are inserted in approximate order of that field's increase or you don't care about insert/update speed as much).
For anything mode detailed, you will need to post your table structure, existing indexes, the queries that are slow and the query plans you have now for those slow queries.
Another item is to enure that as little of the fields are textual as possible, especially ones that are "decodable" - your comment mentioned "is it boxed" in the text fields set. If so, I assume the values are "yes"/"no" or some other very limited data set. If so, simply store a numeric code for valid values and do en/de-coding in your app, and search by the numeric code. Not a tremendous speed improvement but still an improvement.
I've done this using SQL's full text indexes. This will require very application changes and no changes to the database schema except for the addition of the full text index.
First, add the Full Text index to the table. Include in the full text index all of the columns the search should perform against. I'd also recommend having the index auto update; this shouldn't be a problem unless your SQL Server is already being highly taxed.
Second, to do the actual search, you need to convert your query to use a full text search. The first step is to convert the search string into a full text search string. I do this by splitting the search string into words (using the Split method) and then building a search string formatted as:
"Word1*" AND "Word2*" AND "Word3*"
The double-quotes are critical; they tell the full text index where the words begin and end.
Next, to actually execute the full text search, use the ContainsTable command in your query:
SELECT *
from containstable(Bugs, *, '"Word1*" AND "Word2*" AND "Word3*"')
This will return two columns:
Key - The column identified as the primary key of the full text search
Rank - A relative rank of the match (1 - 1000 with a higher ranking meaning a better match).
I've used approaches similar to this many times and I've had good luck with it.
If you want a truly plug-in solution then you should just go with Google itself. It sounds like your doing some kind of e-commerce or commercial site (given the use of the term 'SKU'), So you probably have a catalog of some kind with product pages. If you have consistent markup then you can configure a google appliance or service to do exactly what you want. It will send a bot in to index your pages and find your fields. No SQl, little coding, it will not be dependent on your database, or even coldfusion. It will also be quite fast and familiar to customers.
I was able to do this with a coldfusion site in about 6 hours, done! The only thing to watch out for is that google's index is limited to what the bot can see, so if you have a situation where you want to limit access based on a users role or permissions or group, then it may not be the solution for you (although you can configure a permission service for Google to check with)
Because SQL Server is where your data is that is where your search performance is going to be a possible issue. Make sure you have indexes on the columns you are searching on and if using a like you can't use and index if you do this SELECT * FROM TABLEX WHERE last_name LIKE '%FR%'
But it can use an index if you do it like this SELECT * FROM TABLEX WHERE last_name LIKE 'FR%'. The key here is to allow as many of the first characters to not be wild cards.
Here is a link to a site with some general tips. https://web.archive.org/web/1/http://blogs.techrepublic%2ecom%2ecom/datacenter/?p=173

Resources