Modeling data for maximum searchability - database

What's the best way to model this data:
public class Item {
Double price;
String geoHash;
Long startAvailabilty; // timestamp
Long endAvailabilty; // timestamp
Set<String> keywords;
String category;
String dateCreated; // iso date
String dateUpdated; // iso date
Integer likes;
Boolean isActive;
}
As such it will be possible to query and search for the ff:
Price range
String "starts with" query (geo hash search)
Timestamp range (similar to price range)
Keyword search (search from a set of string)
Equality (for category etc)
ISO Date string search
Boolean search (search w/c one is active)
For the Riak KV database.

From the above, I can imagine this being implemented either via maps, possibly containing sets with map reduce/secondary indexes or by using Yokozuna (Solr).
Despite this, the above data set looks more like it should be in a relational database than a non-SQL database. For something in the middle, you might want to consider Riak TS (download) which is similar to KV but has a thin SQL layer on top. The current version (1.5.1) is a bit old but we plan to launch an improved version later this year with additional SQL features available.

Related

Fast matching of input string with strings in db

Given an algorithm of strings matching which works with certain strings (e.g. "123456789") and string patterns (e.g. "1*******9").
String patterns are not any kind of regexp or SQL LIKE pattern - they only provide "*" placeholder which means "a single digit or letter".
So, the algorithm will treat these values as "equal":
12ABCDE89
12A***E89
**A****8*
*********
The data is stored in the relational database (MS SQL Server) and .net core app addressed it via EntityFramework Core.
The required scenario is to obtain 500 input strings (either certain or a pattern) and to find matched rows in the database (in the table containing 1 million of rows).
First I implemented it using LIKE pattern matching (first I transformed input strings to the LIKE pattern and then built a predicate for WHERE clause), but tests showed that it has unacceptable performance.
Can I implement this task using FULL-TEXT SEARCH feature of MSSQL? What will the predicate look like in this case? Any other ideas on the implementation?
You can try CLR user-defined function approach (example). But you don't need to use SQL queries at all. Just compare 2 strings using your algorithm. Such approach theoretically should be faster
[SqlFunction(
DataAccess = DataAccessKind.None,
SystemDataAccess = SystemDataAccessKind.None,
IsPrecise = true,
IsDeterministic = true)
]
// Have to be public and static
public static bool CustomIsEqualTo(string baseString, string stringToCompare)
{
return true;
}

Composite or FilterPredicate query on Ref'd entity

Here's what I have:
class A{
Ref<foo> b;
Ref<foo> c;
int code;
Date timestamp
}
The pseudo "where" clause of the SQL statement would like like this:
where b = object or (c = object and code = 1) order by timestamp
In plain English, give me all the records of A if b equals the specific object or if c equals the specified object when code equals 1. Order the result w/ timestamp.
Is the composite query part even possible w/ datastore (Objectify)? I really don't want to do two queries and merge the results, because I have to sort by timestamp.
Any help is appreciated.
P.S. I already tried
new FilterPredicate(b, EQUAL, object)
This didn't work, because the entity type is not a support type.
Thanks!
Pass a native datastore Key object to the FilterPredicate. The Google SDK Key, not the generic Objectify Key<?>.
Normally when filtering on properties, Objectify translates Ref<?> and Key<?> objects to native datastore keys for you. With the google-supplied FilterPredicate, that isn't an option. So you have to do the translation manually.
Objectify stores all Key<?> and Ref<?> fields and native datastore Keys, so you can freely interchange them (or even change the type of fields if you want).

store strings of arbitrary length in Postgresql

I have a Spring application which uses JPA (Hibernate) initially created with Spring Roo. I need to store Strings with arbitrary length, so for that reason I've annotated the field with #Lob:
public class MyEntity{
#NotNull
#Size(min = 2)
#Lob
private String message;
...
}
The application works ok in localhost but I've deployed it to an external server and it a problem with encoding has appeared. For that reason I'd like to check if the data stored in the PostgreSQL database is ok or not. The application creates/updates the tables automatically. And for that field (message) it has created a column of type:
text NOT NULL
The problem is that after storing data if I browse the table or just do a SELECT of that column I can't see the text but numbers. Those numbers seems to be identifiers to "somewhere" where that information is stored.
Can anyone tell me exactly what are these identifiers and if there is any way of being able to see the stored data in a #Lob columm from a pgAdmin or a select clause?
Is there any better way to store Strings of arbitrary length in JPA?
Thanks.
I would recommend skipping the '#Lob' annotation and use columnDefinition like this:
#Column(columnDefinition="TEXT")
see if that helps viewing the data while browsing the database itself.
Use the #LOB definition, it is correct. The table is storing an OID to the catalogs -> postegreSQL-> tables -> pg_largeobject table.
The binary data is stored here efficiently and JPA will correctly get the data out and store it for you with this as an implementation detail.
Old question, but here is what I found when I encountered this:
http://www.solewing.org/blog/2015/08/hibernate-postgresql-and-lob-string/
Relevant parts below.
#Entity
#Table(name = "note")
#Access(AccessType.FIELD)
class NoteEntity {
#Id
private Long id;
#Lob
#Column(name = "note_text")
private String noteText;
public NoteEntity() { }
public NoteEntity(String noteText) { this.noteText = noteText }
}
The Hibernate PostgreSQL9Dialect stores #Lob String attribute values by explicitly creating a large object instance, and then storing the UID of the object in the column associated with attribute.
Obviously, the text of our notes isn’t really in the column. So where is it? The answer is that Hibernate explicitly created a large object for each note, and stored the UID of the object in the column. If we use some PostgreSQL large object functions, we can retrieve the text itself.
Use this to query:
SELECT id,
convert_from(loread(
lo_open(note_text::int, x'40000'::int), x'40000'::int), 'UTF-8')
AS note_text
FROM note

GQL query with "like" operator [duplicate]

Simple one really. In SQL, if I want to search a text field for a couple of characters, I can do:
SELECT blah FROM blah WHERE blah LIKE '%text%'
The documentation for App Engine makes no mention of how to achieve this, but surely it's a common enough problem?
BigTable, which is the database back end for App Engine, will scale to millions of records. Due to this, App Engine will not allow you to do any query that will result in a table scan, as performance would be dreadful for a well populated table.
In other words, every query must use an index. This is why you can only do =, > and < queries. (In fact you can also do != but the API does this using a a combination of > and < queries.) This is also why the development environment monitors all the queries you do and automatically adds any missing indexes to your index.yaml file.
There is no way to index for a LIKE query so it's simply not available.
Have a watch of this Google IO session for a much better and more detailed explanation of this.
i'm facing the same problem, but i found something on google app engine pages:
Tip: Query filters do not have an explicit way to match just part of a string value, but you can fake a prefix match using inequality filters:
db.GqlQuery("SELECT * FROM MyModel WHERE prop >= :1 AND prop < :2",
"abc",
u"abc" + u"\ufffd")
This matches every MyModel entity with a string property prop that begins with the characters abc. The unicode string u"\ufffd" represents the largest possible Unicode character. When the property values are sorted in an index, the values that fall in this range are all of the values that begin with the given prefix.
http://code.google.com/appengine/docs/python/datastore/queriesandindexes.html
maybe this could do the trick ;)
Altough App Engine does not support LIKE queries, have a look at the properties ListProperty and StringListProperty. When an equality test is done on these properties, the test will actually be applied on all list members, e.g., list_property = value tests if the value appears anywhere in the list.
Sometimes this feature might be used as a workaround to the lack of LIKE queries. For instance, it makes it possible to do simple text search, as described on this post.
You need to use search service to perform full text search queries similar to SQL LIKE.
Gaelyk provides domain specific language to perform more user friendly search queries. For example following snippet will find first ten books sorted from the latest ones with title containing fern
and the genre exactly matching thriller:
def documents = search.search {
select all from books
sort desc by published, SearchApiLimits.MINIMUM_DATE_VALUE
where title =~ 'fern'
and genre = 'thriller'
limit 10
}
Like is written as Groovy's match operator =~.
It supports functions such as distance(geopoint(lat, lon), location) as well.
App engine launched a general-purpose full text search service in version 1.7.0 that supports the datastore.
Details in the announcement.
More information on how to use this: https://cloud.google.com/appengine/training/fts_intro/lesson2
Have a look at Objectify here , it is like a Datastore access API. There is a FAQ with this question specifically, here is the answer
How do I do a like query (LIKE "foo%")
You can do something like a startWith, or endWith if you reverse the order when stored and searched. You do a range query with the starting value you want, and a value just above the one you want.
String start = "foo";
... = ofy.query(MyEntity.class).filter("field >=", start).filter("field <", start + "\uFFFD");
Just follow here:
init.py#354">http://code.google.com/p/googleappengine/source/browse/trunk/python/google/appengine/ext/search/init.py#354
It works!
class Article(search.SearchableModel):
text = db.TextProperty()
...
article = Article(text=...)
article.save()
To search the full text index, use the SearchableModel.all() method to get an
instance of SearchableModel.Query, which subclasses db.Query. Use its search()
method to provide a search query, in addition to any other filters or sort
orders, e.g.:
query = article.all().search('a search query').filter(...).order(...)
I tested this with GAE Datastore low-level Java API. Me and works perfectly
Query q = new Query(Directorio.class.getSimpleName());
Filter filterNombreGreater = new FilterPredicate("nombre", FilterOperator.GREATER_THAN_OR_EQUAL, query);
Filter filterNombreLess = new FilterPredicate("nombre", FilterOperator.LESS_THAN, query+"\uFFFD");
Filter filterNombre = CompositeFilterOperator.and(filterNombreGreater, filterNombreLess);
q.setFilter(filter);
In general, even though this is an old post, a way to produce a 'LIKE' or 'ILIKE' is to gather all results from a '>=' query, then loop results in python (or Java) for elements containing what you're looking for.
Let's say you want to filter users given a q='luigi'
users = []
qry = self.user_model.query(ndb.OR(self.user_model.name >= q.lower(),self.user_model.email >= q.lower(),self.user_model.username >= q.lower()))
for _qry in qry:
if q.lower() in _qry.name.lower() or q.lower() in _qry.email.lower() or q.lower() in _qry.username.lower():
users.append(_qry)
It is not possible to do a LIKE search on datastore app engine, how ever creating an Arraylist would do the trick if you need to search a word in a string.
#Index
public ArrayList<String> searchName;
and then to search in the index using objectify.
List<Profiles> list1 = ofy().load().type(Profiles.class).filter("searchName =",search).list();
and this will give you a list with all the items that contain the world you did on the search
If the LIKE '%text%' always compares to a word or a few (think permutations) and your data changes slowly (slowly means that it's not prohibitively expensive - both price-wise and performance-wise - to create and updates indexes) then Relation Index Entity (RIE) may be the answer.
Yes, you will have to build additional datastore entity and populate it appropriately. Yes, there are some constraints that you will have to play around (one is 5000 limit on the length of list property in GAE datastore). But the resulting searches are lightning fast.
For details see my RIE with Java and Ojbectify and RIE with Python posts.
"Like" is often uses as a poor-man's substitute for text search. For text search, it is possible to use Whoosh-AppEngine.

What's the best way to store co-ordinates (longitude/latitude, from Google Maps) in SQL Server?

I'm designing a table in SQL Server 2008 that will store a list of users and a Google Maps co-ordinate (longitude & latitude).
Will I need two fields, or can it be done with 1?
What's the best (or most common) data-type to use for storing this kind of data?
Fair Warning! Before taking the advice to use the GEOGRAPHY type, make sure you are not planning on using Linq or Entity Framework to access the data because it's not supported (as of November 2010) and you will be sad!
Update Jul 2017
For those reading this answer now, it is obsolete as it refers to backdated technology stack. See comments for more details.
Take a look at the new Spatial data-types that were introduced in SQL Server 2008. They are designed for this kind of task and make indexing and querying much easier and more efficient.
More information:
MS TechNet: SQL Server 2008 Spatial Data Types,
MSDN: Working with Spatial Data (Database Engine).
I don't know the answer for SQL Server but...
In MySQL save it as FLOAT( 10, 6 )
This is the official recommendation from the Google developer documentation.
CREATE TABLE `coords` (
`lat` FLOAT( 10, 6 ) NOT NULL ,
`lng` FLOAT( 10, 6 ) NOT NULL ,
) ENGINE = MYISAM ;
The way I do it: I store the latitude and longitude and then I have a third column which is a automatic derived geography type of the 1st two columns. The table looks like this:
CREATE TABLE [dbo].[Geopoint]
(
[GeopointId] BIGINT NOT NULL PRIMARY KEY IDENTITY,
[Latitude] float NOT NULL,
[Longitude] float NOT NULL,
[ts] ROWVERSION NOT NULL,
[GeographyPoint] AS ([geography]::STGeomFromText(((('POINT('+CONVERT([varchar](20),[Longitude]))+' ')+CONVERT([varchar](20),[Latitude]))+')',(4326)))
)
This gives you the flexibility of spatial queries on the geoPoint column and you can also retrieve the latitude and longitude values as you need them for display or extracting for csv purposes.
I hate to be a contrarian to those who said "here is a new type, let's use it". The new SQL Server 2008 spatial types have some pros to it - namely efficiency, however you can't blindly say always use that type. It really depends on some bigger picture issues.
As an example, integration. This type has an equivilent type in .Net - but what about interop? What about supporting or extending older versions of .Net? What about exposing this type across the service layer to other platforms? What about normalization of data - maybe you are interested in lat or long as standalone pieces of information. Perhaps you've already written complex business logic to handle long/lat.
I'm not saying that you shouldn't use the spatial type - in many cases you should. I'm just saying you should ask some more critical questions before going down that path. For me to answer your question most accurately I would need to know more about your specific situation.
Storing long/lat separately or in a spatial type are both viable solutions, and one may be preferable to the other depending on your own circumstances.
What you want to do is store the Latitude and Longitude as the new SQL2008 Spatial type -> GEOGRAPHY.
Here's a screen shot of a table, which I have.
alt text http://img20.imageshack.us/img20/6839/zipcodetable.png
In this table, we have two fields that store geography data.
Boundary: this is the polygon that is the zip code boundary
CentrePoint: this is the Latitude / Longitude point that represents the visual middle point of this polygon.
The main reason why you want to save it to the database as a GEOGRAPHY type is so you can then leverage all the SPATIAL methods off it -> eg. Point in Poly, Distance between two points, etc.
BTW, we also use Google's Maps API to retrieve lat/long data and store that in our Sql 2008 DB -- so this method does work.
SQL Server has support for spatial related information. You can see more at http://www.microsoft.com/sqlserver/2008/en/us/spatial-data.aspx.
Alternativly you can store the information as two basic fields, usually a float is the standard data type reported by most devices and is accurate enough for within an inch or two - more than adequate for Google Maps.
NOTE: This is a recent answer based on recent SQL server, .NET stack updates
latitute and longitude from google Maps should be stored as Point(note capital P) data in SQL server under geography data type.
Assuming your current data is stored in a table Sample as varchar under columns lat and lon, below query will help you convert to geography
alter table Sample add latlong geography
go
update Sample set latlong= geography::Point(lat,lon,4326)
go
PS: Next time when you do a select on this table with geography data, apart from Results and Messages tab, you will also get Spatial results tab like below for visualization
If you are using Entity Framework 5 < you can use DbGeography. Example from MSDN:
public class University
{
public int UniversityID { get; set; }
public string Name { get; set; }
public DbGeography Location { get; set; }
}
public partial class UniversityContext : DbContext
{
public DbSet<University> Universities { get; set; }
}
using (var context = new UniversityContext ())
{
context.Universities.Add(new University()
{
Name = "Graphic Design Institute",
Location = DbGeography.FromText("POINT(-122.336106 47.605049)"),
});
context. Universities.Add(new University()
{
Name = "School of Fine Art",
Location = DbGeography.FromText("POINT(-122.335197 47.646711)"),
});
context.SaveChanges();
var myLocation = DbGeography.FromText("POINT(-122.296623 47.640405)");
var university = (from u in context.Universities
orderby u.Location.Distance(myLocation)
select u).FirstOrDefault();
Console.WriteLine(
"The closest University to you is: {0}.",
university.Name);
}
https://msdn.microsoft.com/en-us/library/hh859721(v=vs.113).aspx
Something I struggled with then I started using DbGeography was the coordinateSystemId. See the answer below for an excellent explanation and source for the code below.
public class GeoHelper
{
public const int SridGoogleMaps = 4326;
public const int SridCustomMap = 3857;
public static DbGeography FromLatLng(double lat, double lng)
{
return DbGeography.PointFromText(
"POINT("
+ lng.ToString() + " "
+ lat.ToString() + ")",
SridGoogleMaps);
}
}
https://stackoverflow.com/a/25563269/3850405
If you are just going to substitute it into a URL I suppose one field would do - so you can form a URL like
http://maps.google.co.uk/maps?q=12.345678,12.345678&z=6
but as it is two pieces of data I would store them in separate fields
Store both as float, and use unique key words on them.i.em
create table coordinates(
coord_uid counter primary key,
latitude float,
longitude float,
constraint la_long unique(latitude, longitude)
);

Resources