OrientDb - Select from multiple vertices using graph - database

I am just starting out with nosql databases, and OrientDB in particular. My prior experience is with relational databases, mostly with SQL Server.
In MSSQL, I can do something like this:
SELECT s.url, p.title
FROM Site s
JOIN Page p ON s.Id = p.SiteId
And it will give me a table of all pages, along with the url of the site they belong to:
url | title
----------------------------
site1.com | page1
site1.com | page2
site2.com | page1
Now, in OrientDb, my understanding is that you should use a link for a one-way relationship, or an edge for a bidirectional relationship. Since I want to know what pages belong to a site, as well as what site a particular page belongs to, I have decided to use an edge in this case. The Site and Page classes/vertices are already created in similar fashion, but I can't figure out how to get a similar result set. From the documentation (https://orientdb.com/docs/2.2/SQL.html):
OrientDB allows only one class (classes are equivalent to tables in
this discussion) as opposed to SQL, which allows for many tables as
the target. If you want to select from 2 classes, you have to execute
2 sub queries and join them with the UNIONALL function
Their example SELECT FROM E, V then becomes SELECT EXPAND( $c ) LET $a = ( SELECT FROM E ), $b = ( SELECT FROM V ), $c = UNIONALL( $a, $b ), but that's not what I want. That results in something along the lines of
url | title
----------------------------
site1.com |
site1.com |
site2.com |
| page1
| page2
| page1
How would I go about creating the original result set, like in MSSQL?
Additional consideration: My training and experience with MSSQL dictates that database operations should be done in the database, rather than the application code. For example, I could have done one database call to get the s.url and s.id fields, then a second call to get the p.title and p.SiteId fields, and then matched them up in application code. The reason I avoid this is because multiple database calls is less efficient time-wise than the time it takes to return the extra/redundant information (in my example, site1.com is returned twice).
Is this perhaps not the case for OrientDb, or even graph/nosql databases in general? Should I instead be making two separate calls to get all of the data I need, i.e. SELECT FROM Site WHERE Url = "site1.com" AND SELECT EXPAND(OUT("HasPages")) FROM Site WHERE Name = "site1.com"?
Thank you

Try this:
select Url, out("HasPages").title as title from Site unwind title
Hope it helps
Regards

Related

Join in Laravel's Eloquent

I have a visit Model and I'm getting the data I want like that:
$app_visits = Visit::select([
'start',
'end',
'machine_name'
])->where('user_id', $chosen_id)->get();
But I want to add points for every visit. Every visit has an interaction (but there's no visit_id (because of other system I cannot add it).
Last developer left it like that:
$interactions = Interaction::where([
'machine_name' => $app_visit->machine_name,
])->whereBetween('date', [$app_visit->start, $app_visit->end])->get();
$points = 0;
foreach ($interactions as $interaction) {
$points += (int)$interaction->app_stage;
}
$app_visits[$key]['points'] = $points
But I really don't like it as it's slow and messy. I wanted to just add 'points' sum to the first query, to touch database only once.
#edit as someone asked for database structure:
visit:
|id | start | end | machine_name | user_id
inteaction:
|id | time | machine_name | points
You can use a few things in eloquent. Probably the most useful for this case, is the select(DB::raw(sql...)) as you will have to add a bit of raw sql to retrieve a count.
For example:
return $query
->join(...)
->where(...)
->select(DB::raw(
COUNT(DISTINCT res.id) AS count'
))
->groupBy(...);
Failing that, I'd just replace the eloquent with raw sql. We've had to do that a fair bit, as our data sets are massive, and eloquent model building has proven a little slow.
Update as you've added structure. Why not just add a relation to Interaction, based upon machine_name (or even a custom method using raw sql that calculates the points), and use: Visits::with('interaction.visitPoints')->...blah ?
Take a look at DB instead of Eloquent:
https://laravel.com/docs/5.6/queries
For more complex and efficient queries.
There is also a possibility to use raw SQL with this facade.

Google Datastore - Search Optimization Technique

I am dealing with a real-estate app. A Home will hvae typical properties like Price, Bed Rooms, Bath Rooms, SqFt, Lot size etc. User will search for Homes and such a query will require multiple inequality filters like: Price between x and y, rooms greater than z, bathrooms more than p... etc.
I know that multiple inequality filters are not allowed. I also do not want to perform any filtering in my code and/because I want to be able to use Cursors.
so I have come up with two solutions. I am not sure if these are right - so wonder if gurus can shed some light
Solution 1: I will discretize the values of each attribute and save them in a list-field, then use IN. For example: If there are 3 bed rooms, instead of storing beds=3, I will store beds = [1,2,3]. Now if a user searches for homes with say at least two bedrooms, then instead of writing the filter as beds>2, I will write the filter as "beds IN [2]" - and my home above [1,2,3] will qualify - so so will any home with 2 beds [1,2] or 4 beds [1,2,3,4] and so on
Solution 2: It is similar to the first one but instead of creating a list-property, I will actually add attributed (columns) to the home. So a home with 3 bed rooms will have the following attributed/columns/properties: col-bed-1:true, col-bed-2:true, col-bed-3:true. Now if a user searches for homes with say at least two bedrooms, then instead of writing the filter as beds>2, I will write the filter as "col-bed-2 = true" - and my home will qualify - so will any home with 2 beds, 3 beds, 4 beds and so on
I know both solutions will work, but I want to know:
1. Which one is better both from a performance and google pricing perspective
2. Is there a better solution to do this?
I do almost exactly your use case with a python gae app that lists posts with housing advertisements (similar to craigslist). I wrote it in python and searching with a filter is working and straightforward.
You should choose a language: Python, Java or Go, and then use the Google Search API (that has built-in filtering for equalities or inequalities) and build datastore indexes that you can query using the search API.
For instance, you can use a python class like the following to populate the datastore and then use the Search API.
class Home(db.Model):
address = db.StringProperty(verbose_name='address')
number_of_rooms = db.IntegerProperty()
size = db.FloatProperty()
added = db.DateTimeProperty(verbose_name='added', auto_now_add=True) # readonly
last_modified = db.DateTimeProperty(required=True, auto_now=True)
timestamp = db.DateTimeProperty(auto_now=True) #
image_url = db.URLProperty();
I definitely think that you should avoid storing permutations for several reasons: Permutations can explode in size and makes the code difficult to read. Instead you should do like I did and find examples where someone else has already solved an equal or similar problem.
This appengine demo might help you.

How do you modify a UNION query in CakePHP 3?

I want to paginate a union query in CakePHP 3.0.0. By using a custom finder, I have it working almost perfectly, but I can't find any way to get limit and offset to apply to the union, rather than either of the subqueries.
In other words, this code:
$articlesQuery = $articles->find('all');
$commentsQuery = $comments->find('all');
$unionQuery = $articlesQuery->unionAll($commentsQuery);
$unionQuery->limit(7)->offset(7); // nevermind the weirdness of applying this manually
produces this query:
(SELECT {article stuff} ORDER BY created DESC LIMIT 7 OFFSET 7)
UNION ALL
(SELECT {comment stuff})
instead of what I want, which is this:
(SELECT {article stuff})
UNION ALL
(SELECT {comment stuff})
ORDER BY created DESC LIMIT 7 OFFSET 7
I could manually construct the correct query string like this:
$unionQuery = $articlesQuery->unionAll($commentsQuery);
$sql = $unionQuery->sql();
$sql = "($sql) ORDER BY created DESC LIMIT 7 OFFSET 7";
but my custom finder method needs to return a \Cake\Database\Query object, not a string.
So,
Is there a way to apply methods like limit() to an entire union query?
If not, is there a way to convert a SQL query string into a Query object?
Note:
There's a closed issue that describes something similar to this (except using paginate($unionQuery)) without a suggestion of how to overcome the problem.
Apply limit and offset to each subquery?
scrowler kindly suggested this option, but I think it won't work. If limit is set to 5 and the full result set would be this:
Article 9 --|
Article 8 |
Article 7 -- Page One
Article 6 |
Article 5 --|
Article 4 --|
Comment 123 |
Article 3 -- Here be dragons
Comment 122 |
Comment 121 --|
...
Then the query for page 1 would work, because (the first five articles) + (the first five comments), sorted manually by date, and trimmed to just the first five of the combined result would result in articles 1-5.
But page 2 won't work, because the offset of 5 would be applied to both articles and comments, meaning the first 5 comments (which weren't included in page 1), will never show up in the results.
Being able to apply these clauses directly on the query returned by unionAll() is not possible AFAIK, it would require changes to the API that would make the compiler aware where to put the SQL, being it via options, a new type of query object, whatever.
Query::epilog() to the rescue
Luckily it's possible to append SQL to queries using Query::epilog(), being it raw SQL fragments
$unionQuery->epilog('ORDER BY created DESC LIMIT 7 OFFSET 7');
or query expressions
$unionQuery->epilog(
$connection->newQuery()->order(['created' => 'DESC'])->limit(7)->offset(7)
);
This should give you the desired query.
It should be noted that according to the docs Query::epilog() expects either a string, or a concrete \Cake\Database\ExpressionInterface implementation in the form a \Cake\Database\Expression\QueryExpression instance, not just any ExpressionInterface implementation, so theoretically the latter example is invalid, even though the query compiler works with any ExpressionInterface implementation.
Use a subquery
It's also possible to utilize the union query as a subquery, this would make things easier in the context of using the pagination component, as you wouldn't have to take care of anything other than building and injecting the subquery, since the paginator component would be able to simply apply the order/limit/offset on the main query.
/* #var $connection \Cake\Database\Connection */
$connection = $articles->connection();
$articlesQuery = $connection
->newQuery()
->select(['*'])
->from('articles');
$commentsQuery = $connection
->newQuery()
->select(['*'])
->from('comments');
$unionQuery = $articlesQuery->unionAll($commentsQuery);
$paginatableQuery = $articles
->find()
->from([$articles->alias() => $unionQuery]);
This could of course also be moved into a finder.

Joining with a tags table -- should I join in PHP or on the DB server?

Using the Toxi solution, how should I select the tags for a certain "Bookmark" (to keep the Delicious theme):
I can either:
1) Join in a single query:
select bookmark.title, bookmark.url,
(
SELECT group_concat( tags.name ) as tagNames
FROM taggings INNER JOIN tags
ON taggings.tagId_fk=tags.tagId
WHERE taggings.bookmarkId_fk = bookmarks.bookmarkId_fk
)
from bookmarks
where bookmarks.id=1 ;
^^ That gives
title url tagNames
A bkmrk http://url.com tag1,tag2,tag3
2) Use two queries: one to retrieve the bookmark id's to display, then another to retrieve the tags for those bookmarks. The results can then be merged in PHP.
So really this question is: In general efficiency/database load-wise is it better to do more joining in a single query or multiple queries?
How do you make that kind of decision? Or do you simply not think about it until load causes a problem?
Server side is more efficient.
In both cases, the server must read all of the tags.
If you bring them to PHP, then they must all travel over the wire and PHP has to fiddle with them.
If you do them on the server, the finished answer (smaller) comes over the wire ready for PHP to pass it up to the UI.

Best way to store user-submitted item names (and their synonyms)

Consider an e-commerce application with multiple stores. Each store owner can edit the item catalog of his store.
My current database schema is as follows:
item_names: id | name | description | picture | common(BOOL)
items: id | item_name_id | picture | price | description | picture
item_synonyms: id | item_name_id | name | error(BOOL)
Notes: error indicates a wrong spelling (eg. "Ericson"). description and picture of the item_names table are "globals" that can optionally be overridden by "local" description and picture fields of the items table (in case the store owner wants to supply a different picture for an item). common helps separate unique item names ("Jimmy Joe's Cheese Pizza" from "Cheese Pizza")
I think the bright side of this schema is:
Optimized searching & Handling Synonyms: I can query the item_names & item_synonyms tables using name LIKE %QUERY% and obtain the list of item_name_ids that need to be joined with the items table. (Examples of synonyms: "Sony Ericsson", "Sony Ericson", "X10", "X 10")
Autocompletion: Again, a simple query to the item_names table. I can avoid the usage of DISTINCT and it minimizes number of variations ("Sony Ericsson Xperia™ X10", "Sony Ericsson - Xperia X10", "Xperia X10, Sony Ericsson")
The down side would be:
Overhead: When inserting an item, I query item_names to see if this name already exists. If not, I create a new entry. When deleting an item, I count the number of entries with the same name. If this is the only item with that name, I delete the entry from the item_names table (just to keep things clean; accounts for possible erroneous submissions). And updating is the combination of both.
Weird Item Names: Store owners sometimes use sentences like "Harry Potter 1, 2 Books + CDs + Magic Hat". There's something off about having so much overhead to accommodate cases like this. This would perhaps be the prime reason I'm tempted to go for a schema like this:
items: id | name | picture | price | description | picture
(... with item_names and item_synonyms as utility tables that I could query)
Is there a better schema you would suggested?
Should item names be normalized for autocomplete? Is this probably what Facebook does for "School", "City" entries?
Is the first schema or the second better/optimal for search?
Thanks in advance!
References: (1) Is normalizing a person's name going too far?, (2) Avoiding DISTINCT
EDIT: In the event of 2 items being entered with similar names, an Admin who sees this simply clicks "Make Synonym" which will convert one of the names into the synonym of the other. I don't require a way to automatically detect if an entered name is the synonym of the other. I'm hoping the autocomplete will take care of 95% of such cases. As the table set increases in size, the need to "Make Synonym" will decrease. Hope that clears the confusion.
UPDATE: To those who would like to know what I went ahead with... I've gone with the second schema but removed the item_names and item_synonyms tables in hopes that Solr will provide me with the ability to perform all the remaining tasks I need:
items: id | name | picture | price | description | picture
Thanks everyone for the help!
The requirements you state in your comment ("Optimized searching", "Handling Synonyms" and "Autocomplete") are not things that are generally associated with an RDBMS. It sounds like what you're trying to solve is a searching problem, not a data storage and normalization problem. You might want to start looking at some search architectures like Solr
Excerpted from the solr feature list:
Faceted Searching based on unique field values, explicit queries, or date ranges
Spelling suggestions for user queries
More Like This suggestions for given document
Auto-suggest functionality
Performance Optimizations
If there were more attributes exposed for mapping, I would suggest using a fast search index system. No need to set aliases up as the records are added, the attributes simply get indexed and each search issued returns matches with a relevance score. Take the top X% as valid matches and display those.
Creating and storing aliases seems like a brute-force, labor intensive approach that probably won't be able to adjust to the needs of your users.
Just an idea.
One thing that comes to my mind is sorting the characters in the name and synonym throwing away all white space. This is similar to the solution of finding all anagrams for a word. The end result is ability to quickly find similar entries. As you pointed out, all synonyms should converge into one single term, or name. The search is performed against synonyms using again sorted input string.

Resources