(n)->(on)<-(n) cypher query gives me a cartesian product - database

I have a Cypher query that goes something like this:
MATCH (t1:Team)-[:PLAYS_ON]->(m:Match)<-[:PLAYS_ON]-(t2:Team)
RETURN t1 AS "Team 1", m AS "Match", t2 as "Team 2"
My goal is to have a query that allows me to see matches and which teams are matching eachother in a deportive context.
Assuming Team 1, Team 2 and Team 3, and the matches being Team 1 vs Team 2 and Team 2 vs Team 3, my expected output is:
+------+-----+------+
|Team 1|match|Team 2|
|------+-----+------|
|Team 1|date |Team 2|
|Team 2|date |Team 3|
+------+-----+------+
But I get:
+------+-----+------+
|Team 1|match|Team 2|
|------+-----+------|
|Team 1|date |Team 2|
|Team 2|date |Team 1|
|Team 2|date |Team 3|
|Team 3|date |Team 2|
+------+-----+------+
I'm relatively new to Cypher/Neo4J, so, I would not be impressed if it turns out i'm commiting a very obvious and stupid mistake, but I don't have the brains to see it.
Thank you for your answers!

One way to do it is:
MATCH (t:Team)-[:PLAYS_ON]->(m:Match)
WITH collect(t) AS t, m
RETURN t[0] AS t1, m, t[1] AS t2
Which on this sample data:
MERGE (a:Team{name: 'Team1'})
MERGE (b:Team{name: 'Team2'})
MERGE (c:Team{name: 'Team3'})
MERGE (d:Match{Date: '2022-05-11'})
MERGE (e:Match{Date: '2022-05-12'})
MERGE (a)-[:PLAYS_ON]-(d)
MERGE (b)-[:PLAYS_ON]-(d)
MERGE (b)-[:PLAYS_ON]-(e)
MERGE (c)-[:PLAYS_ON]-(e)
Will give you this:
╒════════════════╤═════════════════════╤════════════════╕
│"t1" │"m" │"t2" │
╞════════════════╪═════════════════════╪════════════════╡
│{"name":"Team1"}│{"Date":"2022-05-11"}│{"name":"Team2"}│
├────────────────┼─────────────────────┼────────────────┤
│{"name":"Team2"}│{"Date":"2022-05-12"}│{"name":"Team3"}│
└────────────────┴─────────────────────┴────────────────┘
In order to understand this solution, you can read about the concept of cardinality.
Basically, since the number of options for the first MATCH is two per each (:Match) (two teams, one match) the query will return two options per each (:Match). Since you want only one results per match, you can use collect to group these two lines into one.
In other words, your query is saying get all options for this arrangement, meaning two per match. The query here is getting all options per match and then "group by" match, to create a list of teams per each match.

An alternative way to achieve what you want is this query (using Nimrod Serok's graph):
MATCH (t1:Team)-[:PLAYS_ON]->(m:Match)<-[:PLAYS_ON]-(t2:Team)
WHERE id(t1) < id(t2)
RETURN t1, m, t2
This works because the WHERE clause ensures that the same team pairing doesn't get considered in both directions (which causes the cartesian product).

Related

Work around the SQLite parameter limit in a select query

I have a GUI application with a list of people which contains the person's database id and their attributes. Something like this:
+----+------+
| ID | Name |
+----+------+
| 1 | John |
| 2 | Fred |
| 3 | Mary |
[...]
This list can be filtered, so the amount and type of people depend from time to time. To get a list of Peewee Person objects I first get the list of visible IDs and use the following query:
ids = [row[0] for row in store]
Person.select().where(Person.id.in_(ids))
Which in turn translates to the following SQL:
('SELECT "t1"."id", "t1"."name" FROM "person" AS "t1" WHERE ("t1"."id" IN (?, ?, ?, ...))', [1, 2, 3, ...])
This throws an OperationalError: too many SQL variables error on Windows with more than 1000 people. This is documented in the Peewee and SQLite docs. Workarounds given online usually relate to bulk inserts and ways to split the action in chunks. Is there any way to work around this limitation with the mentioned SELECT ... WHERE ... IN query?
Getting the separate objects in a list comprehension is too slow:
people = [Person.get_by_id(row[0]) for row in store]
Maybe split the list of IDs in max 1000 items, use the select query on each chunk and then combine those somehow?
Where are the IDs coming from? The best answer is to avoid using that many parameters, of course. For example, if your list of IDs could be represented as a query of some sort, then you can just write a subquery, e.g.
my_friends = (Relationship
.select(Relationship.to_user)
.where(Relationship.from_user == me))
tweets_by_friends = Tweet.select().where(Tweet.user.in_(my_friends))
In the above, we could get all the user IDs from the first query and pass them en-masse as a list into the second query. But since the first query ("all my friends") is itself a query, we can just compose them. You could also use a JOIN instead of a subquery, but hopefully you get the point.
If this is not possible and you seriously have a list of >1000 IDs...how is such a list useful in a GUI application? Over 1000 anything is quite a lot of things.
To try and answer the question you asked -- you'll have to chunk them up. Which is fine. Just:
user_ids = list_of_user_ids
accum = []
# 100 at a time.
for i in range(0, user_ids, 100):
query = User.select().where(User.id.in_(user_ids[i:i+100]))
accum.extend([user for user in query])
return accum
But seriously, I think there's a problem with the way you're implementing this that makes it even necessary to filter on so many ids.

Solr max results for particular field type

I am using Solr 4.10.3.
I have various fields in schema. e.g id,title,content,type etc. I have docs scenario such that many docs have same type value.
id | title | content | type
1 | pro | My | abc
2 | ver | name | ht
3 | art | is | abc
and so on.
When I query Solr, I want total 10 results(as default) but in them only maximum two of type:abc. Rest of the 8 results can be of any type except abc and can be more of one type.
Is there any possible solution.?
Make two queries, once with rows=2 and type:abc, and second time with rows=8 and -type:abc. Rest of the query can be identical. Then combine the results before you show them to users.
EDIT: After some research on what comes next in Solr features, I believe that combining the results will become possible once the streaming expressions are part of Solr (maybe in 5.2). See https://issues.apache.org/jira/browse/SOLR-7377
Starting with Solr 5.2 you can accomplish this using the Streaming Expressions (and the Streaming API under the covers). The expression would be
top(n=10,
merge(
top(n=2,
search(fl="id, title, content, type", q="type:abc", sort="title asc")
),
search(fl="id, title, content, type", q="-type:abc", sort="title asc")
on="title asc"
)
)
What's going on here is that first you're finding at most two documents of type "abc", then you are finding all documents that are not of type "abc". Notice that both of these are sorted by "title asc" (it's important that they be sorted by the same field(s)). Then we merge these two streams on the field "title" in ascending order. What this does is, using the title field, pick the first document from one of the streams, then it picks the second document, then the third, and so on. By merging "on title" what we're doing is using the title field to decide which document comes first in the merged stream. The outer top will ensure that only 10 documents are returned. Because we limited the "type:abc" stream to have at most 2 documents the final merged stream will have at most 2 "type:abc" documents.
Obviously you can change the sort criteria to whatever is best for you but it is critical with a merge that the incoming streams are sorted in the same order. And in fact, the Streaming API will validate that for you and reject the expression if that is not the case.

Relevance and Solr Grouping

Say I have the following collection of webpages in a Solr index:
+-----+----------+----------------+--------------+
| ID | Domain | Path | Content |
+-----+----------+----------------+--------------+
| 1 | 1.com | /hello1.html | Hello dude |
| 2 | 1.com | /hello2.html | Hello man |
| 3 | 1.com | /hello3.html | Hello fella |
| 4 | 2.com | /hello1.html | Hello sir |
...
And I want a query for hello to show results grouped by domain like:
Results from 1.com:
/hello1.html
/hello2.html
/hello3.html
Results from 2.com:
/hello1.html
How is ordering determined if I sort by score? I use a combination of TF/IDF and PageRank for my results normally, but since that calculates scores for each individual item, how does it determine how to order the gruops? What if 1.com/hello3.html and 1.com/hello2.html have very low relevance but two results while 2.com/hello1.html has really high relevance and only one result? Or vice versa? Or is relevance summed when there are multiple items in a grouping field?
I've looked around, but haven't been able to find a good answer to this.
Thanks.
It sounds to me like you are using Result Grouping. If that's the case, then the groups are sorted according to the sort parameter, and the records within each group are sorted according to the group.sort parameter. If you sort the groups by sort=score desc (this is the default, so you wouldn't actually need to specify it), then it sorts the groups according to the score of each group. How this score is determined isn't made very clear, but if you look through the examples in the linked documentation you can see this statement:
The groups are sorted by the score of the top document within each group.
So, in your example, if 2.com's hello1.html was the most relevant document in your result set, "Results from 2.com" would be your most relevant group even though "Results from 1.com" includes three times the document count.
If this isn't what you want, your best options are to provide a different sort parameter or result post-processing. For example, for one project I was involved in, (where we had a very modest number of groups,) we chose to pull the top three results for each group and in post processing we calculated our own sort order for the groups based on the combination of their scores and numFound values. This sort of strategy might have been prohibitive for cases with too many groups, and may not be a good idea if the more numerous groups run the risk of making the most relevant documents harder to find.

Retrieving data from 2 tables that have a 1 to many relationship - more efficient with 1 query or 2?

I need to selectively retrieve data from two tables that have a 1 to many relationship. A simplified example follows.
Table A is a list of events:
Id | TimeStamp | EventTypeId
--------------------------------
1 | 10:26... | 12
2 | 11:31... | 13
3 | 14:56... | 12
Table B is a list of properties for the events. Different event types have different numbers of properties. Some event types have no properties at all:
EventId | Property | Value
------------------------------
1 | 1 | dog
1 | 2 | cat
3 | 1 | mazda
3 | 2 | honda
3 | 3 | toyota
There are a number of conditions that I will apply when I retrieve the data, however they all revolve around table A. For instance, I may want only events on a certain day, or only events of a certain type.
I believe I have two options for retrieving the data:
Option 1
Perform two queries: first query table A (with a WHERE clause) and store data somewhere, then query table B (joining on table A in order to use same WHERE clause) and "fill in the blanks" in the data that I retrieved from table A.
This option requires SQL Server to perform 2 searches through table A, however the resulting 2 data sets contain no duplicate data.
Option 2
Perform a single query, joining table A to table B with a LEFT JOIN.
This option only requires one search of table A but the resulting data set will contain many duplicated values.
Conclusion
Is there a "correct" way to do this or do I need to try both ways and see which one is quicker?
Ex
Select E.Id,E.Name from Employee E join Dept D on E.DeptId=D.Id
and a subquery something like this -
Select E.Id,E.Name from Employee Where DeptId in (Select Id from Dept)
When I consider performance which of the two queries would be faster and why ?
would EXPECT the first query to be quicker, mainly because you have an equivalence and an explicit JOIN. In my experience IN is a very slow operator, since SQL normally evaluates it as a series of WHERE clauses separated by "OR" (WHERE x=Y OR x=Z OR...).
As with ALL THINGS SQL though, your mileage may vary. The speed will depend a lot on indexes (do you have indexes on both ID columns? That will help a lot...) among other things.
The only REAL way to tell with 100% certainty which is faster is to turn on performance tracking (IO Statistics is especially useful) and run them both. Make sure to clear your cache between runs!
More REF

Ranking of Full Text Search (SQL Server)

For the last couple hours I have been messing with all sorts of different variations of SQL Server full text search. However I am still unable to figure out how the ranking works. I have come across a couple examples that really confuse me as to how they rank higher then others. For example
I have a table with 5 cols + more that are not indexed. All are nvarchar fields.
I am running this query (Well almost.. I retyped with different names)
SET #SearchString = REPLACE(#Name, ' ', '*" OR "') --Splits words with an OR between
SET #SearchString = '"'+#SearchString+'*"'
print #SearchString;
SELECT ms.ID, ms.Lastname, ms.DateOfBirth, ms.Aka, ms.Key_TBL.RANK, ms.MiddleName, ms.Firstname
FROM View_MemberSearch as ms
INNER JOIN CONTAINSTABLE(View_MemberSearch, (ms.LastName, ms.Firstname, ms.MiddleName, ms.Aka, ms.DateOfBirth), #SearchString) AS KEY_TBL
ON ms.ID = KEY_TBL.[KEY]
WHERE KEY_TBL.RANK > 0
ORDER BY KEY_TBL.RANK DESC;
Thus if I search for 11/05/1964 JOHN JACKSON I would get "11/05/1964" OR "JOHN*" OR "JACKSON*" and these results:
ID -- First Name -- Middle Name -- Last Name -- AKA -- Date of Birth -- SQL Server RANK
----------------------------------------------------------------------------------
1 | DAVE | JOHN | MATHIS | NULL | 11/23/1965 | 192
2 | MARK | JACKSON | GREEN | NULL | 05/29/1998 | 192
3 | JOHN | NULL | JACKSON | NULL | 11/05/1964 | 176
4 | JOE | NULL | JACKSON | NULL | 10/04/1994 | 176
So finally my question. I don't see how row 1 and 2 are ranked above row 3 and why row 3 is ranked the same as row 4. Row 2 should have the highest rank by far seeing as the search string matches the First name and Last Name as well as the Date of birth.
If I change the OR to AND I don't get any results.
I've found AND and OR clauses don't apply across columns. Create an indexed view that merges the columns and you'll get better results. Look at my past questions and you'll find information that suites your scenario.
I also have found I'm better off not appending a '*'. I thought it'd turn up more matches, but it tended to return worse results (particularly for long words). As a middle ground you might only append a * to longer words.
The example case you give is definately weird.
It's not entirely equivalent, but perhaps this question I asked (How-to: Ranking Search Results) could be of assistance?
What happens if you remove the DoB criteria?
MS Full-Text search is really really a black box that's hard to understand and customize
You pretty much take it AS IS, unlike Lucene is great for customization
Thank you guys.
Frank you were correct that AND and OR do not go across columns this was something I did not notice at first.
To get the best results I had to merge all 5 columns into 1 column in a view. Then search on that single column. Doing so gave me the exact results I wanted without any extras.
My actual search string after converting it ended up being "Word1*" AND "Word2*"
Using the % sign still did not do what msdn said it should do. Meaning if I searched for the word josh and it got changed into "Josh%" when I searched then "Joshua" would not be found. Pretty dumb however with "Josh*" then joshua would be found.

Resources