PostgreSQL: How to implement GIN? - c

I want to bring my text hash function into GIN indexer.
See the Extensibility below:
http://www.postgresql.org/docs/9.0/static/gin-extensibility.html
I can understand about compare.
int compare(Datum a, Datum b)
However how about extractValue, extractQuery and consistent.
Datum *extractValue(Datum inputValue, int32 *nkeys)
Datum *extractQuery(Datum query, int32 *nkeys, StrategyNumber n, bool **pmatch, Pointer **extra_data)
bool consistent(bool check[], StrategyNumber n, Datum query, int32 nkeys, Pointer extra_data[], bool *recheck)
The manual doesn't help me to implement them.
I know how to implement them. In detail:
What's passed to inputValue of extractValue?
What's returned by extractValue?
What's passed to query of extractQuery?
What's returned by extractQuery?
What's passed to query of consistent?
What's passed to check of consistent?
The index storage (hashed key) will be int4. The input type is text.

Did you read all the documentation? Do you know what a inverse index does. I can't answer your question fully because you haven't specified what your queries will look like. But here is an attempt (based on the information at http://www.postgresql.org/docs/9.2/static/gin-extensibility.html and http://www.sai.msu.su/~megera/wiki/Gin). Also, look at the tsearch example.
Input to compare is two keys values, so two integers.
The input to extractValue is your input type, text. The output is an array of keys: in your case apparently an array of integers.
extractQuery gets as input your query type, which might be a string (you don't specify) and returns a list of interesting keys you want the system to find matches for. It can also return extra information for the consistant method.
After the system has found values that are interesting according to what you returned from extractQuery, this method will return whether the value actually matched the query.
Since you haven't specified your query type I'll give an example with fulltext search.
There for example, the query type is a string like 'foo and bar', this would return the keys 'foo' and 'bar' and some data so the consistant function knows that both terms must be present.
But really, this is all described on the above pages.

Related

NOT IN within a cypher query

I'm attempting to find all values that match any item within a list of values within cypher. Similar to a SQL query with in and not in. I also want to find all values that are not in the list in a different query. The idea is I want to assign a property to each node that is binary and indicates whether the name of the node is within the predefined list.
I've tried the following code blocks:
MATCH (temp:APP) - [] -> (temp2:EMAIL_DOMAIN)
WHERE NOT temp2.Name IN ['GMAIL.COM', 'YAHOO.COM', 'OUTLOOK.COM', 'ICLOUD.COM', 'LIVE.COM']
RETURN temp
This block returns nothing, but should return a rather large amount of data.
MATCH (temp:APP) - [] -> (temp2:EMAIL_DOMAIN)
WHERE temp2.Name NOT IN ['GMAIL.COM', 'YAHOO.COM', 'OUTLOOK.COM', 'ICLOUD.COM', 'LIVE.COM']
RETURN temp
This code block returns an error in relation to the NOT's position. Does anyone know the correct syntax for this statement? I've looked around online and in the neo4j documentation, but there are a lot of conflicting ideas with version changes. Thanks in advance!
Neo4j is case sensitive so you need to check the data to ensure that the EMAIL_DOMAIN.Name is all upper case. If Name is mixed case, you can convert the name using toUpper(). If Name is all lower case, then you need to convert the values in your query.
MATCH (temp:APP) - [] -> (temp2:EMAIL_DOMAIN)
WHERE NOT toUpper(temp2.Name) IN ['GMAIL.COM', 'YAHOO.COM', 'OUTLOOK.COM', 'ICLOUD.COM', 'LIVE.COM']
RETURN temp

How to search in array with LIKE operator

id | name | ipAddress
----+----------+-------------------------
1 | testname | {192.168.1.60,192.168.1.65}
I want to search ipAddress with LIKE. I tried:
{'$mac_ip_addresses.ip_address$': { [OP.contains]: [searchItem]}},
This one also:
{'$mac_ip_addresses.ip_address$': { [OP.Like] : { [OP.any]: [searchItem]}}},
The data type of ipAddress is text[]. I want to search in ipAddress with LIKE.
searchItem contains the IP that need to be searched in the ipAddress field so I want to search in array with LIKE.
I don't know Sequelize but I can answer from postgres side.
There is no short syntax to search for a pattern inside array in PostgreSQL.
If you want to check pattern for each array element individually, then you need to unfold the array using unnest:
SELECT id, name, ipaddress
FROM testing
WHERE EXISTS (
SELECT 1 FROM unnest(ipaddress) AS ip
WHERE ip LIKE '8.8.8.%'
);
If the array is frequently searched this way, it's better to store the data in normalized form.
However, there is a short syntax (plus GIN index support) for for equality based search (see #> and other operators here).
SELECT id, name, ipaddress
FROM testing
WHERE ipaddress #> ARRAY['8.8.8.8'];
What you asked
~~ is the operator used internally to implementing SQL LIKE. There is no commutator for it - no operator that works with left and right operand switched.
That's the one you'd need for your attempt to use the ANY construct with the pattern to the left. Related:
You can create the operator, though, and it's pretty simple:
CREATE OR REPLACE FUNCTION reverse_like (text, text)
RETURNS boolean LANGUAGE sql IMMUTABLE PARALLEL SAFE AS
'SELECT $2 LIKE $1';
CREATE OPERATOR <~~ (function = reverse_like, leftarg = text, rightarg = text);
Inspired by Jeff Janes' idea here:
Match string pattern to any array element
Then your query can have the pattern to the left of the operator:
SELECT *
FROM mac_ip_addresses
WHERE '192.168.2%.255' <~~ ANY (ipaddress);
Simple, but considerably slower than the EXISTS expression demonstrated by filiprem.
Then again, either query is excruciatingly slow for big tables, since neither can use an index. A normalized DB design with a n:1 table holding one IP each would allow that. It would also occupy several times the space on disk. Still, the much cleaner implementation ...
While stuck with your current design, there is still a way: create a trigram GIN index on a text representation of the array and add a redundant, "sargable" predicate to the query additionally. Confused? Here's the recipe:
First, trigram indexes? Read this if you are not familiar:
PostgreSQL LIKE query performance variations
Neither the cast from text[] to text nor array_to_string() are immutable. But we need that for an expression index. Long story short, fake it with an immutable wrapper function:
CREATE OR REPLACE FUNCTION f_textarr2text(text[])
RETURNS text LANGUAGE sql IMMUTABLE AS $$SELECT array_to_string($1, ',')$$;
CREATE INDEX iparr_trigram_idx ON iparr
USING gin (f_textarr2text(iparr) gin_trgm_ops);
Related answer with the long story (and why it's safe):
Indexing an array for full text search
Then your query can be:
SELECT *
FROM mac_ip_addresses
WHERE NOT ('192.168.9%.255' <~~ ANY (ipaddress))
AND f_textarr2text(ipaddress) LIKE '192.168.9%.255'; -- logically redundant
The added predicate is logically redundant, but can tap into the power of the trigram index.
Much faster for big tables. Still a bit faster, yet:
SELECT *
FROM mac_ip_addresses
WHERE EXISTS (SELECT FROM unnest(ipaddress) ip WHERE ip LIKE '192.168.9%.255')
AND f_textarr2text(ipaddress) LIKE '192.168.9%.255';
But that's minor now.
db<>fiddle here
I addressed the question asked, as I took an interest. Might be of interest to the general public. Most probably not what you need, though.
What you need
I want to search in ipAddress with LIKE. searchItem contains the IP that need to be searched in the ipAddress field so I want to search in array with LIKE.
That should probably read:
"I want to search a given IP address (searchItem) in the array ipAddress. My first idea is to use LIKE ..."
Well, LIKE is for pattern matching. To find complete IP addresses in an array, it's the wrong tool. filiprem's second query with array operators is the way to go. Probably good enough.
Using the built-in data type cidr instead of text would be better. And the ip4 data type of the additional ip4r module would be much better, yet. All in combination with standard array operators like demonstrated.
Finally, converting IPv4 addresses to integer and using that with the additional inrarray module should be stellar - as far as performance is concerned.

Neo4j output format

After working with neo4j and now coming to the point of considering to make my own entity manager (object manager) to work with the fetched data in the application, i wonder about neo4j's output format.
When i run a query it's always returned as tabular data. Why is this??
Sure tables keep a big place in data and processing, but it seems so strange that a graph database can only output in this format.
Now when i want to create an object graph in my application i would have to hydrate all the objects and this is not really good for performance and doesn't leverage true graph performace.
Consider MATCH (A)-->(B) RETURN A, B when there is one A and three B's, it would return:
A B
1 1
1 2
1 3
That's the same A passed down 3 times over the database connection, while i only need it once and i know this before the data is fetched.
Something like this seems great http://nigelsmall.com/geoff
a load2neo is nice, a load-from-neo would also be nice! either in the geoff format or any other formats out there https://gephi.org/users/supported-graph-formats/
Each language could then implement it's own functions to create the objects directly.
To clarify:
Relations between nodes are lost in tabular data
Redundant (non-optimal) format for graphs
Edges (relations) and vertices (nodes) are usually not in the same table. (makes queries more complex?)
Another consideration (which might deserve it's own post), what's a good way to model relations in an object graph? As objects? or as data/method inside the node objects?
#Kikohs
Q: What do you mean by "Each language could then implement it's own functions to create the objects directly."?
A: With an (partial) graph provided by the database (as result of a query) a language as PHP could provide a factory method (in C preferably) to construct the object graph (this is usually an expensive operation). But only if the object graph is well defined in a standard format (because this function should be simple and universal).
Q: Do you want to export the full graph or just the result of a query?
A: The result of a query. However a query like MATCH (n) OPTIONAL MATCH (n)-[r]-() RETURN n, r should return the full graph.
Q: you want to dump to the disk the subgraph created from the result of a query ?
A: No, existing interfaces like REST are prefered to get the query result.
Q: do you want to create the subgraph which comes from a query in memory and then request it in another language ?
A: no i want the result of the query in another format then tabular (examples mentioned)
Q: You make a query which only returns the name of a node, in this case, would you like to get the full node associated or just the name ? Same for the edges.
A: Nodes don't have names. They have properties, labels and relations. I would like enough information to retrieve A) The node ID, it's labels, it's properties and B) the relation to other nodes which are in the same result.
Note that the first part of the question is not a concrete "how-to" question, rather "why is this not possible?" (or if it is, i like to be proven wrong on this one). The second is a real "how-to" question, namely "how to model relations". The two questions have in common that they both try to find the answer to "how to get graph data efficiently in PHP."
#Michael Hunger
You have a point when you say that not all result data can be expressed as an object graph. It reasonable to say that an alternative output format to a table would only be complementary to the table format and not replacing it.
I understand from your answer that the natural (rawish) output format from the database is the result format with duplicates in it ("streams the data out as it comes"). I that case i understand that it's now left to an alternative program (of the dev stack) to do the mapping. So my conclusion on neo4j implementing something like this:
Pro's - not having to do this in every implementation language (of the application)
Con's - 1) no application specific mapping is possible, 2) no performance gain if implementation language is fast
"Even if you use geoff, graphml or the gephi format you have to keep all the data in memory to deduplicate the results."
I don't understand this point entirely, are you saying that these formats are no able to hold deduplicated results (in certain cases)?? So infact that there is no possible textual format with which a graph can be described without duplication??
"There is also the questions on what you want to include in your output?"
I was under the assumption that the cypher language was powerful enough to specify this in the query. And so the output format would have whatever the database can provide as result.
"You could just return the paths that you get, which are unique paths through the graph in themselves".
Useful suggestion, i'll play around with this idea :)
"The dump command of the neo4j-shell uses the approach of pulling the cypher results into an in-memory structure, enriching it".
Does the enriching process fetch additional data from the database or is the data already contained in the initial result?
There is more to it.
First of all as you said tabular results from queries are really commonplace and needed to integrate with other systems and databases.
Secondly oftentimes you don't actually return raw graph data from your queries, but aggregated, projected, sliced, extracted information out of your graph. So the relationships to the original graph data are already lost in most of the results of queries I see being used.
The only time that people need / use the raw graph data is when to export subgraph-data from the database as a query result.
The problem of doing that as a de-duplicated graph is that the db has to fetch all the result data data in memory first to deduplicate, extract the needed relationships etc.
Normally it just streams the data out as it comes and uses little memory with that.
Even if you use geoff, graphml or the gephi format you have to keep all the data in memory to deduplicate the results (which are returned as paths with potential duplicate nodes and relationships).
There is also the questions on what you want to include in your output? Just the nodes and rels returned? Or additionally all the other rels between the nodes that you return? Or all the rels of the returned nodes (but then you also have to include the end-nodes of those relationships).
You could just return the paths that you get, which are unique paths through the graph in themselves:
MATCH p = (n)-[r]-(m)
WHERE ...
RETURN p
Another way to address this problem in Neo4j is to use sensible aggregations.
E.g. what you can do is to use collect to aggregate data per node (i.e. kind of subgraphs)
MATCH (n)-[r]-(m)
WHERE ...
RETURN n, collect([r,type(r),m])
or use the new literal map syntax (Neo4j 2.0)
MATCH (n)-[r]-(m)
WHERE ...
RETURN {node: n, neighbours: collect({ rel: r, type: type(r), node: m})}
The dump command of the neo4j-shell uses the approach of pulling the cypher results into an in-memory structure, enriching it and then outputting it as cypher create statement(s).
A similar approach can be used for other output formats too if you need it. But so far there hasn't been the need.
If you really need this functionality it makes sense to write a server-extension that uses cypher for query specification, but doesn't allow return statements. Instead you would always use RETURN *, aggregate the data into an in-memory structure (SubGraph in the org.neo4j.cypher packages). And then render it as a suitable format (e.g. JSON or one of those listed above).
These could be a starting points for that:
https://github.com/jexp/cypher-rs
https://github.com/jexp/cypher_websocket_endpoint
https://github.com/neo4j-contrib/rabbithole/blob/master/src/main/java/org/neo4j/community/console/SubGraph.java#L123
There are also other efforts, like GraphJSON from GraphAlchemist: https://github.com/GraphAlchemist/GraphJSON
And the d3 json format is also pretty useful. We use it in the neo4j console (console.neo4j.org) to return the graph visualization data that is then consumed by d3 directly.
I've been working with neo4j for a while now and I can tell you that if you are concerned about memory and performances you should drop cypher at all, and use indexes and the other graph-traversal methods instead (e.g. retrieve all the relationships of a certain type from or to a start node, and then iterate over the found nodes).
As the documentation says, Cypher is not intended for in-app usage, but more as a administration tool. Furthermore, in production-scale environments, it is VERY easy to crash the server by running the wrong query.
In second place, there is no mention in the docs of an API method to retrieve the output as a graph-like structure. You will have to process the output of the query and build it.
That said, in the example you give you say that there is only one A and that you know it before the data is fetched, so you don't need to do:
MATCH (A)-->(B) RETURN A, B
but just
MATCH (A)-->(B) RETURN B
(you don't need to receive A three times because you already know these are the nodes connected with A)
or better (if you need info about the relationships) something like
MATCH (A)-[r]->(B) RETURN r

Retrieve last row from the Google DataStore Java

I want to retrieve the last row from the data store so how can i do that??
I know the long method i.e
for(Table_name e: resultset)
{
cnt++;
}
results.get(cnt).getvalue();
I have String with (number) as primary key.can i use it to get descending order???
Is there any method through which i can get the last row???
You should probably sort in the opposite order (if possible for your query, the data store has some restrictions here) and get the first result of that.
Also, if you store numbers in String fields the order may not be what you want it to be (you might need padding here).

Mongoid Syntax Questions

1) Finding by instance object
Assuming I have the instance object called #topic. I want to retrieve the answers for this given topic. I was thinking I should be able to pass in :topics=>#topic, but i had to do the very ugly query below.
#answers = Answers.where(:topic_ids => {"$in" => [#topic.id]})
2) Getting the string representation of the id. I have a custom function (shown below). But shouldn't this be a very common requirement?
def sid
return id.to_s
end
If your associations are set up correctly, you should be able to do:
#topic.answers
It sounds like the above is what you are looking for. Make sure you have set up your associations correctly. Mongoid is very forgiving when defining associations, so it can seem that they are set up right when there is in fact a problem like mismatched names in references_many and referenced_in.
If there's a good reason why the above doesn't work and you have to use a query, you can use this simple query:
#answers = Answer.where(:topic_ids => #topic.id)
This will match any Answer record whose topic_ids include the supplied ID. The syntax is the same for array fields as for single-value fields like Answer.where(:title => 'Foo'). MongoDB will interpret the query differently depending on whether the field is an array (check if supplied value is in the array) or a single value (check if the supplied value is a match).
Here's a little more info on how MongoDB handles array queries:
http://www.mongodb.org/display/DOCS/Advanced+Queries#AdvancedQueries-ValueinanArray

Resources