Mixing Gremlin, OpenCypher and SPARQL - graph-databases

AWS Neptune being on a lower level a triple-store (presumably a BlazeGraph fork), it can be queried via SPARQL. At the same time, one can use Gremlin and OpenCypher but as far as I can tell these do not access the triple data (and vice versa). That is, it looks like the property graph is separate from the triples. Is this correct or am I missing something obvious? Can one fetch property-graph data with SPARQL, can one query triples with Gremlin/OpenCypher?

Property Graph and RDF data are stored separately inside of Neptune. If you write data via the /sparql endpoint, it can then only be accessed via SPARQL thereafter. Similar situation if you write via the /gremlin or /oc endpoints, although Property Graph data can be simultaneously accessed via either Gremlin or openCypher.

Related

Accessibility of data using different APIs

Can I access the same data using different APIs?
Example: Can I insert the data using YCQL and read it using YEDIS?
Short answer is No. Even though the APIs share a common distributed document store, the data modeling and query constructs offered are significantly different. This means that the data inserted or managed by one API cannot be queried by the other API. More on this in the Note here: https://docs.yugabyte.com/latest/introduction/#what-client-apis-are-supported-by-yugabyte-db

How do I query heterogeneous JSON data in S3?

We have an Amazon S3 bucket that contains around a million JSON files, each one around 500KB compressed. These files are put there by AWS Kinesis Firehose, and a new one is written every 5 minutes. These files all describe similar events and so are logically all the same, and are all valid JSON, but have different structures/hierarchies. Also their format & line endings are inconsistent: some objects are on a single line, some on many lines, and sometimes the end of one object is on the same line as the start of another object (i.e., }{).
We need to parse/query/shred these objects and then import the results into our on-premise data warehouse SQL Server database.
Amazon Athena can't deal with the inconsistent spacing/structure. I thought of creating a Lambda function that would clean up the spacing, but that still leaves the problem of different structures. Since the files are laid down by Kinesis, which forces you to put the files in folders nested by year, month, day, and hour, we would have to create thousands of partitions every year. The limit to the number of partitions in Athena is not well known, but research suggests we would quickly exhaust this limit if we create one per hour.
I've looked at pumping the data into Redshift first and then pulling it down. Amazon Redshift external tables can deal with the spacing issues, but can't deal with nested JSON, which almost all these files have. COPY commands can deal with nested JSON, but require us to know the JSON structure beforehand, and don't allow us to access the filename, which we would need for a complete import (it's the only way we can get the date). In general, Redshift has the same problem as Athena: the inconsistent structure makes it difficult to define a schema.
I've looked into using tools like AWS Glue, but they just move data, and they can't move data into our on-premise server, so we have to find some sort of intermediary, which increases cost, latency, and maintenance overhead.
I've tried cutting out the middleman and using ZappySys' S3 JSON SSIS task to pull the files directly and aggregate them in an SSIS package, but it can't deal with the spacing issues or the inconsistent structure.
I can't be the first person to face this problem, but I just keep spinning my wheels.
Rumble is an open-source (Apache 2.0) engine that allows you to use the JSONiq query language to directly query JSON (specifically, JSON Lines files) stored on S3, without having to move it anywhere else or import it into any data store. Internally, it uses Spark and DataFrames.
It was successfully tested on collections of more than 20 billion objects (10+ TB), and it also works seamlessly if the data is nested and heterogeneous (missing fields, extra fields, different types in the same field, etc). It was also tested with Amazon EMR clusters.
Update: Rumble also works with Parquet, CSV, ROOT, AVRO, text, and SVM, and on HDFS, S3, and Azure.
I would probably suggest 2 types of solutions
I believe MongoDB/DynamoDB/Cassandra are good at processing Heterogenous JSON structure. I am not sure about the inconsistency in ur JSON but as long as it is a valid JSON, I believe it should be ingestable in one of the above DBs. Please provide a sample JSON if possible. But these tools have their own advantages and disadvantages. The data modelling is entirely different for these No SQL's than the traditional SQLs.
I am not sure why your Lambda is not able to do a cleanup. I believe you would have tried to call a Lambda when a S3 PUT happens in a bucket. This should be able to cleanup the JSON unless there are complex processes involved.
Unless the JSON is in a proper format, no tool would be able to process it perfectly, I believe more than Athena or Spectrum, MongoDB/DyanoDB/Cassandra will be right fit to this use case
Would be great if you could share the limitations that you faced when you created a lot of partitions?

manual serialization / deserialization of AppEngine Datastore objects

Is it possible to manually define the logic of the serialization used for AppEngine Datastore?
I am assuming Google is using reflection to do this in a generic way. This works but proves to be quite slow. I'd be willing to write (and maintain) quite some code to speed up the serialization / deserialization of datastore objects (I have large objects and this consumes quite some percentage of the time).
The datastore uses Protocol-Buffers internally, and there is no way round, as its the only way your application can communicate with the datastore.
(The implementation can be found in the SDK/google/appengine/datastore/entity_pb.py)
If you think, (de)serialization is too slow in your case, you probably have two choices
Move to a lower DB API. There's another API next to the two well-documented ext.db and ext.ndb APIs at google.appengine.datastore. This hasn't all the fancy model-stuff and provides a simple (and hopefully fast) dictionary-like api. This will keep your datastore-layout compatible with the other two DB APIs.
Serialize the object yourself, and store it in a dummy entry consisting just of a text-field. But you'll probably need to duplicate data into your base entry, as you cannot filter/sort by data inside your self-serialized text.

Is OData suitable for multi-tenant LOB application?

I'm working on a cloud-based line of business application. Users can upload documents and other types of object to the application. Users upload quite a number of documents and together there are several million docs stored. I use SQL Server.
Today I have a somewhat-restful-API which allow users to pass in a DocumentSearchQuery entity where they supply keyword together with request sort order and paging info. They get a DocumentSearchResult back which is essentially a sorted collection of references to the actual documents.
I now want to extend the search API to other entity types than documents, and I'm looking into using OData for this. But I get the impression that if I use OData, I will face several problems:
There's no built-in limit on what fields users can query which means that either the perf will depend on if they query a indexed field or not, or I will have to implement my own parsing of incoming OData requests to ensure they only query indexed fields. (Since it's a multi-tenant application and they share physical hardware, slow queries are not really acceptable since those affect other customers)
Whatever I use to access data in the backend needs to support IQueryable. I'm currently using Entity Framework which does this, but i will probably use something else in the future. Which means it's likely that I need to do my own parsing of incoming queries again.
There's no built-in support for limiting what data users can access. I need to validate incoming Odata queries to make sure they access data they actually have permission to access.
I don't think I want to go down the road of manually parsing incoming expression trees to make sure they only try to access data which they have access to. This seems cumbersome.
My question is: Considering the above, is using OData a suitable protocol in a multi-tenant environment where customers write their own clients accessing the entities?
I think it is suitable here. Let me give you some opinions about the problems you think you will face:
There's no built-in limit on what fields users can query which means
that either the perf will depend on if they query a indexed field or
not, or I will have to implement my own parsing of incoming OData
requests to ensure they only query indexed fields. (Since it's a
multi-tenant application and they share physical hardware, slow
queries are not really acceptable since those affect other customers)
True. However you can check for allowed fields in the filter to allow the operation or deny it.
Whatever I use to access data in the backend needs to support
IQueryable. I'm currently using Entity Framework which does this, but
i will probably use something else in the future. Which means it's
likely that I need to do my own parsing of incoming queries again.
Yes, there is a provider for EF. That means if you use something else in the future you will need to write your own provider. If you change EF probably you took a decision to early. I don´t recommend WCF DS in that case.
There's no built-in support for limiting what data users can access. I
need to validate incoming Odata queries to make sure they access data
they actually have permission to access.
There isn´t any out-of-the-box support to do that with WCF Data Services, right. However that is part of the authorization mechanism that you will need to implement anyway. But I have good news for you: do it is pretty easy with QueryInterceptors. simply intercepting the query and, based on the user privileges. This is something you will have to implement it independently the technology you use.
My answer: Considering the above, WCF Data Services is a suitable protocol in a multi-tenant environment where customers write their own clients accessing the entities at least you change EF. And you should have in mind the huge effort it saves to you.

Is it possible to store graphs hbase? if so how do you model the database to support a graph structure?

I have been playing around with using graphs to analyze big data. Its been working great and really fun but I'm wondering what to do as the data gets bigger and bigger?
Let me know if there's any other solution but I thought of trying Hbase because it scales horizontally and I can get hadoop to run analytics on the graph(most of my code is already written in java), but I'm unsure how to structure a graph on a nosql database? I know each node can be an entry in the database but I'm not sure how to model edges and add properties to them(like name of nodes, attributes, pagerank, weights on edges,etc..).
Seeing how hbase/hadoop is modeled after big tables and map reduce I suspect there is a way to do this but not sure how. Any suggestions?
Also, does this make sense what I'm trying to do? or is it there better solutions for big data graphs?
You can store an adjacency list in HBase/Accumulo in a column oriented fashion. I'm more familiar with Accumulo (HBase terminology might be slightly different) so you might use a schema similar to:
SrcNode(RowKey) EdgeType(CF):DestNode(CFQ) Edge/Node Properties(Value)
Where CF=ColumnFamily and CFQ=ColumnFamilyQualifier
You might also store node/vertex properties as separate rows using something like:
Node(RowKey) PropertyType(CF):PropertyValue(CFQ) PropertyValue(Value)
The PropertyValue could be either in the CFQ or the Value
From a graph processing perspective as mentioned by #Arnon Rotem-Gal-Oz you could look at Apache Giraph which is an implementation of Google Pregel. Pregel is the method Google use for large graph processing.
Using HBase/Accumulo as input to giraph has been submitted recently (7 Mar 2012) as a new feature request to Giraph: HBase/Accumulo Input and Output formats (GIRAPH-153)
You can store the graph in HBase as adjacency list so for example, each raw would have columns for general properties (name, pagerank etc.) and a list of keys of adjacent nodes (if it a directed graph than just the nodes you can get to from this node or an additional column with the direction of each)
Take a look at apache Giraph (you can also read a little more about it here) while this isn't about HBase it is about handling graphs in Hadoop.
Also you may want to look at Hadoop 0.23 (and up) as the YARN engine (aka map/reduce2) is more open to non-map/reduce algorithms
I would not use HBase in the way "Binary Nerd" recommended it as HBase does not perform very well when handling multiple column families.
Best performance is achieved with a single column family (a second one should only be used if you very often only access the content of one column family and the data stored in the other column family is very large)
There are graph databases build on top of HBase you could try and/or study.
Apache S2Graph
provides REST API for storing, querying the graph data represented by edge and vertices. There you can find a presentation, where the construction of row/column keys is explained. Analysis of operations' performance that influenced or is influenced by the design are also given.
Titan
can use other storage backends besides HBase, and has integration with analytics frameworks. It is also designed with big data sets in mind.

Resources