I want to delete a vertex and all vertexes that point to it where the edges between them are from a set of labels - not necessarily all the incoming vertexes.
In a relational database you can configure automatic deletion of related entities using cascading deletes on foreign keys.
Is there a similar configuration expressible in Titan DB or with Gremlin?
There is no such feature in Titan at this time. You would have to manage such logic manually as part of your delete of the vertex:
g.v(123).in('some','labels','only').remove()
g.v(123).remove()
Related
I'm looking for an open source data store that scales as easily as Cassandra but data can be queried via documents like MongoDB.
Are there currently any databases out that do this?
In this website http://nosql-database.org you can find a list of many NoSQL databases sorted by datastore types, you should check the Document stores there.
I'm not naming any specific database to avoid a biased/opinion-based answer, but if you are interested in a data store that is as scalable as Cassandra, you probably want to check those which use master-master/multi-master/masterless (you name it, the idea is the same) architecture, where both writes and reads can be split among all nodes in the cluster.
I know Cassandra is optimized towards writes rather than reads, but without further details in the question can't refine the answer with more information.
Update:
Disclaimer: I haven't used CouchDB at all, and haven't tested it's performance either.
Since you spotted CouchDB I'll add what I've found in the official documentation, in the distributed database and replication section.
CouchDB is a peer-based distributed database system. It allows users
and servers to access and update the same shared data while
disconnected. Those changes can then be replicated bi-directionally
later.
The CouchDB document storage, view and security models are designed to
work together to make true bi-directional replication efficient and
reliable. Both documents and designs can replicate, allowing full
database applications (including application design, logic and data)
to be replicated to laptops for offline use, or replicated to servers
in remote offices where slow or unreliable connections make sharing
data difficult.
The replication process is incremental. At the database level,
replication only examines documents updated since the last
replication. Then for each updated document, only fields and blobs
that have changed are replicated across the network. If replication
fails at any step, due to network problems or crash for example, the
next replication restarts at the same document where it left off.
Partial replicas can be created and maintained. Replication can be
filtered by a javascript function, so that only particular documents
or those meeting specific criteria are replicated. This can allow
users to take subsets of a large shared database application offline
for their own use, while maintaining normal interaction with the
application and that subset of data.
Which looks quite scalable to me, as it seems you can add new nodes to the cluster and then all the data gets replicated.
Also partial replicas seems an interesting option for really big data sets, which I'd configure these very carefully, in order to prevent situations where a given query to the database might not yield valid results, for example, in the case of a network partition and having only access to a partial set.
First of all thanks for reading.
I need to replicate a subset of data that is based on a join filter; filter based on a join with an other table (Microsoft:"Using join filters, you can extend a row filter from one published table to another."). This is the setting:
SQL Server 2012;
replication sources on a subscription of a transaction replication
replication needs to be one direction sync (from publisher to subscriber);
only one subscriber/subscription;
small dataset with not many transactions;
WAN network.
What I established so far:
Option 1 - Create views and replicate those to tables via Transactional replication.
pros: no triggers are used,
cons: objects like key, constraints are not replicated
Option 2 - Use Merge replication with the join filter and set #subscriber_upload_options = 2 (download only).
pros: native MS functionality, all objects are replicated
cons: merge replication uses triggers, these won't be fired with bulk loads.
The results of these two approaches are exactly the same. However the technique differs, for example the different Agents that are used.To my understanding Merge replication is especially for server - client architectures, which is not my case but.. it works..
Because of the result is the same I am a bit in doubt which approach I should follow. I was hoping that you can give me some points to consider or advice me in which approach I should follow.
For the setup given in this question, both Transactional and Merge replication types are good.
The only things for you to consider are:
If latency for data transfer to the Subscriber should be minimal, choose Transactional Replication.
If you require access to intermediate data states, choose Transactional Replication.
For example, if a row changes five times, transactional replication allows an application to respond to each change (such as firing a trigger), not simply the net data change to the row.
However, the type of replication you choose for an application depends on many factors.
Here are links to relevant articles on learn.microsoft.com:
"Types of Replication"
"Transactional Replication"
"Merge Replication"
In phpMyAdmin, the tables do not show any relationships and they cannot be implemented in MyISAM tables. That means all the database relationships and data integrity need to be taken care of in the code level rather than at the database level? Is there any advantage or disadvantage of this?
I think the lack of relationships is probably a legacy from the early days of mysql. I have seen that the latest versions of Joomla is using InnoDB - tables instead of MyIsam, and this might indicate that they are thinking of adding relationships? (Innodb supports foreign keys and triggers, while myisam does not).
Obviously, it is much easier to maintain data integrity by using relationships and triggers, that doing everything at the application level. The only disadvantage I can think of is that it might require a bit more work with the coding to make sure the application don't give errors when you try to delete items with foreign keys ...
regards Jonas
We are new to Vertica and found it relatively surprising that only one database at a time can be UP/active. In our research work we need to access multiple databases at a time, so I'd like to know how other Vertica users manage this limitation. The only approaches I've thought of so far are a) taking turns (start and stop databases as needed), or b) (mis-)using schemas to group tables into logical databases. Thanks for your help!
You can have multiple databases. Each database would need dedicated nodes. With a 6 node cluster:
DB1 on node1, node2, node3
DB2 on node4, node5, node6
In order to maintain high availability, each database would require at least 3 nodes for a K-Safety level of 1. If the databases loses a node with K-level 1, the database will run normally.
The way Vertica is designed is intended for a single database instance. Vertica falls under the MPP (Massively Parallel Processing) category. Multiple databases would be competing for resources on a cluster. The parallel design enables the distribution of storage and workload across the nodes. The best design is to logically create your schemas like you would databases.
You can run more then one Vertica database even on the same node !! Yo just need to alter the port number where the database run on !!
But like #FreshPrinceOfSO said Vertica is quite hungry for resources (Memory in special). So is recomended to keep your Cluster with one database running on it!!!
I prefer to create a new cluster insted of mixing up the schemas !! Or if you choose to create schemas to behave like database repos you need to have a good knowladge of your resource management tasks!
I'm getting my first exposure to data warehousing, and I’m wondering is it necessary to have foreign key constraints between facts and dimensions. Are there any major downsides for not having them? I’m currently working with a relational star schema. In traditional applications I’m used to having them, but I started to wonder if they were needed in this case. I’m currently working in a SQL Server 2005 environment.
UPDATE: For those interested I came across a poll asking the same question.
Most data-warehouses (DW) do not have foreign keys implemented as constraints, because:
In general, foreign key constraint would trigger on: an insert into a fact table, any key-updates, and a delete from a dimension table.
During loading, indexes and constraints are dropped to speed-up the loading process, data integrity is enforced by the ETL application.
Once tables are loaded, DW is essentially read-only -- the constraint does not trigger on reads.
Any required indexes are re-built after the loading.
Deleting in a DW is a controlled process. Before deleting rows from dimensions, fact tables are queried for keys of rows to be deleted -- deleting is allowed only if those keys do not exists in any of fact tables.
Just in case, it is common to periodically run queries to detect orphan records in fact tables.
We use them, and we're happy with it.
Is it good practice to have foreign keys in a datawarehouse (relationships)?
There is overhead, but you can always disable the constraint during load and then re-enable it.
Having the constraint in place can catch ETL bugs and modelling defects.
I think in theory, you need that. But it depends on how you separate your data over database. If all of them in the same database, foreign key can help you because setting foreign key will help the database do selecting faster based on the indexing. If you share tables over many database, you need to check it on your application level
You can have your database check it for you but it can be slow. And generally, in data warehouse, we don't care about redundancy or integrity. We already have a lot of data and a few integrity and redundancy will not affect the general aggregate data
I don't know about necessary, but I feel they are good for data integrity reasons. You want to make sure that your fact table is always pointing to a valid record in the dimension table. Even if you are sure this will happen, why not have the database validate the requirement for you?
The reasons for using integrity constraints in a data warehouse are exactly the same as in any other database: to guarantee the integrity of the data. Assuming you and your users care about the data being accurate then you need some way of ensuring that it remains so and that business rules are being correctly applied.
As far as I know FKs, speed up queries. Also, many BI solutions exploit them in their integration layer. So for me they are a must in DWs.
Hope this thread is still active.
My thinking is: for large fact tables with many dimensions and records, foreign keys will slow inserts and updates so that a fact table becomes too slow to load especially as it increases in size. Indexes are used for querying AFTER the table is loaded, so they can be disabled during inserts/updates and then rebuilt. The foreign key RELATION is important NOT the foreign key itself: this is really implicit in the ETL process. I have found that foreign keys make things TOO slow in the real world Datawarehouse. You need to use a VIRTUAL foreign key: the relation is their but not the constraint. If you damage the foreign key relations in a Datawarehouse you are doing something wrong.
If you disable them during inserts and there is an mismatch or orphan, you won't be able to reenable them, so what's the point.
The whole point of the DW is fast access and querying. Foreign keys make that impossible.
Interesting debate: not easy to find this question on the Net
Kev