Versioned RDF store [closed] - database

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
Let me try rephrasing this:
I am looking for a robust RDF store or library with the following features:
Named graphs, or some other form of reification.
Version tracking (probably at the named graph level).
Privacy between groups of users, either at named graph or triple level.
Human-readable data input and output, e.g. TriG parser and serialiser.
I've played with Jena, Sesame, Boca, RDFLib, Redland and one or two others some time ago but each had its problems. Have any improved in the above areas recently? Can anything else do what I want, or is RDF not yet ready for prime-time?
Reading around the subject a bit more, I've found that:
Jena, nothing further
Sesame, nothing further
Boca does not appear to be maintained any more and seems only really designed for DB2. OpenAnzo, an open-source fork, appears more promising.
RDFLib, nothing further
Redland, nothing further
Talis Platform appears to support changesets (wiki page and reference in Kniblet Tutorial Part 5) but it's a hosted-only service. Still may look into it though.
SemVersion sounded promising, but appears to be stale.

Talis is the obvious choice, but privacy may be an issue, or perceived issue anyway, since its a SaaS offering. I say obvious because the three emboldened features in your list are core features of their platform IIRC.
They don't have a features list as such - which makes it hard to back up this answer, but they do say that stores of data can be individually secured. I suppose you could - at a pinch - sign up to a separate store on behalf of each of your own users.
Human readable input is often best supported by writing custom interfaces for each user-task, so you best be prepared to do that as needs demand.
Regarding prime-time readiness. I'd say yes for some applications but otherwise "not quite". Mostly the community needs to integrate with existing developer toolsets and write good documentation aimed at "ordinary" developers - probably OO developers using Java, .NET and Ruby/Groovy - and then I predict it will snowball.
See also Temporal Scope for RDF triples

From: http://www.semanticoverflow.com/questions/453/how-to-implement-semantic-data-versioning/748#748
I personally quite like the pragmatic approach which Freebase has adopted.
Browse and edit views for humans:
http ://www.freebase.com/view/guid/9202a8c04000641f80000000041ecebd
http ://www.freebase.com/edit/topic/guid/9202a8c04000641f80000000041ecebd
The data model exposed here:
http ://www.freebase.com/tools/explore/guid/9202a8c04000641f80000000041ecebd
Stricly speaking, it's not RDF (it's probably a superset of it), but part of it can be exposed as RDF:
http ://rdf.freebase.com/rdf/guid.9202a8c04000641f80000000041ecebd
Since it's a community driven website, not only they need to track who said what, when... but they are probably keeping the history as well (never delete anything):
http ://www.freebase.com/history/view/guid/9202a8c04000641f80000000041ecebd
To conclude, the way I would tackle your problem is very similar and pragmatic. AFAIK, you will not find a solution which works out-of-the-box. But, you could use a "tuple" store (3 or 4 aren't enough to keep history at the finest granularity (i.e. triples|quads)).
I would use TDB code as a library (since it gives you B+Trees and a lot of useful things you need) and I would use a data model which allows me to: count quads, assign an ownership to a quad, a timestamp and previous/next quad(s) if available:
[ id | g | s | p | o | user | timestamp | prev | next ]
Where:
id - long (unique identifier, same (g,s,p,o) will have different id...
a lot of space, but you can count quads... and when you have a
community driven website (like this one) counting things it's
important.
g - URI (or blank node?|absent (i.e. default graph))
s - URI|blank node
p - URI
o - URI|blank node|literal
user - URI
timestamp - when the quad was created
prev - id of the previous quad (if present)
next - id of the next quad (if present)
Then, you need to think about which indexes you need and this would depend on the way you want to expose and access your data.
You do not need to expose all your internal structures/indexes to external users/people/applications. And, when (and if), RDF vocabularies or ontologies for representing versioning, etc. will emerge, you are able to quickly expose your data using them (if you want to).
Be warned, this is not common practice and it you look at it with your "semantic web glasses" it's probably wrong, bad, etc. But, I am sharing the idea, since I believe it's not harmful, it allows to provide a solution to your question (it will be slower and use more space than a quad store), part of it can be exposed to the semantic web as RDF / LinkedData.
My 2 (heretic) cents.

LMF comes with a versioning module: http://code.google.com/p/lmf/wiki/ModuleVersioning
The Linked Media Framework is an easy-to-setup server application developed in JavaEE that bundles core Semantic Web technologies to offer many advanced services.

Take a look to see if Virtuoso's RDF support meets your needs, it sounds as though it might go quite a way, and it plays nice with XML and web services too. There's a commercial and a GPL'd version.

Mulgara/Fedora-Commons might fit the bill. I belive that privacy is currently a major project, and I understand that it supports versioning, but it might be too much in that is is an object-store too.

(years later)
I think both Oracle's RDF store:
http://www.oracle.com/technetwork/database/options/semantic-tech/index.html
and the recently announced graph store in IBMs DB2 supports much of this:
http://www-01.ibm.com/software/data/db2/linux-unix-windows/graph-store.html

Related

How do you handle collection and storage of new data in an existing system? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I am new to system design and have been asked to solve a problem.
Given a car rental service website, I need to work on a new feature.
The company has come up with some more data that they would like to capture and analyze along with the data that they already have.
This new data can be something like time and cost to assemble a car.
I need to understand the following:
1: How should I approach the problem, from API design perspective?
2: Is changing the schema of your tables going to do any good, if that is an option?
3: Which databases can be used?
The values once stored can be changed. For example, the time to assemble can reduce or increase, hence the users should be able to update the values.
To answer you question let's divide it in two parts, ideal architecture and Q&A's
Architecture:
A typical system would consist of many technologies working together to solve a practical problem. Problems can be solved in many ways and may have more than one solution. We are not talking about efficiency and effectively of any architecture here as it's whole new subject to explore. But it's always wise to choose what's best for your use case.
Since you already have existing software built, it's always helpful to follow it's existing design pattern which will help you understand existing code in detail and allow you to create logical blocks which will fit nicely and actually help in integrating functionality instead of working against it.
Since this clears the pre planning phase let's discuss on how this affects what solution is ideal for your use case in my opinion.
Q&A's
1. How should I approach the problem, from API design perspective?
There will be lots of assumption, anything but less system consisting of api should have basic functionality of authentication and authorization when ever needed. Apart from that, try to stick to full REST specification, which will allow API consumers to follow standard paths and integration would have minimal impact when deciding what endpoints would look like and what they expect from consumer.
Regardless, not all systems are ideal for such use case and thus it's in up to system designer how much of system is compatible with standard practices.
Name convention matters when newer version api will have api/v2
paths and old one having api/v1, which is good practice for routing
new functionality. Which allows system to expand seamlessly.
2: Is changing the schema of your tables going to do any good, if that is an option?
In short term when you do not have much data, it's relatively easy to migrate data. When it becomes huge, it's much more painful and resource intensive.
Good practices would allow you to prevent such scenarios where you might not need migrate data.
Database normalization becomes so crucial in such cases when potential data structure would grow rapidly and requires attention.
Regardless of using any sql or nosql solution, a good data structure will always be helpful in both data management and programming implementation.
In my opinion, getting data structures near perfect is always a good
idea, because it will reduce future costs of migration and
frustration it brings. Still some use cases requires addition on
columns and its okay to add them as long as it does not have much
impact on existing code. Otherwise it can always be decoupled in
separate table for additional fields.
3: Which databases can be used?
Typically any rdbms is enough for this kind of tasks. You might be surprised when you see case studies of large data creators still using mysql in clusters.
So answer is, as long as you have normal scenario, go ahead and pick any database of your choice, until you hit its single instance scalability limits. And those limits are pretty huge for small to mod scale apps.
How should I approach the problem, from API design perspective?
Design a good data model which is appropriate for the data it needs to store. The API design will follow from the data model.
Is changing the schema of your tables going to do any good, if that is an option?
Does the new data belong in the existing tables? Then maybe you should store it there. Except: can you add new columns without breaking any existing applications? Maybe you can but the regression testing you'll need to undertake to prove it may be ruinous for your timelines. Separate tables are probably the safer option.
Which databases can be used?
You're rather vague about the nature of the data you're working with, but it seems structured (numbers?). So that suggests a SQL with strong datatypes would be the best fit. Beyond that, use whatever data platform is currently being used. Any perceived benefits from a different product will be swept away by the complexities and hassle of deploying it.
Last word. Talk this over with your boss (or whoever set you this task). Don't rely on the opinions of some random stranger on the interwebs.

Database of scientific paper abstracts

I am trying to find a database with scientific papers which will allow me to:
1. Get metadata of papers by doi (including abstracts);
2. Do this stuff regularly (e.g. daily updated);
3. Ability to download whole existing database.
I know about Crossref API, however, only 3% of all publications presented have abstract (and none of biggest publishers like Springer or Elsevier provide them). On the other side I see some projects like Dimensions or Researcher which already implemented mentioned functionality. So the question is: does somebody know such services (possibly not free) and had experience working with them?
Have you looked at Semantic Scholar (https://www.semanticscholar.org/)? They have an API that supports the first of your requirements (http://api.semanticscholar.org/) and also provide the "Open Research Corpus" (http://labs.semanticscholar.org/corpus/) which should satisfy your third requirement. It is a smaller database than what is provided by Scopus or Web of Science, but both of those require subscriptions to fully use their APIs and don't (as far as I know) have a real way for you to purchase a full download of the database.

Are RDF / Triple Stores suited for storing application data? (as opposed to the graph Metadata)

I'm trying to create a small web application for a "personal information manager" / wiki kind of tool where I can take notes in the form of HTML snippets (or maybe Markdown), annotate them with some https://schema.org/ microdata and store both the snippet and the metadata somewhere for querying.
My understanding so far is that most semantic data stores (triple/quad stores, or databases supporting RDF) are better suited for storing and querying mainly the metadata. So I'll probably also want some traditional store of some sort (relational, document store, key-value, or even a non-rdf graph db) where I can store the full text of each note and maybe some other bits like time of last access, user-id that owns the note, etc, and also perform traditional (non-semantic) fulltext queries.
I started looking for stores that would allow me to store both data and metadata in a single place. I found a few: Ontotext GraphDB, Stardog, MarkLogic, etc. All of these seem to do exactly what I want, but have some pretty limiting free license terms that really discourage me from studying them in depth: I prefer to study open technologies that I could potentially use on a real product.
Before digging deeper, I was wondering:
If my assumption is correct: that I'll need to use one store for the data and another for the metadata.
if there's any setup involving free/open source software that developers with experience in RDF/Sparql can recommend, given the problem I describe.
Right now I'm just leaning towards using Apache Jena for the RDF store and SPARQL queries, and something completely independent for the rest of the data (PostgreSQL most likely).
Before digging deeper, I was wondering:
If my assumption is correct: that I'll need to use one store for the data and another for the metadata.
Not necessarily, no, though there certainly are some cases in which that distinction may be useful. But most RDF databases offer scalable storage for both data and metadata. The only requirement is that your (meta)data is represented as RDF. If you're worried about performance of things like text queries, most of them offer support for full-text indexing through Lucene, Solr, or Elasticsearch.
if there's any setup involving free/open source software that developers with experience in RDF/Sparql can recommend, given the problem I describe.
This is really not the right place to ask this question. Tool recommendations are considered off-topic on StackOverflow since they attract biased answers. But as said, there's plenty of tools, both open-source/free and commercial, that you can look into. I suggest you pick one you like the look of, experiment a bit, and perhaps talk to that particular tool's community to explain what you're trying to do. Apache Jena and Eclipse Rdf4j are two popular open-source projects, but there's plenty of others.

Data masking for data in AWS RDS [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Closed 4 years ago.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Improve this question
I have an AWS RDS (AuroraDB) and I want to mask the data on the DB. Does Amazon provides any service for data masking?
I have seen RDS encryption but I am looking for data masking because the database contains sensitive data. So I want to know is there any service they provide for data masking or is there any other tool which can be used to mask the data and add it manually into the DB?
A list of tools which can be used for data masking is most appreciated if any for mine case. Because I need to mask those data for testing as the original DB contains sensitive information like PII(Personal Identifiable information). I also have to transfer these data to my co-workers, so I consider data masking an important factor.
Thanks.
This is a fantastic question and I think your pro-active approach to securing the most valuable asset of your business is something that a lot of people should heed, especially if you're sharing the data with your co-workers. Letting people see only what they need to see is an undeniably good way to reduce your attack surfaces. Standard cyber security methods are no longer enough imo, demonstrated by numerous attacks/people losing laptops/usbs with sensitive data on. We are just humans after all. With the GDPR coming in to force in May next year, any company with customers in the EU will have to demonstrate privacy by design and anonymisation techniques such as masking have been cited as way to show this.
NOTE: I have a vested interest in this answer because I am working on such a service you're talking about.
We've found that depending on your exact use case, size of data set and contents will depend on your masking method. If your data set has minimal fields and you know where the PII is, you can run standard queries to replace sensitive values. i.e. John -> XXXX. If you want to maintain some human readability there are libraries such as Python's Faker that generate random locale based PII you can replace your sensitive values with. (PHP Faker, Perl Faker and Ruby Faker also exist).
DISCLAIMER: Straight forward masking doesn't guarantee total privacy. Think someone identifying individuals from a masked Netflix data set by cross referencing with time stamped IMDB data or Guardian reporters identifying a Judges porn preferences from masked ISP data.
Masking does get tedious as your data set increases in fields/tables and you perhaps want to set up different levels of access for different co-workers. i.e. data science get lightly anonymised data, marketing get a access to heavily anonymised data. PII in free text fields is annoying and generally understanding what data is available in the world that attackers could use to cross reference is a big task.
The service i'm working on aims to alleviate all of these issues by automating the process with NLP techniques and a good understanding of anonymisation maths. We're bundling this up in to a web-service and we're keen to launch on the AWS marketplace. So I would love to hear more about your use-case and if you want early access we're in private beta at the moment so let me know.
If you are exporting or importing data using CSV or JSON files (i.e. to share with your co-workers) then you could use FileMasker. It can be run as an AWS Lamdbda function reading/writing CSV/JSON files on S3.
It's still in development but if you would like to try a beta now then contact me.
Disclaimer: I work for DataVeil, the developer of FileMasker.

I need a client side browser database. What are my options [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I'm creating a web site that I think must have a client side database. The other option would be to stick everything on the server at the expense of increased complexity and decreased scalability. What options do I have? Must I build a plugin? Must I wait until everybody's HTML5 compliant?
Update There's been a lot of comments about why I would actually need this. Here are my thoughts. Tell me if I'm being silly:
The clients will have a large and complex state that will require something like a database to provide the data interaction that I need. Therefore (I think) cookies are out of the picture.
This data is transient, so the client won't care if it gets erased as soon as they close a session. However they will need to keep the data if they go to a different web page and then come back. Therefore (I think) somehow storing the data in some sort of a javascript SQL implementation will not work.
I can certainly do everything that I want to do on the server, and servers can scale to manage the load (Facebook). But (I think) I'd rather build a plugin than pay for the infrastructure to support this load. This is for a bare bones startup. (The richer the startup is, the barer my bones will be.)
Indexed Database (Can I use)
Web SQL (Can I use)
localStorage
I'm about 5 years late in answering this, but given that there are errors and outdated data in some of the existing answers, and unaddressed points in the original question, I figured i'd throw in my two cents.
First, contrary to what others have implied on here, localStorage is not a database. It is (or should be perceived as) a persistent, string-based key-value store...
...which may be perfectly fine for your needs (and brings me to my second point).
Do you need explicit or implicitly relationships between your data items?
How about the ability to query over said items?
Or more than 5 MB in space?
If you answered "no" to all all of the above, go with localStorage and save yourself from the headaches that are the WebSQL and IndexedDB APIs. Well, maybe just the latter headache, since the former has been deprecated.
There are also several other client-side storage facilities (native and non-native) you may want to look in to, some of which are deprecated* but still see support from some browsers:
userData*
The rest of webStorage (sessionStorage and globalStorage*)
HTML5 File System*
Flash Locally Shared Objects
Silverlight Isolated Storage
Check out BakedGoods if you want to utilize any of these facilities, and more, without having to write low-level storage operation code. With it, placing data in one (or more) of them, for example, is as simple as:
bakedGoods.set({
data: [{key: "key1", value: "val1"}, {key: "key2", value: "val2"}],
storageTypes: ["silverlight", "fileSystem", "localStorage"],
options: optionsObj,
complete: function(byStorageTypeStoredKeysObj, byStorageTypeErrorObj){}
});
Oh, and for the sake of complete transparency, BakedGoods is maintained by this guy right here :) .
Use PouchDB.
PouchDB is an open-source JavaScript database inspired by Apache CouchDB that is designed to run well within the browser.
It helps building applications that works online as well as offline.
Basically, it stores the last fetched data in the in-browser database (uses IndexedDB, WebSQL under the hood) and then syncs again when the network gets active.
I came across a JavaScript Database http://www.taffydb.com/ still trying it out myself, hope this helps.
If you are looking for a NoSQL-style db on the client you can check out http://www.forerunnerdb.com. It supports the same query language as MongoDB and has a data-binding module if you want your DOM to reflect changes to your data automatically.
It is also open source, is constantly being updated with new features and the community around it is growing rapidly.
Disclaimer, I'm the lead developer of the project.
If you feel like you need it then use it for the clients that support it and implement a server-side fallback for clients that don't.
An alternative is you can use Flash and Local Shared Objects which can store a lot more information than a cookie, will work in all browsers with Flash (which is pretty much all browsers), and store typed data. You don't have to do the whole app in Flash, you can just write a tiny utility to read/write LSO data. This can be done using straight ActionScript projects without any framework and will give you a tiny 5-15kb swf.
There are two API's you'll primarily need. SharedObject.getLocal() to get access to a LSO and read/write it's data, and ExternalInterface.addCallback which you can use to register an AS3 method as a callback to call your read/write LSO method.
SharedObject
http://help.adobe.com/en_US/FlashPlatform/reference/actionscript/3/flash/net/SharedObject.html?filter_flex=4.1&filter_flashplayer=10.1&filter_air=2
ExternalInterface
http://help.adobe.com/en_US/FlashPlatform/reference/actionscript/3/flash/external/ExternalInterface.html
These links are to Flex references but for this you can just create an ActionScript project with no need for the Flex framework and therefore greatly reduced swf size. There are a number of good IDEs including free open-source ones like FlashDevelop.
FlashDevelop
http://www.flashdevelop.org/
Check out HTML5 Local Storage:
http://people.w3.org/mike/localstorage.html
You may also find this helpful:
HTML5 database storage (SQL lite) - few questions
When Windows 98 first came out, there were a lot of us still stuck on MS-DOS 6.22. Naturally, there were really cool features on the new operating system that wouldn't run in MS-DOS.
There comes a time when some things must be left behind to make room for innovation. If your application is really innovative and will offer cool new functionality that uses the latest and greatest technologies, then some older browsers will naturally need to be left behind.
The advantage that you have is that, unlike upgrading an operating system, upgrading from IE7 to Chrome 8 or Firefox 3.6 is a more reachable goal for the average user of your app, especially if you provide a link and upgrade instructions.
I would try Mozilla's localForage. https://localforage.github.io/localForage/

Resources