How can I set up a database that is able to import large amounts of data automatically on a daily basis? [closed] - database

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
I am new to databases and programming and am now supposed to create a database that can store large amounts of data. The critical problem to me is that i need to update the database everyday and add 150 sets of data to 60 different tables. The datasets all come in a different format though (.csv, .row, .sta...).
I would like to be able to create a GUI that can automatically import the datasets every day and update the tables. So far I have found a lot of info on exporting data from databases, not so much on importing data though.
Does someone have a pointer?

You haven't given a reasonable definition of "best", what are your criteria? Is cost a major concern (I'm guessing not if you're considering Matlab, but maybe)? Is the UI for you only or are you turning end users loose on it? If end users, how smart/reliable are they? When you say manual import, do you mean a mostly automatic process that you occasionally manually initiate, or will it have to ask a lot of questions and have many different combinations?
I import lots of data daily and from different sources, I have to manually re-launch a process because a user has made a change and needs to see it reflected immediately, but but my set of defined sources doesn't change often. I've had good luck using the SSIS (SQL Server Integration Services) tool in Microsoft's SQL Server. And it can certainly handle large amounts of data.
The basic functionality is you write a "package" that contains definitions of what your source is, how it's configured (i.e. if you are importing from a text file tell it the name and path, is it fixed field or delimited, what is the delimiter or width of each field, what field to skip, how many rows to skip, etc.), and where to put it (DB name and table, map fields, etc). I then set the schedule (mine are all overnight) in SQL agent, and it is all automatic from there unless something changes, in which case you edit your package to account for the changes.
I can also manually start any Package at any time and with very little effort.
And the range of imports sources is pretty impressive. I pull in data from CSV files, Lotus Notes, and DB2, and it's all automatic every night. It's also a fairly graphical "builder", which is frustrating for a hardcore coder, but if you're new to programming it's probably easier than a more code or script oriented approach.
Good luck, and welcome to the dark side. We have cookies.

Related

Data masking for data in AWS RDS [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Closed 4 years ago.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Improve this question
I have an AWS RDS (AuroraDB) and I want to mask the data on the DB. Does Amazon provides any service for data masking?
I have seen RDS encryption but I am looking for data masking because the database contains sensitive data. So I want to know is there any service they provide for data masking or is there any other tool which can be used to mask the data and add it manually into the DB?
A list of tools which can be used for data masking is most appreciated if any for mine case. Because I need to mask those data for testing as the original DB contains sensitive information like PII(Personal Identifiable information). I also have to transfer these data to my co-workers, so I consider data masking an important factor.
Thanks.
This is a fantastic question and I think your pro-active approach to securing the most valuable asset of your business is something that a lot of people should heed, especially if you're sharing the data with your co-workers. Letting people see only what they need to see is an undeniably good way to reduce your attack surfaces. Standard cyber security methods are no longer enough imo, demonstrated by numerous attacks/people losing laptops/usbs with sensitive data on. We are just humans after all. With the GDPR coming in to force in May next year, any company with customers in the EU will have to demonstrate privacy by design and anonymisation techniques such as masking have been cited as way to show this.
NOTE: I have a vested interest in this answer because I am working on such a service you're talking about.
We've found that depending on your exact use case, size of data set and contents will depend on your masking method. If your data set has minimal fields and you know where the PII is, you can run standard queries to replace sensitive values. i.e. John -> XXXX. If you want to maintain some human readability there are libraries such as Python's Faker that generate random locale based PII you can replace your sensitive values with. (PHP Faker, Perl Faker and Ruby Faker also exist).
DISCLAIMER: Straight forward masking doesn't guarantee total privacy. Think someone identifying individuals from a masked Netflix data set by cross referencing with time stamped IMDB data or Guardian reporters identifying a Judges porn preferences from masked ISP data.
Masking does get tedious as your data set increases in fields/tables and you perhaps want to set up different levels of access for different co-workers. i.e. data science get lightly anonymised data, marketing get a access to heavily anonymised data. PII in free text fields is annoying and generally understanding what data is available in the world that attackers could use to cross reference is a big task.
The service i'm working on aims to alleviate all of these issues by automating the process with NLP techniques and a good understanding of anonymisation maths. We're bundling this up in to a web-service and we're keen to launch on the AWS marketplace. So I would love to hear more about your use-case and if you want early access we're in private beta at the moment so let me know.
If you are exporting or importing data using CSV or JSON files (i.e. to share with your co-workers) then you could use FileMasker. It can be run as an AWS Lamdbda function reading/writing CSV/JSON files on S3.
It's still in development but if you would like to try a beta now then contact me.
Disclaimer: I work for DataVeil, the developer of FileMasker.

Strategies to building a database of 30m images [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Summary
I am facing the task of building a searchable database of about 30 million images (of different sizes) associated with their metadata. I have no real experience with databases so far.
Requirements
There will be only a few users, the database will be almost read-only, (if things get written then by a controlled automatic process), downtime for maintenance should be no big issue. We will probably perform more or less complex queries on the metadata.
My Thoughts
My current idea is to save the images in a folder structure and build a relational database on the side that contains the metadata as well as links to the images themselves. I have read about document based databases. I am sure they are reliable, but probably the images would only be accessible through a database query, is that true? In that case I am worried that future users of the data might be faced with the problem of learning how to query the database before actually getting things done.
Question
What database could/should I use?
Storing big fields that are not used in queries outside the "lookup table" is recommended for certain database systems, so it does not seem unusual to store the 30m images in the file system.
As to "which database", that depends on the frameworks you intend to work with, how complicated your queries usually are, and what resources you have available.
I had some complicated queries run for minutes on MySQL that were done in seconds on PostgreSQL and vice versa. Didn't do the tests with SQL Server, which is the third RDBMS that I have readily available.
One thing I can tell you: Whatever you can do in the DB, do it in the DB. You won't even nearly get the same performance if you pull all the data from the database and then do the matching in the framework code.
A second thing I can tell you: Indexes, indexes, indexes!
It doesn't sound like the data is very relational so a non-relational DBMS like MongoDB might be the way to go. With any DBMS you will have to use queries to get information from it. However, if your worried about future users, you could put a software layer between the user and DB that makes querying easier.
Storing images in the filesystem and metadata in the DB is a much better idea than storing large Blobs in the DB (IMHO). I would also note that the filesystem performance will be better if you have many folders and subfolders rather than 30M images in one big folder (citation needed)

Flexible way/database to keep a very large amount of data

I have a very large SQLite database (~10MB). Currently I'm using this database with a custom PHP interface at localhost, but I'm looking for a different storage and a different interface as well. The data inside does not change too often, so reading data has the priority.
I'm not very happy with the current setup, because the database schema is "custom" and I don't want to need to change it in the future (because maybe I will find a "better way"). I think it's not standard and that this may bother me in the future.
My question is: what is a good way to store data that is very flexible? I'm looking for a standard data storage that I will to fill just one time. Is SQLite still the best choice?
What is inside this database:
I'm going to describe in general how's the structure.
Table A is full of questions;
Table B is full of answers. Each question has different answers;
Tables from C are full of data regarding answer X of question Y. Each answer has a lot of different metadata available.
What is extremely important:
fast queries;
portability through *NIX without having to install a lot of software;
not too complex queries, bit still I will need some joins;
don't have to change the schema every two months.
What is not important:
frequent updates (I update the database once a year, more or less).
I thought of going with files and folders, but I'm not sure it's the best thing.
Your problem can be well modelled as a Graph Problem. What about a Graph Based Database like Orient DB or Neo4j? In a portion of Graph there can be 2 nodes: One for the question say A, one for the answer say B and the edge connecting them can have your table c data or metadata. Each answer node can have multiple incoming edges from several question nodes and there can be multiple answers to a single question. You dont require any schema, the databases I have mentioned are NoSQL databases. Search would be extremely fast as even they can be integrated with Lucene Search Engine for full text searches and I believe since you have Q&A so you would require Full Text Searches.
Here is an example:

What's the best tool for data integration [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I'm looking the best tool for data integration. I need the following features:
Customized loading/matching and
clearing of data from different
sources (including MSSQL Server,
PostgreSQL, WebServices, Excel, text
files in various formats). The
receiver of data is MSSQL Server 2008.
Ability to configure rules of data convertation externally (e.g. config files or visual tools)
Support for Unicode, logging, multithreading and fault tolerance
Scalability (very important)
Ability to process large volumes of data (more that 100MB per day)
I look at SQL Server Integration Service 2008, but I'm not sure that it fits these criteria. What do you think?
It looks like Integration Services (SSIS) should handle your requirements. It should definitely be high on your list because it has good integration with SQL Server and is extremely cost effective compared to most alternatives.
As far as scalability is concerned, your data sounds very small (100MB per day is not much these days) so it's well within the capabilities of SSIS, even for complex data flows. For fault tolerance, SSIS has restartability features out of the box but if high availability is important to you then you may want to consider clustering / mirroring.
I only know SSIS from first hand experience so I can't say how it compares to other solutions.
But I'd say it's a good solution for all the points you ask.
The only thing that is a bit tricky:
don't know Ability to configure rules of data convertation externally (e.g. config files or visual tools)
Not sure I get this right.
You can store configuration parameters for SSIS in external files or even in an SQL table. But you would still need to specify the kinds of rules inside the package. Unless you write your own script component (inside of which you can of course interpret the formating rules you store externally)

Is there a database with git-like qualities? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I'm looking for a database where multiple users can contribute and commit new data; other users can then pull that data into their own database repository, all in a git-like manner. A transcriptional database, if you like; does such a thing exist?
My current thinking is to dump the database to a single file as SQL, but that could well get unwieldy once it is of any size. Another option is to dump the database and use the filesystem, but again it gets unwieldy once of any size.
There's Irmin: https://github.com/mirage/irmin
Currently it's only offered as an OCaml API, but there's future plans for a GraphQL API and a Cap'n'Proto one.
Despite the complex API and the still scarce documentation, it allows you to plug any backend (In-Memory, Unix Filesystem, Git In-Memory and Git On-Disk). Therefore, it runs even on Unikernels and Browsers.
It also offers a bidirectional model where changes on the Git local repository are reflected upon Application State and vice-versa. With the complex API, you can operate on any Git-level:
Append-only Blob storage.
Transactional/compound Tree layer.
Commit layer featuring chain of changes and metadata.
Branch/Ref/Tag layer (only-local, but offers also remotes) for mutability.
The immutable store is often associated/regarded for the blobs + trees + commits on documentation.
Due the Content-addressable inherited Git-feature, Irmin allows deduplication and thus, reduced memory-consumption. Some functionally persistent data structures fit perfectly on this database, and the 3-way merge is a novel approach to handle merge conflicts on a CRDT-style.
Answer from: How can I put a database under version control?
I have been looking for the same feature for Postgres (or SQL databases in general) for a while, but I found no tools to be suitable (simple and intuitive) enough. This is probably due to the binary nature of how data is stored. Klonio sounds ideal but looks dead. Noms DB looks interesting (and alive). Also take a look at Irmin (OCaml-based with Git-properties).
Though this doesn't answer the question in that it would work with Postgres, check out the Flur.ee database. It has a "time-travel" feature that allows you to query the data from an arbitrary point in time. I'm guessing it should be able to work with a "branching" model.
This database was recently being developed for blockchain-purposes. Due to the nature of blockchains, the data needs to be recorded in increments, which is exactly how git works. They are targeting an open-source release in Q2 2019.
Because each Fluree database is a blockchain, it stores the entire history of every transaction performed. This is part of how a blockchain ensures that information is immutable and secure.
It's not SQL, but CouchDB supports replicating the database and pushing/pulling changes between users in a way similar to what you describe.
Some more information in the chapter on replication in the O'Reilly CouchDB book.

Resources