Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
Problem
Every day we recieve a new set of data files from our backoffice application. This application is not able to produce an incremental changeset so all it can do is dump to a large file.
Currently every morning we drop our old MySQL tables and load the data into uor database.
One of the problems we have here is that we are unable to act on specific changes in the data and also we are using CQRS and would have quite some benefits here if we had an incremental list.
File format is currently CSV
Data size per file is up to 10GB
Number of rows per file is up to 40 million
Approximately 30 data files
On average less than 1% of rows is changed each day
Most files either have no primary key or a combined primary key. For many, the full row is the only thing that makes them unique.
The order of data is not fixed. Rows may switch positions
Desired situation
When we receive the new data we calculate the difference and push a message into Kafka for each changed (if a rowidentifier exists), added or removed row.
Technology
We use AWS and are able to use all technologies AWS offers
We are not limited to a certain amount of hardware. We can just start up some new servers in AWS
Cost is only a very limited factor. We have quite a large budget and the ability to have an incremental set offers us quite a lot of value.
We have a running Kubernetes cluster
Question
So the main question is, What would be the best way to compare these 2 large files and create an incremental set? We need it to be fast, preferably within the hour or close to that.
Are there database types that have this natively or are there technologies that can do this for us?
"...The order of data is not fixed. Rows may switch positions..." That is the one that makes it hard. If the rows did not change a git diff or text file comparison tool would work.
Spitballing here but:
For each row create a SHA hash
Use the hash as a unique ID
Store each UNIQUE hash and associated data into a DB Table.
Post processing the file, dump the table into a text file (CSV/SQL/etc)
Commit file changes to source control
When you receive a new data set, check if the hash exists
If no: append the hash to the end of the table
If yes: ignore
Dump the table into a text file (CSV/SQL/etc)
'git diff' commits to see change sets.
Might be able to do this with AWS Glue...
Bonus:
To make it even easier create a location the back office app can upload the file and create a cron to process the report at a given time
This process is a typical ETL (Extract-Transform-Load) task. You are extracting data from one source/format, changing it, and loading/inserting it into a different source/format.
Let me know if any of this was helpful.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
So, I am trying to create a database that can store thousands of malware binary files with sizes ranging anywhere from kb's to 50 mb. I am currently testing with cassandra using blobs but of course with files that big cassandra isn't handling it that well. Does anyone have any good ideas maybe for a better database or maybe a better way to go about using cassandra. I am relatively new to databases so please be as detailed as possible.
Thank You
If you have your heart set on cassandra you would want to store the blob files outside cassandra as the large file sizes will cause problems with your compaction and repairs. Ideally you would store the blob files on a network store somewhere outside cassandra. That said apparently walmart did do it previously
Cassandra setup:
CREATE TABLE [IF NOT EXISTS] malware_table (
malware_hash varchar,
filepath varchar,
date_found timestamp,
object blob,
other columns...
PRIMARY KEY (malware_hash, filepath)
What we're doing here is creating a composite key based on the malware hash. So you can do SELECT * FROM malware_table WHERE malware_hash = ?. If there was a collision you have two files to look at. Additionally this lookup will be super fast as its a key value lookup. Keep in mind with cassandra you can only query by your primary key.
As its not likely that you're going to be updating files in the past you're going to want to run Size based compaction. For faster lookups in the long run. This will be more expensive on harddrive space since you'll need to have 50% of your harddrives free at any given time.
Alternative solution:
I would probably store this in s3/gcs or some network store. Create a folder to represent the hash of the folder and then store the files inside each folder. Use the api to determine if the file is there. If this is something being hit 1000s of times a second you would want to create a caching layer infront of it to reduce lookup times. The cost of a object store is going to be VASTLY cheaper than a cassandra cluster and will likely scale better.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I just want to ask regarding transaction logs in SQL Server. We can take backup of those log files in .bak format at our any system location.
The problem is to extract SQL statement/query from transaction log backup file. We can do it using fn_dump_dblog function. But what we want is to extract the query or data on which transaction has to be done in logs.
I want to do it manually same as "apex" tool do for sql server. And don't want to use any third party tool.
Right now I am able to extract table name and operation type from logs. But still searching for SQL statement extraction.
Decoding the contents of the transaction log is exceptionally tricky - there is a reason Apex gets to charge money for the tool that does it - it's a lot of work to get it right.
The transaction log itself is a record of the changes that occurred, not a record of what the query was that was executed to make the change. In your question you mention extracting the query - that is not possible, only the data change can be extracted.
For simple insert / delete transactions it is possible to decode them, but the complexity of just doing that is too large to reproduce here in detail. The simpler scenario of just decoding the log using fn_dblog it analogous, but the complexity of that should give you an idea of how difficult it is. You can extract the operation type + the hex data in the RowLogContents - depending on the type of operation, the RowLogContents can be 'relatively' simple to decode, since it is the same format as a row at a binary / hex level on the page.
I am loathe to use a link as an example / answer, but the length of the explanation just for a simple scenario is non-trivial. My only redemption for the link answer is that it is my article - so that's also the disclaimer. The length and complexity really makes this question un-answerable with a positive answer!
https://sqlfascination.com/2010/02/03/how-do-you-decode-a-simple-entry-in-the-transaction-log-part-1/
https://sqlfascination.com/2010/02/05/how-do-you-decode-a-simple-entry-in-the-transaction-log-part-2/
There have been further articles published which built on this to try automate this logic into t-sql itself.
https://raresql.com/2012/02/01/how-to-recover-modified-records-from-sql-server-part-1/
The time / effort you will spend attempting to write your own decoding is sufficiently high that compared to the cost of a license, I would never recommend attempting to write your own software to do this unless you planned on selling it.
You may also wish to consider alternative tracing mechanisms being placed in-line with the running of the application code, and not something you try reverse engineer out of a backup.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
We have a service that currently runs on top of a MySQL database and uses JBoss to run the application. The rate of the database growth is accelerating and I am looking to change the setup to improve scaling. The issue is not in a large number of rows nor (yet) a particularly high volume of queries but rather in the large number of BLOBs stored in the db. Particularly the time it takes to create or restore a backup (we use mysqldump and Percona Xtrabackup ) is a concern as well as the fact that we will need to scale horizontally to keep expanding the disk space in the future. At the moment the db size is around 500GB.
The kind of arrangement that I figure would work well for our future needs is a hybrid database that uses both MySQL and some key-value database. The latter would only store the BLOBs. The meta data as well as data for user management and business logic of the application would remain in the MySQL db and benefit from structured tables and full consistency. The application itself would handle the issue of consistency between the databases.
The question is which database to use? There are lots of NoSQL databases to choose from. Here are some points on what qualities I am looking for:
Distributed over multiple nodes, which are flexible to add or remove.
Redundancy of storage, with the database automatically making sure each value object is stored on at least two different nodes.
Value objects' size could range from a few dozen bytes to around 100MB.
The database is accessed from a java EJB application on top of JBoss as well as a program written in C++ that processes the data in the db. Some sort of connector for each would be needed.
No need for structure for the data. A single string or even just a large integer would suffice for the key, pure byte array for the value.
No updates for the value objects are needed, only inserts and deletes. If a particular object is made obsolete by a new object that fulfills the same role, the old object is deleted and a new object with a new key is inserted.
Having looked around a bit, Riak sounds good except for its problems with storing large value objects. Which database would you recommend?
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
I am new to databases and programming and am now supposed to create a database that can store large amounts of data. The critical problem to me is that i need to update the database everyday and add 150 sets of data to 60 different tables. The datasets all come in a different format though (.csv, .row, .sta...).
I would like to be able to create a GUI that can automatically import the datasets every day and update the tables. So far I have found a lot of info on exporting data from databases, not so much on importing data though.
Does someone have a pointer?
You haven't given a reasonable definition of "best", what are your criteria? Is cost a major concern (I'm guessing not if you're considering Matlab, but maybe)? Is the UI for you only or are you turning end users loose on it? If end users, how smart/reliable are they? When you say manual import, do you mean a mostly automatic process that you occasionally manually initiate, or will it have to ask a lot of questions and have many different combinations?
I import lots of data daily and from different sources, I have to manually re-launch a process because a user has made a change and needs to see it reflected immediately, but but my set of defined sources doesn't change often. I've had good luck using the SSIS (SQL Server Integration Services) tool in Microsoft's SQL Server. And it can certainly handle large amounts of data.
The basic functionality is you write a "package" that contains definitions of what your source is, how it's configured (i.e. if you are importing from a text file tell it the name and path, is it fixed field or delimited, what is the delimiter or width of each field, what field to skip, how many rows to skip, etc.), and where to put it (DB name and table, map fields, etc). I then set the schedule (mine are all overnight) in SQL agent, and it is all automatic from there unless something changes, in which case you edit your package to account for the changes.
I can also manually start any Package at any time and with very little effort.
And the range of imports sources is pretty impressive. I pull in data from CSV files, Lotus Notes, and DB2, and it's all automatic every night. It's also a fairly graphical "builder", which is frustrating for a hardcore coder, but if you're new to programming it's probably easier than a more code or script oriented approach.
Good luck, and welcome to the dark side. We have cookies.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
I have around 10 tables containing millions of rows. Now I want to archive 40% of data due to size and performance problem.
What would be best way to archive the old data and let the web application run? And in the near future if I need to show up the old data along with existing.
Thanks in advance.
There is no single solution for any case. It depends much on your data structure and application requirements. Most general cases seemed to be as follows:
If your application can't be redesigned and instant access is required to all your data, you need to use more powerful hardware/software solution.
If your application can't be redesigned but some of your data could be count as obsolete because it's requested relatively rearely you can split data and configure two applications to access different data.
If your application can't be redesigned but some of your data could be count as insensitive and could be minimized (consolidated, packed, etc.) you can perform some data transformation as well as keeping full data in another place for special requests.
If it's possible to redesign your application there are many ways to solve the problem.In general you will implement some kind of archive subsystem and in general it's complex problem especially if not only your data changes in time but data structure changes too.
If it's possible to redesign your application you can optimize you data structure using new supporting tables, indexes and other database objects and algorythms.
Create archive database if possible maintain different archive server because this data wont be much necessary but still need to be archived for future purposes, hence this reduces load on server and space.
Move all the table's data to that location. Later You can retrieve back in number of ways:
Changing the path of application
or updating live table with archive table