Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 months ago.
Improve this question
My scenario:
There's a flask WebApp where a user can login and upload a CSV file.
I need to store and be able to recall the data from the CSV.
Each user can have a different number of columns and rows in their CSV.
I was thinking I could create a unique table per user, recreate it each time a user uploads a new dataset.
However, maybe it makes more sense to just dump it into a NoSQL based database??
Looking for advice of what's the best approach my scenario?
Since you are only storing and recalling CSV data, I presume you do not change anything within the CSV file. At this point you do not really need a database, in my opinion the simplest solution is to store the CSV file in a folder (let's say each user has its own folder on the server).
Even if you have multiple CSV files for each user, you can store them in the same folder (with different names - you can come up with your own naming convention).
When you recall the file, you can scan the folder of that user and give him the list of files (if there are more than 1), so he can pick whatever CSV he/she wants to get back from the server.
Let's talk about databases: a SQL-like database needs fixed structure (with some exceptions). So you would need to have a consistent dataset in order to add it to a sql DB. In your case it would be easier to use a noSQL database, since with noSQL you don't care about the structure, is just "random" data that you store.
But again ... if you don't do anything with the data, only storing, just store the data as it is: CSV files.
If you have more info, we can brainstorm further.
Regards,
Mike
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
Problem
Every day we recieve a new set of data files from our backoffice application. This application is not able to produce an incremental changeset so all it can do is dump to a large file.
Currently every morning we drop our old MySQL tables and load the data into uor database.
One of the problems we have here is that we are unable to act on specific changes in the data and also we are using CQRS and would have quite some benefits here if we had an incremental list.
File format is currently CSV
Data size per file is up to 10GB
Number of rows per file is up to 40 million
Approximately 30 data files
On average less than 1% of rows is changed each day
Most files either have no primary key or a combined primary key. For many, the full row is the only thing that makes them unique.
The order of data is not fixed. Rows may switch positions
Desired situation
When we receive the new data we calculate the difference and push a message into Kafka for each changed (if a rowidentifier exists), added or removed row.
Technology
We use AWS and are able to use all technologies AWS offers
We are not limited to a certain amount of hardware. We can just start up some new servers in AWS
Cost is only a very limited factor. We have quite a large budget and the ability to have an incremental set offers us quite a lot of value.
We have a running Kubernetes cluster
Question
So the main question is, What would be the best way to compare these 2 large files and create an incremental set? We need it to be fast, preferably within the hour or close to that.
Are there database types that have this natively or are there technologies that can do this for us?
"...The order of data is not fixed. Rows may switch positions..." That is the one that makes it hard. If the rows did not change a git diff or text file comparison tool would work.
Spitballing here but:
For each row create a SHA hash
Use the hash as a unique ID
Store each UNIQUE hash and associated data into a DB Table.
Post processing the file, dump the table into a text file (CSV/SQL/etc)
Commit file changes to source control
When you receive a new data set, check if the hash exists
If no: append the hash to the end of the table
If yes: ignore
Dump the table into a text file (CSV/SQL/etc)
'git diff' commits to see change sets.
Might be able to do this with AWS Glue...
Bonus:
To make it even easier create a location the back office app can upload the file and create a cron to process the report at a given time
This process is a typical ETL (Extract-Transform-Load) task. You are extracting data from one source/format, changing it, and loading/inserting it into a different source/format.
Let me know if any of this was helpful.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
We are developing an E-commerce website using ASP.net and SQL server. The customer can view and order a wide variety of switches and light fittings.
As we need to display images of these products for each category, the number of images we need to display may rise to over 500. We are a bit confused by whether we should save these images as Image type in SQL, or whether it's better to store the path of the image. In the latter case, is storing all images under single folder better?
The image data type has nothing to do with images. I don't know why it exists. It is legacy.
Store the images like you would store any other blob: varbinary(max).
The issue whether to store blobs in the database at all has been discussed before. Note, that the answers there are very opinionated and subjective. It is clearly wrong to say that one should always store blobs inside out outside of the database.
You can use the below data types for BLOBs on sql server:
Binary: Fixed size up to 8,000 bytes.
VarBinary(n): Variable size up to 8,000 bytes (n specifies the max size).
VarBinary(max): Variable size, limit of 2 GB.
What Is a BLOB?
I would suggest you could save the image urls in the database, against your product id. But if you are requesting upto 500 urls at a time, I would suggest perhaps introducing a CDN in the middle to cache the Image URLs. Will have a big impact on performance
Storing Image to Folder and path of that file to sql server is good idea. But it may cause same name of file and replaced. so saved with timestamp. I think it will help you otherwise do another way that saving to Image Datatype of sql server.
To Save Data to sql server Image Datatype Follow Guildline on below link
Go to Link
Let's think about this. If you do store in a blob field, make sure you create a separate file group for that data. If you decide to store as a path. Use filestream. The database will manage the path. It will be handled in transactions. It will be backed up with the database.
Saving image directly into the database is not that good idea.Most better will be saving image and media/document files to s3 bucket/cdn and that much cost you dont want to pay.save images into folder and save its path to the database with varchar datatype
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I work on a project that has very well defined lines of responsibility. There are about six to ten of us and we currently do all of our work in Excel, building a single spreadsheet with maintenance requirements for ships. A couple of times during the project process we stop all work and compile all of the individual spreadsheets into one spreadsheet. Since each person had a well defined area, we don't have to worry about one person overwriting another person's work. It only takes an hour, so it isn't that huge of a deal. Less than optimal, sure, but it gets the job done.
But each person fills out their data differently. I think moving to a database would serve us well by making the data more regimented with validation rules. But the problem is, we do not have any type of share drive or database server where we can host the database, and that won't change. I was wondering if there was a simple solution similar to the way we were handling the Excel spreadsheet. I envisioned a process where I would wipe the old data and then import the new data. But I suspect that will bring up other problems.
I am pretty comfortable building small databases and using VBA and whatnot. This project would probably have about six tables, and probably three that would have the majority of the data for any given project (the others would be reference tables and slow-to-change data). Bottom line is, I am wondering if it is worth it, or should I stick with Excel?
Access 2007 onwards has an option for "Collecting email replies" which can organise flat data, but it can only be a single query that's populated so might be a bit limiting.
The only solution I can think of that's easier than you currently use is to create the DB with some VBA modules that export all new/updated data to an XML/csv file and attached this to an email. You'd then have to create a VBA module that would import the data from these files into the current table.
It's a fair amount of work to get set up but once working might be fairly quick and robust.
Edit, just to add, I have solved a similar problem but I solved it with VB.net and XML files rather than Access.
You can link Access databases to other databases (or import from them). So you can distribute a template database for users to add records to and then email back. When receiving back, you would either import or link them to a master database and do whatever you needed to do with the combined data.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
I have around 10 tables containing millions of rows. Now I want to archive 40% of data due to size and performance problem.
What would be best way to archive the old data and let the web application run? And in the near future if I need to show up the old data along with existing.
Thanks in advance.
There is no single solution for any case. It depends much on your data structure and application requirements. Most general cases seemed to be as follows:
If your application can't be redesigned and instant access is required to all your data, you need to use more powerful hardware/software solution.
If your application can't be redesigned but some of your data could be count as obsolete because it's requested relatively rearely you can split data and configure two applications to access different data.
If your application can't be redesigned but some of your data could be count as insensitive and could be minimized (consolidated, packed, etc.) you can perform some data transformation as well as keeping full data in another place for special requests.
If it's possible to redesign your application there are many ways to solve the problem.In general you will implement some kind of archive subsystem and in general it's complex problem especially if not only your data changes in time but data structure changes too.
If it's possible to redesign your application you can optimize you data structure using new supporting tables, indexes and other database objects and algorythms.
Create archive database if possible maintain different archive server because this data wont be much necessary but still need to be archived for future purposes, hence this reduces load on server and space.
Move all the table's data to that location. Later You can retrieve back in number of ways:
Changing the path of application
or updating live table with archive table
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
I am looking for a non-SQL database.
My requirements are as follow:
Should be able to store >10 billion records
Should consume only 1 gb of memory atmost.
User request should take less than 10 ms. (including processing time)
Java based would be great.(i need to access it from java and also if anytime I need to modify the database code )
The database will hold e-commerce search records like number of searches ,sales , product bucket,product filters...and many more...the database now is a flat file and I show now some specific data to users.The data to be show I configure prior and then according to that configuration users can send http request to view data. I want to make things more dynamic and people can view data without prior configuration....
In other words I want to built a fast analyzer which can show users what the user request for.
The best place to find names of non-relational databases is the NoSQL site. Their home page has a pretty comprehensive list, split onto various categories - Wide Column Store, Key-value Pair, Object, XML, etc. Find out more.
You don't really give enough information about your requirements. But it sounds like kdb+ meets all of the requirements that you've stated. But only if you want to get to grips with the rather exotic (and very powerful) Q language.