Performance issue, flat file vs database storage - sql-server

I am working on a project that asked me to develop a system to provide JSON output, the following is the flow:
1a) Some tables will be updated via the administration panel (my company side)
1b) Some related tables will be updated via the administration panel (partner side)
-- Let say SuperHeroes & Males were updated in 1a), Studios & Years were updated in 1b)
2) Client browse our site and request an information set which:
Has an enabled and not deleted row in SuperHeroes (Ant-man)
Has an enabled and not deleted row in Males linked to SuperHeroes (Scott Lang)
Joined the above records then look if they are linked to and exists in Studios (Marvel)
Linked to an existing row in Years (2015)
3) A very small data will be outputted to a JSON string as the following: { id:1,type:marvel },{ id:1,type:dc }
All rows in the above 4 tables will be updated/deleted at anytime without notification, [No Foreign Key as well]
I am thinking to update the information in a flat file every time 1a is performed (since we can update the system of my company side but not the partner side, and they are rejected to save some extra information into a flat file, so the situation is we have no easy way to know if the Studios or Years tables are modified)
Then while the JSON request will first load the information from the flat file (all outputting data will be stored in this file), then use a simple SQL statement to filter if a linked record exists in Studios & Years
I have done my research and getting confused, I concluded when the data amount is small then flat file will be great but beware the file comes larger and larger (The flat file we are talking will noway more than 50 rows at 1 time, and that should not be modified frequently)
Some answer said database is good at Query data (I think so and the requirement will perform SQL check too)
So I don't know if its good when my data amounts are small but still need some communication with the database..
I appreciate your time and your help, all idea & hints are welcome, thanks!

Your conclusion regarding amount of data is absolutely correct, and file should handle those 50 rows, but..
Using database as storage should give you more options in the future, e.g.:
you'll be able to produce any output due to separation of data
representation (today it's JSON, but what if you have to produce XML
at some point? will you add files that store XML's next to JSON
files? and then CSV or any other?)
transactions will guarantee the ACID
scalability and performance - if your dataset get bigger (you never know :)), many
DB's offer you many possibilities like table partitioning, partitioning-based clustering
or replication
Wrong choices regarding architecture and technology made at the beginning of the project will always backfire.

Related

SQL to Excel - Max each sheet

I have a SQL Table with close to 2 Million rows and I am trying to export this data into an Excel file so the stakeholders can manipulate data, see charts, so on...
The issue is, when I hit refresh, it fails after getting all the data saying the number of rows exceed max rows limitation in Excel. This table is going to keep growing every day.
What I am looking for here is a way to refresh data, then add rows to Sheet 1 until max rows limitation is reached. Once maxed out, I want the rows to start getting inserted into Sheet 2. Once maxed out, move to 3rd sheet, all from the single SQL table, from a single refresh.
This does not have to happen in Excel (Data -> Refresh option), I can have this as a part of the SSIS package that I am already using to populate rows in the SQL table.
I am also open to any alternate ways to export SQL table into a different format that can be used by said stakeholders to create charts, analyze data, and whatever else pleases them.
Without sounding too facetious, you are suggesting a very inefficient method.
The best way of approaching this method is not to use .xlsx files at all for the data storage.
Assuming your destination stakeholders don't have read access to the SQL server, export the data to .csv and then use Power Query in some sort of 'Dashbaord.xlsx' type file to load the .csv to the data model which can handle hundreds of millions of rows instead of just 1.05m.
This will allow for the use of Power Pivot and DAX for analysis and the data will also be visible in the data model table view if users do want raw rows (or they can refer to the csv file..).
If they do have SQL read access then you can query the server directly so you don't need to store any rows whatsoever as it will read directly.
Failing all that and you decide to do it your way, I would suggest the following.
Read your table into a Pandas df and iterate over each row and cell of the dataframe, writing to an your xlsx[sheet1] using openpyxl then once the row number reaches 1,048,560 simply iterate to xlsx[sheet2].
In short: openpyxl allows you to create workbooks, worksheets, and write to cells directly.
But depending on how many columns you have it could take incredibly long.
Product Limitations
Excel 2007+ 1,048,576 rows by 16,384 columns
A challenge with your suggestion of filling a worksheet with the max number of rows and then splitting is "How are they going to work with that data?" and "Did you split data that should have been together to make an informed choice?"
If Excel is the tool the users want to use and they must have access to all the data, then you're going to need to put the data into a Power Pivot data model (and yes, that's going to impact the availability of some data visualizations). A Power Pivot model is an in-memory tabular data set. What the means is that the data engine, xVelocity, is going to use a bunch of memory but can get over the 1 million row limitation. Depending on how much memory is required, you might need to switch from the default 32 bit Office install and go with a 64 bit install (and I've seen clients have to max RAM out on old, low end desktops because they went cheap for business users).
Power Pivot will have a connection to your SQL Server (or other provider). When it refreshes data, it's going to fire off queries and determine the unique values in columns and then create a dictionary of unique values. This allows it to compress the data with low cardinality really well - sales dates are likely going to be repeated heavily within your set so the compression is good. Assuming your customers are typically not-repeat customers, a customer surrogate key would have high cardinality and thus not compress well since there's little to no repeat. The refresh is going to be dependent on your use case and environment. Maybe the user has to manually kick it off, maybe you have SharePoint with Excel services installed and then you can have it refresh the data on various intervals.
If they're good analysts, you might try turning them on to Power BI. Same-ish engine behind the scenes but built from the ground up to be an response reporting tool. If they're just wading through tables of data, they're not ready for PBI. If they are making visuals out of the data, PBI is likely a better fit.

Importing CSV to database (duplicate entries)

My job requires that I look up information on a long spreadsheet that's updated and sent to me once or twice a week. Sometimes the newest spreadsheet leaves off information that was in the last spreadsheet causing me to have to look through several different spreadsheets to find the info I need. I recently discovered that I could convert the spreadsheet to a CSV file and then upload it to a database table. With a few lines of script all I have to do is type in what I'm looking for and Voila! Now I just got the newest spreadsheet and I'm wondering if I can just Import it on top of the old one. There is a unique number for each row that I have set to primary in the database. If I try to import it on top of the current info will it just skip the rows where the primary would be duplicated or would it just mess up my database?
Thought I'd ask the experts before I tried it. Thanks for your input!
Details:
the spreadsheet consists of clients of ours. Each row contains the client's name, a unique id number, their address and contact info. I can set the row containing the unique ID to primary, then upload it. My concern is that there is nothing to signify a new row in a csv file (i think). when I upload it it it gives me the option to skip duplicates but will it skip the entire row or just that cell causing my data to be placed in the wrong rows.. It's apache server IDK what versions of mysql. I'm using 000webhost for this.
Higgs,
This issue in database/ETL terminology is called deduplication strategy.
There is not a template answer for this, but I suggest these helpful readings:
Academic paper - Joint Deduplication of Multiple Record Types
in Relational Data
Deduplication article
Some open source tools:
Duke tool
Data cleaner
there's a little checkbox when you click on import near the bottom that says 'ignore duplicates' or something like that. simpler than i thought.

Better way to store updatable scientific data?

I am using a file consisting of published scientific data. I'm using this file with a program that reads in the first 5 space delimited data fields, and everything after that is considered a comment by the program.
2 example lines (of thousands):
FeII 1608.4511 0.521 55.36 -1300 M03 Journal of Physics
FeII 1611.23045 0.0321 55.36 1100 01J AJ
The program reads it as:
FeII 1608.4511 0.521 55.36 -1300
FeII 1611.23045 0.0321 55.36 1100
These numbers are each measurements and most (don't get me started) have associated errors that are not listed in this file. I would like to store this information in a useful and updatable way. That is, say the first entry FeII 1608.4511 has an error of plus/minus 0.002. Consider when a new measurement is made and changes it to: FeII 1608.45034 plus/minus 0.0005. I would like to update the value, the error, and record some information about the publication that it came from.
The program that uses this file is legacy code and is both crucial and inflexible: and it needs the file to look like the above output when it's read in. I would really like for there to be a way to update the input file to include things like errors on the values and publication hyperlinks in comments. I would also like a kind of version control ability to return the state of this large file today; or in 5 months after 20 more lines are updated with new values.
Any suggestions on how best to accomplish this? Should I store everything in some kind of database?
Databases are deeply tied to identity. If a database can't identify a row by the data that's in it, a database isn't going to help you.
If I were you, I'd start by storing the base file in a version control system, not a database. At 20 changes per 5 months, I'd probably make those changes manually and commit each batch of changes. (I don't know what might constitute a batch for you. Could be a single change every time.)
Since the format of the existing file is both crucial and brittle, I'm not sure whether modifying it is a good idea. I think I'd feel better about storing error ranges and publication hyperlinks in a separate file, and using a script to put the pieces together for applications that can use error ranges and hyperlinks.
A database sounds sensible, SQL Server Express is free and widely used.
You can read in the text file including all comments and output the edited data in the same format. You can use a number of front ends including Access, for rapid development, or something you create yourself in VB.Net, or even Excel, at a pinch.
You will need to consider the structure of the table(s) but it should not be too difficult, and you can get help here.
For updating the information in the file introducing errors and links, you don't need any database; just open the file, iterate through the lines and update each one.
If you want to be able to restore a line state, you definetively need some kind of database. You can create a database in Sql Server or Firebird for example, and store in it a row for each line historical state (with date of creation off course); your file itself would be the repository for current values and you would be able to restore the file with a date and some simple fetcing of the database information.
If you can't use a database like Firebird or SQL Server, you can store the historical data in a simple text file, it's up to you. Just remember that you necesarely will need, like #CatCall commented, a way to identify each line in order to create a relation between the line in the file and the historical data stored in your repository.

Missing data in first record in MS Access (front end) and using SQL Server (back end)

I have a database that I just converted the back end to SQL Server using SSMA. I left the front end in MS Access. I only converted the tables and not the queries. It already had some data in it and that moved over just fine.
All was going well until just recently. On opening the database and loading the main form Event Interest it started having problems with the first record of the subform, called Names. The first field in the first record has data sometimes and not others. This is a text field. When data is in the field it puts in random numbers. I believe they may be related to SQL somehow. When there is no data/missing you can select the field and hit the backspace button and the data will appear minus the one character you just errased. I have no idea what is going on.
Any help you can supply I would greatly appreciate it. Thank you in advance.
I am new to SQL Server and I have used older versions of MS Access for a few years.
I am not certain what the problem might be, but these are some considerations that come to mind:
try deleting and recreating your linked tables. Perhaps an update to the table structure (or view, if you're linked to a view) has invalidated some of the metadata stored in the table link in your front end.
does your table have a primary key? If not, you really need one. There really is no such thing as a properly-designed data table in a relational database that is PK-less.
does your table have a timestamp? If not, add one, as this helps Access keep track of whether or not the data has changed on the server.
However, let me add that none of these issues manifest themselves exactly with the symptoms you've described, so they may not help.

What FoxPro data tools can I use to find corrupted data?

I have some SQL Server DTS packages that import data from a FoxPro database. This was working fine until recently. Now the script that imports data from one of the FoxPro tables bombs out about 470,000 records into the import. I'm just pulling the data into a table with nullable varchar fields so I'm thinking it must be a weird/corrupt data problem.
What tools would you use to track down a problem like this?
FYI, this is the error I'm getting:
Data for source column 1 ('field1') is not available. Your provider may require that all Blob columns be rightmost in the source result set.
There should not be any blob columns in this table.
Thanks for the suggestions. I don't know if it a corruption problem for sure. I just started downloading FoxPro from my MSDN Subscription, so I'll see if I can open the table. SSRS opens the table, it just chokes before running through all the records. I'm just trying to figure out which record it's having a problem with.
Cmrepair is an excellent freeware utility to repair corrupted .DBF files.
Have you tried writing a small program that just copies the existing data to a new table?
Also,
http://fox.wikis.com/wc.dll?Wiki~TableCorruptionRepairTools~VFP
My company uses Foxpro to store quite a bit of data... In my experience, data corruption is very obvious, with the table failing to open in the first place. Do you have a copy of foxpro to open the table with?
At 470,000 records you might want to check to see if you're approaching the 2 gigabyte limit on FoxPro table size. As I understand it, the records can still be there, but become inaccessible after the 2 gig point.
#Lance:
if you have access to Visual FoxPro command line window, type:
SET TABLEVALIDATE 11
USE "YourTable" EXCLUSIVE && If the table is damaged VFP must display an error here
PACK && To reindex the table and deleted "marked" records
PACK MEMO && If you have memo fields
After doing that, the structure of the table must ve valid, if you want to see fields with invalid data, you can try:
SELECT * FROM YourTable WHERE EMPTY(YourField) && All records with YourField empty
SELECT * FROM YourTable WHERE LEN(YourMemoField) > 200 && All records with a long memo field, there can be corrupted data
etc.
Use Repair Databases from my site (www.shershahsoft.com) for FREE (and Will always be FREE).
I have designed this program to repair damaged Foxpro/FoxBase/Dbase files. The program is very quick. It will repair 1 GB table in less than a minute.
You can asign files, and folders to the program. As you start the program it will mark all the corrupted files, and by clicking Repair or Check and Repair button, it will repair all the corrupted files. Moreover, it will create a folders "CorruptData" in the folders where the actual data exist, and will keep copies of the corrupt files there.
One thing to keep in mind, always run Windows CheckDsk on the drives where you store the files. Cause, when records are being copied to a table and power failure occures, there exists lost clusters which Windows converts to files during CheckDsk. After that, the RepairDatabases will do the job for you.
I have used many paid and free programs which repair tables, but all such programs leave extra records in the tables with embiguit characters (and they are time consuming too). The programer needs to find and delete such records manually. But Repair Databases actually recovers the original records, you need no further action. The only action you need is reindexing your files.
In the repair process some times File Open Dialog appears which asks to locate the compact index file for a table with indeces. You may click cancel the dialog at that point, the table will be repaired, however, you will need to reindex the file later. (this dialog may appear several times depending upon the number of corrupted indeces.)

Resources