I am writing a project which needs to scrape data from a website, I am using pyspider, and it runs automatically every 24 hours(scraping the data every 24hours). Problem is , before writing the new data entry into the dB, I want to compare the new data with the existing data in the dB.
Is there a tool/lib I can use?
I am running my project on aws, what’s the best tool I can use to work with aws?
My idea is to set up some rule for the data to update/insert into the dB, but when the new data is somehow conflict with rule then I will be able to view the data/scrape log(where the tool will label it as pending)and waiting for admin to do further operation.
Thanks in advance.
[List of data compare, synchronization and migration tools.]https://dbmstools.com/categories/data-compare-tools
visit there. it might be helpful
Related
There is a requirement in my project where we need to design a system which can collect data through Web API and then use the data to compare and copy the received data to an existing SQL Server DB. I want to know if anyone has already worked on such requirement and if yes then what is the best way to design it? I am currently thinking of below two options. Please let me know which one is better and if there is any other option.
My algorithm will be as - fetch the data through web api -> compare the data -> save mismatched data to a particular table -> copy new data to the existing tables.
The two options I am currently thinking of are-
1) Use a windows service which will run once in a day and execute above algo.
2) Use SSIS package which will run once in a day and execute above algo.
If anyone has used either of this solution, please guide me to an article or blog which can be helpful to me.
I have similar project requirement before. What I achieved is in SSIS.
Brief steps:
Using C# script to get the return data (http://json2csharp.com/ is an easy way to return C# class based on your JSON components)
using third party dll, install Newtonsoft.Json to deserialize the JSON
Assign the results in C# script to each pre defined variable (be careful with the data type)
Compare the result with the existing table in data flow task.
Let me know if you have any questions
To give you the question first: I want to know if it is possible to create a stored procedure or something in SQL Server that intercepts and translates SELECT, INSERT, and UPDATE commands. Now for the explanation:
I am writing a web application to replace an old desktop app. Its a business app which is basically a database interface with reports and searches and all the good ol' CRUD. The new and old apps need to live in harmony together since some customers may be using the old and new together to access the same DB.
My problem is that the original database format stores most data in a single blob of text (1 nvarchar(MAX) field). I want to add functionality to search on fields stored in the blob, but it will be cumbersome and slow. I would like to update the database format without changing the desktop app at all, hence the question above.
It occurs to me that I could do this on the client by writing a wrapper class for the data access object and then do a bulk replace in the client code to reference the wrapper, but I want to know what my options are on the server as well.
In case anyone wants to know, the old app is in VB6 and the new in C#.
EDIT
Alright, so it looks like if I do anything on the server side we are looking at adding stored procedures and then updating the client VB6 code to reference the stored procs. Do something like a bulk replace of SELECT with sp_oldselect ... To return the data in a different format. I'm guessing a client-side wrapper would be the best solution for the time-being. Old apps die hard.
You can create a bunch of views for the old client and let it to query those views. It will be slow as hell in most cases, but it can 'replace' the select query. For updates and insert.. well.. instead of triggers on the views could help is some cases, but it will require lots of processing.
However my suggestion is to provide exactly the same functionality in the web app and deprecate the desktop app. When the desktop app's share is low enough, stop supporting it. From this point, you are (mostly) free to add new functions, upgrade the database schema, etc.
I agree with JonH, that alot can go wrong here, but you can try and read up on the INSTEAD OF Triggers in MS SQL server here: https://technet.microsoft.com/en-us/library/ms179288(v=sql.105).aspx
Im currently developing a servlet homepage (spring + hibernate + mysql).
Im at the moment using the Hibernate property hibernate.hbm2ddl.auto set to update.
This is working fine and Hibernate creates and updates my tables.
However, Ive have read on multiple places that this is not recommended in production and that it is unsafe.
But if I dont put this option my tables is not created, and I really don't want to create my tabels manually on the server. I got limited time working on this alone.
How is this usually done? It's seems like it is quite much work to add all tables manually imo.
In production, you typically have already existing tables with a large amount of data that you don't want to lose, and that you want to migrate to the new schema. Hibernate can't do that automagically for you. It doesn't know that the data that was previously in column A must now be in the new column B.
So you'll need to create a migration script. Of course, you can use Hibernate to generate the new schema for you in development, see what the differences with the old schema are, and create your script thanks to that. But yes, having an app in production and migrate it needs some work to be done.
Does anyone have an elegant suggestion for how to get the contents of an Excel spreadsheet into SQL Server via a web form? I need to allow our clients to upload modest amounts of structured data, and I need that data to ultimately reside in a sql table. I really can't expect the clientele to produce anything but an Excel file, but I could require that it be an xlsx.
The web app is written in Coldfusion; it doesn't need to be able to handle huge numbers of simultaneous requests, but I don't want to consider some sort of server-side batch job processing or shunt the user to an asp.net page (which is what we are doing now).
Any recommendations (or examples of how others are successfully doing this) would be appreciated. Due to the sensitivity of the data, we really can't do anything to compromise the security of the web or sql servers.
If you are using CF9, then you could easily use the cfspreadsheet tag too. I mention this one specifically because Shawn's link did not (presumably due to its being relatively new on the CF scene). Here's the livedoc link: http://help.adobe.com/en_US/ColdFusion/9.0/CFMLRef/WSc3ff6d0ea77859461172e0811cbec17cba-7f87.html
For full use, I would create a web form with a standard file upload field. On the backend handling the form submission, get a copy of the file with
<cffile action="upload" destination="uploaded.xls".....>
Then use:
<cfspreadsheet action="read" query="myExcelData" src="uploaded.xls" ...>
At which point, your spreadsheet content will be available as a query object. You can then loop over this query, running insert queries into your sql server each time you loop. That should do it.
Here are the most notable options to help point you in the right direction; choose what you are most comfortable with (Source: Charlie Arehart).
CFXL
JXLS
CFX_Excel
My personal recommendation is to go the CFX_Excel route. Although a commercial product, it will grant you the most functionality/flexibility of the options listed.
I'd like to know your approach/experiences when it's time to initially populate the Grails DB that will hold your app data. Assuming you have CSVs with data, is is "safer" to create a script (with whatever tool fits you) that:
1.-Generates the Bootstrap commands with the domain classes, run it in test or dev environment and then use the native db commands to export it to prod?
2.-Create the DB's insert script assuming GORM's version = 0 and incrementing manually the soon-to-be autogenerated IDs ?
My fear is that the second approach may lead to inconsistencies for hibernate will have the responsability for the IDs generation and there may be something else I'm missing.
Thanks in advance.
Take a look at this link. This allows you to run groovy scripts in the normal grails context giving you access to all grails features including GORM. I'm currently importing data from a legacy database and have found that writing a Groovy script using the Groovy SQL interface to pull out the data then putting that data in domain objects appears to be the easiest thing to do. Once you have the data imported you just use the commands specific to your database system to move that data to the production database.
Update:
Apparently the updated entry referenced from the blog entry I link to no longer exists. I was able to get this working using code at the following link which is also referenced in the comments.
http://pastie.org/180868
Finally it seems that the simplest solution is to consider that GORM as of the current release (1.2) uses a single sequence for all auto-generated ids. So considering this when creating whatever scripts you need (in the language of your preference) should suffice. I understand it's planned for 1.3 release that every table has its own sequence.