I believe that topic speaks for itself.
I would love to read your opinions and maybe possible solutions on this?
Lets say that we have first, simplest scenario:
There is 1k of new products that needs to be imported to system via server.
Products is only one table in database, no relations needed.
Imagine that we are trying to upload this data and in the middle it fails but there is no rollback, data is saved partially.
Let's say 200 from 1000 products were saved into db.
Since you got message that data was saved partially because of server failure you want to import the same data once again but it is not very effective to check if every of product exist in database. How do we restore such state of a batch process to start it from 201 product not starting from beginning?
Thanks!
Related
We have a long running DB query that populates a temporary table (we are not supposed to change this behavior) which results 6 to 10 million records, around 4 to 6 GB data.
I need to use .NET Web API for fetching data from SQL DB and the API is hosted on IIS. When a request comes from the client to API, query runs minimum 5 minutes based on amount of data in different joining tables and populates temp table. Then API has to read data from DB temp table and send it to client.
Without blocking client, without loosing DB temp table, without blocking IIS, how can we achieve this requirement?
Just thinking, if I use async API, will I be able to achieve this?
there are things you need to consider and things you can do.
if you kick off the query execution as the result of an API call, what happens if you get 10 calls to that endpoint, at the same time? Dead API, that's what going to happen.
You might be able to find a different trigger for the execution of the query, so you can run this query once per day for example or once every 4 hours and then store the result in a permanent table. The APIs job then only becomes to look at this table, not wait for anything and return some data.
The second thing you can do is to return only the data you need for the screen you are displaying. You are not going to show 4-6 gb worth of data in one go, I suspect you have some pagination there and you can rejig the code a little to only return one page of data in one go.
You don't say what kind of data you have, but if it something which doesn't require you to run that query very often then you can definitely make some improvements.
<---- edited after report clarification ---->
ok, since it's a report, here's another idea.
the aim is to make sure that the pressure is not on the api itself which needs to be responsive and quick. Let the API receive the request with the parameters needed. Offload the actual report generation activity to another service.
Keep track of what this service is doing so you can report on the status of the activity : has it started, is it finished, whatever else you need. You can use a queue for that, or simply keep track of jobs in the database.
generate the report file and store it somewhere.
email the user with the file attached or email a link so the user can download it. Another option is to provide a link to the report somewhere in the UI.
Well, I am going to query a 4 GB data using a cfquery. It's gonna be pain to query
the whole database as it's gonna take very long time to get the data back.
I tried stored procedure when the data was 2 GB and it wasn't really fast at that time either.
The data pulling will be done based on the date range user is gonna select from a HTML page.
I have been suggested to follow data archiving in order to speed up querying the database.
Do you think that I'll have to create a separate table with only fields that are required and then query this newly created table?
Well, the size of the current table is 4GB but it is increasing day by day, basically, it's a response database ( getting the information stored from somewhere
else). After doing some research, I am wondering if writing a Trigger could be one option? So, if I do this, then as soon as a new entry (row) will be added
into the current 4GB table , the trigger will initiate some SQL Query which will transfer the contents of the required fields into the newly created table.
This will keep on happening as long as I keep on getting new values in my original 4GB database.
Does above approach sounds good enough to tackle my problem? I have one more concern, even though I am filtering out the only fields required to querying into
a new table, at some point of time, the size of my new database will also increase and that could alsow slower the speed of querying the new table?
Please correct me if Iam wrong somewhere.
Thanks
More Information:
I am using SQL Server. Indexing is currently done but it's not effective.
Archiving the data will be farting against thunder. The data has to travel from your database to your application. Then your application has to process it to build the chart. The more data you have, the longer that will take.
If it is really necessary to chart that much data, you might want to simply acknowledge that your app will be slow and do things to deal with it. This includes code to prevent multiple page requests, displays to the user, and such.
I'm using Delphi XE4 Architect (Delphi Xe3 is ok as well)
I need to find a smart solution to the following problem
and I would like to use one of these frameworks: kbmMW or RemOjects SDK / DataAbstract or RealThinClient
Currently I have an application using a very simple MSSQL database on a site A that is used by users of a site B through the remote desktop.
The application sometimes needs to show some pictures and also view some PDF, but it is mostly text data entry.
There is no particular reason for me to use MSSQL,
but it is a database that I found already active and populated and I have not built it myself.
And now, it would be complicated to change it.
(Database is not important, not using specific features nor stored procedures nor triggers)
Users of the Site B are connected to the A site via a network connection very slow
and occasionally the connection is not available for a few hours and up to one day (this is the major problem).
The situation of the connection, unfortunately, can not be improved for various reasons.
The database is quite simple has many tables that hardly ever change,
about ten instead undergo daily updates and potentially they may be subject to competing changes.
Mainly the records of these tables contain data that are locked in update
from a single user to edit some fields and then he saves releasing the lock.
I would like to get something very different to optimize performance.
Users of the A site have higher priority, they are more important, because the A site is the headquarters.
I would like to have a copy of the database at Site A to Site B,
so that users of site B can work in local, much faster without using the remote desktop connecting to the site A.
The RDP protocol is not very optimized and in any case if the connection is absent, users could not work.
Synchronize and update databases lock records between the two databases may not be a big problem.
Basically when a user of the Site B acquires edit a record in the database B,
obviously a user of the site A should not be able to modify the same record on the database of the site A.
This should also work in the opposite direction of course.
My big problem is figuring out how handling to the best the situation that occurs
when the connection between B and A is not available for some hours. (And transaction/events is increasing on site B).
Events on Site A have generally priority (on collision) on events on Site B.
Users of the Site B must be able to continue working.
When the connection becomes active, the changes should be sent to the database at Site A.
Obviously this can result in conflicts, but the changes made on the record
possibly by users B can be discarded or it can be done under the supervision of a selective merge
and approval record by record user of the site B.
Well, I hope the scenario is almost explained clearly.
Additional infos:
DB schema is very simple, only tables, no triggers, stored procedure. So I can build one as example but imagine 10 tables that can be updated in concurrency.
DB is used by a desktop app of sales departement, so it contains most secret data.
Remote connection is typically max 512Kbit, but the main problem here is that the connection sometimes may be not active
and user on remote site must work anyway. THis is the main focus.
Total data of daily updates could be at max 10 Mb, compressed, only for DB connections. There are some other data synchronized
on the same connection but they are not part of this job.
I don't want to use specific MSSQL tools or services (replications or so on), because DB could change in future.
Thanks
We do almost exactly this using a Delphi client app, a kbmMW based Delphi server app, MSSQL database (though it used to work quite happily on on DBISAM database too).
We have some tables that only the head office site users are allowed to modify. The smaller tables are transferred in their entirety each time there is a "merge". The larger tables and the transaction type tables all have a date added and/or a date modified field and only those records that have been changed or added in the last 3 weeks or so (configurable) are transferred. This means sites can still update to the latest data even if they have been disconnected for quite some time - we used to have clients in remote places on dubious dial up lines!
We only run the merge routines once or twice a day but it would work equally well on an hourly basis or other time schedule.
At given times of day each site (including head office) "export" their changed/new records to files (eg client dataset tables or similar). These are then zipped up by the application and placed in an "outgoing" folder. The zip file is named based on the location id, date, time etc. The files are transferred by some external means eg via FTP / file share / email etc etc. Each branch office sends/transfers its data files to head office and head office transfers its data to each branch. The files are transferred by whatever means to an "incoming" folder.
On a regular basis (eg hourly) each location does a check on the incoming folder to see if there is anything new for it to import. If so it adds all the new records, branch locations overwrite the head-office data tables with the new ones and edited records are merged in "somehow". This is the tricky bit. The easiest policy is "head office wins" so all edits are accepted unless there is a conflict in which case the head office version wins. Alternatively you could use "last edited wins" - but then you need to make sure clocks are in sync across locations. The other option is to add conflicting records to some form of "suspense" status and let an end user decide at some point in the future. We do this on one data set. Whichever conflict method you choose you need to record each decision in some form of log table and prompt an administrative level user to check occasionally.
When the head office imports data or when data is added at the head office then a field is set to indicate the data is part of the master data. When branches add data this field is empty to indicate it has yet to reach the master set. This helps when branches export their data as they can include all data that doesn't have this field set.
We have found that you can't run the merge interactively as you'll end up never getting any work done and you won't be able to run the merge at night etc. It needs to be fully automated with the ability for an admin user to make adjustments at some point after the fact.
We've been running this approach for several years now on multi-site operations and once it settled down it has worked pretty much flawlessly. With 2 export/import schedules per day we have found the branch offices run perfectly well and are only ever missing less than a days worth of transactions. Works well in our scenario where we don't often have conflicts. Exported data is in the region of 5-10MB which zips up plenty small enough.
Primary keys are vital! We use a GUID and it hasn't let us down yet.
The choice of database server and n-tier framework are, actually, irrelevant. It's the process that matters here.
Basically when a user of the Site B acquires edit a record in the database B, obviously a user of the site A should not be able to modify the same record on the database of the site A. This should also work in the opposite direction of course.
I can't see how you're ever going to make this bit work reliably if both sites have their own copy of the database and you're allowing for dropped/non-existent inter-site connections on occasion.
I am designing an application and I have two ideas in mind (below). I have a process that collects data appx. 30 KB and this data will be collected every 5 minutes and needs to be updated on client (web side-- 100 users at any given time). Information collected does not need to be stored for future usage.
Options:
I can get data and insert into database every 5 minutes. And then client call will be made to DB and retrieve data and update UI.
Collect data and put it into Topic or Queue. Now multiple clients (consumers) can go to Queue and obtain data.
I am looking for option 2 as better solution because it is faster (no DB calls) and no redundancy of storage.
Can anyone suggest which would be ideal solution and why ?
I don't really understand the difference. The data has to be temporarily stored somewhere until the next update, right.
But all users can see it, not just the first person to get there, right? So a queue is not really an appropriate data structure from my interpretation of your system.
Whether the data is written to something persistent like a database or something less persistent like part of the web server or application server may be relevant here.
Also, you have tagged this as real-time, but I don't see how the web-clients are getting updates real-time without some kind of push/long-pull or whatever.
Seems to me that you need to use a queue and publisher/subscriber pattern.
This is an article about RabitMQ and Publish/Subscribe pattern.
I can get data and insert into database every 5 minutes. And then client call will be made to DB and retrieve data and update UI.
You can program your application to be event oriented. For ie, raise domain events and publish your message for your subscribers.
When you use a queue, the subscriber will dequeue the message addressed to him and, ofc, obeying the order (FIFO). In addition, there will be a guarantee of delivery, different from a database where the record can be delete, and yet not every 'subscriber' have gotten the message.
The pitfalls of using the database to accomplish this is:
Creation of indexes makes querying faster, but inserts slower;
Will have to control the delivery guarantee for every subscriber;
You'll need TTL (Time to Live) strategy for the records purge (considering delivery guarantee);
I would like to know how to retrieve data from aggregated logs? This is what I have:
- about 30GB daily of uncompressed log data loaded into HDFS (and this will grow soon to about 100GB)
This is my idea:
- each night this data is processed with Pig
- logs are read, split, and custom UDF retrieves data like: timestamp, url, user_id (lets say, this is all what I need)
- from log entry and loads this into HBase (log data will be stored infinitely)
Then if I want to know which users saw particular page within given time range I can quickly query HBase without scanning whole log data with each query (and I want fast answers - minutes are acceptable). And there will be multiple querying taking place simultaneously.
What do you think about this workflow? Do you think, that loading this information into HBase would make sense? What are other options and how do they compare to my solution?
I appreciate all comments/questions and answers. Thank you in advance.
With Hadoop you are always doing one of two things (either processing or querying).
For what you are looking to-do I would suggest using HIVE http://hadoop.apache.org/hive/. You can take your data and then create a M/R job to process and push that data how you like it into HIVE tables. From there (you can even partition on data as it might be appropriate for speed to not look at data not required as you say). From here you can query out your data results as you like. Here is very good online tutorial http://www.cloudera.com/videos/hive_tutorial
There are a lots of ways to solve this but it sounds like HBase is a bit overkill unless you want to setup all the server required for it to run as an exercise to learn it. HBase would be good if you have thousands of people simultaneously looking to get at the information.
You might also want to look into FLUME which is new import server from Cloudera . It will get your files from some place straight to HDFS http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3b2-flume/