We are using Weblogic JMS as the JMS provider for our application. We use file store as the persistent store. Is there any mechanism to condfigure the file store size so that after the file has reached the specified size, a new file is generated. Right now I have seen that all the messages till today are persisted into one single file. I think it will affect the IO operation to read/write messages as the size increase. Has anyone come across this situation and what is the solution?
Here are the details about WebLogic JMS file store tuning.
Maybe this info about unit-of-order processing will be helpful as well.
I can't see anything about file size, but perhaps the message paging info will be helpful.
When I've used JMS with persistent messaging, I haven't had a problem with file size. Once a message is transacted, what happens to its entry in the file? Shouldn't it be removed or marked as acknowledged? Surely the file size doesn't grow without bound.
Related
I am building a web application where the users can create reports and then upload some images for the created reports. Those images will be rendered in the browser when the user clicks a button on the report page. The images are confidential and only authorized users will be able to access them.
I am aware of the pros and cons of storing images in database, in filesystem or a service like amazon S3. For my application, I am inclined to keep the images in the filesystem and paths of the images in the database. That means I have to deal with the problems arising around distributed transaction management. I need some advice on how to deal with these problems.
1- I believe one of the proper solutions is to use technologies like JTA and XADisk. I am not very knowledgeable about these technologies but I believe 2 phase commit is how automicity is achieved. I am using MySQL as the database, and it seems like 2 phase commit is supported by MySQL. Problem with this approach is XADisk does not seem to be an active project and there is not much documentation about it and there is the fact that I am not very knowlegable about the ins and outs of this approach. I am not sure if I should invest in this approach.
2- I believe I can get away with some of the problems arising from the violation of ACID properties for my application. While uploading images, I can first write the files to disk, if this operation succeeds I can update the paths in the database. If database transaction fails, I can delete the files from the disk. I know that is still not bulletproof; an electricity shortage might occur just after the db transaction or the disk might not be responsive for a while etc...I know there are also concurrency issues, for instance if one user tries to modify the uploaded image and another tries to delete it at the same time, there will be some problems. Still the chances for concurrent updates in my application will be relatively low.
I believe I can live with orphan files on the disk or orphan image paths on the db if such exceptional cases occur. If a file path exists in db and not in the file system, I can show a notification to the user on report page and he might try to reupload the image. Orphan files in the file system would not be too much problem, I might run a process to detect such files time to time. Still, I am not very comfortable with this approach.
3- The last option might be to not store file paths in the db at all. I can structure the filesystem such that I can infer the file path in code and load all images at once. For instance, I can create a folder with the name of report id for each report. When a request has been made to load images of the report, I can load the images at once since I know the report id. That might end up with huge number of folders in the filesystem and I am not sure if such a design is acceptable. Concurrency issues will still exist in this scheme.
I would appreciate some advice on which approach I should follow.
I believe you are trying to be ultra-correct, and maybe not that much is needed, but I also faced some similar situation some time ago and explored also different possibilities. I disliked options aligned to your option 1, but about the 2 and 3, I had different successful approaches.
Let's sum up first the list of concerns:
You want the file to be saved
You want the file path to be linked to the corresponding entity (i.e the report)
You don't want a file path to be linked to a file that doesn't exist
You don't want files in the filesystem not linked to any report
And the different approaches:
1. Using DB
You can assure transactions in the DB pretty much with any relational database, and with S3 you can ensure read-after-write consistency for both new objects and upload of new objects. If you PUT an object and you get a 200 OK, it will be readable. Now, how to put all this together? You need to keep track of the process. I can figure 2 ways:
1.1 With a progress table
The upload request is saved to a table with anything need to identify this file, report id, temp uploaded file path, destination path, and a status column
You save the file
If the file safe fails you can update the record in the table, or delete it
If saving the file is successful, in a transaction:
update the progress table with successful status
update the table where you actually save the relationship report-image
Have a cron, but not checking the filesystem, but checking the process table. If there is any file in the filesystem that is orphan, definitely it had been added to the table (it was point 1). Here you can decide if you will delete the file, or if you have enough info, you can continue with the aborted process triggering the point 4.
The same report-image relationship table with some extra status columns.
1.2 With a queue system
Like RabbitMQ, SQS, AMQ, etc
A very similar approach could be done with any queue system instead of a db table. I wont give much details because it depends more on your real infrastructure, but just the general idea.
The upload request goes to a queue, you send a message with anything you may need to identify this file, report id, and if you want a tentative final path.
You upload the file
A worker reads pending messages in the queue and does the work. The message is marked as consumed only when everything goes well.
If something fails, naturally the message will come back to the queue
In the next time a message is read, the worker can have enough info to see if there is work to resume, or even a file to delete if resuming is not possible
In both cases, concurrency problems wont be straightforward to manage, but can be managed (relying on DB locks in fist case, and FIFO queues in second cases) but always with some application logic
2. Without DB
To some extent a system without a database would be perfectly acceptable, if we can defend it as a proper convention over configuration design.
You have to deal with 3 things:
Save files
Read files
Make sure that the structure of the filesystem is manageable
Lets start with 3:
Folder structure
In general, something like one folder for report id will be too simple, and maybe hard to maintain, and also ultimately too plain. This will cause issues, because if we have a folder images with one folder per report, and tomorrow you have less say 200k reports, the images folder will have 200k elements, and even an ls will take too much time, same for any programing language trying to access. That will kill you
You can think about something more sophisticated. Personally like a way that I learnt from Magento 1 more than 10 years ago and I used a lot since then: Using a folder structure following first outside rules, but extended with rules derived extended with the file name itself.
We want to save a product image. The image name is: myproduct.jpg
first rule is: for product images i use /media/catalog/product
then, to avoid many images in the same one, i create one folder per every letter of the image name, up to some number of letters. Lets say 3. So my final folder will be something like /media/catalog/product/m/y/p/myproduct.jpg
like this, it is clear where to save any new image. You can do something similar using your reports id, categories, or anything that makes sense for you. The final objective is to avoid too flat structure, and to create a tree that makes sense to you, and also that can be automatized easily.
And that takes us to the next part:
Read and write.
I implemented a similar system before quite successfully. It allowed me to save files easy, and to retrieve them easily, with locations that were purely dynamic. The parts here were:
S3 (but you can do with any filesystem)
A small microservice acting as a proxy for both read and write.
Some namespace system and attached logic.
The logic is quite simple. The namespace lets me know where the file will be saved. For example, the namespace can be companyname/reports/images.
Lets say a develop a microservice for read and write:
For saving a file, it receives:
namespace
entity id (ie you report)
file to upload
And it will do:
based on the rules I have for that namespace, and the id and file name will save the file in this folder
it doesn't return the physical location. That remains unknown to the client.
Then, for reading, clients will use a URL that uses also convention. For example you can have something like
https://myservice.com/{NAMESPACE}/{entity_id}
And based on the logic, the microservice will know where to find that in the storage and return the image.
If you have more than one image per report, you can do different things, such as:
- you may want to have a third slug in the path such as https://myservice.com/{NAMESPACE}/{entity_id}/1 https://myservice.com/{NAMESPACE}/{entity_id}/2 etc...
- if it is for your internal application usage, you can have one endpoint that returns the list of all eligible images, lets say https://myservice.com/{NAMESPACE}/{entity_id} returns an array with all image urls
How I implemented this was with quite simple yml config to define the logic, and very simple code reading that config. That allowed me to have a lot of flexibility. For example save reports in total different paths or servers or s3 buckets if they belong to different companies or are different report types
I am working on an application where I am using the paid services of Google app engine. In the application I am parsing a large xml file and trying to extracting data to the datastore. But while performing this task GAE is throwing me an error as below.
I also tried to change the performance setting by increasing frontend instance class from F1 to F2.
ERROR:
Exceeded soft private memory limit of 128 MB with 133 MB after servicing 14 requests total.
After handling this request, the process that handled this request was found to be using too much memory and was terminated. This is likely to cause a new process to be used for the next request to your application. If you see this message frequently, you may have a memory leak in your application.
Thank you in advance.
When you face the Exceeded soft private memory limit error you have two alternatives to follow:
To upgrade your instance to a more powerful one, which gives you more memory.
To reduce the chunks of data you process in each request. You could split the XML file to smaller pieces and keep the smaller instance doing the work.
I agree with Mario's answer. Your options are indeed to either upgrade to an Instance class with more memory such as F2 or F3 or process these XML files in smaller chunks.
To help you decide what would be the best path for this task, you would need to know if these XML files to be processed will grow in size. If the XML file(s) will remain approximately this size, you can likely just upgrade the instance class for a quick fix.
If the files can grow in size, then augmenting the instance memory may only buy you more time before encountering this limit again. In this case, your ideal option would be to use a stream to parse the XML file(s) in smaller units, consuming less memory. In Python, xml.sax can be used to accomplish just that as the parse method can accept streams. You would need to implement your own ContentHandler methods.
In your case, the file is coming from the POST request but if the file were coming from Cloud Storage, you should be able to use the client library to stream the content through to the parser.
I had a similar problem, almost sure it's my usage of /tmp directory was causing it, this directory is mounted in memory which was causing it. So, if you are writing any files into /tmp don't forget to remove them!
Another option is that you actully have a memory leak! It says after servicing 14 requests - this means the getting more powerful instance will only delay the error. I would recommend cleaning memory, now I don't know what your code looks like, I'm trying following with my code:
import gc
# ...
#app.route('/fetch_data')
def fetch_data():
data_object = fetch_data_from_db()
uploader = AnotherHeavyObject()
# ...
response = extract_data(data_object)
del data_object
del uploader
gc.collect()
return response
After trying things above, now it seems that issue was with FuturesSession - related to this https://github.com/ross/requests-futures/issues/20. So perhaps it's another library you're using - but just be warned that some of those libraries leak memory - and AppEngine preserves state - so whatever is not cleaned out - stays in memory, and affects following requests on that same instance.
I'm trying to process large (~50mb) sized xml files to store in the datastore. I've tried using backends, sockets (to pull the file via urlfetch), and even straight up uploading the file within my source code, but again keep running into limits (i.e. the 32 mb limit).
So, I'm really confused (and a little angry/frustrated). Does appengine really have no real way to process a large file? There does seem to be one potential work around, which would involve remote_apis, amazon (or google compute I guess) and a security/setup nightmare...
Http ranges was another thing I considered, but it'll be painful to somehow connect the different splitted parts together (unless I can manage to split the file at exact points)
This seems crazy so I thought I'd ask stackover flow... am I missing something?
update
Tried using range requests and it looks like the server I'm trying to stream from doesn't use it. So right now I'm thinking either downloading the file, hosting it on another server, then use appengine to access that via range http requests on backends AND then automate the entire process so I can run it as a cron job :/ (the craziness of having to do all this work for something so simple... sigh)
What about storing it in the cloud storage and reading it incrementally, as you can access it line by line (in Python anyway) so it wont' consume all resources.
https://developers.google.com/appengine/docs/python/googlecloudstorageclient/
https://developers.google.com/storage/
The GCS client library lets your application read files from and write
files to buckets in Google Cloud Storage (GCS). This library supports
reading and writing large amounts of data to GCS, with internal error
handling and retries, so you don't have to write your own code to do
this. Moreover, it provides read buffering with prefetch so your app
can be more efficient.
The GCS client library provides the following functionality:
An open method that returns a file-like buffer on which you can invoke
standard Python file operations for reading and writing. A listbucket
method for listing the contents of a GCS bucket. A stat method for
obtaining metadata about a specific file. A delete method for deleting
files from GCS.
I've processed some very large CSV files in exactly this way - read as much as I need to, process, then read some more.
def read_file(self, filename):
self.response.write('Truncated file content:\n')
gcs_file = gcs.open(filename)
self.response.write(gcs_file.readline())
gcs_file.seek(-1024, os.SEEK_END)
self.response.write(gcs_file.read())
gcs_file.close()
Incremental reading with standard python!
I want to be able to copy the file I have which comes in as XML into a new folder location on the server. Essentially I want to hold a back up of the input files in a new folder.
What I have done so far is try to follow what has been said on this forum post - link text
At first I tried the last method which didn't do anything (file renaming while reading). So I tried one of the other options and altered the orchestration and put a Send shape just after the Receive shape. So the same message that comes in is sent out to the logical port. I export the MSI, and I have created a Send Port in the Admin console which has been set to point to my copy location. It copies the file but it continues to create one every second. The Event Viewer also reports warnings saying "The file exists". I have set the Copy Mode of the port to 'overwrite' and 'Create New', both are not working.
I have looked on Google but nothing helps - BTW I support BizTalk but I have no idea how pipelines, ports work. So any help would be appreciated.
thanks for the quick responses.
As David has suggested I want to be able to track the message off the wire before BizTalk does any processing with it.
I have tried to the CodePlex link that Ben supplied and its points to 'Atomic-Scope's BizTalk Message Archiving Pipeline Component' which looks like my client will have to pay for. I have downloaded the trial and will see if I have any luck.
David - I agree that the orchestration should represent the business flow and making a copy of a file isn't part of the business process. I just assumed when I started tinkering around I could do it myself in the orchestration as suggested on the link I posted.
I'd also rather not rely on the BizTalk tracking within the message box database as I suppose the tracked messages will need to be pruned on a regular basis. Is that correct or am I talking nonsense?
However is there a way I can do what Atomic-Scope have done which may be cheaper?
**Hi again, I have figured it out from David's original post as indicated I also created a Send port which just has a "Filter" expression like - BTS.ReceivePortName == ReceivePortName
Thanks all**
As the post you linked to suggests there are several ways of achieving this sort of result.
The first question is: What do you need to track?
It sounds like there are two possible answers to that question in your case, which I'll address seperately.
You need to track the message as received off the wire before BizTalk touches it
This scenario often arises where you need to be able to prove that your BizTalk solution is not the source of any message corruption or degradation being seen in messages.
There are two common approaches to this:
Use a pipeline component such as the one as Ben Runchey suggests
There is another example of a pipeline component for archiving here on codebetter.com. It looks good - just be careful if you use other components, and where you place this component, that you are still following BizTalk streaming model proper practices. BizTalk pipelines are all forwardonly streaming, meaning that your stream is readonly once, and all the work on them the happens in an eventing manner.
This is a good approach, but with the following caveats:
You need to be careful about the streaming employed within the pipeline component
You are not actually tracking the on the wire message - what your pipeline actually sees is the message after it has gone through the BizTalk adapter (e.g. HTTP adapter, File etc...)
Rely upon BizTalk's out of the box tracking
BizTalk automatically persists all messages to the message box database and if you turn on BizTalk tracking you can make BizTalk keep these messages around.
The main downside here is that enabling this tracking will result in some performance degradation on your server - depending on the exact scenario, this may not be a huge hit, but it can be signifigant.
You can track the message after it has gone through the initial receive pipeline
With this approach there are two main options, to use a pure messaging send port subscribing to the receive port, to use an orchestration send port.
I personally do not like the idea of using an orchestration send port. Orchestrations are generally best used to model the business flow needed. Unless this archiving is part of the business flow as understood by standard users, it could simply confuse what does what in your solution.
The approach I tend to use is to create a messaging send port in the BizTalk admin console that subscribes to your receive port. The send port will then just use a standard BizTalk file adapter, with a pass through pipeline.
I think you should look at the Biztalk Message Archiving pipeline component. You can find it on Codeplex (http://www.codeplex.com/btsmsgarchcomp).
You will have to create a new pipeline and deploy it to your biztalk group. Then update your receive pipeline to archive the file to a location that the host this receive location is running under has access to.
I have a Silverlight application that needs to upload large files to the server. I've looked at uploading using both WebClient as well a HttpWebRequest, however I don't see an obvious way stream the upload with either option. Do to the size of the files, loading the entire contents into memory before uplaoding is not reasonable. Is this possible in Silverlight?
You could go with a "chunking" approach. The Silverlight File Uploader on Codeplex uses this technique:
http://www.codeplex.com/SilverlightFileUpld
Given a chunk size (e.g. 10k, 20k, 100k, etc), you can split up the file and send each chunk to the server using an HTTP request. The server will need to handle each chunk and re-assemble the file as each chunk arrives. In a web farm scenario when there are multiple web servers - be careful to not use the local file system on the web server for this approach.
It does seem extraordinary that the WebClient in Silverlight fails to provide a means to pump a Stream to the server with progress events. Its especially amazing since this is offered for a string upload!
It is possible to code what would be appear to be doing what you want with a HttpWebRequest.
In the call back for BeginGetRequestStream you can get the Stream for the outgoing request and then read chunks from your file's Stream and write them to the output stream. Unfortunately Silverlight does not start sending the output to the server until the output stream has been closed. Where all this data ends up being stored in the meantime I don't know, its possible that if it gets large enough SL might use a temporary file so as not to stress the machines memory but then again it might just store it all in memory anyway.
The only solution to this that might be possible is to write the HTTP protocol via sockets.