where does Jackrabbit store its tree information - jackrabbit

I am a newbie to Jackrabbit. I wonder where Jackrabbit store its tree information. I want to access to the tree information in real time, even i restart my program. It should behave like the tree information is stored in the file system permanently. But now if I stopped my program, all the Jackrabbit tree information will be lost. How can I solve this problem?
Thanks!

Jackrabbit can use different PersistenceManagers to store the content, if you're using a TransientRepository, for example, the content isn't persisted as that's meant for testing only.
See http://wiki.apache.org/jackrabbit/PersistenceManagerFAQ for more info.

Related

How to pipe the complete graph to Giraph through TinkerPop 3 stack?

I've a graph with different types of nodes & relationships. Each type of node has 3-4 properties. For testing purpose on HDFS, I'm using GraphSON file to store this graph. Now I want to analyse this graph using Giraph. I've explore Giraph's IO classes & also found that Gremlin can directly load GraphSON. So could you please explain me how to load the graph into Giraph using TinkerPop stack?
See the Giraph sample in the docs, it does almost exactly what you're looking for. Instead of hadoop-gryo.properties use hadoop-graphson.properties (and of course adjust the input location setting, gremlin.hadoop.inputLocation, in the configuration file).

Modifying the EXT2 Filesystem

For a project I'm working on, I need to be able to modify the EXT2 filesystem. I have done extensive research, but as this is not a commonly required task, there seems to be very little helpful information available online. Unfortunately, due to confidentiality, I cannot go into specifics of what I am working on, but I can break the situation down to several key problems that I would be very grateful for assistance on. I should note that I have very little experience with Linux kernel/OS development.
I realise that this is not an easy problem to tackle, and it is definitely going to be a big challenge seeing as this is only a small part of the project. Any assistance with these problems (or any other warnings/advice) is greatly appreciated.
If this turns out to be impossible, then the entire project will have to be rethought, hence why I am starting at this point.
Problem 1: From source code to implementation
This is not really the first problem, but I need to answer it first in order to make sure I'm on the right track at all! Remember that I am a novice, so this may be a silly question.
From what I can tell, all EXT2 source code is contained within the kernel source code at linux/fs/ext2/. If I were to modify this source code to make the changes I require, then successfully rebuild the kernel, what would it take to make a drive or partition use this newly modified filesystem? Would I just be able to reformat a drive as EXT2, using the new kernel, and the modified filesystem would be applied to it due to the modified source code? Or am I oversimplifying this?
Problem 2: Extending the metadata
A vital part of my project requires me to extend the metadata that each file on the drive contains. In this case, metadata refers to details such as owner, created timestamp, modified timestamp etc. What I would like to be able to do is add and be able to update an additional metadata field. I would think the best approach would be to modify the inode, but after looking through the source code for a long time, I can't see anywhere obvious to start.
Problem 3: Consequences?
Assume I am successful in adding and using this field on the modified filesystem. If a file was then moved to another drive, with an unmodified EXT2 filesystem, would the data contained in this extra metadata field be lost? Obviously files are moved between different filesystems all the time with no problem, but I am unsure as to how they handle this. I should note that it is not required to be able to access this metadata on any other system, I only require that the data not be lost.
Bonus question
So far I've been using CentOS for my prototype. If there is any advice as to whether this is good/bad/doesn't matter, I would appreciate it.
EDIT/UPDATE TO CLARIFY PROBLEM
As I said previously, confidentiality prevents me discussing exactly what the end-game is. Here is a very basic problem, which should hopefully explain more of the kind of things I need to be able to do.
Traditional filesystems keep track of three main timestamps: created, last accessed, last modified. Let's say I wanted to add a new type of metadata, called "modify history", which stores a list of timestamps for each and every modification. The only way I can think to do this, would be to add another attribute which would point to a datablock, to which timestamps are appended every time a write is made to that file/inode.
Again, the actual problem is far more complex, but this hopefully gives a better idea of the kind of thing I need to be able to do.

Convert plone database to csv or SQL

I am helping out an organization which are planning on changing their members system. Right now their system is developed in Plone and all their data is in a Data.fs file.
Their system is down for the moment and it would take some time and effort to get it up and running.
Is there a way to get the data out from the database into a standard format such as csv files or SQL? Or do they need to get the system up and running beforehand and export the files from "within" plone?
Thanks for your help and ideas!
Kind regards,
Samuel
The Data.fs file is a Object Oriented Database file, and it is written by a framework called the ZODB. The data within it represent python instances, layed out in a tree structure.
You could open this database from a python script, but in order for you to make sense of the contained structures, you'll need access to the original class definitions that make up the stored instances. Without those class definitions all you'll get is placeholder objects (Broken objects) that are of no use at all.
As such, it's probably easier to just get the Plone instance back up and running, as it'll be easier to export the exact data you want out if you have things like the catalog (basically a specialized database index) to build your export.
It could be that this site is down because of something trivial, something we can help you with here on Stack Overflow, or on the Plone users mailinglists or in the #plone IRC channel. If you do get it up and running and have some details on what you are trying to export, we certainly can help.
You'll need to get the system up and running to export data. Data in the data.fs file is stored as Python pickles and is not intelligible to "outside" systems.
As the others have pointed out before, your best course would be to have Plone running back again. After doing so, try csvreplicata to export existing data to csv format. And for user accounts, try atreal.usersinout.
If you need professional help, you can search for available providers from http://plone.org/support/providers
For free support, post specific problems here.
Recently I managed to export Plone 4 site to sqlite using SQLExporter: http://plone.org/products/proteon.sqlexporter. But you need to get your Plone instance working first to use it.

Need ideas on retrieving data from a website

I'm stumped and need some ideas on how to do this or even whether it can be done at all.
I have a client who would like to build a website tailored to English-speaking travelers in a specific country (Thailand, in this case). The different modes of transportation (bus & train) have good web sites for providing their respective information. And both are very static in terms of the data they present (the schedules rarely change). Here's one of the sites I would need to get info from: train schedules The client wants to provide users the ability to search for a beginning and end location and determine, using the external website's information, how they can best get there, being provided a route with schedule times for the different modes of chosen transport.
Now, in my limited experience, I would think the way to do that would be to retrieve the original schedule info from the external site's server (via API or some other means) and retain the info in a database, which can be queried as needed. Our first thought was to contact the respective authorities to determine how/if this can be done, but this has proven to be problematic due to the language barrier, mainly.
My client suggested what is basically "screen scraping", but that sounds like it would be complicated at best, downloading the web page(s) and filtering through the HTML for relevant/necessary data to put into the database. My worry is that the info on these mainly static sites is so static, that the data isn't even kept in a database to build the page and the web page itself is updated (hard-coded) when something changes.
I could really use some help and suggestions here. Thanks!
Screen scraping is always problematic IMO as you are at the mercy of the person who wrote the page. If the content is static, then I think it would be easier to copy the data manually to your database. If you wanted to keep up to date with changes, you could then snapshot the page when you transcribe the info and run a job to periodically check whether the page has changed from the snapshot. When it does, it sends an email for you to update it.
The above method could also be used in conjunction with some sort of screen scaper which could fall back to a manual process if the page changes too drastically.
Ultimately, it is a case of how much effort (cost) is your client willing to bear for accuracy
I have done this for the following site: http://www.buscatchers.com/ so it's definitely more than doable! A key feature of a web scraping solution for travel sites is that it must send you emails if anything went wrong during the scraping process. On the site, I use a two day window so that I have two days to fix the code if the design changes. Only once or twice have I had to change my code, and it's very easy to do.
As for some examples. There is some simplified source code here: http://www.buscatchers.com/about/guide. The full source code for the project is here: https://github.com/nicodjimenez/bus_catchers. This should give you some ideas on how to get started.
I can tell that the data is dynamic, it's to well structured. It's not hard for someone who is familiar with xpath to scrape this site.

storing database values in source control

We have a table in our our database that stores XSL's and XSD's that are applied to XML documents created in our application. This table is versioned in the sense that each time a change is made, a new row is created.
I'm trying to propose that we store the XSL's and XSD's as files in our Source control system instead of relying on the database to track the history. Each time a file is updated, we would deploy the new version to the database.
I don't seem to be getting much agreement on the issue. can anyone help me out with pros and cons of this approach? Perhaps I'm missing something.
XSL and XSD files are part of the application and so ought to be kept under source control. That's just obvious. Even if somebody wanted to catgorise them as data they would be reference data and so - in my book at least - would need to be kept under source control. This is because reference data is part of the application and so part of its configuration. For instance, applications which use the database to store values for drop downs or to implement business rules need to be certain that it holds the right version of the data.
The only argument for keeping multiple versions of the files in the dtabase would be if you might need to process older versions of the XML files. This depends on the nature of your application. Certainly I have worked on systems where XML files / messages came from external (third party) systems, where we really had no control over the format of the messages sent. So for a variety of reasons we needed to be able to handle incoming XML regardless of whether its structure was current or historical. But is is in addition to storing the files in a source control repository, not instead of.

Resources