Technology for long-term archiving (LTA) of digitally signed documents - archive

Imagine that you have thousands or millions documents signed in CAdES, XAdES or PAdES format. Signing certificate for end user is typically issued for 1-3 years. After few years, certificate will expire, revocation data (CRLs) required for verification will not be available and original crypto algorithms will not guaranee anything after 10-20 years.
I am courious if there is some mature and ready to use solution for this. I know that this can be handled by archive timestamps, but I need real product which will automatically maintain data required for long term validation, add timestamps automatically, etc.
Could you recommend me some application or library? Is it standalone solution or something what can be integrated with filenet or similar system?

The EU does currently try to endorse Advanced Digital Signatures based on the CAdES, XAdES and PAdES standards. These were specifically designed with the goal of providing the possibility for long-term archiving and validation.
CAdES is based on CMS, XAdES on XML-DSig and PAdES on the signatures defined in ISO 32000-1, which themselves again are based on CMS.
One open source solution for XAdES is the Belgian eid project, you could have a look at that.
These are all standards for signatures, they do not, however, go into detail on how you would actually implement an archiving solution, this would still be up to you.

These are all standards for signatures, they do not, however, go into detail on how you would actually implement an archiving solution, this would still be up to you.
However, this is something what am I looking for. It seems that Belgian eid mentioned above does not address it at all. (I added some clarification to my original question).

You may find this web site helpful. It's an official site even though its pointing to an IP address. The site discusses in detail your problem and offers a great deal of advise in dealing with long term electronic record storage through a Standards based approach.
The VERS standard is quite extensive and fully supports digital signatures and how best to deal with expired signatures.
The standard is also being adopted by leading EDMS/ECM providers.

If I got your question right, our SecureBlackbox components support XAdES, PAdES and CAdES standards and pulls necessary revocation information (and timestamps) and embeds them in to the signature automatically.

Related

Database of scientific paper abstracts

I am trying to find a database with scientific papers which will allow me to:
1. Get metadata of papers by doi (including abstracts);
2. Do this stuff regularly (e.g. daily updated);
3. Ability to download whole existing database.
I know about Crossref API, however, only 3% of all publications presented have abstract (and none of biggest publishers like Springer or Elsevier provide them). On the other side I see some projects like Dimensions or Researcher which already implemented mentioned functionality. So the question is: does somebody know such services (possibly not free) and had experience working with them?
Have you looked at Semantic Scholar (https://www.semanticscholar.org/)? They have an API that supports the first of your requirements (http://api.semanticscholar.org/) and also provide the "Open Research Corpus" (http://labs.semanticscholar.org/corpus/) which should satisfy your third requirement. It is a smaller database than what is provided by Scopus or Web of Science, but both of those require subscriptions to fully use their APIs and don't (as far as I know) have a real way for you to purchase a full download of the database.

Genetic Algorithm vs Expert System

I'm having some doubts about which system should I use for a new software.
No code has been written yet, I'm just breaking apart all the needs and only then start coding.
This will be implemented in a computer company that provides services for other companies, onsite and remotely.
These are my variables:
Number of technicians
Location of customer
Type of problem
Services already scheduled for the technician
Expertise of the technician about the situation
Customer priority
Maybe some are missing, but these are the most important ones.
This job is being done manually, and has humans, we fail to see the best route to be taken sometimes.
Let's say that a customer calls with a printer problem.
First, check which tech knows about printers.
Then, is the tech available? far from the customer? can it be done remotely (software issues)?
Can it be done by another tech who is closer from the customer location?
Does this customer have more priority than the other where the same tech should be going?
Is the technician schedule full? If yes, pass to another printer/hardware tech.
I know my english is not perfect (not my natural language), but I'll try to provide more details or correct the text as needed.
So, my question is this, what kind of approach would you take? Genetic algorithm seems nice for this kind of job, and I also have some experience with GAF and WatchMaker (Java GA Framework). However, when reading the text above, an expert system seems also appropriate.
Have someone done something like this?!I had search for this kind of software and couldn't find anything alike.
Would another approach be better than the two asked?!
Also, I'm building up a table with all the techs capabilities and expertise, with simple rules like, 1 to 5 about each expertise. This is also a decision factor.
Thanks.
Why not do both? Use an expert system (a rule engine) to define your constraints and use a metaheuristic (such as Local Search or Genetic Algorithms) to solve it. The planning engine OptaPlanner (java, open source) does exactly that (by using the rule engine Drools). The architecture look likes this:
Here's a video demonstrating the constraint flexibility on the vehicle routing problem (VRP). Your problem seems to be an advanced variant on VRP (which is a variant on TSP).
Maybe you can start off with TSP,
here http://en.m.wikipedia.org/wiki/Travelling_salesman_problem
I guess it only deals with the distance.

Are any Health Information Exchanges' APIs documented?

I was uncertain of the correct site in StackExchange to ask this but since it's about APIs I just went with Stack Overflow.
In the US currently more and more States and companies are setting up Health Information Exchanges to electronically exchange records between different hospitals, practices, etc. What I'm wondering is: are any of these protocols, APIs, etc documented anywhere? Off and on over the last few weeks I've tried to find anything, from any state, detailing how these work specifically, but I cannot find anything. I do find vague references to "documentation" and "standards," with no detail on the protocols, encoding, etc.
It may be a case of just not searching with the correct terminology, though part of me is beginning to suspect that none are documented anywhere.
Time for an acronym stew.
I'm not aware of any specific products/platforms provided by specific HIE vendors that expose public APIs. But, there are a variety of standards in the HIT community that are commonly used by HIEs:
The HL7 standards define a large number of data exchange and message formats for all sorts of patient health information. HL7 v2 is a custom delimited format. HL7 v3 is an XML format. Both have similar semantics. This is commonly used to exchange health information with an HIE. Note that this is a very broad standard and HL7 messages are highly subject to interpretation or customization in terms of which individual elements are required or utilized by each vendor.
CCD and CCR are also commonly used for exchange of health data, especially in conjunction with PHR (Personal Health Record) systems such as HealthVault.
LOINC and SNOMED are sets of standard names and identifiers used, among other places, in HL7 messages.
I've often seen SAML used in SOAP messages to provide additional security.
SAML only provides authentication/authorization support. HL7 is not encrypted so for HIPAA compliance when communicating between enterprises you either need to encrypt the connection via SSL or a VPN or use an application layer encryption solution such as CloudPrime
Disclosure: I am an advisor to CloudPrime.

How to get book metadata?

My application needs to retrieve information about any published book based on a provided ISBN, title, or author. This is hardly a unique requirement---sites like Amazon.com, Chegg.com, and even software like Book Collector seem to be able to do this easily. But I have not been able to replicate it.
To clarify, I do not need to search the entire database of books---only a limited subset which have been inputted, as in a book collection. The database would simply allow me to tag the inputted books with the necessary metadata to enable search on that subset of books. So scale is not the issue here---getting the metadata is.
The options I have tried are:
Scrape Amazon. Scraping the regular Amazon pages was not very robust to things like missing authors, and while scraping the smaller mobile pages was faster, they shared the same issues with robustness of extraction. Plus, building this into an application is a clear violation of Amazon's Terms of Service.
Scrape the Library of Congress. While this seems to have fewer legal ramifications, ease and robustness were again issues.
ISBNdb.com API. While the service is free up to a point, and does a good job of returning the necessary metadata, I need to do this for over 500 books on a daily basis, at which point this service costs money proportional to use. I'd prefer a free or one-time payment solution that allows me to do the same.
Google Book Data API. While this seems to provide the information I need, I cannot display the book preview as their terms of service requires.
Buy a license to a database of books. For example, companies like Ingram or Baker & Taylor provide these catalogs to retailers and libraries. This solution is obviously expensive, so I'm hoping that there's a more elegant solution I've missed. But if not, and someone on SO has had a good experience with a particular database, I'm willing to go with that.
I've tried to describe my approach in detail so others with fewer books can take advantage of the above solutions. But given my requirements, I'm at my wits' end for retrieving book metadata.
Since it is unlikely that you have to retrieve the same 500 books every day: store the data retrieved from isbndb.com in a database and fill it up book by book.
Instead of scraping Amazon, you can use the API they expose for their affiliate program: https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html
It allows about 3k requests per hour and returns well-formed XML. It requires you to set a link to the book that you show the information about, and you must state that you are an affiliate partner.
This might be what you're looking for. They even offer a complete download!
https://openlibrary.org/data
As it seems, a lot of libraries and other organisations make information such as "ISBN" available through MAchine-Readable Cataloging aka MARC, you can find more information about it here as well.
Now knowing the "right" term to search for I discovered WorldCat.org.
Maybe this whole MARC thing gives you a new kind of an idea :)

Looking for an example of when screen scraping might be worthwhile

Screen scraping seems like a useful tool - you can go onto someone else's site and steal their data - how wonderful!
But I'm having a hard time with how useful this could be.
Most application data is pretty specific to that application even on the web. For example, let's say I scrape all of the questions and answers off of StackOverflow or all of the results off of Google (assuming this were possible) - I'm left with data that is not very useful unless I either have a competing question and answer site (in which case the stolen data will be immediately obvious) or a competing search engine (in which case, unless I have an algorithm of my own, my data is going to be stale pretty quickly).
So my question is, under what circumstances could the data from one app be useful to some external app? I'm looking for a practical example to illustrate the point.
It's useful when a site publicly provides data that is (still) not available as an XML service. I had a client who used scraping to pull flight tracking data into one of his company's intranet applications.
The technique is also used for research. I had a client who wanted to compare the contents of several online dictionaries by part of speech, and all of these sites had to be scraped.
It is not a technique for "stealing" data. All ordinary usage restrictions apply. Many sites implement CAPTCHA mechanisms to prevent scraping, and it is inappropriate to work around these.
A good example is StackOverflow - no need to scrape data as they've released it under a CC license. Already the community is crunching statistics and creating interesting graphs.
There's a whole bunch of popular mashup examples on ProgrammableWeb. You can even meet up with fellow mashupers (O_o) at events like BarCamps and Hack Days (take a sleeping bag). Have a look at the wealth of information available from Yahoo APIs (particularly Pipes) and see what developers are doing with it.
Don't steal and republish, build something even better with the data - new ways of understanding, searching or exploring it. Always cite your data sources and thank those who helped you. Use it to learn a new language or understand data or help promote the semantic web. Remember it's for fun not profit!
Hope that helps :)
If the site has data that would benefit from being accessible through an API (and it would be free and legal to do so), but they just haven't implemented one yet, screen scraping is a way of essentially creating that functionality for yourself.
Practical example -- screen scraping would allow you to create some sort of mashup that combines information from the entire SO family of sites, since there's currently no API.
Well, to collect data from a mainframe. That's one reason why some people use screen scraping. Mainframes are still in use in the financial world and often it's running software that has been written in the previous century. The people who wrote it might already be retired and since this software is very critical for these organizations, they really hate it when some new code needs to be added. So, screenscraping offers an easy interface to communicate with the mainframe to collect information from the mainframe and then send it onwards to any process that needs this information.
Rewrite the mainframe application, you say? Well, software on mainframes can be very old. I've seen software on mainframes that was over 30 years old, written in COBOL. Often, those applications work just fine and companies don't want to risk rewriting parts because it might break some code that had been working for over 30 years! Don't fix things if they're not broken, please. Of course, additional code could be written but it takes a long time for mainframe code to be used in a production environment. And experienced mainframe developers are hard to find.
I myself had to use screen scraping too in a software project. This was a scheduling application which had to capture the output to the console of every child process it started. It's the simplest form of screen scraping, actually, and many people don't even realize that if you redirect the output of one application to the input of another, that it's still a kind of screen scraping. :)
Basically, screen scraping allows you to connect one (web) application with another one. It's often a quick solution, used when other solutions would cost too much time. Everyone hates it, but the amount of time it saves still makes it very efficient.
Let's say you wanted to get scores from a popular sports site that did not offer the information available with an XML feed or API.
For one project we found a (cheap) commercial vendor that offered translation services for a specific file format. The vendor didn't offer an API (it was, after all, a cheap vendor) and instead had a web form to upload and download from.
With hundreds of files a day the only way to do this was to use WWW::Mechanize in Perl, screen scrape the way through the login and upload boxes, submit the file, and save the returned file. It's ugly and definitely fragile (if the vendor changes the site in the least it could break the app) but it works. It's been working now for over a year.
One example from my experience.
I needed a list of major cities throughout the world with their latitude and longitude for an iPhone app I was building. The app would use that data along with the geolocation feature on the iPhone to show which major city each user of the app was closest to (so as not to show exact location), and plot them on a 3D globe of the earth.
I couldn't find an appropriate list in XML/Excel/CSV type format anywhere easily, but I did find this wikipedia page with (roughly) the info I needed. So I wrote up a quick script to scrape that page and load the data into a database.
Any time you need a computer to read the data on a website. Screen scraping is useful in exactly the same instances that any website API is useful. Some websites, however, don't have the resources to create an API themselves; screen scraping is the developer's way around that.
For instance, in the earlier days of Stack Overflow, someone built a tool to track changes to your reputation over time, before Stack Overflow itself provided that feature. The only way to do that, since Stack Overflow has no API, was to screen scrape.
The obvious case is when a webservice doesn't offer reverse search. You can implement that reverse search over the same data set, but it requires scraping the entire dataset.
This may be fair use if the reverse search also requires significant pre-processing, e.g. because you need to support partial matching. The data source may not have the technical skills or computing resources to provide the reverse search option.
I use screen scraping daily, I run some eCommerce sites and have screen-scraping scripts running daily to gather product lists automatically from my suppliers wholesale sites. This allows me to have upto date information on all the products available to me from several suppliers and allows me to flag non-economical margins due to price changes.

Resources