Database of scientific paper abstracts - doi

I am trying to find a database with scientific papers which will allow me to:
1. Get metadata of papers by doi (including abstracts);
2. Do this stuff regularly (e.g. daily updated);
3. Ability to download whole existing database.
I know about Crossref API, however, only 3% of all publications presented have abstract (and none of biggest publishers like Springer or Elsevier provide them). On the other side I see some projects like Dimensions or Researcher which already implemented mentioned functionality. So the question is: does somebody know such services (possibly not free) and had experience working with them?

Have you looked at Semantic Scholar (https://www.semanticscholar.org/)? They have an API that supports the first of your requirements (http://api.semanticscholar.org/) and also provide the "Open Research Corpus" (http://labs.semanticscholar.org/corpus/) which should satisfy your third requirement. It is a smaller database than what is provided by Scopus or Web of Science, but both of those require subscriptions to fully use their APIs and don't (as far as I know) have a real way for you to purchase a full download of the database.

Related

JitterBit vs Dell Boomi vs Celigo

We've narrowed our selection for an ipaas down to the above 3.
Initially we're looking to pass data from a cloud based HR system to Netsuite, and from Netsuite to Salesforce, and sometimes JIRA.
i've come from a Mulesoft background which I think would be too complex for this. On the other hand it seems that Celigo is VERY drag and drop, and there's not much room for modification/customisation.
Of the three, do you have any experience/recommendations? We aren't looking for any code heavy custom APIs, most will just be simple scheduled data transfers but there may be some complexity within the field mapping, and we want to set ourselves up for the future.
I spent a few years removing Celigo from NetSuite and Salesforce. The best way I can describe Celigo is that it is like the old school anti-virus programs which were often worse than the viruses... lol... It digs itself into the end system, making removing it a nightmare.
Boomi does the job, but is very counter-intuitive, and overly complex. You can't do everything from one screen, you can't easily bounce back and forth between tasks/operations/etc. And, sometimes it is very difficult to find where endpoints are used, as they are not always shown in their "where is this used" feature. Boomi has a ton of endpoint connectors pre-built (the most, I believe), but I have not seen an easy way to just create your own. Boomi also has much more functionality than just the integrations, if that is something that may be needed.
Jitterbit, my favorite, is ridiculously simple to use. You can access everything from one main screen, you can connect to anything (as long as it can reach out to the network, or you can reach it via the network - internal or external). Jitterbit has a lot of pre-built endpoint connectors. It is also extremely easy to just create a connection to anything you want. The win with Jitterbit is that it is super easy to use, super easy to learn, it always works, they have amazing support (if you need it). I have worked with Jitterbit the most (about 6 years), and I have never been unable to complete an integration task in less that a couple of day, max.
I have extensive experience with Dell Boomi platform but none with JitterBit or Celigo. Dell Boomi offers very versatile and well supported iPaaS solution. The technical challenges of Boomi are some UI\usability issues (#W3BGUY mentioned the main ones) and the lack of out-of-the-box support for CI/CD and DevOps processes (code management, versioning, deployments etc.)
One more important component to consider here is the pricing of the platform. Boomi does charge their clients yearly connection prices. Connection is defined as a unique combination of URL, username and password. The yearly license costs vary and can range anywhere between ($1,000 - $12,000) per license per year. The price depends greatly on your integration landscape and the discounts provided so I would advise on engaging with vendor early to understand your costs. Would be great to hear from others on pricing for JitterBit and Celigo.
Boomi is also more than just an iPaaS platform. They offer other modules of their platform to customers: API Management, Boomi Flow (workflow and automation module), Master Data Hub (master data management). Some of these modules are well developed and some are in their infancy (API Management).
From my limited experience with MuleSoft platform, I share the OP's sentiments about it being too complex for simple integrations. They do provide great CI/CD and DevOps functionality though if that is something that is needed.
There is not a simple answer to a question like this. One needs to look at multiple aspects of the platform and make a decision based on multitude of factors. I would advise looking at Gartner and Forrester reports for a general guidelines and working out the pricing (initial and recurring) with the vendor.
I have only used Jitterbit, so can only comment on that. It works fine. It is pretty intuitive and easy to use, but does have some flexibility with writing your own queries, defining and mapping file formats, and choosing different transfer protocols.
I've only used the free version (which you need to host somewhere and also is not supported) and it was good enough for production tasks. If you have the luxury of time, I'd say download it and try it out. If it works for you, throw it on a server or upgrade to the cloud version.
One note: Jitterbit uses background services. If you run it locally and then decide to migrate your account to a server, you need to stop those services on your local. Otherwise, it will try to run jobs from both locations and that doesn't turn out well.
Consider checking out Choreo as well. It has a novel simultaneous code + low-code approach for integration development. And provides rich AI support for performance monitoring, debugging, and data mapping.
Disclaimer: I'm a member of the project.

Are RDF / Triple Stores suited for storing application data? (as opposed to the graph Metadata)

I'm trying to create a small web application for a "personal information manager" / wiki kind of tool where I can take notes in the form of HTML snippets (or maybe Markdown), annotate them with some https://schema.org/ microdata and store both the snippet and the metadata somewhere for querying.
My understanding so far is that most semantic data stores (triple/quad stores, or databases supporting RDF) are better suited for storing and querying mainly the metadata. So I'll probably also want some traditional store of some sort (relational, document store, key-value, or even a non-rdf graph db) where I can store the full text of each note and maybe some other bits like time of last access, user-id that owns the note, etc, and also perform traditional (non-semantic) fulltext queries.
I started looking for stores that would allow me to store both data and metadata in a single place. I found a few: Ontotext GraphDB, Stardog, MarkLogic, etc. All of these seem to do exactly what I want, but have some pretty limiting free license terms that really discourage me from studying them in depth: I prefer to study open technologies that I could potentially use on a real product.
Before digging deeper, I was wondering:
If my assumption is correct: that I'll need to use one store for the data and another for the metadata.
if there's any setup involving free/open source software that developers with experience in RDF/Sparql can recommend, given the problem I describe.
Right now I'm just leaning towards using Apache Jena for the RDF store and SPARQL queries, and something completely independent for the rest of the data (PostgreSQL most likely).
Before digging deeper, I was wondering:
If my assumption is correct: that I'll need to use one store for the data and another for the metadata.
Not necessarily, no, though there certainly are some cases in which that distinction may be useful. But most RDF databases offer scalable storage for both data and metadata. The only requirement is that your (meta)data is represented as RDF. If you're worried about performance of things like text queries, most of them offer support for full-text indexing through Lucene, Solr, or Elasticsearch.
if there's any setup involving free/open source software that developers with experience in RDF/Sparql can recommend, given the problem I describe.
This is really not the right place to ask this question. Tool recommendations are considered off-topic on StackOverflow since they attract biased answers. But as said, there's plenty of tools, both open-source/free and commercial, that you can look into. I suggest you pick one you like the look of, experiment a bit, and perhaps talk to that particular tool's community to explain what you're trying to do. Apache Jena and Eclipse Rdf4j are two popular open-source projects, but there's plenty of others.

How to get book metadata?

My application needs to retrieve information about any published book based on a provided ISBN, title, or author. This is hardly a unique requirement---sites like Amazon.com, Chegg.com, and even software like Book Collector seem to be able to do this easily. But I have not been able to replicate it.
To clarify, I do not need to search the entire database of books---only a limited subset which have been inputted, as in a book collection. The database would simply allow me to tag the inputted books with the necessary metadata to enable search on that subset of books. So scale is not the issue here---getting the metadata is.
The options I have tried are:
Scrape Amazon. Scraping the regular Amazon pages was not very robust to things like missing authors, and while scraping the smaller mobile pages was faster, they shared the same issues with robustness of extraction. Plus, building this into an application is a clear violation of Amazon's Terms of Service.
Scrape the Library of Congress. While this seems to have fewer legal ramifications, ease and robustness were again issues.
ISBNdb.com API. While the service is free up to a point, and does a good job of returning the necessary metadata, I need to do this for over 500 books on a daily basis, at which point this service costs money proportional to use. I'd prefer a free or one-time payment solution that allows me to do the same.
Google Book Data API. While this seems to provide the information I need, I cannot display the book preview as their terms of service requires.
Buy a license to a database of books. For example, companies like Ingram or Baker & Taylor provide these catalogs to retailers and libraries. This solution is obviously expensive, so I'm hoping that there's a more elegant solution I've missed. But if not, and someone on SO has had a good experience with a particular database, I'm willing to go with that.
I've tried to describe my approach in detail so others with fewer books can take advantage of the above solutions. But given my requirements, I'm at my wits' end for retrieving book metadata.
Since it is unlikely that you have to retrieve the same 500 books every day: store the data retrieved from isbndb.com in a database and fill it up book by book.
Instead of scraping Amazon, you can use the API they expose for their affiliate program: https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html
It allows about 3k requests per hour and returns well-formed XML. It requires you to set a link to the book that you show the information about, and you must state that you are an affiliate partner.
This might be what you're looking for. They even offer a complete download!
https://openlibrary.org/data
As it seems, a lot of libraries and other organisations make information such as "ISBN" available through MAchine-Readable Cataloging aka MARC, you can find more information about it here as well.
Now knowing the "right" term to search for I discovered WorldCat.org.
Maybe this whole MARC thing gives you a new kind of an idea :)

Recommendations for Webbased Archive

Requirements for archival type software
1. Data/Image/possibly video.... upload/search/retrevial/edit from web.
2. Easily implemented user defined Custom Fields
3. Easy backup.
4. Low cost ... either opensource or very low cost
I am a very novice programmer. My primary goal is to manage a collection and publish it to the web.
Options
A. Open source software such as collective access
Problems: Custom fields not supported. Continued support? Portablity of
database?
B. Use Microsoft Access and then use MVC or other development platforms to eventually
publish to the web.
Problems:Difficult to integrate to web?
C. Design my own MVC database application.
Problems:Difficult for novice programmer? Custom Fields and Upload of various data
formats difficult to implement?
Sounds like you are looking for a Digital Assets Management system. I found ResourceSpace (http://www.resourcespace.org/) and Razuna (http://www.razuna.org/) very useful for similar projects - both fall into your A category.
Requirements for archival type
software 1. Data/Image/possibly
video.... upload/search/retrevial/edit
from web. 2. Easily implemented user
defined Custom Fields 3. Easy backup.
4. Low cost ... either opensource or very low cost
Hi there,
As mentioned here before, but Razuna will satisfy your requirements quite well.
It can manage images, documents, videos and audios. It will share folderd and collections on the web with access permissions and will allow you to search among the different kind of assets as well.
Moreover, it can handle metadata of all this asset. It will not only read metadata, but also WRITE metadata, also. Furthermore, you can set the custom fields for each asset type and users will have a web interface to work with.
Razuna supports different databases (H2, MySQL, MS SQL and Oracle (soon DB2)) and let's you migrate from one db to another with ease (backup / restore option).
Best of it all: It is available under a open source license for you to deploy and enjoy today. You can get it at http://razuna.org.
Kind Regards,
Nitai
PS: I'm the main developer and founder of Razuna.

What is the production ready NonSQL database?

With the rising of non-sql database usage in high traffic website, I'm interested to use it for my project. Now I've heard several names like Voldermort, MongoDB and CouchDB. But which are among these NonSQL database that is production ready? I've seen the download pages and it seems that none of them is production ready because is not version 1.0 yet. Is there any other names other than these 3 that is recommendable to be used in production?
What do you mean by production ready? As far as I know, all of them are being used on live systems.
You should make your choice based on how the features they provide fit your needs.
You can also add Tokyo Cabinet to the list as well as the mnesia database provided by the Erlang VM.
I think you need to start out from your project requirements to see what kind of database you really need. There are many non-relational DBMS:s out there and they differ a lot in what kind of problems they are good at solving. I think the article Should you go Beyond Relational Databases? by Martin Kleppmann is a good starting point for finding out what you need. There's also a lot of stackoverflow threads on similar topics, these are my favorites:
The Next-gen Databases
Non-Relational Database Design
When shouldn’t you use a relational
database?
Good reasons NOT to use a relational
database?
When you have narrowed down what you actually need you can take a deeper look into the alternatives to see which DBMS are production ready for your use case. Production readiness isn't a yes/no thing: people may successfully deploy some solution that for example lacks in tool support - in another project this could be a no-go.
As for version numbers different projects have a different take on this, so you can't just compare the version numbers. I'm involved in the graph database project Neo4j and even if it has been in production use for 5+ years by now we still haven't released a version 1.0 final yet.
I'm tempted to answer "use SIRA_PRISE".
It's definitely non-SQL.
And its current version is 1.2, meaning that someone like you must definitely assume it's "production-ready".
But perhaps I shouldn't be answering at all.
Nice article comparing rdbms with 'next gen' and listing some providers:
Is the Relational Database Doomed?
http://readwrite.com/2009/02/12/is-the-relational-database-doomed
I will suggest you to use Arangodb.
ArangoDB is a multi-model mostly-memory database with a flexible data model for documents and graphs. It is designed as a “general purpose database”, offering all the features you typically need for modern web applications.
ArangoDB is supposed to grow with the application—the project may start as a simple single-server prototype, nothing you couldn’t do with a relational database equally well. After some time, some geo-location features are needed and a shopping cart requires transactions. ArangoDB’s graph data model is useful for the recommendation system. The smartphone app needs a lean API to the back-end—this is where Foxx, ArangoDB’s integrated Javascript application framework, comes into play.
Another unique feature is ArangoDB’s query language AQL — it makes querying powerful and convenient. AQL enables you to describe complex filter conditions and joins in a readable format, much in the same way as SQL.
You can model your data in several ways:
in key/value pairs
as collections of documents
as graphs with nodes, edges, and properties for both
You can access data in ArangoDB:
using the general HTTP REST API via curl/wget, or your browser
via the ArangoDB shell (“arangosh”)
using a programming language specific client library
Server requirements for ArangoDB:
ArangoDB runs on Linux, OS X and Microsoft Windows.
It runs on 32bit and 64bit systems, though using a 32bit system will limit you to using only approximately 2 to 3 GB of data with ArangoDB.

Resources