How to query across multiple remote accumulo instances - distributed

I'm new to accumulo and had a newbie question.
I have several independent, remote accumulo instances. I would like to run a single query across all the instances simultaneously and aggregate the results. Is there a library or a standard method/best practice of doing this ?
thx

There is no Accumulo-recommended way to do this. Like you consider them independent clusters, we would also consider them independent and rely on you, the user, to aggregate data you query from each. Given that Scanners and BatchScanners expose an Iterator, it is very straightforward to merge the results from each Iterator (Guava's Iterators class might be helpful).

Related

Solr - Efficient way to search across multiple cores?

I am building a user-facing search engine for movies, music and art where users perform free-text queries (like Google) and get the desired results. Right now, I have movies, music and art data indexed separately on different cores and they do not share a similar schema. For ease of maintenance, I would prefer having them in separate cores as it is now.
Till date, I have been performing my queries individually on each core, but I want to expand this capability to perform a single query that runs across multiple cores/indexes. Say I run a query by the name of the artist and the search engine returns me all the relevant movies, music and art work they have done. Things get tricky here.
Based on my research, I see that there are two options in this case.
Create a fourth core, add shards attribute that points to my other three cores. Redirect all my queries to this core to return required results.
Create a hybrid index merging all three schema's and perform queries on this index.
With the first option, the downside I see is that the keys need to be unique across my schema for this to work. I am going to have the key artistName across all my cores, this isn't going to help me.
I really prefer keeping my schema separately, so I do not want to delve into the second option as such. Is there a middle ground here? What would be considered best practice in this case?
Linking other SO questions that I referred here:
Best way to search multiple solr core
Solr Search Across Multiple Cores
Search multiple SOLR core's and return one result set
I am of the opinion that you should not be doing search across multiple core.
Solr or Nosql databases are not meant for it. These database are preferred when we want to achieve faster response which is not possible with the RDBMS as it involves the joins.
The joins in the RDBMS slower's the performance of your query as the data grows in size.
To achieve the faster response we try to convert the data into flat document and stores it in the NoSQL database like MongoDB, Solr etc..
You should covert your data into such a way that, it should be part of single document.
If above option is not possible then create individual cores and retrieve the specific data from specific core with multiple calls.
You can also check for creating parent child relation document in solr.
Use solr cloud option with solr streaming expression.
Every option has its pros and cons. It all depends on your requirement and what you can compromise.

What's the Best Way to Query Multiple REST API's and Deliver Relevant Search Results?

My organization has multiple databases that we need to provide search results for. Right now you have to search each database individually. I'm trying to create a web interface that will query all the databases at once and sort the results based upon relevance.
Some of the databases I have direct access to. Others I can only access via a REST API.
My challenge isn't knowing how to query each individual database. I understand how to make API calls. It's how to sort the results by relevance.
On the surface it looks like Elasticsearch would be a good option. Its reverse indexing system seems like a good solution to figuring out which results are going to be the most relevant to our users. It's also super fast.
The problem is that I don't see a way (so far) to include results from an external API into Elasticsearch so it can do its magic.
Is there a better option that I'm not aware of? Or is it possible to have Elasticsearch evaluate the relevance of results from an external API while also including data from its own internal indices?
I did find an answer, although nobody replied. :\
The answer is to use the http_poll plugin with logstash. This will make an API call and injest the results into Elasticsearch.
Another option could be some form of microservices orchestration for the various API calls then merge them into a final result set.

Database Queries in One Spark Worker Thread

Recently I've started learning Spark to accelerate the processing. In my situation the input RDD of the Spark application does not contain all the data required for the batch processing. As a result, I have to do some SQL queries in each worker thread.
Preprocessing of all the input data is possible, but it takes too long.
I know the following questions may be too "general", but any experience will help.
is it possible to do some SQL queries in worker threads?
will the scheduling on the data server be the bottle neck, if a single query is complicated?
which database suits this situation (with good concurrency abilities maybe)? MongoDB? *SQL?
It is hard to answer some of your questions without a specific use-case. But the following generic answers might be of some help
Yes. You can access external data sources (RDBMS, Mongo etc.). You can use mapPartitions to even improve performance by creating the connections once. See an example here
can't answer without looking at specific example
Database selection is dependent on use-cases.

Using DTO Pattern to synchronize two schemas?

I need to synchronize two databases.
Those databases stores same semantic objects but physically different across two databases.
I plan to use a DTO Pattern to uniformize object representation :
DB ----> DTO ----> MAPPING (Getters / Setters) ----> DTO ----> DB
I think it's a better idea than physically synchronize using SQL Query on each side, I use hibernate to add abstraction, and synchronize object.
Do you think, it's a good idea ?
Nice reference above to Hitchhiker's Guide.
My two cents. You need to consider using the right tool for the job. While it is compelling to write custom code to solve this problem, there are numerous tools out there that already do this for you, map source to target, do custom tranformations from attribute to attribute and will more than likely deliver with faster time to market.
Look to ETL tools. I'm unfamiliar with the tools avaialable in the open source community but if you lean in that direction, I'm sure you'll find some. Other tools you might look at are: Informatica, Data Integrator, SQL Server Integration Services and if you're dealing with spatial data, there's another called Alteryx.
Tim
Doing that with an ORM might be slower by order of magnitude than a well-crafted SQL script. It depends on the size of the DB.
EDIT
I would add that the decision should depend on the amount of differences between the two schemas, not your expertise with SQL. SQL is so common that developers should be able to write simple script in a clean way.
SQL has also the advantage that everybody know how to run the script, but not everybody will know how to run you custom tool (this is a problem I encountered in practice if migration is actually operated by somebody else).
For schemas which only slightly differ (e.g. names, or simple transformation of column values), I would go for SQL script. This is probably more compact and straightforward to use and communicate.
For schemas with major differences, with data organized in different tables or complex logic to map some value from one schema to the other, then a dedicated tool may make sense. Chances are the the initial effort to write the tool is more important, but it can be an asset once created.
You should also consider non-functional aspects, such as exception handling, logging of errors, splitting work in smaller transaction (because there are too many data), etc.
SQL script can indeed become "messy" under such conditions. If you have such constraints, SQL will require advanced skills and tend to be hard to use and maintain.
The custom tool can evolve into a mini-ETL with ability to chunck the work in small transactions, manage and log errors nicely, etc. This is more work, and can result in being a dedicated project.
The decision is yours.
I have done that before, and I thought it was a pretty solid and straightforward way to map between 2 DBs. The only downside is that any time either database changes, I had to update the mapping logic, but it's usually pretty simple to do.

Why would a Database developer use LINQ

I am curious to know why should a Database person learn LINQ. How can it be useful.
LINQ has one big advantage over most other data access methods: it centers around the 'query expression' metaphor, meaning that queries can be passed around like objects and be altered and modified, all before they are executed (iterated). In practice what that means is that code can be modular and better isolated. The data access repository will return the 'orders' query, then an intermediate filter in the request processing pipeline will decorate this query with a filter, then it gets passed on to a display module that adds sorting and pagination etc. In the end, when its iterated, the expression has totally transformed into very specific SELECT ... WHERE ... ORDER BY ... LIMIT ... (or the other back-end specific pagination like ROW_NUMBER). For application developers, this is priceless and there simply isn't any viable alternative. This is why I believe LINQ is here to stay and won't vanish in 2 years. It is definitely more than just a fad. And I'm specifically referring to LINQ as database access method.
The advantage of manipulating query expression objects alone is enough of a factor to make LINQ a winning bid. Add to this the multiple container types it can manipulate (XML, arrays and collections, objects, SQL) and the uniform interface it exposes over all these disparate technologies, consider the parallel processing changes coming in .Net 4.0 that I'm sure will be integrated transparently into LINQ processing of arrays and collections, and is really no way LINQ will go away. Sure, today it sometimes produces unreadable, poorly performant and undebuggable SQL, and is every dedicated DBA nightmare. It will get better.
Linq provides "first class" access to your data. Meaning that your queries are now part of the programming language.
Linq provides a common way to access data objects of all kinds. For example, the same syntax that is used to access your database queries can also be used to access lists, arrays, and XML files.
Learning Linq will provide you with a deeper understanding of the programming language. Linq was the driver for all sorts of language innovations, such as extension methods, anonymous types, lambda expressions, expression trees, type inference, and object initializers.
I think a better question is why would one replace SQL with LINQ when querying a database?
"Know your enemy" ?
OTOH, out of interest. I'm learning more about XMl, XSLT, XSDs etc because I can see a use for them as a database developer.
Is this a real question? It can't be answered with a code snippet. Most of these discussion type questions get closed quickly.
I think one of the many reasons why one could consider linq it because linq is big productivity boost and huge time saver
This question also prompts me to ask, if LINQ becomes immensely popular, will server side developers know SQL as well as their predecessors did?

Resources