Recently I've started learning Spark to accelerate the processing. In my situation the input RDD of the Spark application does not contain all the data required for the batch processing. As a result, I have to do some SQL queries in each worker thread.
Preprocessing of all the input data is possible, but it takes too long.
I know the following questions may be too "general", but any experience will help.
is it possible to do some SQL queries in worker threads?
will the scheduling on the data server be the bottle neck, if a single query is complicated?
which database suits this situation (with good concurrency abilities maybe)? MongoDB? *SQL?
It is hard to answer some of your questions without a specific use-case. But the following generic answers might be of some help
Yes. You can access external data sources (RDBMS, Mongo etc.). You can use mapPartitions to even improve performance by creating the connections once. See an example here
can't answer without looking at specific example
Database selection is dependent on use-cases.
Related
We're heading into a POC and would need to determine if Snowflake tasks and streams are useful for CDC and data transformation. I have read snowflake documentation and the more I read it seems like it will be a complex mess to handle. Thinking about thousands of tables and complex transformations, how will tasks and streams scale up? Considering of a table that gets loaded from 5 other feeds, how will the process look like. On top of that, snowflake doesn't offer any visualization to work with tasks. Can some of you who worked with Snowflake streams/tasks comment and share you opinion of using tasks and streams? If you went with an alternative after trying them out, was it a commercial ETL tool or databricks? If we're already using qlik to bring in data into AWS S3 (data lake), would it make sense to use streams to ingest from our data lake into snowflake?
TIA
This question seems too wide for the typical Stack Overflow process (so the community might choose to close it).
In the meantime, I'll reply here to one of the stated questions: "On top of that, snowflake doesn't offer any visualization to work with tasks"
There is a tool to visualize tasks, created by a Snowflake SE:
https://medium.com/snowflake/visualizing-task-hierarchies-and-dependencies-in-snowflake-snowsight-d28298d0f0ed
For the larger picture: Snowflake streams and tasks are basic building blocks for more complex solutions. As your use case grows more complex, you'll need to find ways to manage this complexity - either with your own tools, Snowflake's, or third parties.
Since you are running a POC: Make sure to ask your Snowflake sales contact. Engineers like Dave are ready and eager to find a solution that fits your needs.
Currently, I'm working on a MERN Web Application that'll need to communicate with a Microsft SQL Server database on a different server but on the same network.
Data will only be "transferred" from the Mongo database to the MSSQL one based on a user action. I think I can accomplish this by simply transforming the data to transfer into the appropriate format on my Express server and connecting to the MSSQL via the matching API.
On the flip side, data will be transferred from the MSSQL database to the Mongo one when a certain field is updated in a record. I think I can accomplish this with a Trigger, but I'm not exactly sure how.
Do either of these solutions sound reasonable or are there more better/industry standard methods that I should be employing. Any and all help is much appreciated!
There are (in general) two ways of doing this.
If the data transfer needs to happen immediately, you may be able to use triggers to accomplish this, although be aware of your error handling.
The other option is to develop some form of worker process in your favourite scripting language and run this on a schedule. (This would be my preferred option, as my personal familiarity with triggers is fairly limited). If option 1 isn't viable, you could set your schedule to be very frequent, say once per minute or every x seconds, as long as a new task doesn't spawn before the previous is completed.
The broader question though, is do you need to have data duplicated across two different sources? The obvious pitfall with this approach is consistency, should anything fail you can end up with two data sources wildly out of sync with each other and your approach will have to account for this.
I am writing a web application with nodeJS that can be used by other applications to store logs and accessed later in a web interface or by applications themselves providing an API. Similar to Graylog2 but schema free.
I've already tried couchDB in which each document would be a log doc but since I'm not really using revisions it seems to me I'm not using its all features. And beside that I think if the logs exceeds a limit it would be pretty hard to manage in couchDB.
What I'm really looking for, is a big array of logs that can be sorted, filtered, searched and capped on. Then the last events of it accessed. It should be schema free and writing to it should be non-blocking.
I'm considering using Cassandra(I'm not really familiar with it) due to the points here said. MongoDB seems good here too, since Graylog2 uses in mongoDB, in here it has some good points about it.
I've already have seen this question, but not satisfied with the answers.
Edit:
For some reasons I can't use Cassandra in production, now I'm trying MongoDB.
One more reason to use mongoDB :
http://www.slideshare.net/WombatNation/logging-app-behavior-to-mongo-db
More edits:
It is similar to graylog2, but the difference I want to make that instead of having a message field, having fileds defined by the client, which is why I want it to be schema free, and because of that, I may need to query in the user defined fields. We can build it on SQL, but querying on the user defined fields would be reinventing wheel. Same goes with files.
Technically what I'm looking for is to get rich statistical data in the end, or easy debugging and a lot of other stuff that we can't get out of the logs.
Where shall it be stored and how shall it be retrieved?
I guess it depends on how much data you are dealing with. If you have a huge amount (terabytes and petabytes per day) of logs then Apache Kafka, which is designed to allow data to be PULLED by HDFS in parallel, is a interesting solution - still in the incubation stage. I believe if you want to consume Kafka messages with MongoDb, you'd need to develop your own adapter to ingest it as a consumer of a particular Kafka topic. Although MongoDb data (e.g. shards and replicas) is distributed, it may be a sequential process to ingest each message. So, there may be a bottleneck or even race conditions depending on the rate and size of message traffic. Kafka is optimized to pump and append that data to HDFS nodes using message brokers FAST. Then once it is in HDFS you can map/reduce to analyze your information in a variety of ways.
If MongoDb can handle the ingestion load, then it is an excellent, scalable, real-time solution to find information, particularly documents. Otherwise, if you have more time to process data (i.e. batch processes that take hours and sometimes days), then Hadoop or some other Map Reduce database is warranted. Finally, Kafka can distribute that load of messages and hookup that fire-hose to a variety of consumers. Overall, these new technologies spread the load and huge amounts of data across cheap hardware using software to manage failure and recover with a very low probability of losing data.
Even with a small amount of data, MongoDb is a nice option to traditional relational database solutions which require more overhead of developer resources to design, build and maintain.
General Approach
You have a lot of work ahead of you. Whichever database you use, you have many features which you must build on top of the DB foundation. You have done good research about all of your options. It sounds like you suspect that all have pros and cons but all are imperfect. Your suspicion is correct. At this point it is probably time to start writing code.
You could just choose one arbitrarily and start building your application. If your guess was correct that the pros and cons balance out and it's all about the same, then why not simply start building immediately? When you hit difficulty X on your database, remember that it gave you convenience Y and Z and that's just life.
You could also establish the fundamental core of your application and implement various prototypes on each of the databases. That might give you true insight to help discriminate between the databases for your specific application. For example, besides the interface, indexing, and querying questions, what about deployment? What about backups? What about maintenance and security? Maybe "wasting" time to build the same prototype on each platform will make the answer very clear for you.
Notes about CouchDB
I suppose CouchDB is "NoSQL" if you say so. Other things which are "no SQL" include bananas, poems, and cricket. It is not a very meaningful word. We have general-purpose languages and domain-specific languages; similarly CouchDB is a domain-specific database. It can save you time if you need the following features:
Built-in web API: clients may query directly
Incremental map-reduce: CouchDB runs the job once, but you can query repeatedly at no cost. Updates to the data set are immediately reflected in the map/reduce result without full re-processing
Easy to start small but expand to large clusters without changing application code.
Have you considered Apache Kafka?
Kafka is a distributed messaging system developed at LinkedIn for
collecting and delivering high volumes of log data with low latency.
Our system incorporates ideas from existing log aggregators and
messaging systems, and is suitable for both offline and online message
consumption.
I'm trying to populate a table with user information in a MS SQL database with information from multiple data sources (i.e. LDAP and some other MS SQL databases). The process needs to run as a daily scheduled task to ensure that the user information table is updated frequently.
The initial attempt at this query/ update script was written in VBScript and would query each data source and then update the user information table. Unfortunately, this takes very long to run and update the user information table.
I'm curious if anyone has written anything similar and if you recommend or noticed a performance improvement by writing the script in another language. Some have recommended Perl because of multi-threading, but if anyone has any other suggestions on ways to improve the process or other approaches could you share tips or lessons learned.
It's good practise to use Data Transformation Services (DTS) or SSIS as it has become known for doing repetitive DB tasks. Although this won't solve your problem, it may give some pointers to what is going on as you can log each stage of the process, wrap it in transactions etc. It is especially well suited for bulk loading and updates, and it understands VBScript natively so there should be no problem there.
Other than that I have to agree with Brian, find out what's making it slow and fix that, changing languages is unlikely to fix it on its own, especially if you have an underlying issue. As a general point my experience when using LDAP, which is pretty small, was it could be incredibly slow reader bulk user details.
I can't tell you how to solve your particular problem, but whenever you run into this situation you want to find out why it is slow before you try to solve it. Where is the slow down? Some major things to consider and investigate include:
getting the data
interacting with the network
querying the database
updating indices in the database
Get some timing and profiling information to figure out where to concentrate your efforts.
Hmmm. Seems like you could cron a script that uses dump utils from the various sources, then seds the output into good form for the load util for the target database. The script could be in bash or Perl, whatever.
Edit: In terms of performance, I think the first thing you want to try is to make sure that you disable any autocommit at the beginning of the load process, then issue the commit after writing all the records. This can make a HUGE performance difference.
AS MrTelly said, use SSIS or DTS. Then schedule the package to run. Just converting to this alone will probaly fix your speed issue as they have tasks that are optimized for bulk inserting. I would never do this in a script language rather that t-SQl anyway. Likely your script works row by row instead of on sets of data but that is just a guess.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
When I attended a presentation of SQL Server 2008 at Microsoft, they did a quick gallup to see what features we were using. It turned out that in the entire lecture hall, my company was the only one using the Service Broker. This surprised me a lot, as I thought that more people would be using it.
My experience with SB is that it does it's job well, but is pretty tough to administer and it's hard to get an overview.
So, have you considered using the Service Broker? If not, why not? Did you go for MSMQ instead? Is there anything in SQL Server 2008 that would make you consider using the Service Broker.
I've been using SQL Service Broker since a couple of months after SQL 2005 was released. We use it non-stop here sending hundreds of thousands of messages through it per day.
We use it to load data from staging tables to production tables so that the service that loads the staging table doesn't have to wait for the data to actually process, it can go back and get more data to load.
We use it to queue the deletion of files from the file system. (When the row is deleted the file needs to be deleted as well.)
At prior companies I've used it to print loan documents and the checks that were sent out to the customers.
I even used Service Broker to do ETL from an OLTP database to an OLAP database for real time reporting.
Most people (especially DBAs) don't like Service Broker because there isn't any UI for it. If you want to use service broker or see what its doing you have to actually write and run some T/SQL.
I have been using SB in 2005 for about two years now with one implementation handling several hundred thousand messages a day. I would say the biggest challenge has been not so much in the architecture but understanding all the nuances involved. The documentation from Microsoft is poor with very few practical examples. Remus Rusanu's blogs have really been helpful in doing things like dialog reuse and activation stored procedure tuning. I have found it's REALLY important to reuse dialogs as much as possible (and working through all the associated locking involved with that) as well as handling multiple received messages as a set rather than one at a time.
Monitoring SB can be a pain. You basically depend on a bunch of system views to tell you what's going on. Orphaned messages are a pain. There's just a lot of little gotchas that can, well, getcha.
Aside from the problems, and there aren't THAT many, I think it has really worked out better than I expected it to. Since SB is integrated into the database, there's no separate message queues to back up outside the database. It's all transactionally consistent. Performance is good. It's a great solution.
I would use it again and will continue to use it.
At my current company, our usage of SB is somewhat different to that of the other posters. We use SB in SQL2005 mainly as a management tool. For example, we use it to manage updates to a small set of mutable tables that are present in a large number of otherwise immutable databases. All the messages are between services running on the same instance and the message volume is very low.
My experience with SB has been that it can be somewhat 'fiddly' to setup correctly and, as you mentioned in your question, it is hard to get an overview of the state of SB because there is not a single monitoring tool.
Nevertheless, we have found it hugely valuable as a way to automate a lot of database management tasks in a traceable and reliable way.
I have recently considered using Service Broker for a project, but yes, decided to go for MSMQ instead.
Our architecture consisted of a number of (clustered) servers, each needing to write information into a single instance of SQL reliably.
As I understand it, SB only works for SQL to SQL communication, so we would have needed an instance of SQL on each clustered box. We felt this was a bit unnecessary, hence using MSMQ
To be honest, i'm can't think of a scenario where I would use SB - I'm interested in knowing a bit more about your scenario, to see if I'm missing something vital.
Service Broker can be used in various cases where automation is required to be done in the distributed architecture.
Such applications receiving events from various devices and need processing to be done reliably. Where events from devices (detection) or sensors are used for processing the logic of automation. To do exchange of data between multiple database or applications.
I hope the implementation can be more secured and reliable with SB