NebulaGraph Database: How can I speed it up if a task result is saved slowly or data is transferred slowly between tasks?

NebulaGraph Database: How can I speed it up if a task result is saved slowly or data is transferred slowly between tasks? - graph-databases

When I use NebulaGraph and NebulaGraph Explorer for graph algorithm, it takes a long time. How can I speed it up?
I used the workflow feature in NebulaGraph Explorer, which allows me to design my own workflow, but once I did, the execution was slow.

If you are only exploring how you would like to finally implement your production graph-based pipeline/application with graph algorithms, quickly execution and experiments will be extremely helpful(instead of running algos on the whole graph for each epoch of your quick study).
It's doable in explorer in may ways:
consider create a smaller dataset for the study
use query(with limit sampling of the data) instead of scanning full data in the workflow pipeline
explore and query the subgraph of the data in the canvas and leverage the explorer's in-browser graph algorithms, everything will be run on small piece of data visually on your browser
I personally will go with option 3 to do fast back-and-forth graph algo experiments first and then go with workflow before final implementations.
As for option.1 it'll be not easy to cut a smaller piece of graph while with the expected graph structure/connectivity unchanged, plus option.2 is with a similar cost of the option.3, but we could do everything with drag-drop-click in the option.3 :).
ref:
https://docs.nebula-graph.io/3.3.0/nebula-explorer/graph-explorer/graph-algorithm/ the in-browser graph algorithm function

Related

Is SageMaker multi-node Spot-enabled GPU training an anti-pattern?

Is it an anti-pattern to do multi-node Spot-enabled distributed GPU training on SageMaker?
I'm afraid that several issues will slow things down or even make them infeasible:
the interruption detection lag
the increased probability of interruption (N instances)
the need to re-download data at every interruption
the need start/stop whole clusters instead of just replacing interrupted nodes
the fact that Sagemaker doesn' support variable size cluster
Additionally EC2-Spot documentation deters users from using Spot in multi-node workflows where nodes are tightly coupled (which is the case in data-parallel and model-parallel training) "Spot Instances are not suitable for workloads that are inflexible, stateful, fault-intolerant, or tightly coupled between instance nodes."
Anybody here have experience doing Spot-enabled distributed GPU training on SageMaker happily?

Short answer is that Spot training works well when the instance type you need, in the region you need, has enough free capacity, at a particular time. Otherwise you won't be able to start the job, or get too frequent interruptions.
Why not just try it for yourself? Once you have a working on-demand training job, you can enable spot training by adding 3 relevant parameters to the job's Estimator definition, and implement checkpoint save/load (good to have anyway). Then if it works well, great! If not, switch back.

Database and large Timeseries - Downsampling - OpenTSDB InfluxDB Google DataFlow

I have a project where we sample "large" amount of data on per-second basis. Some operation are performed as filtering and so on and it needs then to be accessed as second, minute, hour or day interval.
We currently do this process with an SQL based system and a software that update different tables (daily average, hourly averages, etc...).
We are currently looking if other solution could fit our needs and I went across several solutions, as open tsdb, google cloud dataflow and influxdb.
All seem to address timeseries needs, but it gets difficult to get information about the internals. opentsdb do offer downsampling but it is not clearly specified how.
The need is since we can query vast amount of data, for instance a year, if the DB downsample at the query and is not pre-computed, it may take a very long time.
As well, downsampling needs to be "updated" when ever "delayed" datapoint are added.
On top of that, upon data arrival we perform some processing (outliner filter, calibration) and those operation should not be written on the disk, several solution can be used like a Ram based DB but perhaps some more elegant solution that would work together with the previous specification exists.
I believe this application is not something "extravagant" and that it must exist some tools to perform this, I'm thinking of stock tickers, monitoring and so forth.
Perhaps you may have some good suggestions into which technologies / DB I should look on.
Thanks.

You can accomplish such use cases pretty easily with Google Cloud Dataflow. Data preprocessing and optimizing queries is one of major scenarios for Cloud Dataflow.
We don't provide a "downsample" primitive built-in, but you can write such data transformation easily. If you are simply looking at dropping unnecessary data, you can just use a ParDo. For really simple cases, Filter.byPredicate primitive can be even simpler.
Alternatively, if you are looking at merging many data points into one, a common pattern is to window your PCollection to subdivide it according to the timestamps. Then, you can use a Combine to merge elements per window.
Additional processing that you mention can easily be tacked along to the same data processing pipeline.
In terms of comparison, Cloud Dataflow is not really comparable to databases. Databases are primarily storage solutions with processing capabilities. Cloud Dataflow is primarily a data processing solution, which connects to other products for its storage needs. You should expect your Cloud Dataflow-based solution to be much more scalable and flexible, but that also comes with higher overall cost.

Dataflow is for inline processing as the data comes in. If you are only interested in summary and calculations, dataflow is your best bet.
If you want to later take that data and access it via time (time-series) for things such as graphs, then InfluxDB is a good solution though it has a limitation on how much data it can contain.
If you're ok with 2-25 second delay on large data sets, then you can just use BigQuery along with Dataflow. Dataflow will receive, summarize, and process your numbers. Then you submit the result into BigQuery. HINT, divide your tables by DAYS to reduce costs and make re-calculations much easier.
We process 187 GB of data each night. That equals 478,439,634 individual data points (each with about 15 metrics and an average of 43,000 rows per device) for about 11,512 devices.
Secrets to BigQuery:
LIMIT your column selection. Don't ever do a select * if you can help it.
;)

Rule of thumb to begin optimizing a page's database queries

I'm implementing database profiling in a website that will definitely start seeing a measured increase in growth over the next year. I'm implementing query profiling on each page (using Zend) and am going to log issues when a page gets too slow. At that point, I'll see what I can do to optimize the queries. The problem, is that without any experience with scaling a website, I'm not sure what "too slow" would be for the queries on a given page. Is there any accepted time-limit for the queries on a given page before one should look for ways to optimize the queries?
Thanks,
Eric

There's no global "too slow". Everything depends on what the queries do and what's your traffic like. Invest some time in writing scenarios for a traffic generator and just load-test your website. Check which parts break first, fix them and repeat. Even the simple queries can hit some pathological cases.
Don't forget to load more fake data into the database too - more users are likely to generate more data for you and some problems may start only when the dataset is larger than your database caching/buffers. Make sure you're blaming the right queries too - if you have something locking the tables for update, other transactions may need retries / get delayed - look at the top N queries instead of fixating on one single query.
Make sure you look at the queries from both sides too - from the client and the server. If you're using mysql for example, you can easily log all queries which don't use indexes for joins / searches. You can also use percona toolkit (previously Maatkit) to grab the traffic off the network and analyse that instead. You can use mysqltunner to see how many cache misses you experience. For other databases, you can find similar tools elsewhere.
If there is any general rule, I'd say - if your queries start taking 10x the time they took without any other load, you've got a problem. Also, it's not about queries - it's about page load time. Find an answer to "how long should the page generation take?" and go from there. (probably less than a second unless you do heavy data processing under the covers)

Query Profiler for Couchbase

I have recently started experimenting with Couchbase 2.0. What would be a good way to measure time taken by operations like inserts, aggregation etc. in couchbase? Is there a profiler available?
One basic way would be to measure the time taken to excecute the rest api commands but that would include network latency. What I'm looking for is the time taken by the processor to run the commands.
Thanks in advance.

One can see high-level ops/sec, broken down into various operations
(reads, writes, etc.) from the UI console when monitoring a data bucket. The
'cbstats' tool has a 'timings' histogram option which will report the times
for operations, and breaks it down into important parts of the operation
such as how long it took to read items off of disk if necessary, or how long
to complete a delete operation, etc.
The documentation for this command ca be found here

Google App Engine - stats on datastore reads for a complete application

In order to decrease the cost of an existing application which over-consumes on Datastore reads, I am trying to get stats on the application as a whole.
What I'd like to get for the overall application is stats about the queries that are returning the biggest number of rows during a complete day of production. The cost of retrieving data being $0.70 / million, there is a big incentive to optimise / cache some queries but first I have to understand which query retrieves too much data.
Appstats apparently does not provide this information as the tool's primary driver is to optimise one RPC call.
Does anyone has a magic solution for this one ? One alternative I thought about was to build by myself a tool to log after each query the number of rows returned but that looks like an overkill and will require to open the code.
Thanks a lot for your help !
Hugues

See this related post: https://stackoverflow.com/questions/11282567/calculating-datastore-api-usage-per-request/
What you can do to measure and optimize is to look at the cost field provided by the LogService. (It's called cpm_usd in the admin panel).
Using this information you can find the most expensive urls and thus optimize its queries.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight