How to export data from Milvus? - database

We are using Milvus2.0 for vector search. And Milvus is positioned as a database. So it should provide an efficient way to export data from it.
After reading Milvus Website, I only find an API named "query" to obtain vectors from Milvus. Then I test this API. But I found it is too slow to use it.
How do I quickly export large quantities of data from Milvus?

Query is not designed to pull massive amount of data from Milvus for the following reasons...
2 Gb rpc transfer limit per call and pagination is not implemented yet
performance is slower than pulling directly from drive
There is a tool called Milvus Data Migration (MilvusDM) tool, however, it does not support Milvus 2.x yet. The community is working on this and it will be available in future release. This is the MEP document: https://wiki.lfaidata.foundation/display/MIL/MEP+24+--+Support+bulk+load

Related

What is the best way to run a report in notebooks when connected to snowflake connector?

My last couple of questions have been on how to connect to snowflake and add and read data with the python connector in a ipython notebook. However, I am having troubling with the next best step to create a report with the data I seek to visualize.
I would like to upload all of the data, store it, then analyze it, kind of like a homemade dashboard.
So what I have done so far is a small version:
Staged my data from a local file, and I will run adding new data
each time I open the notebook
Then I will use the python connector to call any data from storage
Create visualizations with numpy objects in the local notebook.
My data will start out very small, but over time I would imagine I would have to move computation to the cloud to minimize the memory used locally for the small dashboard.
My question is, my data is called from a api that results in json files, new data is no bigger that 75 MB a day 8 columns, with two aggregate calls to the data, done in the sql call. If I run these visualizations monthly, is it better to aggregate the information in Snowflake, or locally?
Put the raw data into Snowflake. Use tasks and procedures to aggregate it and store the result. Or better yet, don't do any aggregations except for when you want the data - let Snowflake do the aggregations in real-time off the raw data.
I think what you might be asking is whether you should ETL your data or ELT your data:
ETL: Extract, Transform, Load (in that order) - Extract data from your API. Transform it locally on your computer. Load it into Snowflake.
ELT: Extract, Load, Transform (in that order) - Extract data from your API. Load it into Snowflake. Transform it after it's in Snowflake.
Both ETL and ELT are valid. Many companies use both approaches w/ snowflake interchangeably. But Snowflake was built for it to kind of be your data lake - the idea being, "Just throw all your data up here and then use our awesome compute and storage resources to transform them quickly and easily."
Do a Google search on "Snowflake ELT" or "ELT vs ETL" for more information.
Here are some considerations either way off the top of my head:
Tools you're using: Some tools like SSIS were built w/ ETL in mind - transformation of the data before you store it in your warehouse. That's not to say you can't ELT, but it wasn't built w/ ELT in mind. More modern tools - like Fivetran or even Snowpipe assume you're going to aggregate all your data into Snowflake, and then transform it once it's up there. I really like the ELT paradigm - i.e. just get your data into the cloud - transform it quickly once it's up there.
Size and growth of your data: If your data is growing, it becomes harder and harder to manage it on local resources. It might not matter when your data is in gigabytes or millions of rows. But as you get into billions of rows or terabytes of data, the scale-ability of the cloud can't be matched. If you feel like this might happen and you think putting it into the cloud isn't a premature optimization, I'd load your raw data into Snowflake and transform it after it's up there.
Compute and Storage Capacity: Maybe you have a massive amount of storage and compute at your fingertips. Maybe you have an on-prem cluster you can provision resources from at the drop of a hat. Most people don't have that.
Short-Term Compute and Storage Cost: Maybe you have some modest resources you can use today and you'd rather not pay Snowflake while your modest resources can do the job. Having said that, it sounds like the compute to transform this data will be pretty minimal, and you'll only be doing it once a day or once a month. If that's the case, the compute cost will be very minimal.
Data Security or Privacy: Maybe you have a need to anonymize data before moving it to the public cloud. If this is important to you you should look into Snowflake's security features, but if you're in an organization where it's super difficult to get a security review and you need to move forward with something, transforming it on-prem while waiting for security review is a good alternative.
Data Structure: Do you have duplicates in your data? Do you need access to other data in Snowflake to join on in order to perform your transformations? As you start putting more and more data into Snowflake, it makes sense to transform it after it's in Snowflake - that's where all your data is and you will find it easier to join, query and transform in the cloud where all your other data is.
My question is, my data is called from a api that results in json files, new data is no bigger that 75 MB a day 8 columns, with two aggregate calls to the data, done in the sql call. If I run these visualizations monthly, is it better to aggregate the information in Snowflake, or locally?
I would flatten your data in python or Snowflake - depending on which you feel more comfortable using or how complex the data is. You can just do everything on the straight json, although I would rarely look to design something that way myself (it's going to be the slowest to query.)
As far as aggregating the data, I'd always do that on Snowflake. If you would like to slice and dice the data various ways, you may look to design a data mart data model and have your dashboard simply aggregate data on the fly via queries. Snowflake should be pretty good with that, but for additional speed then aggregating it up to months may be a good idea too.
You can probably mature your process from being local python script driven too something like serverless lambda and event driven wwith a scheduler as well.

Send sensor data to hazelcast database and link it to development tools to design dashboards

Can we use hazel-cast database to link and design the data according to tracker with bar graph, below are the points which I need to confirm to build the application for hardware:
- I am using temperature sensor interfacing with Arduino Yun and wanted to upload the data given by temperature sensor on hazel-cast server.
By using single database output uploaded in hazelcast server, reads the data through database through Arduino MKR1000.
Link the data to different development tools to design different types of dashboards like Pie chart, Bar chart, Line chart etc.
Please suggest how the best way to link to create the database in data-grid
How you want to use data on your dashboard will basically depend on how you have modelled your data - one map or multiple maps etc. Then you can retrieve data through single key-based lookups or by running queries and use that for your dashboard. You can define the lifetime of the data - be it few minutes or hours or days. See eviction: http://docs.hazelcast.org/docs/3.10.1/manual/html-single/index.html#map-eviction
If you decide to use a visualisation tool for dashboard that can use JMX then you can latch on to Hazelcast exposed JMX beans that would give you information about data stored in the cluster and lot more. Check out this: http://docs.hazelcast.org/docs/3.10.1/manual/html-single/index.html#monitoring-with-jmx
You can configure Hazelcast to use a MapLoader - MapStore to persist the cached data to any back-end persistence mechanism – relational or no-sql databases may be good choices.
On your first point, I wouldn’t expect anything running on the Arduino to update the database directly, but the MKR1000 is going to get you connectivity, so you can use Kafka/MQTT/… - take a look at https://blog.hazelcast.com/hazelcast-backbone-iot-internet-things/.
If you choose this route, you’d set up a database that is accessible to all cluster members, create the MapLoader/MapStore class (see the example code, for help) and configure the cluster to read/write.
Once the data is in the cluster, access is easy and you can use a dashboard tool of your choice to present the data.
(edit) - to your question about presenting historical data on your dashboard:
Rahul’s blog post describes a very cool implementation of near/real-time data management in a Hazelcast RingBuffer. In that post, I think he mentioned collecting data every second and buffering two minutes worth.
The ring buffer has a configured capacity, but note that he is over-writing, on add - this is kind of a given for real-time systems; given the choice is losing older data or crashing.
For a generalized query-tool approach, I think you’d augment this. Off the top of my head, I could see using the ring-buffer in conjunction with a distributed map. You could (but wouldn’t need to) populate the map, using an map-event interceptor to populate the ring buffer. That should leave the existing functionality intact. The map, though, would allow you to configure a map-store/map-loader, so that your data is saved in a backing store. The map would support queries - but keep in mind that IMDG queries do not read through to the backing store.
This would give you flexibility, at the cost of some complexity. The real-time data in the ring buffer would be always available, quickly and easily. Data returned from querying the map would be very quick, too. For ‘historical’ data, you can query your backing-store - which is slower, but will probably have relatively great storage capacity. The trick here is to know when to query each. The most recent data is a given, with it’s fixed capacity. You need to know how much is in the cluster - i.e. how far back your in-memory history goes. I think it best to configure the expiry to a useful limit and provision the storage so that data leaves the map by expiration - not eviction. In this way, you can know what the beginning of the in-memory history is. Monitoring eviction events would tell you that your cluster has a complete view of data back to a known time.
 

Fast JSON/flat data server for mostly reads

I ask this question apprehensively because it is not a pure programming question, and because I am seeking a (well informed) suggestion.
I have an analytic front end, written in JavaScript, with lots of aggregations and charting happening in the browser (dimple.js, even stats.js, ...)
I want to feed this application with JSON or delimited data from some high performance data structure server. No writes except for loading. Data will be maybe 1-5 GB in size and there could be dozens, if not hundreds concurrent readers, but only in peak hours. This data is collected from and fed by Apache Hive.
Now my question is about the selection of a database/datastore server choices for this.
(I have pretty good command of SQL/NoSQL choices, so I am really seeking advice for the very specific requirements)
Requirements and specifications for this datastore are:
Mostly if not all queries will be reads, initiated by the web, JS-based front end.
Data can be served as JSON or flat tabular csv, psv, tsv.
Total data size on this store will be 1-5 GB, with possible future growth, but nothing imminent (6-12 months)
Data on this datastore will be refreshed/loaded into this store daily. Probably never in a real time.
Data will/can be accessed via some RESTful web services, Socket IO, etc.
Faster read access, the better. Speed matters.
There has to be a security/authentication method for sensitive data protection.
It needs to be reasonably stable, not a patching-requiring bleeding edge.
Liberal, open source license.
So far, my initial candidates for examination were Postgres (optimized for large cache) and Mongo. Just because I know them pretty well.
I am also familiar with Redis, Couch.
I did not do benchmark myself, but I have seen benchmarks where Postgres was faster than Mongo (while offering JSON format). Mongo is web-friendlier.
I am considering in-memory stores with persistence such as Redis, Aerospike, Memcached. Redis 3.0 is my favorite so far.
So, I ask you here if you have any recommendations for the production quality datastore that would fit well what I need.
Any civil and informed suggestions are welcome.
What exactly does your data look like? Since you said CSV like exports, I'm assuming this is tabular, structured data that would usually be found in a relational database?
Some options:
1. Don't use a database
Given the small dataset, just serve it out of memory. You can probably spend a few hours to write a quick app with any decent web framework that just loads up the data into memory (for example, from a flat file) and then searches and returns this data in whatever format and way you need.
2. Use an embedded database
You can also try an embedded database like SQLite which gives you in-memory performance but with a reliable SQL interface. Since it's just a single-file database, you can have another process generate a new DB file, then swap it out when you update the data for the app.
3. Use a full database system
Use a regular relational database. mySQL, PostgreSQL, SQL Server (Express Edition) are all free and can handle that dataset easily and will just cache it all in RAM. If it's read queries, I don't see any issues with a few hundred concurrent users. You can also use memSQL community edition if you need more performance. They all support security, are very reliable, and you can't beat SQL for data access.
Use a key/value system if your data isn't relational or tabular and is more of a fit as simple values or documents. However remember KV stores aren't great at scans or aggregations and don't have joins. Memcached is just a distributed cache, don't use it for real data. Redis and Aerospike are both great key/value systems with Redis giving you lots of nice data structures to use. Mongo is good for data flexibility. Elasticsearch is a good option for advanced search-like queries.
If you're going to these database systems though, you will still need a thin app layer somewhere to interface with the database and then return the data in the proper format for your frontend.
If you want to skip that part, then just use CouchDB or Riak instead. Both are document oriented and have a native HTTP interface with JSON responses so you can consume it directly from your frontend, although this might cause security issues since anyone can see the javascript calls.

Recommended Setup for BigData Application

I am currently working on a long term project that will need to support:
Lots of fast Read/Write operations via RESTful Services
An Analytics Engine continually reading and making sense of data
It is vital that the performance of the Analytics Engine not be affected by the volume of Reads/Writes coming from the API calls.
Because of that, I'm thinking that I may have to use a "front-end" database and some sort of "back-end" data warehouse. I would also need to have something like Elastic Search or Solr indexing the data stored in the data warehouse.
The Questions:
Is this a Recommended Setup? What would the alternative be?
If so...
I'm considering either Hive or Pig for the data-warehousing, and Elastic Search or Solr as a Search Engine. Which combination is known to work better together?
And finally...
I'm seriously considering Cassandra as the "fron-end" database. What is the relation between Cassandra and Hadoop, and when/why should they be put to work together instead of having just Cassandra?
Please note, my intention is NOT to start a debate about which of these is better, but to understand how can they be put to work better more efficiently. If it makes any difference, the main code is being written in Scala and Java.
I truly appreciate your help. I'm basically learning as I go and all comments will be very helpful.
Thank you.
First let's talk about Cassandra
This is a NoSQL database with eventual consistency which basically means for you that different nodes into a Cassandra cluster may have different 'snapshots' of data in the case that there is an inter cluster communication/availability problem. The data eventually will be consistent however.
Since you consider it as a 'frontend' database what you need to understand is how you will model your data. Cassandra can take advantage of indexes however you still need to defined upfront your access pattern.
Normally there is no relation between Cassandra and Hadoop (except that both are written in Java) however the Datastax distribution (enterprise version) has Hadoop support directly from Cassandra.
As a general workflow you will read/write most current data (let's say - last 24 hours) from your 'small' database that enough performance (Cassandra has excellent support for it) and you would move anything older than X (older than 24 hours) to a 'long term storage' such as Hadoop where you can run all sort of Map Reduce etc.
In regards to the text search it really depends what you need - Elastic Search is sort of competition to Solr and reverse. You can see yourself how they compare here http://solr-vs-elasticsearch.com/
As for your third question,
I think Cassandra is more like a database to save data.
Hadoop is responsible to provide a compution model to let you analyze your large data in
Cassandra.
So it is very helpful to combine Cassandra with Hadoop.
Also have other ways you can consider, such as combine with mongo and hadoop,
for mongo has support mongo-connector between hadoop and it's data.
Also if you have some search requirements , you can also use solr, directly generated index from mongo.

How are databases used to implement document collaboration?

How are document collaboration tools such as Google Docs and Sharepoint implemented in the backend? What kind of database architecture in the backend is used to implement features such as multiple people editting the document simultaneously. How is this done efficiently efficiently for large documents without having each edit update an entire database entry?
And how do they maintain the complete version history of every single edit while not using up tons of disk space?
Do Google Docs and Sharepoint have degrading performance for very very large documents?

Resources