Suppose I have a long list of URLs. Now, I need to write a script to do the following -
Go to each of the URLs
Get the data returned
And store it in a database
I know two ways of doing this -
Pop one URL from the list, download the data and save it in the database. Pop the next URL, download data and save it in the db, and repeat...
This will require way too many disk writes so, other way is to
Download the data from each of the URLs and save it to the memory. And finally, save it all to the database in one disk write.
But this will require carrying a huge chunk of data in the memory. So there's a possibility that program may just terminate because of OOM Error.
Is there any other way, which is some kind of intermediate between these methods?
(In particular, I am writing this script in Julia and using MongoDB)
We can extend #Trifon's solution a little bit with concurrency. You could simultaneously run two threads:
A thread which fetches the data from the URLs, and stores them in a channel in the memory.
A thread which reads from the channel and writes the data to the disk.
Make sure that the channel has some bounded capacity, so that Thread 1 is blocked in case there are too many consecutive channel writes without Thread 2 consuming them.
Julia is supposed to have good support for parallel computing
Write results to the database in batches, say every 1000 URLs.
This solution is something between 1 & 2 of the two ways you are describing above.
Related
I'm in the design stage of my App DB, running on Firebase.
I wonder what is the best approach to store my Data.
I know that Firebase only returns the first layer of the document unless you explicitly query for sub-collection And I want to reduce extra reads as much as possible to save $.
In short, my DB holds devices and each device can have messages (call-backs) that the server needs to send back to him, one at a time, in FIFO order.
The user sees the waiting call-backs when watching a device in the App.
My first thought was to hold a collection of messages for each device, but this approach will require two reads for each call-back in each device when the server wants to know what is the next call-back to send.
My second approach is to hold the call-back in array in each device document but I'm not sure if it is the right approach, sounds messy to me for some reason.
My third option is to hold a collection of call-backs with the deviceID field in each call-back. the drawback I see in this approach is that I need to perform some kind of "join" when viewing a device and his call-backs in the App, And the searching time for the next call-back is increasing, although still reasonable (log(n) time).
If we talk numbers theoretically the system can store tens of thousands of devices and each device can have 10-100 call-backs in line, and each call-back can be called every 15 seconds or more.
My first thought was to hold a collection of messages for each device, but this approach will require two reads for each call-back in each device when the server wants to know what is the next call-back to send.
If you need to query the messages that correspond to a device, then storing them into a collection is a good option to go ahead with. This is because you can perform simple and compound qureis.
My second approach is to hold the call-back in array in each device document but I'm not sure if it is the right approach, sounds messy to me for some reason.
If you only need to list them, and not query, then that's a really good idea in my opinion, but only if the size of the document can stay below the maximum limit. So there are some limits when it comes to how much data you can put into a document. According to the official documentation regarding usage and limits:
Maximum size for a document: 1 MiB (1,048,576 bytes)
As you can see, you are limited to 1 MiB total of data in a single document. When we are talking about storing text, you can store pretty much but as your array gets bigger, be careful about this limitation.
My third option is to hold a collection of call-backs with deviceID field in each call-back. the drawback I see in this approach is that I need to perform some kind of "join" when viewing a device and his call-backs in the App, And the searching time for the next call-back is increasing, although still reasonable (log(n) time).
When talking again about collections, we are getting back to the first question. Remember, we are usually structuring a Firestore database according to the queries that we want to perform. So it's up to you to decide if this approach it's best for you.
Always keep in mind that you are billed according to the number of documents a query returns.
We are running large query jobs where we hit the 128M response size and BigQuery raises the "Response too large to return. Consider setting allowLargeResults to true in your job configuration" error.
We are opting for the allowLargeResults approach to keep the already complex SQL unchanged (instead of chunking things at this level). The question is what is the best way to process the results written to the intermediate table:
Export the table to GCS, then queue tasks that process chunks of the response file using offsets into the GCS file. This introduces latency from GCS, GCS file maintenance (e.g. cleaning up files), and another point of failure (http errors/timeouts etc).
Query chunks from the intermediate tables also using queued tasks. The question here is what is the best way to chunk the rows (is there an efficient way to do this, e.g. is there an internal row number we can refer to?). We probably end up scanning the entire table for each chunk so this seems more costly than the export to GCS option.
Any experience in this area and or recommendations?
Note that we are running in the Google App Engine (Python)
Thanks!
I understand that https://cloud.google.com/bigquery/docs/reference/v2/tabledata/list willl let you read chunks of a table without performing a query (incurring data processing charges).
This lets you read the results of a query in parallel as well as all queries are written to a temporary table id which you can pass to this function and supply different ranges (with startIndex,maxResults).
Note that BigQuery is able to export data in chunks - and you can request as many chunks as workers you have.
From https://cloud.google.com/bigquery/exporting-data-from-bigquery#exportingmultiple:
If you ask to export to:
['gs://my-bucket/file-name.json']
you will get an export in one GCS file, as long as it's less than 1GB.
If you ask to export to:
['gs://my-bucket/file-name-*.json']
you will get several files with each having a chunk of the total export. Useful when exporting more than 1GB.
If you ask to export to:
['gs://my-bucket/file-name-1-*.json',
'gs://my-bucket/file-name-2-*.json',
'gs://my-bucket/file-name-3-*.json']
you will get exports optimized for 3 workers. Each of these patterns will receive a series of exported chunks, so each worker can focus on its own chunks.
It is not clear what is exactly the reason for "chunk processing". If you have some complex SQL logic that needs to be run against your data and result happend to be bigger than current 128MB limit - just still do it with allowLargeResults and than consume result the way you need to consume it.
Of course you most likely have reason for chunking but it is not understood and thus making answering problematic
Another suggestion is to not to ask many questions in one question - this make answering very problematic, so you have great chance to not to get it answered
Finally, my answer for the only question that is relatively clear (for me at least)
The question here is what is the best way to chunk the rows (is there
an efficient way to do this, e.g. is there an internal row number we
can refer to?). We probably end up scanning the entire table for each
chunk so this seems more costly than the export to GCS option
It depends on how and when your table was created!
If your table was loaded as a one big load - i dont see way of avoiding table scan again and again.
If table was loaded in increments and recently - you have chance to enjoy so called Table Decoration (specifically you should look to Range Decorators)
In the very early era of BigQuery there was an expectation of having Partitioned Decorators - this would address a lot of users needs - but it is still not available and i dont know what are the plans on them
Ok everyone, I have an excellent challenge for you. Here is the format of my data :
ID-1 COL-11 COL-12 ... COL-1P
...
ID-N COL-N1 COL-N2 ... COL-NP
ID is my primary key and index. I just use ID to query my database. The datamodel is very simple.
My problem is as follow:
I have 64Go+ of data as defined above and in a real-time application, I need to query my database and retrieve the data instantly. I was thinking about 2 solutions but impossible to set up.
First use sqlite or mysql. One table is needed with one index on ID column. The problem is that the database will be too large to have good performance, especially for sqlite.
Second is to store everything in memory into a huge hashtable. RAM is the limit.
Do you have another suggestion? How about to serialize everything on the filesystem and then, at each query, store queried data into a cache system?
When I say real-time, I mean about 100-200 query/second.
A thorough answer would take into account data access patterns. Since we don't have these, we just have to assume equal probably distribution that a row will be accessed next.
I would first try using a real RDBMS, either embedded or local server, and measure the performance. If this this gives 100-200 queries/sec then you're done.
Otherwise, if the format is simple, then you could create a memory mapped file and handle the reading yourself using a binary search on the sorted ID column. The OS will manage pulling pages from disk into memory, and so you get free use of caching for frequently accessed pages.
Cache use can be optimized more by creating a separate index, and grouping the rows by access pattern, such that rows that are often read are grouped together (e.g. placed first), and rows that are often read in succession are placed close to each other (e.g. in succession.) This will ensure that you get the most back for a cache miss.
Given the way the data is used, you should do the following:
Create a record structure (fixed size) that is large enough to contain one full row of data
Export the original data to a flat file that follows the format defined in step 1, ordering the data by ID (incremental)
Do a direct access on the file and leave caching to the OS. To get record number N (0-based), you multiply N by the size of a record (in byte) and read the record directly from that offset in the file.
Since you're in read-only mode and assuming you're storing your file in a random access media, this scales very well and it doesn't dependent on the size of the data: each fetch is a single read in the file. You could try some fancy caching system but I doubt this would gain you much in terms of performance unless you have a lot of requests for the same data row (and the OS you're using is doing poor caching). make sure you open the file in read-only mode, though, as this should help the OS figure out the optimal caching mechanism.
When developing software that records input signals (numbers) in real time, how can this data be best stored and compressed? Would an SQL engine be good for this, permitting fast data mining in the future, or are there other data formats that would be suitable or compressed enough for upto 1000 data samples per second?
I don't mind building in VC++ but ideas applicable to C# would be ideal.
It is hard to say without more info, such as, what is the source, will you be needing to query the stored data, and so on.
But for 1000 samples/sec, you should propably look at holding a few seconds of data in memory, and then writing them out in bulk to persistent storage on another thread. (Multi-processor machine recommended).
If you decide to do it via a managed language, keep the same data structure around for keeping the samples - so that the GC does not need to collect memory too often. You can get marginally better performance by using pointers and the unsafe keyword (provides direct access to the memory structure and eliminates bounds checking code for arrays).
I don't know how much CPU time is needed for you to collect each sample; and how time-critical it is to read each sample at a specified time (will they be buffered in the device you are reading from ?). If the sampling is time-critical, you have 1 ms per sample; and then you probably cannot afford the risk of the garbage collector kicking in, as it will block your thread for some time. In this case, I would go for an unmanaged approach.
SQL Server would easily be able to hold your data, or you could write them to a file. It mostly depends on what you need to do with the data at a later time. I don't know how much data each sample is, but let's assume it is 8 bytes. Then you have 8000 bytes per second to write of raw data - perhaps you have some overhead, so it could be 10 kB/s. Most storage mechanisms I can think of will be able to write data at this speed. Just make sure to write on another thread than the one that are doing the sampling.
You may want to look at time-series databases, rather than relational. These will be optimised to deal with the sort of data and usage you're considering.
Kx is a popular choice, as is Fame.
I'm designing an application that receives information from roughly 100k sensors that measure time-series data. Each sensor measures a single integer data point once every 15 minutes, saves a log of these values, and sends that log to my application once every 4 hours. My application should maintain about 5 years of historical data. The packet I receive once every 4 hours is of the following structure:
Data and time of the sequence start
Number of samples to arrive (assume this is fixed for the sake of simplicity, although in practice there may be partials)
The sequence of samples, each of exactly 4 bytes
My application's main usage scenario is showing graphs of composite signals at certain dates. When I say "composite" signals I mean that for example I need to show the result of adding Sensor A's signal to Sensor B's signal and subtracting Sensor C's signal.
My dilemma is how to store this time-series data in my database. I see two options, assuming I use a relational database:
Store every sample in a row of its own: when I receive a signal, break it to samples, and store each sample separately with its timestamp. Assume the timestamps can be normalized across signals.
Store every 4-hour signal as a separate row with its starting time. In this case, whenever a signal arrives, I just add it as a BLOB to the database.
There are obvious pros and cons for each of the options, including storage size, performance, and complexity of the code "above" the database.
I wondered if there are best practices for such cases.
Many thanks.
Storing each sample in it's own row sounds simple and logical to me. Don't be too hasty to optimize unless there is actually a good reason for it. Maybe you should do some tests with dummy data to see if any optimization is really necessary.
I think storing the data in the form that makes it easiest to carry out your main goal is likely the least painful overall. In this case, it's likely the more efficient as well.
Since your main goal appears to be to display the information in interesting and flexible ways I'd go with separate rows for each data point. I presume most of the effort required to write this program well is likely on the display side, you should minimize the complexity on that side as much as possible.
Storing data in BLOBs is good if the content isn't relevent and you would never want to run queries against it. In this case, your data will be the contents of the database, and therefore, very relevent.
I think you should:
1.Store every sample in a row of its own: when I receive a signal, break it to samples, and store each sample separately with its timestamp. Assume the timestamps can be normalized across signals.
I see two database operations here: the first is to store the data as it comes in, and the second is to retrieve the data in a (potentially large) number of ways.
As Kieveli says, since you'll be using discrete parts of the data (as opposed to all of the data all at once), storing it as a blob won't help you when it comes time to read it. So for the first task, storing the data line by line would be optimal.
This might also be "good enough" when querying the data. However, if performance is an issue, and/or if you get massive amounts of volume [100,000 sensors x 1 per 15 minutes x 4 hours = 9,600,000 rows per day, x 5 years = 17,529,600,000 or so rows in five years]. To my mind, if you want to write flexible queries against that kind of data, you'll want some form of star schema structure (as gets used in data warehouses).
Whether you load the data directly into the warehouse, or let it build up "row by row" to be added to the warehouse ever day/week/month/whatever, depends on time, effort, available resources, and so on.
A final suggestion: when you set up a test environment for your new code, load it with several years of (dummy) data, to see how it will perform.