Let's say I've got 3 tables (LiveDataTable, ReducedDataTable, ScheduleTable).
Basically I've got a stream of events -> whenever I receive an event I write extracted data of this event to LiveDataTable.
The problem is that there's a huge number of events that's why LiveDataTable may become really huge, so I've got another ReducedDataTable where I combine rows from LiveDataTable (think of selecting 100 rows from LiveDataTable, reducing it to 1 row and write it to ReducedDataTable and then deleting these 100 rows from LiveDataTable).
In order to determine the right time of performing these reducing operations there's ScheduleTable. You may think that 1 row ScheduleTable corresponds to 1 reducing operation.
I want to be able to support List<Data> getData() method from Interface. There're 2 cases: I should either read from ReducedDataTable only or merge the results from ReducedDataTable and LiveDataTable.
Here's how my caching works step-by-step:
Read 1 row from ScheduleTable
Read from LiveDataTable
Write to ReducedDataTable (at least 4 rows)
Remove (<= INT_MAX) rows from LiveDataTable
Remove 1 row from ScheduleTable
The problem is I want to determine whether I should read from LiveDataTable and ReducedDataTable programmatically when receiving getData() request. For every step (before #3) I want to read from LiveDataTable and then I'd like to read from ReducedDataTable. How do I determine what step I'm currently at when receiving getData() request?
The reason I asked this questions I believe this's a common problem in DB when handling concurrency.
(Assuming that your compaction process is fast enough)
You can first optimistically read from the small table and if the data is missing - then read from the non-compacted one.
In most cases there will be only one request, not two.
Otherwise you can maintain the timestamp of the data that has been already compacted.
Related
I have a text file(call it grand parent file) which contains 1 million lines. Each of these lines contain absolute paths of some other files(call them parents) as shown below. The paths of parent files are unique.
%: cat input.txt - grand parent file
/root/a/b/c/1.txt -- parent file1
/root/a/b/c/2.txt -- parent file2 ......
...
/root/a/b/d/3.txt
......
.....
upto 1 million files.
Again each of the above parent file contains absolute paths of different files(Call them childs) and their line numbers as shown below: Same child files may be present in multiple parent files with same or different lumbers.
%: cat /root/a/b/c/1.txt -- parent file
s1.c,1,2,3,4,5 -- child file and its line numbers
s2.c,1,2,3,4,5....
...
upto thousands of files
%: cat /root/a/b/c/2.txt
s1.c,3,4,5
s2.c,1,2,3,4,5....
...
upto thousands of files
Now my requirement is that, given a child file and line number I need to return all the parent files that have the given child file number and line data present with in a minute. The insertion needs to be completed with in a day.
I created a relational database with following schema:
ParentChildMapping - Contains the required relation
ID AUTOINCREMENT PRIMARY KEY
ParentFileName TEXT
ChildFileName TEXT
LNumber INT
For a given file name and line number:
SELECT ParentFileName from ParentChildMapping where ChildFileName="s1.txt" and LNumber=1;
I divided grand parent file to 1000 separate sets each containing 1000 records. Then I have a python program which parses each set and reads the content of the parent file and inserts into the database. I can create thousand processes running in parallel and insert all the records in parallel but I am not sure what will be the impact on the relational database as I will be inserting millions of records in parallel. Also I am not sure if relational database is the right approach to chose here. Could you please let me know if there is any tool or technology that better suits this problem. I started with sqlite but it did not support concurrent inserts and failed with database lock error. And Now I want to try MySQL or any other alternate solution that suits the situation.
Sample Code that runs as thousand processes in parallel to insert into MySQL:
import MySQLDb
connection = MySQLDb.connect(host, username,...)
cursor = connection.cursor()
with open(some_set) as fd:
for each_parent_file in fd:
with open(each_parent_file) as parent_fd:
for each_line in parent_fd:
child_file_name, *line_numbers = each_line.strip().split(",")
insert_items = [(each_parent_file, child_file_name, line_num) for line_num in line_numbers]
cursor.executemany("INSERT INTO ParentChildMapping (ParentFileName, ChildFileName, LineNumber) VALUES %s" %insert_items)
cursor.commit()
cursor.close()
connection.close()
Let's start with a naïve idea of what a database would need to do to organize your data.
You have a million parent files.
Each one contains thousands of child files. Let's say 10,000.
Each one contains a list of line numbers. You didn't say how many. Let's say 100.
This is 10^6 * 10^4 * 10^2 = 10^12 records. Suppose that each is 50 bytes. This is 50 terabytes of data. We need it organized somehow, so we sort it. This requires on the order of log_2(10^12) which is around 40 passes. This naïve approach needs is 2 * 10^15 of data. If we do this in a day with 86400 seconds, this needs us to process 23 GB of data per second.
Your hard drive probably doesn't have 50 terabytes of space. Even if it did, it probably doesn't stream data faster than about 500 MB/second, which is 50 times too slow.
Can we improve this? Well, of course. Probably half the passes can happen strictly in memory. You can replace records with 12 byte tuples. There are various ways to compress this data. But the usual "bulk insert data, create index" is NOT going to give you the desired performance on a standard relational database approach.
Congratulations. When people talk about #bigdata, they usually have small data. But you actually have enough that it matters.
So...what can you do?
First what can you do with out of the box tools?
If one computer doesn't have horsepower, we need something distributed. We need a distributed key/value store like Cassandra. We'll need something like Hadoop or Spark to process data.
If we have those, all we need to do is process the files and load them into Cassandra as records, by parent+child file, of line numbers. We then do a map reduce to find, by child+line number of what parent files have it and store that back into Cassandra. We then get answers by querying Cassandra.
BUT keep in mind the back of the envelope about the amount of data and processing required. This approach allows us, with some overhead, to do all of that in a distributed way. This allows us to do that much work and store that much data in a fixed amount of time. However you will also need that many machines to do it on. Which you can easily rent from AWS, but you'll wind up paying for them as well.
OK, suppose you're willing to build a custom solution, can you do something more efficient? And maybe run it on one machine? After all your original data set fits on one machine, right?
Yes, but it will also take some development.
First, let's make the data more efficient. An obvious step is to create lookup tables for file names to indexes. You already have the parent files in a list, this just requires inserting a million records into something like RocksDB for the forward lookup, and the same for the reverse. You can also generate a list of all child filenames (with repetition) then use Unix commands to do a sort -u to get canonical ones. Do the same and you get a similar child file lookup.
Next, the reason why we were generating so much data before is that we were taking a line like:
s1.c,1,2,3,4,5
and were turning it into:
s1.c,1,/root/a/b/c/1.txt
s1.c,2,/root/a/b/c/1.txt
s1.c,3,/root/a/b/c/1.txt
s1.c,4,/root/a/b/c/1.txt
s1.c,5,/root/a/b/c/1.txt
But if we turn s1.c into a number like 42, and /root/a/b/c/1.txt into 1, then we can turn this into something like this:
42,1,1,5
Meaning that child file 42, parent file 1 starts on line 1 and ends on line 5. If we use, say, 4 bytes for each field then this is a 16 byte block. And we generate just a few per line. Let's say an average of 2. (A lot of lines will have one, others may have multiple such blocks.) So our whole data is 20 billion 16 byte rows for 320 GB of data. Sorting this takes 34 passes, most of which don't need to be written to disk, which can easily be inside of a day on a single computer. (What you do is sort 1.6 GB blocks in memory, then write them back to disk. Then you can get the final result in 8 merge passes.)
And once you have that sorted file, you can NOW just write out offsets to where every file happens.
If each child file is in thousands of parent files, then decoding this is a question of doing a lookup from filename to child file ID, then a lookup of child file ID to the range which has that child file listed. Go through the thousand of records, and form a list of the thousands of parent files that had the line number in their range. Now do the lookup of their names, and return the result. This lookup should run in seconds, and (since everything is readonly) can be done in parallel with other lookups.
BUT this is a substantial amount of software to write. It is how I would go. But if the system only needs to be used a few times, or if you have additional needs, the naïve distributed solution may well be cost effective.
Background
We have 2 streams, let's call them A and B.
They produce elements a and b respectively.
Stream A produces elements at a slow rate (one every minute).
Stream B receives a single element once every 2 weeks. It uses a flatMap function which receives this element and generates ~2 million b elements in a loop:
(Java)
for (BElement value : valuesList) {
out.collect(updatedTileMapVersion);
}
The valueList here contains ~2 million b elements
We connect those streams (A and B) using connect, key by some key and perform another flatMap on the connected stream:
streamA.connect(streamB).keyBy(AClass::someKey, BClass::someKey).flatMap(processConnectedStreams)
Each of the b elements has a different key, meaning there are ~2 million keys coming from the B stream.
The Problem
What we see is starvation. Even though there are a elements ready to be processed they are not processed in the processConnectedStreams.
Our tries to solve the issue
We tried to throttle stream B to 10 elements in a 1 second by performing a Thread.sleep() every 10 elements:
long totalSent = 0;
for (BElement value : valuesList) {
totalSent++;
out.collect(updatedTileMapVersion);
if (totalSent % 10 == 0) {
Thread.sleep(1000)
}
}
The processConnectedStreams is simulated to take 1 second with another Thread.sleep() and we have tried it with:
* Setting parallelism of 10 to all the pipeline - didn't work
* Setting parallelism of 15 to all the pipeline - did work
The question
We don't want to use all these resources since stream B is activated very rarely and for stream A elements having high parallelism is an overkill.
Is it possible to solve it without setting the parallelism to more than the number of b elements we send every second?
It would be useful if you shared the complete workflow topology. For example, you don't mention doing any keying or random partitioning of the data. If that's really the case, then Flink is going to pipeline multiple operations in one task, which can (depending on the topology) lead to the problem you're seeing.
If that's the case, then forcing partitioning prior to the processConnectedStreams can help, as then that operation will be reading from network buffers.
I have an FDQuery on a form, and an action that enable and disable components according to its recordCount ( enabled when >0).
Most of times recordCount property return the actual count of records in the query. Sometimes, recordcount return negative value, but I can see records on a grid associated with the query.
RecordCount returned values between -5 and -1, until now.
How can I solve this? why does it return negative values?
why does it return negative values?
That is not FireDAC specific, that is a normal behavior of all Delphi libraries providing TDataSet interface.
TDataSet.RecordCount was never warranted to work. It is even said in the documentation. This is a bonus functionality. Sometimes it may work, other times it will not. It is only expected to work reliably for non-SQL table data like CSV, DBF, Paradox tables and other similar ISAM engines.
How can I solve this?
Relying on this function in the modern world is always a risky endeavor.
So you better design your program so that this function would never be used, or only in very few very specific scenarios.
Instead you should understand what question your program really asks to the library and then find a "language" to make this question by calling other functions, better tailored for your case.
Now tell me, when you search in Google how often do you read through up to 10th page, up to 100th page? Almost never, right? Your program users would also almost never scroll the data grid really far downwards. Keep this in mind.
You always need to show users first data and do it fast. But rarely the last data.
Now, three examples.
1) you read data from some remote server with slow internet. You can only read 100 rows per second. You grid has room to show 20 first rows. The rest user has to scroll by. In total the query can filter 10 000 rows for you.
If you just show those 20 rows to user - then it works almost instantly, it is only 0.2 seconds from when you start reading data to when you filled your grid and presented it to the user. The rest of the data would only be fetched if user would request it by scrolling (I am a bit simplifying here for clarity, I know about pre-caching TDataset.Buffers).
So if you call the RecordCount function what does your program do? It downloads ALL the records into your local memory where it counts them. And with such a speed it would take 10 000 / 100 = 100 seconds, more than a minute and a half.
Just by calling the RecordCount function you called FetchAll procedure and made your program response to the user not in 0.2 seconds but in 1m40s instead.
User would be very nervous waiting for that to finish.
2) Imagine you are fetching the data from some Stored Procedure. Or from a table, where another application is inserting the rows. In other words, that is not some static read-only data, that is live data that is being generated while you are downloading it.
So, how many rows are there then? This moment it is, for example, 1000 rows, in a second it would be 1010 rows, in two second it maybe would be 1050 rows, and so forth.
What is the One True Value when this value is being changed every now and then?
Oookey, you called RecordCount function, you SetLength-ed your array to the 1000, and now you read all the data from your query. But it takes some time to download the data. It usually is fast, but it never is instantaneous. So, for example, it took one second to you to download those 1000 rows the database query data into your array (or grid). But while you were doing it 10 more rows were generated/inserted, and your query is not .EOFed, and you keep fetching rows #1001, #1002, ... #1010 - and you put them in the array/gris rows that just do not exist!
Would it be good?
Or would you cancel your query when you went out of the array/grid boundaries?
That way you would not have Access Violation.
But instead you would have those most recent 10 rows ignored and missed.
Is that good?
3) your query, when you debug it, returns 10 000 rows. You download them all into you program's local memory by calling RecordCount function, and it works like a charm, and you deploy it.
You client uses your program, and the data grows, and one day your query returns not the 10 000 rows, but 10 000 000.
Your program calls RecordCount function to download all those rows, it downloads, for example, 9 000 000 millions...
....and then it crashes with Out Of Memory error.
enable and disable components according to its recordCount > 0
That is a wrong approach to get the data you do not ever need (exact quantity of rows), then discard it. The examples above show you how that makes your program fragile and slow.
All you really want to know is if there are any rows or none at all.
You do not need to count all the rows ad learn their amount, you only wonder whether the query is empty or not.
And that is exactly what you should ask by calling the TDataSet.IsEmpty function instead of RecordCount.
I've tried to find solution for this problem twice before, but unfortunately those answers haven't provided permanent fix, so here I am, giving it another try.
I have an SQL Server stored procedure that returns list of 1.5 million integer IDs. I am calling this SP from ASP.NET/VB.NET code and executing a SqlDataReader:
m_dbSel.CommandType = CommandType.StoredProcedure
m_dbSel.CommandText = CstSearch.SQL.SP_RS_SEARCH_EX
oResult = m_dbSel.ExecuteReader(CommandBehavior.CloseConnection)
Then I am passing that reader to a class constructor to build Generic List(Of Integer). The code is very basic:
Public Sub New(i_oDataReader As Data.SqlClient.SqlDataReader)
m_aFullIDList = New Generic.List(Of Integer)
While i_oDataReader.Read
m_aFullIDList.Add(i_oDataReader.GetInt32(0))
End While
m_iTotalNumberOfRecords = m_aFullIDList.Count
End Sub
The problem is - this doesn't read all 1.5 million of records, the number is inconsistent, final count could be 500K or 1 million etc. (Most often "magic" number of 524289 records is returned). I've tried using CommandBehavior.SequentialAccess setting when executing command, but the results turned out to be inconsistent as well.
When I am running SP in SSMS, it returns certain number of records almost right away and displays them, but then continues to run for a few seconds more until all 1.5 million records are done - does it have anything to do with this?
UPDATE
After a while I found that on very-very rare occasions the loop code above does throw an exception:
System.NullReferenceException: Object reference not set to an instance
of an object. at
System.Data.SqlClient.SqlDataReader.ReadColumnHeader(Int32 i)
So some internal glitch does happen. Also it looks like if I replace
While i_oDataReader.Read
m_aFullIDList.Add(i_oDataReader.GetInt32(0))
End While
that deals in Integers with
While i_oDataReader.Read
m_aFullIDList.Add(i_oDataReader(0))
End While
that deals in Objects - the code seems to run without a glitch and returns all records.
Go figure.
Basically, as we've flogged out in the comments(*), the problem isn't with SqlDataRead, the stored procedure, or SQL at al. Rather, your List.Add is failing because it cannot allocate the additional memory for 2^(n+1) items to extend the List and copy your existing 2^n items into. Most of the time your n=19 (so 524289 items), but sometimes it could be higher.
There are three basic things that you could do about this:
Pre-Allocate: As you've discovered, by pre-allocating you should be able to gwet anywhere from 1.5 to 3 times as many items. This works best if you know ahead of time how many items you'll have, so I'd recommend either excuting a SELECT COUNT(*).. ahead of time, or adding a COUNT(*) OVER(PARTITION BY 1) column and picking it out of the first row returned to pre-allocate the List. The problem with this approach is that you're still pretty close to your limit and could easily run out of memory in the near future...
Re-Configure: Right now you are only getting at most 2^22 bytes of memory for this, when in theory you shoud be able to get around 2^29-2^30. That means that something on your machine is preventing you from extending your writeable Virtual Memory limit that high. Likely causes include the size of your pagefile and competition from other processes (but there are other possibilities). Fix that and you should have more than enough headroom for this.
Streaming: Do you really need all 1.5 million items in memory at the same time? If not and you can determine which you don't need (or extract the info that you do need) on the fly, then you can solve this problem the same way that SqlDataReader does, with streaming. Just read a row, use it, then lose it and go on to the next row.
Hopefully this helps.
(* -- Thanks, obviously, to #granadaCoder and #MartinSmith)
If you really think that the problem rests solely with the List data structure (and not that you are just running out of memory), then there are some other ways to work around the List structure's allocation behavior. One way would be to implement an alternative List class (as IList(of Integer)).
Through the interface it would appear the same as List but internally it would have a different allocation scheme, by storing the data in a nested List(of List(of Integer)). Every 1000 items, it would create a new List(of Integer), add it to the parent nested list and then use it to add in the next 1000 items.
The reason that I didn't suggest this before is because, like pre-allocation, this may allow you to get closer to your memory limit, but, if that's the problem, you are still going to run out eventually (just as with pre-allocating) because this limit is too close to the actual number of items that you need (1.5 million).
Basically you read all record in SqlDataReader with select query I suggest you to add order by in your query and it sort all records in Acceding order and they also read in acceding order in SqlDataReader.
I also face this problem in my last project I have read more than 2 million records from database with unique id serialNo but this records are not come in sequence after 1000 records it jumps to 21, 00, 263th record and all records are come in wrong sequence.
Then I use (order by serialNo) this query and my problem is solved you not need to do anything extra only put order by in your select query and it will work for you
I hope this helps for you.
I need to store the number of plays for every second of a podcast / audio file. This will result in a simple timeline graph (like the "hits" graph in Google Analytics) with seconds on the x-axis and plays on the y-axis.
However, these podcasts could potentially go on for up to 3 hours, and 100,000 plays for each second is not unrealistic. That's 10,800 seconds with up to 100,000 plays each. Obviously, storing each played second in its own row is unrealistic (it would result in 1+ billion rows) as I want to be able to fetch this raw data fast.
So my question is: how do I best go about storing these massive amounts of timeline data?
One idea I had was to use a text/blob column and then comma-separate the plays, each comma representing a new second (in sequence) and then the number for the amount of times that second has been played. So if there's 100,000 plays in second 1 and 90,000 plays in second 2 and 95,000 plays in second 3, then I would store it like this: "100000,90000,95000,[...]" in the text/blob column.
Is this a feasible way to store such data? Is there a better way?
Thanks!
Edit: the data is being tracked to another source and I only need to update the raw graph data every 15min or so. Hence, fast reads is the main concern.
Note: due to nature of this project, each played second will have to be tracked individually (in other words, I can't just track 'start' and 'end' of each play).
Problem with the blob storage is you need to update the entire blob for all of the changes. This is not necessarily a bad thing. Using your format: (100000, 90000,...), 7 * 3600 * 3 = ~75K bytes. But that means you're updating that 75K blob for every play for every second.
And, of course, the blob is opaque to SQL, so "what second of what song has the most plays" will be an impossible query at the SQL level (that's basically a table scan of all the data to learn that).
And there's a lot of parsing overhead marshalling that data in and out.
On the other hand. Podcast ID (4 bytes), second offset (2 bytes unsigned allows pod casts up to 18hrs long), play count (4 byte) = 10 bytes per second. So, minus any blocking overhead, a 3hr song is 3600 * 3 * 10 = 108K bytes per song.
If you stored it as a blob, vs text (block of longs), 4 * 3600 * 3 = 43K.
So, the second/row structure is "only" twice the size (in a perfect world, consult your DB server for details) of a binary blob. Considering the extra benefits this grants you in terms of being able to query things, that's probably worth doing.
Only down side of second/per row is if you need to to a lot of updates (several seconds at once for one song), that's a lot of UPDATE traffic to the DB, whereas with the blob method, that's likely a single update.
Your traffic patterns will influence that more that anything.
Would it be problematic to use each second, and how many plays is on a per-second basis?
That means 10K rows, which isn't bad, and you just have to INSERT a row every second with the current data.
EDIT: I would say that that solutions is better than doing a comma-separated something in a TEXT column... especially since getting and manipulating data (which you say you want to do) would be very messy.
I would view it as a key-value problem.
for each second played
Song[second] += 1
end
As a relational database -
song
----
name | second | plays
And a hack psuedo-sql to start a second:
insert into song(name, second, plays) values("xyz", "abc", 0)
and another to update the second
update song plays = plays + 1 where name = xyz and second = abc
A 3-hour podcast would have 11K rows.
It really depends on what is generating the data ..
As I understand you want to implement a map with the key being the second mark and the value being the number of plays.
What is the pieces in the event, unit of work, or transaction you are loading?
Can I assume you have a play event along the podcastname , start and stop times
And you want to load into the map for analysis and presentation?
If that's the case you can have a table
podcastId
secondOffset
playCount
each even would do an update of the row between the start and ending position
update t
set playCount = playCount +1
where podCastId = x
and secondOffset between y and z
and then followed by an insert to add those rows between the start and stop that don't exist, with a playcount of 1, unless you preload the table with zeros.
Depending on the DB you may have the ability to setup a sparse table where empty columns are not stored, making more efficient.