My question is not on the query language but on the physical distribution of data in a graph database.
Let's assume a simple user/friendship model. In RDBs you would create a table storing IDUserA/IDUserB for a representation of a friendship.
If we assume a bunch of IT-Girls for example with the Facebook limit of 5k friends, we quickly get to huge amounts of data. If GirlA(ID 1) simply likes GirlB(ID 2). It would be an entry wir [1][2] in the table.
With this model it is not possible to get over data redundancy in friendship, because then we have to do either two queries (is there an entry in IDUserA or an entry in IDUserB with ID = 1, what means physically searching both columns) or to store [1][2] and [2][1], what ends up in data redundancy. For a heavy user this means checks against 5000/10000 entries containing an indexed column, which is astronomically big.
So ok, use GraphDBs. We assume the Girls as Nodes. GirlA is the first one ever entered into the DB, so her ID is simply 0. The Entry contains a isUsed - flag for the data chunk of a byte, and is 1 if it is in use. The next 4 bytes are a flag for the filename where her node is stored in (what leads to nearly 4.3 Billion possible files and we assume the file size of 16.7MB so we could use 3 more bytes to declare the offset inside.
Lets assume we define the username datatype as a chunk of 256 (and be for the example so ridgid).
For GirlA it is [1]0.0.0.0-0.0.0
= Her User ID 0 times 256 = 0
For GirlB it is [1]0.0.0.0-0.1.0
= Her User ID 1 times 256 = 256,
so her Usernamedata starts on file 0_0_0_0.dat on offset 256 from start. We don't have to search for her data, we could simply calculate them. A User 100 would be stored in the same file on offset 25600 and so forth and so on. User 65537 would be stored in file 0_0_0_1.dat on offset 0. Loaded in RAM this is only a pointer and pretty fast.
So we could store with this method more nodes than humans ever lived.
BUT: How to find relationships? Ok, with edges. But how to store them? All in one "column" is stupid, because then we are back on relationship models. In a hashtable? Ok, we could store the 0_0_0_0.frds as a hashtable containing all friends of User0, kick off a new instance of a User-Class Object, add the Friends to a binary list or tree that could be found by the pointer cUser.pFriendlist and we would be done. But I think that I make a mistake.
Shouldn't GraphDatabases be something different than mathematical nodes connected with hash tables filled with edges?
The use of nodes and edges is clear, because it allows to connect everything with relationships of anything. But whats about the queries and their speed?
Keeping different edges in different type of files seems somekind of wrong, even if the accessibility is really fast on SSDs.
Sure, I could use a simple relational table to store a edgetype/dataending pair, but please help me: where do I get it wrong!
Related
I have a text file(call it grand parent file) which contains 1 million lines. Each of these lines contain absolute paths of some other files(call them parents) as shown below. The paths of parent files are unique.
%: cat input.txt - grand parent file
/root/a/b/c/1.txt -- parent file1
/root/a/b/c/2.txt -- parent file2 ......
...
/root/a/b/d/3.txt
......
.....
upto 1 million files.
Again each of the above parent file contains absolute paths of different files(Call them childs) and their line numbers as shown below: Same child files may be present in multiple parent files with same or different lumbers.
%: cat /root/a/b/c/1.txt -- parent file
s1.c,1,2,3,4,5 -- child file and its line numbers
s2.c,1,2,3,4,5....
...
upto thousands of files
%: cat /root/a/b/c/2.txt
s1.c,3,4,5
s2.c,1,2,3,4,5....
...
upto thousands of files
Now my requirement is that, given a child file and line number I need to return all the parent files that have the given child file number and line data present with in a minute. The insertion needs to be completed with in a day.
I created a relational database with following schema:
ParentChildMapping - Contains the required relation
ID AUTOINCREMENT PRIMARY KEY
ParentFileName TEXT
ChildFileName TEXT
LNumber INT
For a given file name and line number:
SELECT ParentFileName from ParentChildMapping where ChildFileName="s1.txt" and LNumber=1;
I divided grand parent file to 1000 separate sets each containing 1000 records. Then I have a python program which parses each set and reads the content of the parent file and inserts into the database. I can create thousand processes running in parallel and insert all the records in parallel but I am not sure what will be the impact on the relational database as I will be inserting millions of records in parallel. Also I am not sure if relational database is the right approach to chose here. Could you please let me know if there is any tool or technology that better suits this problem. I started with sqlite but it did not support concurrent inserts and failed with database lock error. And Now I want to try MySQL or any other alternate solution that suits the situation.
Sample Code that runs as thousand processes in parallel to insert into MySQL:
import MySQLDb
connection = MySQLDb.connect(host, username,...)
cursor = connection.cursor()
with open(some_set) as fd:
for each_parent_file in fd:
with open(each_parent_file) as parent_fd:
for each_line in parent_fd:
child_file_name, *line_numbers = each_line.strip().split(",")
insert_items = [(each_parent_file, child_file_name, line_num) for line_num in line_numbers]
cursor.executemany("INSERT INTO ParentChildMapping (ParentFileName, ChildFileName, LineNumber) VALUES %s" %insert_items)
cursor.commit()
cursor.close()
connection.close()
Let's start with a naïve idea of what a database would need to do to organize your data.
You have a million parent files.
Each one contains thousands of child files. Let's say 10,000.
Each one contains a list of line numbers. You didn't say how many. Let's say 100.
This is 10^6 * 10^4 * 10^2 = 10^12 records. Suppose that each is 50 bytes. This is 50 terabytes of data. We need it organized somehow, so we sort it. This requires on the order of log_2(10^12) which is around 40 passes. This naïve approach needs is 2 * 10^15 of data. If we do this in a day with 86400 seconds, this needs us to process 23 GB of data per second.
Your hard drive probably doesn't have 50 terabytes of space. Even if it did, it probably doesn't stream data faster than about 500 MB/second, which is 50 times too slow.
Can we improve this? Well, of course. Probably half the passes can happen strictly in memory. You can replace records with 12 byte tuples. There are various ways to compress this data. But the usual "bulk insert data, create index" is NOT going to give you the desired performance on a standard relational database approach.
Congratulations. When people talk about #bigdata, they usually have small data. But you actually have enough that it matters.
So...what can you do?
First what can you do with out of the box tools?
If one computer doesn't have horsepower, we need something distributed. We need a distributed key/value store like Cassandra. We'll need something like Hadoop or Spark to process data.
If we have those, all we need to do is process the files and load them into Cassandra as records, by parent+child file, of line numbers. We then do a map reduce to find, by child+line number of what parent files have it and store that back into Cassandra. We then get answers by querying Cassandra.
BUT keep in mind the back of the envelope about the amount of data and processing required. This approach allows us, with some overhead, to do all of that in a distributed way. This allows us to do that much work and store that much data in a fixed amount of time. However you will also need that many machines to do it on. Which you can easily rent from AWS, but you'll wind up paying for them as well.
OK, suppose you're willing to build a custom solution, can you do something more efficient? And maybe run it on one machine? After all your original data set fits on one machine, right?
Yes, but it will also take some development.
First, let's make the data more efficient. An obvious step is to create lookup tables for file names to indexes. You already have the parent files in a list, this just requires inserting a million records into something like RocksDB for the forward lookup, and the same for the reverse. You can also generate a list of all child filenames (with repetition) then use Unix commands to do a sort -u to get canonical ones. Do the same and you get a similar child file lookup.
Next, the reason why we were generating so much data before is that we were taking a line like:
s1.c,1,2,3,4,5
and were turning it into:
s1.c,1,/root/a/b/c/1.txt
s1.c,2,/root/a/b/c/1.txt
s1.c,3,/root/a/b/c/1.txt
s1.c,4,/root/a/b/c/1.txt
s1.c,5,/root/a/b/c/1.txt
But if we turn s1.c into a number like 42, and /root/a/b/c/1.txt into 1, then we can turn this into something like this:
42,1,1,5
Meaning that child file 42, parent file 1 starts on line 1 and ends on line 5. If we use, say, 4 bytes for each field then this is a 16 byte block. And we generate just a few per line. Let's say an average of 2. (A lot of lines will have one, others may have multiple such blocks.) So our whole data is 20 billion 16 byte rows for 320 GB of data. Sorting this takes 34 passes, most of which don't need to be written to disk, which can easily be inside of a day on a single computer. (What you do is sort 1.6 GB blocks in memory, then write them back to disk. Then you can get the final result in 8 merge passes.)
And once you have that sorted file, you can NOW just write out offsets to where every file happens.
If each child file is in thousands of parent files, then decoding this is a question of doing a lookup from filename to child file ID, then a lookup of child file ID to the range which has that child file listed. Go through the thousand of records, and form a list of the thousands of parent files that had the line number in their range. Now do the lookup of their names, and return the result. This lookup should run in seconds, and (since everything is readonly) can be done in parallel with other lookups.
BUT this is a substantial amount of software to write. It is how I would go. But if the system only needs to be used a few times, or if you have additional needs, the naïve distributed solution may well be cost effective.
how can i do a search based on combinations of like 50 parameters like filters.
These filters can be price color size brand etc.
So we can get different pages based on these params.
So one link can have price brand size, another one size brand color, and so on.
My question is what will be the best practice to query the database based on these params.
I have one ideea to encrypt them into 101101101 sequence of 1 and 0 and search by that.
So i have like more than 2 milions possible combinations, and i want to reduce the query time.
I heard about btree but i don't know how to use it, i have given my table columns the proper indexes but from this point i don't know in wich direction should i go. How my query is going to look like.
I think that it is a good idea to "encrypt" the params, but don't do it like "10100010", because then you'll have to be storing these values as string.
Rather encode it as base10 number. It means that 100101 = 1*32+0*16+0*8+1*4+0*2+1*1 = 37.
Ofcourse, with 50 flags you'd get a number too big to store as bigint (which is 32 bytes), so try to logically group the parameters and use 2-3 fields for them.
The problem with this aproach would be with querying the data - you would have to write a function extracting a flag from the number, to be able to query the data by only one parameter and not all of them.
I'm trying to find a way to store my data with fast access (better than O(n)).
My database consists of data (4096 byte strings) that represents some information about some items.
The problem is, that the query is never exact. I get one Item, and then need to find the closest match using a function F(a,b).
just an example:
1234
3456
6466
F(a,b) = return % of similar digits
GetClosest(1233,F) = 1234
The problem is that F(a,b) is a complicated algorithm, (not a proper metric).
What I have now is just go over the whole database to search for the best match.
Is there a kind of tree or other cluster database type that can give me faster finding complexity ?
More information:
F gives back a similarity value in %percentage. where 100% is a perfect match.
Sorry, the answer is "probably not" unless there is some more structure to your problem that you haven't described. With 4096 byte strings you're suffering from the curse of dimensionality.
If you had shorter strings and enough data that there was a high likelihood of the nearest match being identical over a large chunk of the string, then you could store your data with multiple tree-like structures indexed over different chunks of the string. With high likelihood the nearest would be close enough that you could prove it was nearest based only on close elements in those trees. However with the size of your strings and the limited data that can be stored in a computer, there is no way this is possibly going to work.
That said, do you need the exact closest, or only a somewhat close one? If only a likely close one, then you could index it by several random sparse samples of bits. In your search you can only check elements that match exactly in one of the elements. This will greatly reduce the search space, while rejecting fewer of the close neighbors, and may produce reasonable (even though frequently wrong) answers.
Is there some way you could assign a 'score' to each datum.
You could index/sequence the data by your score.
When you search you assign a score to your search criteria, and look for the item with the closest score.
Depends very much on your data and your definition of "difference" whether this will work.
I need to store the number of plays for every second of a podcast / audio file. This will result in a simple timeline graph (like the "hits" graph in Google Analytics) with seconds on the x-axis and plays on the y-axis.
However, these podcasts could potentially go on for up to 3 hours, and 100,000 plays for each second is not unrealistic. That's 10,800 seconds with up to 100,000 plays each. Obviously, storing each played second in its own row is unrealistic (it would result in 1+ billion rows) as I want to be able to fetch this raw data fast.
So my question is: how do I best go about storing these massive amounts of timeline data?
One idea I had was to use a text/blob column and then comma-separate the plays, each comma representing a new second (in sequence) and then the number for the amount of times that second has been played. So if there's 100,000 plays in second 1 and 90,000 plays in second 2 and 95,000 plays in second 3, then I would store it like this: "100000,90000,95000,[...]" in the text/blob column.
Is this a feasible way to store such data? Is there a better way?
Thanks!
Edit: the data is being tracked to another source and I only need to update the raw graph data every 15min or so. Hence, fast reads is the main concern.
Note: due to nature of this project, each played second will have to be tracked individually (in other words, I can't just track 'start' and 'end' of each play).
Problem with the blob storage is you need to update the entire blob for all of the changes. This is not necessarily a bad thing. Using your format: (100000, 90000,...), 7 * 3600 * 3 = ~75K bytes. But that means you're updating that 75K blob for every play for every second.
And, of course, the blob is opaque to SQL, so "what second of what song has the most plays" will be an impossible query at the SQL level (that's basically a table scan of all the data to learn that).
And there's a lot of parsing overhead marshalling that data in and out.
On the other hand. Podcast ID (4 bytes), second offset (2 bytes unsigned allows pod casts up to 18hrs long), play count (4 byte) = 10 bytes per second. So, minus any blocking overhead, a 3hr song is 3600 * 3 * 10 = 108K bytes per song.
If you stored it as a blob, vs text (block of longs), 4 * 3600 * 3 = 43K.
So, the second/row structure is "only" twice the size (in a perfect world, consult your DB server for details) of a binary blob. Considering the extra benefits this grants you in terms of being able to query things, that's probably worth doing.
Only down side of second/per row is if you need to to a lot of updates (several seconds at once for one song), that's a lot of UPDATE traffic to the DB, whereas with the blob method, that's likely a single update.
Your traffic patterns will influence that more that anything.
Would it be problematic to use each second, and how many plays is on a per-second basis?
That means 10K rows, which isn't bad, and you just have to INSERT a row every second with the current data.
EDIT: I would say that that solutions is better than doing a comma-separated something in a TEXT column... especially since getting and manipulating data (which you say you want to do) would be very messy.
I would view it as a key-value problem.
for each second played
Song[second] += 1
end
As a relational database -
song
----
name | second | plays
And a hack psuedo-sql to start a second:
insert into song(name, second, plays) values("xyz", "abc", 0)
and another to update the second
update song plays = plays + 1 where name = xyz and second = abc
A 3-hour podcast would have 11K rows.
It really depends on what is generating the data ..
As I understand you want to implement a map with the key being the second mark and the value being the number of plays.
What is the pieces in the event, unit of work, or transaction you are loading?
Can I assume you have a play event along the podcastname , start and stop times
And you want to load into the map for analysis and presentation?
If that's the case you can have a table
podcastId
secondOffset
playCount
each even would do an update of the row between the start and ending position
update t
set playCount = playCount +1
where podCastId = x
and secondOffset between y and z
and then followed by an insert to add those rows between the start and stop that don't exist, with a playcount of 1, unless you preload the table with zeros.
Depending on the DB you may have the ability to setup a sparse table where empty columns are not stored, making more efficient.
I'm trying to write a GQL query that returns N random records of a specific kind. My current implementation works but requires N calls to the datastore. I'd like to make it 1 call to the datastore if possible.
I currently assign a random number to every kind that I put into the datastore. When I query for a random record I generate another random number and query for records > rand ORDER BY asc LIMIT 1.
This works, however, it only returns 1 record so I need to do N queries. Any ideas on how to make this one query? Thanks.
"Under the hood" a single search query call can only return a set of consecutive rows from some index. This is why some GQL queries, including any use of !=, expand to multiple datastore calls.
N independent uniform random selections are not (in general) consecutive in any index.
QED.
You could probably use memcache to store the entities, and reduce the cost of grabbing N of them. Or if you don't mind the "random" selections being close together in the index, select a randomly-chosen block of (say) 100 in one query, then pick N at random from those. Since you have a field that's already randomised, it won't be immediately obvious to an outsider that the N items are related. At least, not until they look at a lot of samples and notice that items A and Z never appear in the same group, because they're more than 100 apart in the randomised index. And if performance permits, you can re-randomise your entities from time to time.
What kind of tradeoffs are you looking for? If you are willing to put up with a small performance hit on inserting these entities, you can create a solution to get N of them very quickly.
Here's what you need to do:
When you insert your Entities, specify the key. You want to give keys to your entities in order, starting with 1 and going up from there. (This will require some effort, as app engine doesn't have autoincrement() so you'll need to keep track of the last id you used in some other entity, let's call it an IdGenerator)
Now when you need N random entities, generate N random numbers between 1 and whatever the last id you generated was (your IdGenerator will know this). You can then do a batch get by key using the N keys, which will only require one trip to the datastore, and will be faster than a query as well, since key gets are generally faster than queries, AFAIK.
This method does require dealing with a few annoying details:
Your IdGenerator might become a bottleneck if you are inserting lots of these items on the fly (more than a few a second), which would require some kind of sharded IdGenerator implementation. If all this data is preloaded, or is not high volume, you have it easy.
You might find that some Id doesn't actually have an entity associated with it anymore, because you deleted it or because a put() failed somewhere. If this happened you'd have to grab another random entity. (If you wanted to get fancy and reduce the odds of this you could make this Id available to the IdGenerator to reuse to "fill in the holes")
So the question comes down to how fast you need these N items vs how often you will be adding and deleting them, and whether a little extra complexity is worth that performance boost.
Looks like the only method is by storing the random integer value in each entity's special property and querying on that. This can be done quite automatically if you just add an automatically initialized property.
Unfortunately this will require processing of all entities once if your datastore is already filled in.
It's weird, I know.
I agree to the answer from Steve, there is no such way to retrieve N random rows in one query.
However, even the method of retrieving one single entity does not usually work such that the prbability of the returned results is evenly distributed. The probability of returning a given entity depends on the gap of it's randomly assigned number and the next higher random number. E.g. if random numbers 1,2, and 10 have been assigned (and none of the numbers 3-9), the algorithm will return "2" 8 times more often than "1".
I have fixed this in a slightly more expensice way. If someone is interested, I am happy to share
I just had the same problem. I decided not to assign IDs to my already existing entries in datastore and did this, as I already had the totalcount from a sharded counter.
This selects "count" entries from "totalcount" entries, sorted by key.
# select $count from the complete set
numberlist = random.sample(range(0,totalcount),count)
numberlist.sort()
pagesize=1000
#initbuckets
buckets = [ [] for i in xrange(int(max(numberlist)/pagesize)+1) ]
for k in numberlist:
thisb = int(k/pagesize)
buckets[thisb].append(k-(thisb*pagesize))
logging.debug("Numbers: %s. Buckets %s",numberlist,buckets)
#page through results.
result = []
baseq = db.Query(MyEntries,keys_only=True).order("__key__")
for b,l in enumerate(buckets):
if len(l) > 0:
result += [ wq.fetch(limit=1,offset=e)[0] for e in l ]
if b < len(buckets)-1: # not the last bucket
lastkey = wq.fetch(1,pagesize-1)[0]
wq = baseq.filter("__key__ >",lastkey)
Beware that this to me is somewhat complex, and I'm still not conviced that I dont have off-by-one or off-by-x errors.
And beware that if count is close to totalcount this can be very expensive.
And beware that on millions of rows it might not be possible to do within appengine time boundaries.
If I understand correctly, you need retrieve N random instance.
It's easy. Just do query with only keys. And do random.choice N times on list result of keys. Then get results by fetching on keys.
keys = MyModel.all(keys_only=True)
n = 5 # 5 random instance
all_keys = list(keys)
result_keys = []
for _ in range(0,n)
key = random.choice(all_keys)
all_keys.remove(key)
result_keys.append(key)
# result_keys now contain 5 random keys.