Im trying to learn diving a file stored in hdfs into splits and reading it to different process (on different machines.)
What I expect is if I have a SequenceFile containing 1200 records with 12 process, I would see around 100 records per process. The way to divide the file is by getting the length of data, then dividing by number of processes, deriving chunk/beg/end size for each split, and then passing that split to e.g. SequenceFileRecordReader, retrieving records in a simple while loop : The code is as below.
private InputSplit getSplit(int id) throws IOException {
...
for(FileStatus file: status) {
long len = file.getLen();
BlockLocation[] locations =
fs.getFileBlockLocations(file, 0, len);
if (0 < len) {
long chunk = len/n;
long beg = (id*chunk)+(long)1;
long end = (id)*chunk;
if(n == (id+1)) end = len;
return new FileSplit(file, beg, end, locations[locations.length-1].getHosts());
}
}
...
}
However, the result shows that the sum of total records counted by each process is different from the records stored in file. What is the right way to divide the SequenceFile into chunk evenly and distribute them to different hosts?
Thanks.
I can't help but wonder why you are trying to do such a thing. Hadoop automatically splits your files and 1200 records to be split into 100 records doesn't sound like a lot of data. If you elaborate on what your problem is, someone might be able to help you more directly.
Here are my two ideas:
Option 1: Use Hadoop's automatic splitting behavior
Hadoop automatically splits your files. The number of blocks a file is split up into is the total size of the file divided by the block size. By default, one map task will be assigned to each block (not each file).
In your conf/hdfs-site.xml configuration file, there is a dfs.block.size parameter. Most people set this to 64 or 128mb. However, if you are trying to do something tiny, like 100 sequence file records per block, you could set this really low... to say 1000 bytes. I've never heard of anyone wanting to do this, but it is an option.
Option 2: Use a MapReduce job to split your data.
Have your job use an "identity mapper" (basically implement Mapper and don't override map). Also, have your job use an "identity reducer" (basically implement Reducer and don't override reduce). Set the number of reducers to the number of splits you want to have. Say you have three sequence files you want split into a total of 25 files, you would load up those 3 files and set the number of reducers to 25. Records will get randomly sent to each reducer, and what you will end up is close to 25 equal splits.
This works because the identity mappers and reducers effectively don't do anything, so your records will stay the same. The records get sent to random reducers, and then they will get written out, one file per reducer into part-r-xxxx files. Each of those files will contain your sequence file(s) split into somewhat even chunks.
Related
I have a text file(call it grand parent file) which contains 1 million lines. Each of these lines contain absolute paths of some other files(call them parents) as shown below. The paths of parent files are unique.
%: cat input.txt - grand parent file
/root/a/b/c/1.txt -- parent file1
/root/a/b/c/2.txt -- parent file2 ......
...
/root/a/b/d/3.txt
......
.....
upto 1 million files.
Again each of the above parent file contains absolute paths of different files(Call them childs) and their line numbers as shown below: Same child files may be present in multiple parent files with same or different lumbers.
%: cat /root/a/b/c/1.txt -- parent file
s1.c,1,2,3,4,5 -- child file and its line numbers
s2.c,1,2,3,4,5....
...
upto thousands of files
%: cat /root/a/b/c/2.txt
s1.c,3,4,5
s2.c,1,2,3,4,5....
...
upto thousands of files
Now my requirement is that, given a child file and line number I need to return all the parent files that have the given child file number and line data present with in a minute. The insertion needs to be completed with in a day.
I created a relational database with following schema:
ParentChildMapping - Contains the required relation
ID AUTOINCREMENT PRIMARY KEY
ParentFileName TEXT
ChildFileName TEXT
LNumber INT
For a given file name and line number:
SELECT ParentFileName from ParentChildMapping where ChildFileName="s1.txt" and LNumber=1;
I divided grand parent file to 1000 separate sets each containing 1000 records. Then I have a python program which parses each set and reads the content of the parent file and inserts into the database. I can create thousand processes running in parallel and insert all the records in parallel but I am not sure what will be the impact on the relational database as I will be inserting millions of records in parallel. Also I am not sure if relational database is the right approach to chose here. Could you please let me know if there is any tool or technology that better suits this problem. I started with sqlite but it did not support concurrent inserts and failed with database lock error. And Now I want to try MySQL or any other alternate solution that suits the situation.
Sample Code that runs as thousand processes in parallel to insert into MySQL:
import MySQLDb
connection = MySQLDb.connect(host, username,...)
cursor = connection.cursor()
with open(some_set) as fd:
for each_parent_file in fd:
with open(each_parent_file) as parent_fd:
for each_line in parent_fd:
child_file_name, *line_numbers = each_line.strip().split(",")
insert_items = [(each_parent_file, child_file_name, line_num) for line_num in line_numbers]
cursor.executemany("INSERT INTO ParentChildMapping (ParentFileName, ChildFileName, LineNumber) VALUES %s" %insert_items)
cursor.commit()
cursor.close()
connection.close()
Let's start with a naïve idea of what a database would need to do to organize your data.
You have a million parent files.
Each one contains thousands of child files. Let's say 10,000.
Each one contains a list of line numbers. You didn't say how many. Let's say 100.
This is 10^6 * 10^4 * 10^2 = 10^12 records. Suppose that each is 50 bytes. This is 50 terabytes of data. We need it organized somehow, so we sort it. This requires on the order of log_2(10^12) which is around 40 passes. This naïve approach needs is 2 * 10^15 of data. If we do this in a day with 86400 seconds, this needs us to process 23 GB of data per second.
Your hard drive probably doesn't have 50 terabytes of space. Even if it did, it probably doesn't stream data faster than about 500 MB/second, which is 50 times too slow.
Can we improve this? Well, of course. Probably half the passes can happen strictly in memory. You can replace records with 12 byte tuples. There are various ways to compress this data. But the usual "bulk insert data, create index" is NOT going to give you the desired performance on a standard relational database approach.
Congratulations. When people talk about #bigdata, they usually have small data. But you actually have enough that it matters.
So...what can you do?
First what can you do with out of the box tools?
If one computer doesn't have horsepower, we need something distributed. We need a distributed key/value store like Cassandra. We'll need something like Hadoop or Spark to process data.
If we have those, all we need to do is process the files and load them into Cassandra as records, by parent+child file, of line numbers. We then do a map reduce to find, by child+line number of what parent files have it and store that back into Cassandra. We then get answers by querying Cassandra.
BUT keep in mind the back of the envelope about the amount of data and processing required. This approach allows us, with some overhead, to do all of that in a distributed way. This allows us to do that much work and store that much data in a fixed amount of time. However you will also need that many machines to do it on. Which you can easily rent from AWS, but you'll wind up paying for them as well.
OK, suppose you're willing to build a custom solution, can you do something more efficient? And maybe run it on one machine? After all your original data set fits on one machine, right?
Yes, but it will also take some development.
First, let's make the data more efficient. An obvious step is to create lookup tables for file names to indexes. You already have the parent files in a list, this just requires inserting a million records into something like RocksDB for the forward lookup, and the same for the reverse. You can also generate a list of all child filenames (with repetition) then use Unix commands to do a sort -u to get canonical ones. Do the same and you get a similar child file lookup.
Next, the reason why we were generating so much data before is that we were taking a line like:
s1.c,1,2,3,4,5
and were turning it into:
s1.c,1,/root/a/b/c/1.txt
s1.c,2,/root/a/b/c/1.txt
s1.c,3,/root/a/b/c/1.txt
s1.c,4,/root/a/b/c/1.txt
s1.c,5,/root/a/b/c/1.txt
But if we turn s1.c into a number like 42, and /root/a/b/c/1.txt into 1, then we can turn this into something like this:
42,1,1,5
Meaning that child file 42, parent file 1 starts on line 1 and ends on line 5. If we use, say, 4 bytes for each field then this is a 16 byte block. And we generate just a few per line. Let's say an average of 2. (A lot of lines will have one, others may have multiple such blocks.) So our whole data is 20 billion 16 byte rows for 320 GB of data. Sorting this takes 34 passes, most of which don't need to be written to disk, which can easily be inside of a day on a single computer. (What you do is sort 1.6 GB blocks in memory, then write them back to disk. Then you can get the final result in 8 merge passes.)
And once you have that sorted file, you can NOW just write out offsets to where every file happens.
If each child file is in thousands of parent files, then decoding this is a question of doing a lookup from filename to child file ID, then a lookup of child file ID to the range which has that child file listed. Go through the thousand of records, and form a list of the thousands of parent files that had the line number in their range. Now do the lookup of their names, and return the result. This lookup should run in seconds, and (since everything is readonly) can be done in parallel with other lookups.
BUT this is a substantial amount of software to write. It is how I would go. But if the system only needs to be used a few times, or if you have additional needs, the naïve distributed solution may well be cost effective.
I'd like to batch process two files with Apache Flink, one after the other.
For a concrete example: suppose I want to assign an index to each line, such that lines from the second file follow the first. Instead of doing so, the following code interleaves lines in the two files:
val env = ExecutionEnvironment.getExecutionEnvironment
val text1 = env.readTextFile("/path/to/file1")
val text2 = env.readTextFile("/path/to/file2")
val union = text1.union(text2).flatMap { ... }
I want to make sure all of text1 is sent through the flatMap operator first, and then all of text2. What is the recommended way to do so?
Thanks in advance for the help.
DataSet.union() does not provide any order guarantees across inputs. Records from the same input partition will remain in order but will be merged with records from the other input.
But there is a more fundamental problem. Flink is a parallel data processor. When processing data in parallel, a global order cannot be preserved. For example, when Flink reads files in parallel, it tries to split these files and process each split independently. The splits are handed out without any particular order. Hence, the records of a single file are already shuffled. You would need to set the parallelism of the whole job to 1 and implement a custom InputFormat to make this work.
You can make that work, but it won't in parallel and you need to tweak many things. I don't think that Flink is the best tool for such a task.
Have you considered using simple unix commandline tools to concatenate your files?
My question is not on the query language but on the physical distribution of data in a graph database.
Let's assume a simple user/friendship model. In RDBs you would create a table storing IDUserA/IDUserB for a representation of a friendship.
If we assume a bunch of IT-Girls for example with the Facebook limit of 5k friends, we quickly get to huge amounts of data. If GirlA(ID 1) simply likes GirlB(ID 2). It would be an entry wir [1][2] in the table.
With this model it is not possible to get over data redundancy in friendship, because then we have to do either two queries (is there an entry in IDUserA or an entry in IDUserB with ID = 1, what means physically searching both columns) or to store [1][2] and [2][1], what ends up in data redundancy. For a heavy user this means checks against 5000/10000 entries containing an indexed column, which is astronomically big.
So ok, use GraphDBs. We assume the Girls as Nodes. GirlA is the first one ever entered into the DB, so her ID is simply 0. The Entry contains a isUsed - flag for the data chunk of a byte, and is 1 if it is in use. The next 4 bytes are a flag for the filename where her node is stored in (what leads to nearly 4.3 Billion possible files and we assume the file size of 16.7MB so we could use 3 more bytes to declare the offset inside.
Lets assume we define the username datatype as a chunk of 256 (and be for the example so ridgid).
For GirlA it is [1]0.0.0.0-0.0.0
= Her User ID 0 times 256 = 0
For GirlB it is [1]0.0.0.0-0.1.0
= Her User ID 1 times 256 = 256,
so her Usernamedata starts on file 0_0_0_0.dat on offset 256 from start. We don't have to search for her data, we could simply calculate them. A User 100 would be stored in the same file on offset 25600 and so forth and so on. User 65537 would be stored in file 0_0_0_1.dat on offset 0. Loaded in RAM this is only a pointer and pretty fast.
So we could store with this method more nodes than humans ever lived.
BUT: How to find relationships? Ok, with edges. But how to store them? All in one "column" is stupid, because then we are back on relationship models. In a hashtable? Ok, we could store the 0_0_0_0.frds as a hashtable containing all friends of User0, kick off a new instance of a User-Class Object, add the Friends to a binary list or tree that could be found by the pointer cUser.pFriendlist and we would be done. But I think that I make a mistake.
Shouldn't GraphDatabases be something different than mathematical nodes connected with hash tables filled with edges?
The use of nodes and edges is clear, because it allows to connect everything with relationships of anything. But whats about the queries and their speed?
Keeping different edges in different type of files seems somekind of wrong, even if the accessibility is really fast on SSDs.
Sure, I could use a simple relational table to store a edgetype/dataending pair, but please help me: where do I get it wrong!
I want to use GUID's (uuid) for naming folders in a huge file store. Each storage item gets his own folder and guid.
The easiest way would be "x:\items\uuid\{uuid}..."
example: "x:\items\uuid\F3B16318-4236-4E45-92B3-3C2C3F31D44F..."
I see here one problem. What if you expect to get at least 10.000 items and probably a few 100.000 or more then 1 million. I don't want to put so many items (sub folders) in one folder.
I thought to solve this by splitting up the guid. Taking the 2 first chars to create sub folders at the first level and the take the next 2 chars and also create sub folders.
The above example would be --> "x:\items\uuid\F3\B1\6318-4236-4E45-92B3-3C2C3F31D44F..."
If the first 4 chars of guid's are really as random as expected then I get after a while 256 folder within 256 folders and I always end up with a reasonable amount of items within each of these folders
For example if you have 1 million items then you get --> 1 000 000 / 256 /256 = 15.25 items per folder
In the past I'v already tested the randomness of the first chars. (via vb.net app). Result: The items where spread quit evenly over the folders.
Also somebody else came to the same conclusion. see How evenly spread are the first four bytes of a Guid created in .NET?
Possible splits I thought of (1 million items as example)
C1 = character 1 of GUID, C2 = character 2, etc
C1\C2\Rest of GUID --> 16 * 16 * 3906 (almost 4000 are still al lot of folders)
C1\C2\C3\C4\Rest of Guid --> 16 * 16 * 16 * 16 * 15 ( unnecessary splitting up of folders)
C1C2\C3C4\Rest of Guid --> 256 * 256 * 15 (for me the best option ?)
C1C2C3\Rest of Guid --> 4096 * 244 (to many folders at first level??)
C1C2C3C4\Rest of Guid --> 65536 * 15 (to many folders at first level!)
My questions are:
Does anyone see drawbacks for this kind of implementation. (scheme: *C1C2\C3C4\Rest of Guid)
Is there some standard for splitting up Guids, or a general way of doing this.
What happens if you put a few 100 thousands of sub folders in one folder (I still prefer not to use any splitting if possible)
Thanks, Mumblic
This is fairly similar to the method git uses for sharding it's object database (although with SHA1 hashes instead of GUIDs...). As with any algorithm, there are pros and cons, but I don't think there are any significant cons in this case that would outweigh the definite pros. There's a little extra CPU overhead to calculate the directory structure, but in the long term, that overhead is probably significantly less than what is necessary to search through a single directory of a million files repeatedly.
Regarding how to do it, it depends a bit on what library you are using to generate the GUIDs - do you get them in a byte-array (or even a struct) format that then needs to be converted to a character representation in order to display it, or do you get them in an already formatted ASCII array? In the first case, you need to extract the appropriate bytes and format them yourself, in the second you just need to extract a substring.
As far as putting an extreme number of sub-folders (or even files) in one folder, the exact performance characteristics are highly dependent on the actual file system in use. Some perform better than others, but almost all will show significant performance degradation the more entries each directory has.
I need to store the number of plays for every second of a podcast / audio file. This will result in a simple timeline graph (like the "hits" graph in Google Analytics) with seconds on the x-axis and plays on the y-axis.
However, these podcasts could potentially go on for up to 3 hours, and 100,000 plays for each second is not unrealistic. That's 10,800 seconds with up to 100,000 plays each. Obviously, storing each played second in its own row is unrealistic (it would result in 1+ billion rows) as I want to be able to fetch this raw data fast.
So my question is: how do I best go about storing these massive amounts of timeline data?
One idea I had was to use a text/blob column and then comma-separate the plays, each comma representing a new second (in sequence) and then the number for the amount of times that second has been played. So if there's 100,000 plays in second 1 and 90,000 plays in second 2 and 95,000 plays in second 3, then I would store it like this: "100000,90000,95000,[...]" in the text/blob column.
Is this a feasible way to store such data? Is there a better way?
Thanks!
Edit: the data is being tracked to another source and I only need to update the raw graph data every 15min or so. Hence, fast reads is the main concern.
Note: due to nature of this project, each played second will have to be tracked individually (in other words, I can't just track 'start' and 'end' of each play).
Problem with the blob storage is you need to update the entire blob for all of the changes. This is not necessarily a bad thing. Using your format: (100000, 90000,...), 7 * 3600 * 3 = ~75K bytes. But that means you're updating that 75K blob for every play for every second.
And, of course, the blob is opaque to SQL, so "what second of what song has the most plays" will be an impossible query at the SQL level (that's basically a table scan of all the data to learn that).
And there's a lot of parsing overhead marshalling that data in and out.
On the other hand. Podcast ID (4 bytes), second offset (2 bytes unsigned allows pod casts up to 18hrs long), play count (4 byte) = 10 bytes per second. So, minus any blocking overhead, a 3hr song is 3600 * 3 * 10 = 108K bytes per song.
If you stored it as a blob, vs text (block of longs), 4 * 3600 * 3 = 43K.
So, the second/row structure is "only" twice the size (in a perfect world, consult your DB server for details) of a binary blob. Considering the extra benefits this grants you in terms of being able to query things, that's probably worth doing.
Only down side of second/per row is if you need to to a lot of updates (several seconds at once for one song), that's a lot of UPDATE traffic to the DB, whereas with the blob method, that's likely a single update.
Your traffic patterns will influence that more that anything.
Would it be problematic to use each second, and how many plays is on a per-second basis?
That means 10K rows, which isn't bad, and you just have to INSERT a row every second with the current data.
EDIT: I would say that that solutions is better than doing a comma-separated something in a TEXT column... especially since getting and manipulating data (which you say you want to do) would be very messy.
I would view it as a key-value problem.
for each second played
Song[second] += 1
end
As a relational database -
song
----
name | second | plays
And a hack psuedo-sql to start a second:
insert into song(name, second, plays) values("xyz", "abc", 0)
and another to update the second
update song plays = plays + 1 where name = xyz and second = abc
A 3-hour podcast would have 11K rows.
It really depends on what is generating the data ..
As I understand you want to implement a map with the key being the second mark and the value being the number of plays.
What is the pieces in the event, unit of work, or transaction you are loading?
Can I assume you have a play event along the podcastname , start and stop times
And you want to load into the map for analysis and presentation?
If that's the case you can have a table
podcastId
secondOffset
playCount
each even would do an update of the row between the start and ending position
update t
set playCount = playCount +1
where podCastId = x
and secondOffset between y and z
and then followed by an insert to add those rows between the start and stop that don't exist, with a playcount of 1, unless you preload the table with zeros.
Depending on the DB you may have the ability to setup a sparse table where empty columns are not stored, making more efficient.