Process two data sources successively in Apache Flink - apache-flink

I'd like to batch process two files with Apache Flink, one after the other.
For a concrete example: suppose I want to assign an index to each line, such that lines from the second file follow the first. Instead of doing so, the following code interleaves lines in the two files:
val env = ExecutionEnvironment.getExecutionEnvironment
val text1 = env.readTextFile("/path/to/file1")
val text2 = env.readTextFile("/path/to/file2")
val union = text1.union(text2).flatMap { ... }
I want to make sure all of text1 is sent through the flatMap operator first, and then all of text2. What is the recommended way to do so?
Thanks in advance for the help.

DataSet.union() does not provide any order guarantees across inputs. Records from the same input partition will remain in order but will be merged with records from the other input.
But there is a more fundamental problem. Flink is a parallel data processor. When processing data in parallel, a global order cannot be preserved. For example, when Flink reads files in parallel, it tries to split these files and process each split independently. The splits are handed out without any particular order. Hence, the records of a single file are already shuffled. You would need to set the parallelism of the whole job to 1 and implement a custom InputFormat to make this work.
You can make that work, but it won't in parallel and you need to tweak many things. I don't think that Flink is the best tool for such a task.
Have you considered using simple unix commandline tools to concatenate your files?

Related

Solution to handle billions of records for faster insertion and instant retrieval

I have a text file(call it grand parent file) which contains 1 million lines. Each of these lines contain absolute paths of some other files(call them parents) as shown below. The paths of parent files are unique.
%: cat input.txt - grand parent file
/root/a/b/c/1.txt -- parent file1
/root/a/b/c/2.txt -- parent file2 ......
...
/root/a/b/d/3.txt
......
.....
upto 1 million files.
Again each of the above parent file contains absolute paths of different files(Call them childs) and their line numbers as shown below: Same child files may be present in multiple parent files with same or different lumbers.
%: cat /root/a/b/c/1.txt -- parent file
s1.c,1,2,3,4,5 -- child file and its line numbers
s2.c,1,2,3,4,5....
...
upto thousands of files
%: cat /root/a/b/c/2.txt
s1.c,3,4,5
s2.c,1,2,3,4,5....
...
upto thousands of files
Now my requirement is that, given a child file and line number I need to return all the parent files that have the given child file number and line data present with in a minute. The insertion needs to be completed with in a day.
I created a relational database with following schema:
ParentChildMapping - Contains the required relation
ID AUTOINCREMENT PRIMARY KEY
ParentFileName TEXT
ChildFileName TEXT
LNumber INT
For a given file name and line number:
SELECT ParentFileName from ParentChildMapping where ChildFileName="s1.txt" and LNumber=1;
I divided grand parent file to 1000 separate sets each containing 1000 records. Then I have a python program which parses each set and reads the content of the parent file and inserts into the database. I can create thousand processes running in parallel and insert all the records in parallel but I am not sure what will be the impact on the relational database as I will be inserting millions of records in parallel. Also I am not sure if relational database is the right approach to chose here. Could you please let me know if there is any tool or technology that better suits this problem. I started with sqlite but it did not support concurrent inserts and failed with database lock error. And Now I want to try MySQL or any other alternate solution that suits the situation.
Sample Code that runs as thousand processes in parallel to insert into MySQL:
import MySQLDb
connection = MySQLDb.connect(host, username,...)
cursor = connection.cursor()
with open(some_set) as fd:
for each_parent_file in fd:
with open(each_parent_file) as parent_fd:
for each_line in parent_fd:
child_file_name, *line_numbers = each_line.strip().split(",")
insert_items = [(each_parent_file, child_file_name, line_num) for line_num in line_numbers]
cursor.executemany("INSERT INTO ParentChildMapping (ParentFileName, ChildFileName, LineNumber) VALUES %s" %insert_items)
cursor.commit()
cursor.close()
connection.close()
Let's start with a naïve idea of what a database would need to do to organize your data.
You have a million parent files.
Each one contains thousands of child files. Let's say 10,000.
Each one contains a list of line numbers. You didn't say how many. Let's say 100.
This is 10^6 * 10^4 * 10^2 = 10^12 records. Suppose that each is 50 bytes. This is 50 terabytes of data. We need it organized somehow, so we sort it. This requires on the order of log_2(10^12) which is around 40 passes. This naïve approach needs is 2 * 10^15 of data. If we do this in a day with 86400 seconds, this needs us to process 23 GB of data per second.
Your hard drive probably doesn't have 50 terabytes of space. Even if it did, it probably doesn't stream data faster than about 500 MB/second, which is 50 times too slow.
Can we improve this? Well, of course. Probably half the passes can happen strictly in memory. You can replace records with 12 byte tuples. There are various ways to compress this data. But the usual "bulk insert data, create index" is NOT going to give you the desired performance on a standard relational database approach.
Congratulations. When people talk about #bigdata, they usually have small data. But you actually have enough that it matters.
So...what can you do?
First what can you do with out of the box tools?
If one computer doesn't have horsepower, we need something distributed. We need a distributed key/value store like Cassandra. We'll need something like Hadoop or Spark to process data.
If we have those, all we need to do is process the files and load them into Cassandra as records, by parent+child file, of line numbers. We then do a map reduce to find, by child+line number of what parent files have it and store that back into Cassandra. We then get answers by querying Cassandra.
BUT keep in mind the back of the envelope about the amount of data and processing required. This approach allows us, with some overhead, to do all of that in a distributed way. This allows us to do that much work and store that much data in a fixed amount of time. However you will also need that many machines to do it on. Which you can easily rent from AWS, but you'll wind up paying for them as well.
OK, suppose you're willing to build a custom solution, can you do something more efficient? And maybe run it on one machine? After all your original data set fits on one machine, right?
Yes, but it will also take some development.
First, let's make the data more efficient. An obvious step is to create lookup tables for file names to indexes. You already have the parent files in a list, this just requires inserting a million records into something like RocksDB for the forward lookup, and the same for the reverse. You can also generate a list of all child filenames (with repetition) then use Unix commands to do a sort -u to get canonical ones. Do the same and you get a similar child file lookup.
Next, the reason why we were generating so much data before is that we were taking a line like:
s1.c,1,2,3,4,5
and were turning it into:
s1.c,1,/root/a/b/c/1.txt
s1.c,2,/root/a/b/c/1.txt
s1.c,3,/root/a/b/c/1.txt
s1.c,4,/root/a/b/c/1.txt
s1.c,5,/root/a/b/c/1.txt
But if we turn s1.c into a number like 42, and /root/a/b/c/1.txt into 1, then we can turn this into something like this:
42,1,1,5
Meaning that child file 42, parent file 1 starts on line 1 and ends on line 5. If we use, say, 4 bytes for each field then this is a 16 byte block. And we generate just a few per line. Let's say an average of 2. (A lot of lines will have one, others may have multiple such blocks.) So our whole data is 20 billion 16 byte rows for 320 GB of data. Sorting this takes 34 passes, most of which don't need to be written to disk, which can easily be inside of a day on a single computer. (What you do is sort 1.6 GB blocks in memory, then write them back to disk. Then you can get the final result in 8 merge passes.)
And once you have that sorted file, you can NOW just write out offsets to where every file happens.
If each child file is in thousands of parent files, then decoding this is a question of doing a lookup from filename to child file ID, then a lookup of child file ID to the range which has that child file listed. Go through the thousand of records, and form a list of the thousands of parent files that had the line number in their range. Now do the lookup of their names, and return the result. This lookup should run in seconds, and (since everything is readonly) can be done in parallel with other lookups.
BUT this is a substantial amount of software to write. It is how I would go. But if the system only needs to be used a few times, or if you have additional needs, the naïve distributed solution may well be cost effective.

Column Wise Reading a Matrix from text file in C and storing them seprately

So I have 2 text files which arbitrarily contain an matrix whose size I don't know,I have a program running parallelly which computes the matrix multiplication of these two,both these programs(p1 and p2) will be running in a round robin fashion for some time quantum t, I will be using threads to parallelly read the files in p1,and have to pass these to P2 simultaneously, so i was thinking that i will be reading file 1 row wise and file 2 column wise and pass these to p2 so that whenever p1 gets preempted in by p2 ,p2 has something to work on rather than wait for turn of p1 again until it reads the whole matrix, since during the multiplication we need the rows from first matrix and columns from second one
while searching for ways to read the file column wise all solutions I found were to read the whole file simultaneously and parse it into columns or something like that.
what i want to know is how to read the columns of second file without reading the rest so that p2 gets the required data to start the multiplication without waiting for the whole matrix
any other way to do this without reading columns is also welcome
I know what you are trying to do with this question. You are besmirching the name of your institution and I am incredibly heartbroken and disappointed in the actions you are taking.
I wish you had read our honorable Professor's words on the difference between discussion and plagiarism. Will you be mentioning the entire internet as your collaborator in the report?
Indeed, does your group even know this reckless, shameful, inhuman, unfair, unprincipled and disrespectful action you are attempting to do?
I would personally, with a heavy heart, recommend you immediately look inside yourself, and find a way to answer this question that satisfies your inner soul, with honesty, decency, integrity, and honor.

Creating New Matching Logic in Informatica (Ratcliffe - Obershelp)

I am conducting a matching project in Informatica 10.2.1 wherein I need to identify matching strings within product descriptions. Ratcliffe-Obershelp is the matching strategy I need to implement.
I've heard Ratcliffe-Obershelp yields greater results than Jaro - Winkler but I am not sure how to code this into a transformation in Informatica since it is not built in.
No code to show as I don't even know where to start.
I'd expect this to be a transformation/group of transformations that would reproduce the matching score that Ratcliffe-Obershelp creates on a per-line basis.
If I understand correctly, the matching logic performs operations in a loop iterating over the input strings. It is not possible to implement such "loop over string" in Expression Transformation using built-in functions. I see two options:
create DECODE function with multiple conditions for each possible length. - This will be ugly. And can be possible assuming only that we start at the begining of each string - implementing full substring comparison will be... so ugly I can't imagine :)
use Java Transformation - as much as I have putting Java into mappings, there are some cases where it's justified. This look like one of the few. Here's some JS reference

Improving performance when looping in a big data set

I am making some spatio-temporal analysis (with MATLAB) on a quite big data set and I am not sure what is the best strategy to adopt in terms of performance for my script.
Actually, the data set is split in 10 yearly arrays of dimension (latitude,longitude,time)=(50,60,8760).
The general structure of my analysis is:
for iterations=1:Big Number
1. Select a specific site of spatial reference (i,j).
2. Do some calculation on the whole time series of site (i,j).
3. Store the result in archive array.
end
My question is:
Is it better (in terms of general performance) to have
1) all data in big yearly (50,60,8760) arrays as global variables loaded for once. At each iteration the script will have to extract one particular "site" (i,j,:) from those arrays for data process.
2) 50*60 distinct files stored in a folder. Each file containing a particular site time series (a vector of dimension (Total time range,1)). The script will then have to open, data process and then close at each iteration a specific file from the folder.
Because your computations are computed on the entire time series, I would suggest storing the data that way in a 3000x8760 vector and doing the computations that way.
Your accesses then will be more cache-friendly.
You can reformat your data using the reshape function:
newdata = reshape(olddata,50*60,8760);
Now, instead of accessing olddata(i,j,:), you need to access newdata(sub2ind([50 60],i,j),:).
After doing some experiments it is clear that the second proposition with 3000 distinct files is much slower than having to manipulate big arrays loaded in workspace. But I didn't try to load all the 3000 files in workspace before computing (A tad to much).
It looks like Reshaping data help's a little bit.
Thanks to all contributors for your suggestions.

hadoop split file in equally size

Im trying to learn diving a file stored in hdfs into splits and reading it to different process (on different machines.)
What I expect is if I have a SequenceFile containing 1200 records with 12 process, I would see around 100 records per process. The way to divide the file is by getting the length of data, then dividing by number of processes, deriving chunk/beg/end size for each split, and then passing that split to e.g. SequenceFileRecordReader, retrieving records in a simple while loop : The code is as below.
private InputSplit getSplit(int id) throws IOException {
...
for(FileStatus file: status) {
long len = file.getLen();
BlockLocation[] locations =
fs.getFileBlockLocations(file, 0, len);
if (0 < len) {
long chunk = len/n;
long beg = (id*chunk)+(long)1;
long end = (id)*chunk;
if(n == (id+1)) end = len;
return new FileSplit(file, beg, end, locations[locations.length-1].getHosts());
}
}
...
}
However, the result shows that the sum of total records counted by each process is different from the records stored in file. What is the right way to divide the SequenceFile into chunk evenly and distribute them to different hosts?
Thanks.
I can't help but wonder why you are trying to do such a thing. Hadoop automatically splits your files and 1200 records to be split into 100 records doesn't sound like a lot of data. If you elaborate on what your problem is, someone might be able to help you more directly.
Here are my two ideas:
Option 1: Use Hadoop's automatic splitting behavior
Hadoop automatically splits your files. The number of blocks a file is split up into is the total size of the file divided by the block size. By default, one map task will be assigned to each block (not each file).
In your conf/hdfs-site.xml configuration file, there is a dfs.block.size parameter. Most people set this to 64 or 128mb. However, if you are trying to do something tiny, like 100 sequence file records per block, you could set this really low... to say 1000 bytes. I've never heard of anyone wanting to do this, but it is an option.
Option 2: Use a MapReduce job to split your data.
Have your job use an "identity mapper" (basically implement Mapper and don't override map). Also, have your job use an "identity reducer" (basically implement Reducer and don't override reduce). Set the number of reducers to the number of splits you want to have. Say you have three sequence files you want split into a total of 25 files, you would load up those 3 files and set the number of reducers to 25. Records will get randomly sent to each reducer, and what you will end up is close to 25 equal splits.
This works because the identity mappers and reducers effectively don't do anything, so your records will stay the same. The records get sent to random reducers, and then they will get written out, one file per reducer into part-r-xxxx files. Each of those files will contain your sequence file(s) split into somewhat even chunks.

Resources