I've been tasked with processing multiple terabytes worth of SCM data for my company. I set up a hadoop cluster and have a script to pull data from our SCM servers.
Since I'm processing data with batches through the streaming interface, I came across an issue with the block sizes that O'Reilly's Hadoop book doesn't seem to address: what happens to data straddling two blocks? How does the wordcount example get around this? To get around the issue so far, we've resorted to making our input files smaller than 64mb each.
The issue came up again when thinking about the reducer script; how is aggregated data from the maps stored? And would the issue come up when reducing?
This should not be an issue providing that each block can cleanly break a part the data for the splits (like by line break). If your data is not a line by line data set then yes this could be a problem. You can also increase the size of your blocks on your cluster too (dfs.block.size).
You can also customize in your streaming how the inputs are going into your mapper
http://hadoop.apache.org/common/docs/current/streaming.html#Customizing+the+Way+to+Split+Lines+into+Key%2FValue+Pairs
Data from the map step gets sorted together based on a partioner class against the key of the map.
http://hadoop.apache.org/common/docs/r0.15.2/streaming.html#A+Useful+Partitioner+Class+%28secondary+sort%2C+the+-partitioner+org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner+option%29
The data is then shuffled together to make all the map keys get together and then transferred to the reducer. Sometimes before the reducer step happens a combiner comes in if you like.
Most likely you can create your own custom -inputreader (here is example of how to stream XML documents http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/streaming/StreamXmlRecordReader.html)
If you have multiple terabytes input you should consider setting block size to even more then 128MB.
If file is bigger than one block it can either be split, so each block of file would go to different mapper, or whole file can go to one mapper (for example if this file is gzipped). But I guess you can set this using some configuration options.
Splits are taken care of automatically and you should not worry about it. Output from maps is stored in tmp directory on hdfs.
Your question about "data straddling two blocks" is what the RecordReader handles. The purpose of a RecordReader is 3 fold:
Ensure each k,v pair is processed
Ensure each k,v pair is only processed once
Handle k,v pairs which are split across blocks
What actually happens in (3) is that the RecordReader goes back to the NameNode, gets the handle of a DataNode where the next block lives, and then reaches out via RPC to pull in that full block and read the remaining part of that first record up to the record delimiter.
Related
I have a large blob (azure) file with 10k json objects in a single array. This does not perform because of its size. As I look to re-architect it, I can either create multiple files with a single array in each of 500-1000 objects or I could keep the one file, but burst the single array into an array of arrays-- maybe 10 arrays of 1000 objects each.
For simplicity, I'd rather break into multiple files. However, I thought this was worth asking the question and seeing if there was something to be learned in the answers.
I would think this depends strongly on your use-case. The multiple files or multiple arrays you create will partition your data somehow: will the partitions be used mostly together or mostly separate? I.e. will there be a lot of cases in which you only read one or a small number of the partitions?
If the answer is "yes, I will usually only care about a small number of partitions" then creating multiple files will save you having to deal with most of your data on most of your calls. If the answer is "no, I will usually need either 1.) all/most of my data or 2.) data from all/most of my partitions" then you probably want to keep one file just to avoid having to open many files every time.
I'll add: in this latter case, it may well turn out that the file structure (one array vs an array-of-arrays) doesn't change things very much, since a full scan is a full scan is a full scan etc. If that's the case, then you may need to start thinking about how to move to the prior case where you partition your data so that your calls fall neatly within few partitions, or how to move to a different data format.
I have correlated the values from my script and have captured into a list of array using Ord=all, now I wanted to display the values randomly and pass it to a file, in a certain format.
Can someone help me understand how random function is used in Loadrunner.
script:
web_reg_save_param("param", "rb=\\", "lb=\\", "Ord=all", LAST);
values:
param_1 = blah-blah
param_2 = blah-blah
and so on n on....
... pass it to a file, ...
Greater than 99% of the time why people want to do this is because they intend to take a value as output generated by one virtual user type and pass it as input to another virtual user type. In general this does not work for the following reasons:
All parameter files are loaded into RAM at the beginning of the test, so a new value written to the tail end of a file will only show up in the next test, not the current test
In a properly designed test virtual user types are distributed to different load generators. This would mean that you would need to write the file to a common location for all of the virtual users to access, such as a shared network drive. You would now be adding two extra finite resource calls to your virtual users, a network request and a disk write request. This will slow your virtual users down, possibly introducing a bottleneck into your entire test design
Let's be blunt, very few LoadRunner users have got the skills to manage tens, hundreds or thousands of users all reading, writing (and potentially deleting) records from the same file. This is a non trivial programming operation. By asking how to write the information to a file you have placed yourself in the skills arena where you don't have the programming maturity to be up to the task on this one. In all likelihood you will introduce all sorts of delays due to locking as all of the users try to access the same time.
HP includes a service to allow users to pass data from one user to another via a broker. This is the Virtual Table Server (VTS). VTS would then manage the locks and all of the reads, writes and deletes to its internal data files which simplifies the act of pasing data from one user to the another immensely. VTS is a "use once" queue for passing data, so there is no reason why you could not also use a queue solution such as RabbitMQ or a Queue table in your database provider to accomplish the same task. Just be sure to not use a queuing solution running on the same infrastructure as your application under test
I am going to convert a text file in the SQLite db form; I am concerned about these points because giving any effort to write code for it:
Will both text file or its corresponding sqlite db be of same size?
SQLite would take less space than text file?
Or text file db is the one with lowest space?
"Hardware is cheap" - I'd strongly recommend not worrying about size differences, which will likely be insignificant anyway, and instead pick whichever solution best meets the rest of your needs. A text file can work just fine for simple projects, but a database has many more features that can help you organize, backup, and query your data much more efficiently and robustly.
For a more in-depth look at the pros and cons of both options, check out: database vs. flat files
Some things to keep in mind:
(NOTE about this answer: Files here references to internal/external storage, not SharedPrefs)
SQL:
Databases have overheads, which does take up size
If the database or a table goes corrupt, all data is lost(how bad this is depends on your app. Losing several thousand pictures: bad. Losing deletion log: not very bad)
Databases can be compressed(see this)
You can split up data into different tables, if you have issues with ID(or whatever way you identify row X), meaning one database can have several tables for each object where object X have identification conflicts with object Y. That basically means you can keep everything in one file, and still avoid conflicts with names. (Read more at the bottom of the answer)
Files:
Every single file has to be defined as its own separate file, which takes up space(name of the file)
You cannot store all attributes in one file without having to set up an advanced reader that determines the different types of data. If you don't do that, and have one file for each attribute, you will use a lot of space.
Reading thousands of lines can go slow, especially if you have several(say 100+) very big files
The OS uses space for each file, excluding the content. The name of the file for an instance, that takes up space. But something to keep in mind is you can keep all the data of an app in a single file. If you have an app where objects of two different types may have naming issues, you create a new database.
Naming conflicts
Say you have two objects, object X and Y.
Scenario 1'
Object X stores two variables. The file names are(x and y are in this case coordinates):
x.txt
y.txt
But in a later version, object Y comes in with the same two files.
So you have to assign an ID to object X and Y:
0-x.txt
0-y.txt
Every file uses 3 chars(7 total, including extension) on the name alone. This grows bigger the more complex the setup is. See scenario 2
But saving in the database, you get the row with ID 0 and find column X or Y.
You do not have to worry about the file name.
Further, if every object saves a lot of files, the reference to load or save each file will take up a lot of space. And that affects your APK file, and slowly pushes you up towards the 50 MB limit(google play limit).
You can create universal methods, but you can do the same with SQL and save space in the APK file. But compared to text files, SQL does save some space in terms of name.
Note though, that if you save 2-3 files(just to take a number) those few bytes going to names aren't going to matter
It is when you start saving hundreds of files, long names to avoid naming conflicts, that is when SQL saves you space. And if the table gets too big, you can compress it. You can zip text files to maybe save some space, but with one-liner files, there is not much to save.
Scenario 2
Object X and Y has three children each.
Every child has 3 variables it saves to the file system. If there was only one object with 3 children, it could have saved it like this
[id][variable name].txt
But because there is another parent with 3 children(of the same type, and they save the same files) the object's children who get saved last are the ones that stay saved. The first 3 get overwritten.
So you have to add the parent ID:
[parent ID][child ID][variable name].txt
And keep in mind, these examples are focused on a few objects. The amount of space saved is low, but when you save hundreds, if not thousands of files, that is when you start to save space.
Now, if you create a table, you can store your main objects(X and Y in this case). Then, you can either create the first table in a way that makes it recognisable whether the object is the parent or child, or you can create a second table. The second table have two ID values; One to identify the parent and one to identify the child. So if you want to find all the children of object 436, you simply write this query:
SELECT * FROM childrentable WHERE `parent_id`='436'
And that will give you all the attributes for all the children with object 436 as its parent.
And everything is stored in the Cursor when returned.
If you were to do the same with a file, this line(where Saver is the file saving and loading class):
Saver.load("0-436-file_name", context);
It is, of course, possible to use a for-loop to cycle the children ID(the 0 at the start), but you would also have to save how many children there are: You cannot get the files as easily, so you have to store values about thee amount of objects and the objects children.
This meaning you have to save more values in more files to be able to get the files you saved in the first place. And this is a really hard way to do things. A database would help you not have to write files to keep track of how many files you saved. The database would return [x] results on each query. So if object 436 has no children, SQL returns 0 rows. But in files, you would have to save 0 as the amount of children. Guessing file names lead to I/O exceptions.
I would expect the text file to be smaller as it has no overhead: all the things a Database gives you have a cost in terms of space.
It sounds like space is the only thing that matters to you, and that you expect to change the contents of the text file often (you call it a 'text file db'). Please note that there is no such thing as a 'text file db'. Reading and writing to it will be very slow compared to a proper db (such as SQLite). Adding different record types (Tables in a db) will complicate your like and I wouldn't want to try to maintain any sort of referential links between record types in a text file.
Hope that helps -
The brief: I need to have access to simply table with only one column, million rows, no any relationships, with just simle 6-character entries - postal codes. I will use it to check against user entered postal code to find out if it is valid. This will be temporary solution for a few monthes until I can remove this validation and leave it to web services. So right now I am seeking for solution to this.
What I have:
Web portal build on top of Adobe CQ5 (Java, OSGi, Apache Sling, CRX)
Linux environment where it is all situated
plain text file (9mb) with these million rows
What I want:
to have fast access to this data (only read, no write) for only one
purpose: to find a row with specific value (six character length, contais only latin symbols and digits).
create this solution as easier as possible, i.e. to use linux
preinstalled software or with ability to quickly install and start it
without long set up and configuring.
Currently I have the next options: use database or use something like HashSet to keep these million records. The first solution requires additional steps for installing and configuring database, the second solution drives me crazy when I think about whole million records in HashSet. So right now I am considering to try to use SQLite, but I want to hear some suggestions on this problem.
Thanks a lot.
Storing in the content repository
You could store it in the CQ5 repository to eliminate the external dependency on sqlite. If you do, I would recommend structuring the storage hierarchically to limit the number of peer nodes. For example, the postcode EC4M 7RF would be stored at:
/content/postcodes/e/c/4/m/ec4m7rf
This is similar to the approach that you will see to users and groups under /home.
This kind of data structure might help with autocomplete if you needed it also. If you typed ec then you could return all of the possible subsequent characters for postcodes in your set by requesting something like:
/content/postcodes/e/c.1.json
This will show you the 4 (and the next character for any other postcode in EC).
You can control the depth using a numeric selector:
/content/postcodes/e/c.2.json
This will go down two levels showing you the 4 and the M and any postcodes in those 'zones'.
Checking for non-existence using a Bloom Filter
Also, have you considered using a Bloom Filter? A bloom filter is a space efficient probabilistic data structure that can quickly tell you whether an item is definitely not in a set. There is a chance of false positives, but you can control the probability vs size trade-off during the creation of the bloom filter. There is no chance of false negatives.
There is a tutorial that demonstrates the concept here.
Guava provides and implementation of the bloom filter that is easily used. It will work like the hashset, but you may not need to hold the whole dataset in memory.
BloomFilter<Person> friends = BloomFilter.create(personFunnel, 500, 0.01);
for(Person friend : friendsList) {
friends.put(friend);
}
// much later
if (friends.mightContain(dude)) {
// the probability that dude reached this place if he isn't a friend is 1%
// we might, for example, start asynchronously loading things for dude while we do a more expensive exact check
}
Essentially, the bloom filter could sit in front of the check and obviate the need to make the check for items that are definitely in the set. For items that maybe (~99% accurate depending on setup) in the set, then the check is made to rule out the false positive.
I would try to use redis memory database wich can handle millions of key/value pair and is blazing fast for loading or reading. Many connectors exists for all languages. and an apache module also exist (mod_redis)
You said that this is a temporary solution/requirement - so do you need a database?
You already have this as a text file - why not just load it into memory as part of your program as it's only 9 MB (assuming your process is persistent and always resident) and reference as an array or just a table of values.
My function parses texts and removes short words, such as "a", "the", "in", "on", "at", etc.
The list of these words might be modified in the future. Also, switching between different lists (i.e., for different languages) might also be an option.
So, where should I store such a list?
About 50-200 words
Many reads every minute
Almost no writes (modifications) - for example, once in a few months
I have these options in my mind:
A list inside the code (fastest, but it doesn't sound like a good practise)
A seperate file "stop_words.txt" (how fast is reading from a file? should I read the same data from the same file every few seconds I call the same function?)
A database table. Would it be really efficient, when the list of words is supposed to be almost static?
I am using Ruby on Rails (if that makes any difference).
If it's only about 50-200 words, I'd store it in memory in a data structure that supports fast lookup, such as a hash map (I don't know what such a structure is called in Ruby).
You could use option 2 or 3 (persist the data in a file or database table, depending on what's easier for you), then read the data into memory at the start of your application. Store the time at which the data was read and re-read it from the persistent storage if a request comes in and the data hasn't been updated for X minutes.
That's basically a cache. It might be possible that Ruby on Rails already provides such a mechanism, but I know too little about it to answer that.
Since look-up of the stop-words needs to be fast, I'd store the stop-words in a hash table. That way, verifying if a word is a stop-word has amortized O(1) complexity.
Now, since the list of stop-words may change, it makes sense to persist the list in a text file, and read that file upon program start (or every few minutes / upon file modification if your program runs continuously).