Selenium Webdriver - How to avoid data duplication - selenium-webdriver

Suppose I have normal "Add Users" module that I want to automate using Java scripting, how I can avoid data duplication to avoid error message such as "User already exist"?

There are numerous ways in which this can be automated. You are receiving 'User already exists' due to fact that you're (probably) running your 'Add Users' test cases using static variables.
Note: For the following examples I will consider a basic registration flow/scenario of a new user: name, email, password being the required fields.
Note-002: My language of choice will be JavaScript. You should be able to reproduce the concept with Java with ease.
1.) Pre-pending/Post-pending a unique identifier to the information you're submitting (e.g.: Date() returns the number of seconds that have elapsed since January 1, 1970 => it will always be unique when running your test case)
var timestamp = Number(new Date());
var email = 'test.e2e' + timestamp + '#<yourMainDomainHere>'
Note: Usually, the name & password don't need to be unique, so you can actually use hardcoaded values without any issues.
2.) The same thing can also be achieved using the Math.random() (for JS), which returns a value between 0 and 1 (0.8018194703223693), 18 digits long.
var almostUnique = Math.random();
// You can go ahead and gen only the decimals
almostUnique = almostUnique.toString().split('.')[1];
var email = 'test.e2e' + almostUnique + '#<yourMainDomainHere>'
!!! Warning: While Math.random() is not actually unique, in hundreds of regression runs of 200 functional test cases, I didn't have the chance of seeing a duplicate.
3.) (Not so elegant | Harder exponentially harder to implement) If you have access to your web-apps backend API and through it you can execute different actions in the DB, then you can actually write yourself some scripts that will run after your registration test cases, like a cleanup suite.
These scripts will have to remove the previously added user from your database.
Hope this helps!

You don't mention what language you are coding in, but use whatever language you use's random function to generate random numbers and/or text for the user id. It won't guarantee that there will not be a duplicate, but the nature of testing is such that you should be able to handle both situations anyway. If this is not clear or I don't understand your question correctly, you'll need to provide a lot more information: what you've tried, what language you use, etc.

Related

Laravel assertDatabaseHas in phpunit test is not working

I have the following code:
/** #test */
public function it_updates_customer_status_to_deactivated_for_admin_users()
{
$this->hubAdminUser = factory(User::class)->state('admin')->create();
$this->customer = Customer::first();
$this->customer->status_id = 2; //active
$this->customer->save();
// this will update status_id to 3
$this->actingAs($this->hubAdminUser)
->patch(route('hub.customer.updateStatus', $this->customer))
->assertRedirect();
$this->assertDatabaseHas('tenants', [
'id' => $this->customer->id,
'status_id' => 3, //deactivated
]);
}
The ->patch(route('hub.customer.updateStatus', $this->customer)) line will change the value of status_id from 2 to 3 which it definitely does as I have even tried $this->customer->refresh()->status_id after the ->assertRedirect(); line and that gives me 3. This is failing as it says that the customer's status_id is set to 2 in the database. Any ideas how I can fix this?
I would you recommend to change assertDatabaseHas to assertEquals.
Here is why:
"When you’re working with Eloquent, you specify a table name - or it automatically figures it out. Then, you forget about it. So, it’s ideally designed for you not to have to know the name of the table."
"Laravel is architected, then, so that we don’t have to know the names of our database tables. With assertDatabaseHas you have to know the name of the table every time you use it."
"But, I think it’s best to stop asserting directly against a database when you don’t need to. Especially since your code is not generally architected in Laravel to deal directly with the database, why would your tests? Stay in your domain and test the input and output values, not the implementation."
From a article https://www.aaronsaray.com/2020/stop-assert-database-has-laravel

How to Prevent Hotlinking by Using AWS WAF

There is an AWS document that explains how to do it for oneself, i.e. how to allow only one's pages to hotlink and reject all others: https://aws.amazon.com/blogs/security/how-to-prevent-hotlinking-by-using-aws-waf-amazon-cloudfront-and-referer-checking/
I'd like to know if WAF is the right choice for my use case, which is a bit different from the one above.
At the company I work for, we intend to sell data through a JS widget.
We'd like to restrict access to those data so that only authorized REFERERs are able to show our data to their users, while rejecting all other REFERERs.
The possibility of spoofing the REFERER is not an important threat for us.
We expect to grow our customer base to some hundreds.
The reason I'm asking this question is due to noticing that there are some strict limits on WAF: https://docs.aws.amazon.com/waf/latest/developerguide/limits.html, according to which I understand that for our use case, WAF wouldn't scale nicely.
WAF is not the right tool for that job.
First, even if there are max 10 rules, each with max 10 conditions, each with max 10 filters, there is a strong max of 100 string conditions per AWS account.
Second, conditions and filters do not compose nicely for our use case. Conditions of a rule get composed with an AND and filters of a condition get composed with an OR. For example, a rule like r(x) := (x=a + x=b + x=c) * (x=d + x=e) would give r(d) = false without ever getting to test x=d.

How much trust should I put in the validity of retrieved data from database?

Other way to ask my question is: "Should I keep the data types coming from database as simple and raw as I would ask them from my REST endpoint"
Imagine this case class that I want to store in the database as a row:
case class Product(id: UUID,name: String, price: BigInt)
It clearly isn't and shouldn't be what it says it is because The type signatures of nameand price are a lie.
so what we do is create custom data types that better represent what things are such as: (For the sake of simplicity imagine our only concern is the price data type)
case class Price(value: BigInt) {
require(value > BigInt(0))
}
object Price {
def validate(amount: BigInt): Either[String,Price] =
Try(Price(amount)).toOption.toRight("invalid.price")
}
//As a result my Product class is now:
case class Product(id: UUID,name: String,price: Price)
So now the process of taking user input for product data would look like this:
//this class would be parsed from i.e a form:
case class ProductInputData(name: String, price: BigInt)
def create(input: ProductInputData) = {
for {
validPrice <- Price.validate(input.price)
} yield productsRepo.insert(
Product(id = UUID.randomUUID,name = input.name,price = ???)
)
}
look at the triple question marks (???). this is my main point of concern from an entire application architecture perspective; If I had the ability to store a column as Price in the database (for example slick supports these custom data types) then that means I have the option to store the price as either price : BigInt = validPrice.value or price: Price = validPrice.
I see so many pros and cons in both of these decisions and I can't decide.
here are the arguments that I see supporting each choice:
Store data as simple database types (i.e. BigInt) because:
performance: simple assertion of x > 0 on the creation of Price is trivial but imagine you want to validate a Custom Email type with a complex regex. it would be detrimental upon retrieval of collections
Tolerance against Corruption: If BigInt is inserted as negative value it would't explode in your face every time your application tried to simply read the column and throw it out on to the user interface. It would however cause problem if it got retrieved and then involved in some domain layer processing such as purchase.
Store data as it's domain rich type (i.e. Price) because:
No implicit reasoning and trust: Other method some place else in the system would need the price to be valid. For example:
//two terrible variations of a calculateDiscount method:
//this version simply trusts that price is already valid and came from db:
def calculateDiscount(price: BigInt): BigInt = {
//apply some positive coefficient to price and hopefully get a positive
//number from it and if it's not positive because price is not positive then
//it'll explode in your face.
}
//this version is even worse. It does retain function totality and purity
//but the unforgivable culture it encourages is the kind of defensive and
//pranoid programming that causes every developer to write some guard
//expressions performing duplicated validation All over!
def calculateDiscount(price: BigInt): Option[BigInt] = {
if (price <= BigInt(0))
None
else
Some{
//Do safe processing
}
}
//ideally you want it to look like this:
def calculateDiscount(price: Price): Price
No Constant conversion of domain types to simple types and vice versa: for representation, storage,domain layer and such; you simply have one representation in the system to rule them all.
The source of all this mess that I see is the database. if data was coming from the user it'd be easy: You simply never trust it to be valid. you ask for simple data types cast them to domain types with validation and then proceed. But not the db. Does the modern layered architecture address this issue in some definitive or at least mitigating way?
Protect the integrity of the database. Just as you would protect the integrity of the internal state of an object.
Trust the database. It doesn't make sense to check and re-check what has already been checked going in.
Use domain objects for as long as you can. Wait till the very last moment to give them up (raw JDBC code or right before the data is rendered).
Don't tolerate corrupt data. If the data is corrupt, the application should crash. Otherwise it's likely to produce more corrupt data.
The overhead of the require call when retrieving from the DB is negligible. If you really think it's an issue, provide 2 constructors, one for the data coming from the user (performs validation) and one that assumes the data is good (meant to be used by the database code).
I love exceptions when they point to a bug (data corruption because of insufficient validation on the way in).
That said, I regularly leave requires in code to help catch bugs in more complex validation (maybe data coming from multiple tables combined in some invalid way). The system still crashes (as it should), but I get a better error message.

How to make datastore keys mapreduce-friendly(-er)?

Edit: See my answer. Problem was in our code. MR works fine, it may have a status reporting problem, but at least the input readers work fine.
I ran an experiment several times now and I am now sure that mapreduce (or DatastoreInputReader) has odd behavior. I suspect this might have something to do with key ranges and splitting them, but that is just my guess.
Anyway, here's the setup we have:
we have an NDB model called "AdGroup", when creating new entities
of this model - we use the same id returned from AdWords (it's an
integer), but we use it as string: AdGroup(id=str(adgroupId))
we have 1,163,871 of these entities in our datastore (that's what
the "Datastore Admin" page tells us - I know it's not entirely
accurate number, but we don't create/delete adgroups very often, so
we can say for sure, that the number is 1.1 million or more).
mapreduce is started (from another pipeline) like this:
yield mapreduce_pipeline.MapreducePipeline(
job_name='AdGroup-process',
mapper_spec='process.adgroup_mapper',
reducer_spec='process.adgroup_reducer',
input_reader_spec='mapreduce.input_readers.DatastoreInputReader',
mapper_params={
'entity_kind': 'model.AdGroup',
'shard_count': 120,
'processing_rate': 500,
'batch_size': 20,
},
)
So, I've tried to run this mapreduce several times today without changing anything in the code and without making changes to the datastore. Every time I ran it, mapper-calls counter had a different value ranging from 450,000 to 550,000.
Correct me if I'm wrong, but considering that I use the very basic DatastoreInputReader - mapper-calls should be equal to the number of entities. So it should be 1.1 million or more.
Note: the reason why I noticed this issue in the first place is because our marketing guys started complaining that "it's been 4 days after we added new adgroups and they still don't show up in your app!".
Right now, I can think of only one workaround - write all keys of all adgroups into a blobstore file (one per line) and then use BlobstoreLineInputReader. The writing to blob part would have to be written in a way that does not utilize DatastoreInputReader, of course. Should I go with this for now, or can you suggest something better?
Note: I have also tried using DatastoreKeyInputReader with the same code - the results were similar - mapper-calls were between 450,000 and 550,000.
So, finally questions. Is it important how you generate ids for your entities? Is it better to use int ids instead of str ids? In general, what can I do to make it easier for mapreduce to find all of my entities mapping them?
PS: I'm still in the process of experimenting with this, I might add more details later.
After further investigation we have found that the error was actually in our code. So, mapreduce actually works as expected (mapper is called for every single datastore entity).
Our code was calling some google services functions that were sometimes failing (the wonderful cryptic ApplicationError messages). Due to these failures, MR tasks were being retried. However, we have set a limit on taskqueue retries. MR did not detect nor report this in any way - MR was still showing "success" in the status page for all shards. That is why we thought that everything is fine with our code and that there is something wrong with the input reader.

How to generate large files (PDF and CSV) using AppEngine and Datastore?

When I first started developing this project, there was no requirement for generating large files, however it is now a deliverable.
Long story short, GAE just doesn't play nice with any large scale data manipulation or content generation. The lack of file storage aside, even something as simple as generating a pdf with ReportLab with 1500 records seems to hit a DeadlineExceededError. This is just a simple pdf comprised of a table.
I am using the following code:
self.response.headers['Content-Type'] = 'application/pdf'
self.response.headers['Content-Disposition'] = 'attachment; filename=output.pdf'
doc = SimpleDocTemplate(self.response.out, pagesize=landscape(letter))
elements = []
dataset = Voter.all().order('addr_str')
data = [['#', 'STREET', 'UNIT', 'PROFILE', 'PHONE', 'NAME', 'REPLY', 'YS', 'VOL', 'NOTES', 'MAIN ISSUE']]
i = 0
r = 1
s = 100
while ( i < 1500 ):
voters = dataset.fetch(s, offset=i)
for voter in voters:
data.append([voter.addr_num, voter.addr_str, voter.addr_unit_num, '', voter.phone, voter.firstname+' '+voter.middlename+' '+voter.lastname ])
r = r + 1
i = i + s
t=Table(data, '', r*[0.4*inch], repeatRows=1 )
t.setStyle(TableStyle([('ALIGN',(0,0),(-1,-1),'CENTER'),
('INNERGRID', (0,0), (-1,-1), 0.15, colors.black),
('BOX', (0,0), (-1,-1), .15, colors.black),
('FONTSIZE', (0,0), (-1,-1), 8)
]))
elements.append(t)
doc.build(elements)
Nothing particularly fancy, but it chokes. Is there a better way to do this? If I could write to some kind of file system and generate the file in bits, and then rejoin them that might work, but I think the system precludes this.
I need to do the same thing for a CSV file, however the limit is obviously a bit higher since it's just raw output.
self.response.headers['Content-Type'] = 'application/csv'
self.response.headers['Content-Disposition'] = 'attachment; filename=output.csv'
dataset = Voter.all().order('addr_str')
writer = csv.writer(self.response.out,dialect='excel')
writer.writerow(['#', 'STREET', 'UNIT', 'PROFILE', 'PHONE', 'NAME', 'REPLY', 'YS', 'VOL', 'NOTES', 'MAIN ISSUE'])
i = 0
s = 100
while ( i < 2000 ):
last_cursor = memcache.get('db_cursor')
if last_cursor:
dataset.with_cursor(last_cursor)
voters = dataset.fetch(s)
for voter in voters:
writer.writerow([voter.addr_num, voter.addr_str, voter.addr_unit_num, '', voter.phone, voter.firstname+' '+voter.middlename+' '+voter.lastname])
memcache.set('db_cursor', dataset.cursor())
i = i + s
memcache.delete('db_cursor')
Any suggestions would be very much appreciated.
Edit:
Above I had documented three possible solutions based on my research, plus suggestions etc
They aren't necessarily mutually exclusive, and could be a slight variation or combination of any of the three, however the gist of the solutions are there. Let me know which one you think makes the most sense, and might perform the best.
Solution A: Using mapreduce (or tasks), serialize each record, and create a memcache entry for each individual record keyed with the keyname. Then process these items individually into the pdf/xls file. (use get_multi and set_multi)
Solution B: Using tasks, serialize groups of records, and load them into the db as a blob. Then trigger a task once all records are processed that will load each blob, deserialize them and then load the data into the final file.
Solution C: Using mapreduce, retrieve the keynames and store them as a list, or serialized blob. Then load the records by key, which would be faster than the current loading method. If I were to do this, which would be better, storing them as a list (and what would the limitations be...I presume a list of 100,000 would be beyond the capabilities of the datastore) or as a serialized blob (or small chunks which I then concatenate or process)
Thanks in advance for any advice.
Here is one quick thought, assuming it is crapping out fetching from the datastore. You could use tasks and cursors to fetch the data in smaller chunks, then do the generation at the end.
Start a task which does the initial query and fetches 300 (arbitrary number) records, then enqueues a named(!important) task that you pass the cursor to. That one in turn queries [your arbitrary number] records, and then passes the cursor to a new named task as well. Continue that until you have enough records.
Within each task process the entities, then store the serialized result in a text or blob property on a 'processing' model. I would make the model's key_name the same as the task that created it. Keep in mind the serialized data will need to be under the API call size limit.
To serialize your table pretty fast you could use:
serialized_data = "\x1e".join("\x1f".join(voter) for voter in data)
Have the last task (when you get enough records) kick of the PDf or CSV generation. If you use key_names for you models you, should be able to grab all of the entities with encoded data by key. Fetches by key are pretty fast, you'll know the model's keys since you know the last task name. Again, you'll want to be mindful size of your fetches from the datastore!
To deserialize:
list(voter.split('\x1f') for voter in serialized_data.split('\x1e'))
Now run your PDF / CSV generation on the data. If splitting up the datastore fetches alone does not help you'll have to look into doing more of the processing in each task.
Don't forget in the 'build' task you'll want to raise an exception if any of the interim models are not yet present. Your final task will automatically retry.
Some time ago I faced the same problem with GAE. After many attempts I just moved to another web hosting since I could do it. Nevertheless, before moving I had 2 ideas how to resolve it. I haven't implemented them, but you may try to.
First idea is to use SOA/RESTful service on another server, if it is possible. You can even create another application on GAE in Java, do all the work there (I guess with Java's PDFBox it will take much less time to generate PDF), and return result to Python. But this option needs you to know Java and also to divide your app to several parts with terrible modularity.
So, there's another approach: you can create a "ping-pong" game with a user's browser. The idea is that if you cannot make everything in a single request, force browser to send you several. During first request make only a part of work which fits 30 seconds limit, then save the state and generate 'ticket' - unique identifier of a 'job'. Finally, send the user response which is simple page with redirect back to your app, parametrized by a job ticket. When you get it. just restore state and proceed with the next part of job.

Resources