This article was originally posted on Kongregate's Developer Blog.
[subheading]Overview[/subheading]With more and more developers utilizing cloud services for hosting, it is critical to understand the performance metrics and limits of your application. First and foremost, this knowledge helps you keep your infrastructure up and running. Just as important, however, is the fact that this will ensure you are not allocating too many resources and wasting money in the process.
This post will describe how Kongregate utilizes Locust.io for internal load testing of our infrastructure on AWS, and give you an idea of how you can do similar instrumentation within your organization in order to gain a deeper knowledge of how your system will hold up under various loads.
[subheading]What Is Locust.io?[/subheading]Locust is a code-driven, distributed load testing suite built in Python. Locust makes it very simple to create customizable clients, and gives you plenty of options to allow them to emulate real users and traffic.
The fact that Locust is distributed means it is easy to test your system with hundreds of thousands of concurrent users, and the intuitive web-based UI makes it trivial to manage starting and stopping of tests.
For more information on Locust, see their documentation and GitHub repository.
[subheading]How We Defined Our Tests[/subheading]We wanted to create a test that would allow us to determine how many more web and mobile users we could support on our current production stack before we needed to either add more web servers or upgrade to a larger database instance.
In order to create a useful test suite for this purpose, we took a look at both our highest throughput and slowest requests in NewRelic. We split those out into web and mobile requests, and started adding Locust tasks for each one until we were satisfied that we had an acceptable test suite.
We configured the weights for each task so that the relative frequencies were correct, and tweaked the rate at which tasks were performed until the number of requests per second for a given number of concurrent users was similar to what we see in production. We also set weights for web vs. mobile traffic so that we could predict what might happen if a mobile game goes viral and starts generating a lot of load on those specific endpoints.
Our application also has vastly different performance metrics for authenticated users (they are more expensive), so we added a configurable random chance for users to create an authenticated session in our on_start function. The Locust HTTP client persists cookies across requests, so maintaining a session is quite simple.
[subheading]How We Ran Our Tests[/subheading]We have all of our infrastructure represented as code, mostly via CloudFormation. With this methodology we were able to bring up a mirror of our production stack with a recent database snapshot to test against. Once we had this stack running we created several thousand test users with known usernames and passwords so that we could initiate authenticated sessions as needed.
Initially, we just ran Locust locally with several hundred clients to ensure that we had the correct behavior. After we were convinced that our tests were working properly, we created a pool of EC2 Spot Instances for running Locust on Amazon Linux. We knew we would need a fairly large pool of machines to run the test suite, and we wanted use more powerful instance types for their robust networking capabilities. There is a cost associated with load testing a lot of users, and using spot instances helped us mitigate that.
In order to ensure we had everything we needed on the nodes, we simply used the following user data script for instance creation on our stock Amazon Linux AMI:
#!/bin/bash
sudo yum -y update
sudo yum -y install python
sudo yum -y groupinstall 'Development Tools'
sudo pip install pyzmq
sudo pip install locustio
To start the Locust instances on these nodes we used Capistrano along with cap-ec2 to orchestrate starting the master node and getting the slaves to attach to it. Capistrano also allowed us to easily upload our test scripts on every run so we could rapidly iterate.
Note: If you use EC2 for your test instances, you'll need to ensure your security group is set up properly to allow traffic between the master and slave nodes. By default, Locust needs to communicate on ports 5557 and 5558.
[subheading]Test Iteration[/subheading]While attempting to hit our target number of concurrent users, we ran into a few snags. Here are some of the problems we ran into, along with some potential solutions:
- Locust Web UI would lock up when requests had dynamic URLs
- Make sure to group these requests to keep the UI manageable.
- Ramping up too aggressively:
- Don't open the flood gates full force, unless that's what you're testing.
- If you're on AWS this can overload your ELB and push requests into the surge queue, where they are in danger of being discarded.
- This can overload your backend instances with high overhead setup tasks (session, authentication, etc.).
- High memory usage (~1MB per active client):
- We did not find a solution for this, but instead just brute-forced the problem by adding nodes with more memory.
- Errors due to too many open files:
- Increase the open files limit for the user running Locust.
- Alternatively, you can run your Locust processes as root, and increase the limit directly from your script:
try: resource.setrlimit(resource.RLIMIT_NOFILE, (1000000, 1000000)) except: print "Couldn't raise resource limit"
Share a session cookie between HTTP/HTTPS requests:
r = self.client.post('https://k.io/session', {'username': u, 'password': p})
self.cookies = { '_session': r.cookies['_session'] }
self.client.get('http://k.io/account', cookies=self.cookies)
Debug request output to verify correctness locally before ramping up:
r = self.client.get('/some_request')
print r.text
[subheading]Outcome[/subheading]
After all was said and done, we ended up running a test with roughly 450,000 concurrent users. This allowed us to discover some Linux Kernel settings that were improperly tuned and causing 502 Bad Gateway
errors, and also let us discover breaking points for both our web servers and our database. The test also helped us confirm that we made correct choices in regard to the number of web server processes per instance, and instance types.
We now have a better idea how our system will respond to viral game launches and other events, we can perform regression tests to ensure that large features don't slow things down unexpectedly, and we can use the information gathered to further optimize our architecture and reduce overall costs.