Closed
Bug 994961
Opened 11 years ago
Closed 11 years ago
Start load testing the staging environment
Categories
(Hello (Loop) :: Server, defect, P3)
Tracking
(Not tracked)
VERIFIED
FIXED
mozilla32
backlog | mlp+ |
People
(Reporter: alexis+bugs, Assigned: jbonacci)
References
Details
(Whiteboard: [est:?][p=.25][s=mlpnightly3][qa!])
We need to send some load to our staging environment in order to see how many concurrent connections it is able to handle. Creating the bug mainly for james to track it :)
This is awaiting some contractual work with our partners (we don't want to send too much load to their servers without noticing them beforehand), and blocking on actually having a stage environment deployed.
Updated•11 years ago
|
Whiteboard: [est:?]
Reporter | ||
Comment 1•11 years ago
|
||
James, can you put an estimation of the time it will take to do? Thanks!
Flags: needinfo?(jbonacci)
Updated•11 years ago
|
backlog: --- → mlp+
Assignee | ||
Comment 2•11 years ago
|
||
On a repeating scale - load test time various...
1. For 1-per-day deployments to Stage, load tests should scale up as follows: 1 hour, 2 hour, 4 hour, 8 hour
Per-minimum should be two hours.
Once Stage stabilizes, then an overnight load test seems reasonable.
2. For 1-per-week deployments to Stage, load tests follow a similar pattern as above, with emphasis on debug work and longer-term load testing.
Day of deployment: 1 hour, 2 hour, 4 hour, 8 hour
During the current Stage deployment: longer load tests as needed for debug, functional checks, error, performance, and so on.
Hope that helps...
Flags: needinfo?(jbonacci)
Comment 3•11 years ago
|
||
To answer mreavy I understood that we want to mock OpenTok calls for the load test.
We will work on it this week in order to be able to run the load test the week after (if that works for James.)
Assignee | ||
Comment 4•11 years ago
|
||
:natim - so far, next week is looking good. After that we have the Fx29 launch...
Comment 5•11 years ago
|
||
The code is ready, we are now waiting for mostlygeek to deploy the staging environment so jbonacci can start load testing.
Updated•11 years ago
|
Status: NEW → ASSIGNED
Comment 6•11 years ago
|
||
Maire to ask Alexis about 987086, 994360, 994961 -- are they really needed for the first nightly of MLP? 978086 is the real question; the other two are unlikely to block and could come a week.
Priority: -- → P2
Target Milestone: --- → mozilla32
Updated•11 years ago
|
Priority: P2 → P3
Comment 7•11 years ago
|
||
marking bugs with no estimate as p=.25, if you own - please change to p=# estimating described here https://wiki.mozilla.org/Media/WebRTC#Bugzilla_Mark-up_.28under_final_review.29
Whiteboard: [est:?] → [est:?][p=.25]
Updated•11 years ago
|
Whiteboard: [est:?][p=.25] → [est:?][p=.25][s=mlpnightly3]
Assignee | ||
Comment 8•11 years ago
|
||
Stage environment is up. Once I verify it, we can start some preliminary (read "functional) load tests to work out the bugs, logging, reporting issues that we find.
Reporter | ||
Comment 9•11 years ago
|
||
Thanks for the heads up, don't hesitate to ping myself once verified so we can start playing with it.
Comment 10•11 years ago
|
||
OK The stage environment is ready for a light load test. So this is what is currently set up:
- https://loop.stage.mozaws.net, 3x m3.medium instances (similar to a prod deploy)
- Elasticache redis - an m1.small instance
- https://loop-delayed-response.stage.mozaws.net, 1x m1.small w/ a random latency between 100:500ms
I would hit it w/ a light load test first. Then crank it up.
It'll be interesting to see:
- https CPU overhead for loop-server calling https://loop-delayed-response.s.m.n
> SSL is offloaded to ELBs so you should be able to hammer this pretty hard
> server has 1024 file handlers, so what happens when > 1024 concurrent connections
- what it takes for the whole cluster to fall over
> what happens when we have 1K, 2K, 4K concurrent users
Let me know if you want me to scale anything up / tweak settings.
Assignee | ||
Comment 11•11 years ago
|
||
:mostlygeek thanks for all this work.
I will try to get to the load test today to check your configs/changes/etc.
I will be out tomorrow, so if not today, then certainly Monday!
Assignee | ||
Comment 12•11 years ago
|
||
Temporarily blocked by bug 1014224
Reporter | ||
Comment 13•11 years ago
|
||
Master is now updated so that you can load test it.
Mostlygeek, can you update stage so that James can load test it?
Flags: needinfo?(bwong)
Assignee | ||
Comment 14•11 years ago
|
||
I opened a Stage deploy ticket here: bug 1019053
Assignee | ||
Comment 15•11 years ago
|
||
Now I am blocked by bug 1019929
Depends on: 1019929
Flags: needinfo?(bwong)
Assignee | ||
Comment 16•11 years ago
|
||
We are on our way!
Quick testing of Stage was successful from my Mac laptop, and also from 3 linux AWS instances.
Keeping this bug open until I can get at least 30min load tests running. Then we can evaluate closing this bug or using it for tracking the first official round of load tests against Stage...
Example:
$ make test
./venv/bin/loads-runner --config=./config/test.ini --server-url=https://loop.stage.mozaws.net:443 loadtest.TestLoop.test_all
[============================================================================================] 100%
Duration: 1.59 seconds
Hits: 5
Started: 2014-06-04 20:35:19.486795
Approximate Average RPS: 3
Average request time: 0.27s
Opened web sockets: 0
Bytes received via web sockets : 0
Success: 1
Errors: 0
Failures: 0
Slowest URL: https://loop.stage.mozaws.net:443/registration
Average Request Time: 0.888493
Stats by URLs:
- https://loop.stage.mozaws.net:443/registration
Average request time: 0.888493 Hits success rate: 1.0
- https://loop.stage.mozaws.net:443/calls/GCbSXurMWshF3ayVDwyH051g03fbotPAThB_3mXbPLm4ywHRehzIgkMPWGLz8o8QLpJeQuW9XNW1VVUW2xwdXOWrH9DKn3LKyoX567kod7sVwn9g3oVC98qfhoqwyt-cQq4XhYER_93Z5KLUbd7inbmWLZRiQuf_lzXCzWzhhpBmTNrrOYRApctri_szuCJyKbLyP4OimiuudK2JDGPTFVhUhj8
Average request time: 0.151973 Hits success rate: 1.0
- https://loop.stage.mozaws.net:443/calls?version=200
Average request time: 0.108467 Hits success rate: 1.0
- https://loop.stage.mozaws.net:443/call-url
Average request time: 0.106362 Hits success rate: 1.0
- https://loop.stage.mozaws.net:443/calls/id/3af69d76b65a32fdb63fae3a2dc65ce6
Average request time: 0.093307 Hits success rate: 1.0
Assignee | ||
Comment 17•11 years ago
|
||
With some very light testing, I am seeing 404s in the nginx access.log file. Here are some typical examples:
1401925447.398 "-" "PUT /push HTTP/1.1" 404 28 "-" "-" 0.001 0.001
1401925447.423 "54.188.254.148" "GET /calls/id/f095c71c33713d938bb156d2b06740ef HTTP/1.1" 404 60 "-" "python-requests/2.3.0 CPython/2.6.6 Linux/2.6.32-431.11.2.el6.x86_64" 0.003 0.003
Are these expected in Stage?
Assignee | ||
Comment 18•11 years ago
|
||
In the nginx access.log I also see a significant number of 204s:
1401924889.127 "24.7.94.153" "DELETE /calls/id/632e434a9351982c43b81045bd733c0d HTTP/1.1" 204 0 "-" "python-requests/2.3.0 CPython/2.7.5 Darwin/13.2.0" 0.003 0.00
3
1401925191.052 "54.189.190.206" "DELETE /calls/id/b84099e88d92ff0a2ce65d201c91580c HTTP/1.1" 204 0 "-" "python-requests/2.3.0 CPython/2.7.6 Linux/3.13.0-24-generi
c" 0.039 0.039
1401925191.063 "54.188.254.148" "DELETE /calls/id/78f44bc968b0c78b9103ff60562a92a0 HTTP/1.1" 204 0 "-" "python-requests/2.3.0 CPython/2.6.6 Linux/2.6.32-431.11.2.
el6.x86_64" 0.018 0.018
1401925191.089 "54.189.190.206" "DELETE /calls/id/f5cf86ac7f25672b59ff4e07803f48d9 HTTP/1.1" 204 0 "-" "python-requests/2.3.0 CPython/2.7.6 Linux/3.13.0-24-generi
c" 0.022 0.022
(varying Python versions)
I assume these are correct given the type of load we are sending, but just wanted to make sure...
If correct, do we really want to be logging these along with the standard 200s?
Assignee | ||
Comment 19•11 years ago
|
||
The longer, more test client intensive 'make bench' works as well. Here is sample console output from rhel linux AWS:
make bench SERVER_URL=https://loop.stage.mozaws.net
./venv/bin/loads-runner --config=./config/bench.ini --server-url=https://loop.stage.mozaws.net loadtest.TestLoop.test_all
[================================================================================================] 100%
Duration: 300.41 seconds
Hits: 29905
Started: 2014-06-04 23:39:00.242098
Approximate Average RPS: 99
Average request time: 0.20s
Opened web sockets: 0
Bytes received via web sockets : 0
Success: 5625
Errors: 0
Failures: 0
Slowest URL: https://loop.stage.mozaws.net/calls/BLAH1
Average Request Time: 0.638324
Stats by URLs (10 slowests):
- https://loop.stage.mozaws.net/calls/BLAH1
Average request time: 0.638324 Hits success rate: 1.0
- https://loop.stage.mozaws.net/calls/BLAH2
Average request time: 0.631144 Hits success rate: 1.0
- https://loop.stage.mozaws.net/calls/BLAH3
Average request time: 0.577574 Hits success rate: 1.0
- https://loop.stage.mozaws.net/calls/BLAH4
Average request time: 0.574913 Hits success rate: 1.0
- https://loop.stage.mozaws.net/calls/BLAH5
Average request time: 0.57401 Hits success rate: 1.0
- https://loop.stage.mozaws.net/calls/BLAH6
Average request time: 0.537536 Hits success rate: 1.0
- https://loop.stage.mozaws.net/calls/BLAH7
Average request time: 0.533552 Hits success rate: 1.0
- https://loop.stage.mozaws.net/calls/BLAH8
Average request time: 0.522361 Hits success rate: 1.0
- https://loop.stage.mozaws.net/calls/BLAH9
Average request time: 0.516951 Hits success rate: 1.0
- https://loop.stage.mozaws.net/calls/BLAH10
Average request time: 0.512179 Hits success rate: 1.0
Assignee | ||
Comment 20•11 years ago
|
||
Moving on to a 30min standard load test, today.
Will continue with more testing tomorrow...
Assignee | ||
Comment 21•11 years ago
|
||
OK. First attempt at a 'make megabench' (that is, a full-on load test) was a failure. The test ended almost immediately :
Testing loadtest.TestLoop.test_all
Test was launched by jbonacci
Run Id 940659b4-26a7-4bab-b8f4-b86d4e4c69b2
Duration 9 sec
Started 2014-06-05 00:19:04 UTC
Ended 2014-06-05 00:19:13 UTC
State Ended
Users [1]
Hits None
Agents 1
Duration 300
Server URL https://loop.stage.mozaws.net
Tests over 0
Successes 0
Failures 0
Errors 1
TCP Hits 0
Opened web sockets 0
Total web sockets 0
Bytes/websockets 0
Requests / second (RPS) 0
and got the following error (as seen on the loads cluster dashboard):
1 occurrences:
loadtest
File "/home/ubuntu/loads/loads/runners/local.py", line 167, in execute
self._execute()
File "/home/ubuntu/loads/loads/runners/local.py", line 247, in _execute
self._run_python_tests()
File "/home/ubuntu/loads/loads/runners/local.py", line 252, in _run_python_tests
self._resolve_name()
File "/home/ubuntu/loads/loads/runners/local.py", line 89, in _resolve_name
self.test = resolve_name(self.fqn)
File "/home/ubuntu/loads/loads/util.py", line 162, in resolve_name
raise ImportError(parts[0])
Nothing of significance in the logs on the server...
I will not file a new bug yet, until we can debug this error, but consider this a blocker...
Comment 22•11 years ago
|
||
This is likely to be a loads-runner CLI argument missing. The include_file directive that doesn't load the right file or something like so.
Reporter | ||
Comment 23•11 years ago
|
||
https://github.com/mozilla-services/loop-server/commit/61cf5852f4b9de3a8ea783ab7df7d26ae233df83 should fix it.
I've just ran a megabench, see it here: https://loads.services.mozilla.com/run/f45740dc-85b1-48ba-b76b-5f9e8e0afbc7
Comment 24•11 years ago
|
||
great job alexis thanks \o/
Assignee | ||
Comment 25•11 years ago
|
||
Yes. Nice catch! Redeploying now - will test shortly...
Assignee | ||
Comment 26•11 years ago
|
||
My first 5min load test was successful.
The second 5min load test had one error:
Errors
1 occurrences:
No JSON object could be decoded
File "/usr/lib/python2.7/unittest/case.py", line 332, in run
testMethod()
File "loadtest.py", line 19, in test_all
call_id, session, caller_token, api_key = self.initiate_call(token)
File "loadtest.py", line 50, in initiate_call
res = self.session.post(self.server_url + '/calls/%s' % token).json()
File "/home/ubuntu/loads/local/lib/python2.7/site-packages/requests-2.2.1-py2.7.egg/requests/models.py", line 741, in json
return json.loads(self.text, **kwargs)
File "/usr/lib/python2.7/json/__init__.py", line 328, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 365, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 383, in raw_decode
raise ValueError("No JSON object could be decoded")
Running a 10min load test now...
Assignee | ||
Comment 27•11 years ago
|
||
10min load test showed more errors.
I need the JSON errors investigated.
I am not too surprised by the HTTPSConnectionPool error, we have seen this on other AWS stacks during load testing...
I will write up some new bugs after I look at the server logs...
Errors
2 occurrences:
No JSON object could be decoded
File "/usr/lib/python2.7/unittest/case.py", line 332, in run
testMethod()
File "loadtest.py", line 19, in test_all
call_id, session, caller_token, api_key = self.initiate_call(token)
File "loadtest.py", line 50, in initiate_call
res = self.session.post(self.server_url + '/calls/%s' % token).json()
File "/home/ubuntu/loads/local/lib/python2.7/site-packages/requests-2.2.1-py2.7.egg/requests/models.py", line 741, in json
return json.loads(self.text, **kwargs)
File "/usr/lib/python2.7/json/__init__.py", line 328, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 365, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 383, in raw_decode
raise ValueError("No JSON object could be decoded")
1 occurrences:
HTTPSConnectionPool(host='loop.stage.mozaws.net', port=443): Max retries exceeded with url: /call-url (Caused by <class 'httplib.BadStatusLine'>: '')
File "/usr/lib/python2.7/unittest/case.py", line 332, in run
testMethod()
File "loadtest.py", line 18, in test_all
token = self.generate_token()
File "loadtest.py", line 43, in generate_token
auth=self.hawk_auth
File "/home/ubuntu/loads/local/lib/python2.7/site-packages/requests-2.2.1-py2.7.egg/requests/sessions.py", line 425, in post
return self.request('POST', url, data=data, **kwargs)
File "/home/ubuntu/loads/loads/measure.py", line 79, in request
method, url, headers=headers, **kwargs)
File "/home/ubuntu/loads/local/lib/python2.7/site-packages/requests-2.2.1-py2.7.egg/requests/sessions.py", line 383, in request
resp = self.send(prep, **send_kwargs)
File "/home/ubuntu/loads/loads/measure.py", line 87, in send
res = _Session.send(self, request, **kwargs)
File "/home/ubuntu/loads/local/lib/python2.7/site-packages/requests-2.2.1-py2.7.egg/requests/sessions.py", line 486, in send
r = adapter.send(request, **kwargs)
File "/home/ubuntu/loads/local/lib/python2.7/site-packages/requests-2.2.1-py2.7.egg/requests/adapters.py", line 378, in send
raise ConnectionError(e)
Reporter | ||
Comment 28•11 years ago
|
||
I think we want to keep track of all the access logs at the same location. 204 and 200 are normal behavior here.
Interesting to see SSL starting fail ;)
Assignee | ||
Comment 29•11 years ago
|
||
On the server side, the /var/log/nginx/access.log is a mix of 200s, 204s, and 404s, as I previously reported.
Nothing else of interest.
The /var/log/loop_out.log shows the following (recent):
...etc...
Server listening on http://127.0.0.1:8000
apiKey undefined
Calls to TokBox are now mocked.
Server listening on http://127.0.0.1:8000
{ [Error: connect ETIMEDOUT] code: 'ETIMEDOUT', errno: 'ETIMEDOUT', syscall: 'connect' }
{ [Error: socket hang up] code: 'ECONNRESET' }
{ [Error: socket hang up] code: 'ECONNRESET' }
{ [Error: socket hang up] code: 'ECONNRESET' }
That's all I can find right now.
Assignee | ||
Comment 30•11 years ago
|
||
For the next test, I lowered the number of users from 20 to 10, but kept the agents at 5.
This is still a 10 minute load test.
The error counts is up from 3 to 6:
3 occurrences: No JSON object could be decoded
1 occurrences: HTTPSConnectionPool(host='loop.stage.mozaws.net', port=443): Max retries exceeded with url
and something new this round:
2 occurrences:
string indices must be integers
File "/usr/lib/python2.7/unittest/case.py", line 332, in run
testMethod()
File "loadtest.py", line 19, in test_all
call_id, session, caller_token, api_key = self.initiate_call(token)
File "loadtest.py", line 51, in initiate_call
return (res['callId'], res['sessionId'], res['sessionToken'],
Assignee | ||
Comment 31•11 years ago
|
||
Several load tests in a row were successful, then this happened:
https://loads.services.mozilla.com/run/b6b41a0e-2af6-460b-b6bf-2e08f985f687
If you can't see that, here are the details:
Test was launched by jbonacci
Run Id b6b41a0e-2af6-460b-b6bf-2e08f985f687
Duration 1 h 6 min and 53 sec.
Started 2014-06-05 23:22:37 UTC
Ended 2014-06-06 00:29:30 UTC
State Ended
Users [20]
Hits None
Agents 10
Duration 3600
Server URL https://loop.stage.mozaws.net
Tests over 283179
Successes 192230
Failures 0
Errors 90936
TCP Hits 1111685
Opened web sockets 0
Total web sockets 0
Bytes/websockets 0
Requests / second (RPS) 276
addFailure 13
The dashboard is showing the following errors:
18 occurrences:
No JSON object could be decoded
File "/usr/lib/python2.7/unittest/case.py", line 332, in run
testMethod()
File "loadtest.py", line 18, in test_all
token = self.generate_token()
File "loadtest.py", line 45, in generate_token
call_url = resp.json()['call_url']
File "/home/ubuntu/loads/local/lib/python2.7/site-packages/requests-2.2.1-py2.7.egg/requests/models.py", line 741, in json
return json.loads(self.text, **kwargs)
File "/usr/lib/python2.7/json/__init__.py", line 328, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 365, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 383, in raw_decode
raise ValueError("No JSON object could be decoded")
82 occurrences:
'hawk-session-token'
File "/usr/lib/python2.7/unittest/case.py", line 332, in run
testMethod()
File "loadtest.py", line 17, in test_all
self.register()
File "loadtest.py", line 36, in register
resp.headers['hawk-session-token'])
File "/home/ubuntu/loads/local/lib/python2.7/site-packages/requests-2.2.1-py2.7.egg/requests/structures.py", line 77, in __getitem__
return self._store[key.lower()][1]
The 'hawk-session-token' errors are new for the load test.
Assignee | ||
Comment 32•11 years ago
|
||
Moving all error info over to this bug: 1021119
Assignee | ||
Comment 33•11 years ago
|
||
This bug is getting too long and we have proved that we can load test Stage.
Closing this out in favor of more specific load test bugs, like bug 1021119
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•11 years ago
|
Status: RESOLVED → VERIFIED
QA Contact: jbonacci
Whiteboard: [est:?][p=.25][s=mlpnightly3] → [est:?][p=.25][s=mlpnightly3][qa!]
You need to log in
before you can comment on or make changes to this bug.
Description
•