Appscale behaving strangely
We are facing a strange problem today. Our code has an endpoint http://lifeprint2.mcongroup/v2/search that actually calls one of the servers where we have CraftAR setup.
Now when we try this through lifeprint2.mcongroup.cn it doesn't work and exits without returning any information. But if we access through this ip address which is I believe is for the load balancer.
it works fine. Was there anything changed that could have introduced this ?
Secondly we have been experiencing the lifeprint2.mcongroup.cn site return 404 errors on our other endpoints as well.
I have checked the logs recently and it shows that our server becomes available one minute and becomes unavailable the next. Our client app is not able to access it consistently.
Please look at the logs below
http://lifeprint2.mcongroup.cn/config was called multiple times within seconds. It has returned both 200 and 404 stasuses. This is affecting our application in China.> app___lpappscale_default_v1_1540812885692-20000.log <
INFO 2018-10-29 15:22:31,970 recording.py:673] Saved; key: appstats:051900, part: 12 bytes, full: 1379 bytes, overhead: 0.000 + 0.004; link: http://lifeprint2.mcongroup.cn/_ah/stats/details?time=1540826551964
INFO 2018-10-29 15:22:31,976 module.py:638] default: "GET / HTTP/1.0" 200 9
INFO 2018-10-29 15:22:40,673 recording.py:673] Saved; key: appstats:060400, part: 58 bytes, full: 13330 bytes, overhead: 0.001 + 0.007; link: http://172.31.16.247/_ah/stats/details?time=1540826560486
INFO 2018-10-29 15:22:40,679 module.py:638] default: "GET /config HTTP/1.0" 200 4881 > app___lpappscale_default_v1_1540812885692-20001.log <
INFO 2018-10-29 15:23:10,492 recording.py:673] Saved; key: appstats:090400, part: 24 bytes, full: 1258 bytes, overhead: 0.000 + 0.003; link: http://172.31.23.105/_ah/stats/details?time=1540826590486
INFO 2018-10-29 15:23:10,497 module.py:638] default: "GET /config HTTP/1.0" 404 233
INFO 2018-10-29 15:23:11,442 recording.py:673] Saved; key: appstats:091400, part: 24 bytes, full: 1210 bytes, overhead: 0.000 + 0.004; link: http://172.31.16.247/_ah/stats/details?time=1540826591434
INFO 2018-10-29 15:23:11,448 module.py:638] default: "GET /config HTTP/1.0" 404 233
#1 Updated by Usamah Ateeq 16 days ago
I tried running the command multiple times.
It fails on first 2-3 attempts and returns 404 errors and then it succeeds.
Please look into this urgently.
#2 Updated by Chris Donati 16 days ago
- Assignee set to Chris Donati
- Priority changed from Normal to High
I'm looking into this right now. At first glance, it appears there are two failure modes. Some requests don't go through to the app (empty request). Of the requests that do go through, some return with 404 errors. Those 404s are returned by the application itself.
The second issue is easier for me to reproduce, so I'm looking at that first.
#4 Updated by Chris Donati 16 days ago
- Project changed from LifePrint to AppScale
- Status changed from New to Feedback
- Deployment type set to AppScale - self deployed
It looks like the service is back up and running. We found a couple of issues, and I will follow up here shortly to describe all of them.
#5 Updated by Chris Donati 15 days ago
~2018-10-29 16:20: Anton and I started investigating the issue. We found that the load balancer sometimes gave an empty response, and it sometimes forwarded the request to the application.
When the application did receive the request, it responded sometimes with a 200, sometimes with a 404, and occasionally with a 500 due to a datastore error.
While investigating the load balancer, we found that all 5 instances were added to the ELB. 3 of them were marked "down", which is expected because they aren't serving traffic over :80.
Minor action item: We should remove the 3 AppScale machines that do not have the load_balancer role from the ELB.
The two machines with the load_balancer role were alternating between being down and up according to the ELB. We discovered that this was because the ELB's health check was marking the machine as "down" after 2 consecutive failures (the previously mentioned 404s and 500s) and “up” only after 10 consecutive successes.
~2018-10-29 15:10: I temporarily relaxed the ELB's HealthyThreshold to the minimum permissible value (2) and the UnhealthyThreshold to the maximum permissible value (10). I did this just in case the application was making URLFetch calls to itself. I wanted to make sure that while debugging the application, the ELB was always forwarding requests to the app.
After making this change, the ELB quickly marked the relevant machines as “up”, and requests were consistently forwarded to the app. The app still occasionally responded with a 404 or a 500.
~2018-10-29 15:40: We found that 2 of the 4 running AppServer instances consistently responded to /config requests with 404s. The other 2 instances usually responded with a 200, but they would occasionally respond with a 500 due to a datastore error.
We isolated the two problematic instances by keeping the processes running but removing routing for them so that we could debug them without affecting the behavior for users.
We found that the two problematic instances encountered a datastore error while they were starting (app.lp.initialize_lp). Because of this, flask did not finish setting up all of the defined routes, and the app responded to all future requests to /config with a 404.
With the problematic instances unrouted, the app no longer responded with 404s. However, it still occasionally encountered errors during datastore operations. We tracked this problem down to a single datastore server that had lost its connection to ZooKeeper. Typically when this happens, kazoo (the ZooKeeper client library we use) will continually try to reconnect. However, a bug in the kazoo library prevented that from happening in this case. I submitted a pull request to address the kazoo bug here: https://github.com/python-zk/kazoo/pull/534
Timeline of relevant events on the deployment (all timestamps are UTC):
2018-10-28 06:32:06: The datastore server at 172.31.16.247:4003 lost connection to ZooKeeper
2018-10-28 06:38:28: The same datastore server encountered the kazoo bug, preventing it from reconnecting.
2018-10-29 11:35:27: The AppServer instance at 172.31.16.65:20001 encountered an error while making a call to the problematic datastore during initialize_lp. All future calls calls to /config on this instance were handled with a 404.
2018-10-29 11:35:28: The AppSerever instance at 172.31.25.57:20004 encountered the same problem as the other instance.
2018-10-29 17:44:51: The problematic instances were stopped, fixing the 404s.
2018-10-29 18:16:35: The problematic datastore server was restarted, fixing the occasional datastore API errors.