Memcached failover in non-sticky mode

Question

Memcached failover in non-sticky mode

GoogleCodeExporter opened this issue 9 years ago · comments

Google Code Exporter commented 9 years ago

A bit about the setup I'm using:
1 haproxy as a load balancer'
2 tomcat6 nodes;
2 memcached nodes;
Non-sticky mode. Kryo serialization strategy;
Operation and sessionBackup timeouts are default;
Locking startegy: auto.

What steps will reproduce the problem?
1. Start all the nodes
2. Login into application
3. Check that session backup is saved in the secondary memcached nodes
4. Shutdown the primary memcached node
5. Navigate to some other page in application (sometimes I get dropped to log 
in screen with new session identifier not sure why this is happening but it's 
possibly caused by timeouts)
6. Restore the memcached node (it takes a while for tomcat to detect that node 
is back to up state and store the backup of session into it. I'm looking for 
the options to change this timeout)
7. As the session backup process is triggered by user requests, in this step 
I'm making some interactions with the application until the session is stored 
as backup.
8. Kill other node (which is now primary)
9. Next interation with application will get me into the login screen (session 
information lost), but if I'll change the session id to the session that had to 
be restored then I will be able to use application with that session 
identifier).

Basically it's quite interesting situation and currently I'm not sure what 
causing this behaviour as I can't stabily reproduce this issue. Any suggestions 
will be appreciated.

Original issue reported on code.google.com by d3da...@gmail.com on 24 Jun 2015 at 1:46

Google Code Exporter · Answer 1 · Thu Jul 30 2015 06:23:20 GMT+0800 (China Standard Time)

I've investigated this issue a bit more. So the session is lost when there are 
concurrent requests and one of them is matched by requestIgnorePattern. As far 
as I understand there's a racing condition which request will get served first. 
In case it will be the ignored request the session will be lost as backup 
retrival will not be triggered.
When this parameter is omitted in context.xml failover is working as expected 
but in my case we have a lot of heavy js pages and each request to such page 
will be generating 30-50 requests to each memcached nodes to update the 
metadata of the session stored there. So disabling it is not an option.

Original comment by d3da...@gmail.com on 1 Jul 2015 at 3:42

Google Code Exporter · Answer 2 · Thu Jul 30 2015 06:23:20 GMT+0800 (China Standard Time)

Great that you investigated this more!

> So the session is lost when there are concurrent requests and one of them is 
matched by requestIgnorePattern.

Are you referring to a request that should *not* match the 
requestIgnorePattern? Is the pattern incorrect / too broad then?

> As far as I understand there's a racing condition which request will get 
served first

If the browser sends parallel requests (e.g. via ajax), then there's indeed no 
guarantee which one hits the server first. This would have to be handled on the 
client side, the server can nothing do about this.

> In case it will be the ignored request the session will be lost as backup 
retrival will not be triggered.

But a request after the ignored one then should trigger backup retrieval, 
doesn't this happen then?

Are the "heavy js pages" somehow related to the session, or are this just 
"stateless" resources?

Original comment by martin.grotzke on 2 Jul 2015 at 7:37

Google Code Exporter · Answer 3 · Thu Jul 30 2015 06:23:20 GMT+0800 (China Standard Time)

About the requestIgnorePattern:
pattern matches the png file in my case. Basically I'm trying the following 
scenario:
1. Login, both memcached nodes up and session is backed up correctly
2. Kill primary node
3. When I'm selecting the menus - png request is sent to backend (css 
background). Right after I'm clicking the link and calling the controller.

In case if the png request is served first request tracking host valve is not 
performing the check of the primary node status, session is not recovered from 
the backup. After it I'm getting new session id which is not contained in any 
of memcached nodes and following request (controller) is served with this new 
session id so application is redirecting to log in screen. Currently I'm not 
sure how this is happening but disabling requestIgnorePattern fixes this issue. 
This possibly can have something with the spring security session fixation 
protection or other similiar stuff.


In case controller gets served in first place then failover is working as 
expected.


Under the heavy js pages I mean that they are requesting a lot of js files 
while they are loading. These requests don't change session information in any 
way.

Original comment by d3da...@gmail.com on 2 Jul 2015 at 2:05

Google Code Exporter · Answer 4 · Thu Jul 30 2015 06:23:20 GMT+0800 (China Standard Time)

[deleted comment]

Google Code Exporter · Answer 5 · Thu Jul 30 2015 06:23:21 GMT+0800 (China Standard Time)

I've tried to reproduce this issue on the msm sample app that is hosted on 
github. The fail-over is working as expected there with the same configuration 
and same tomcat instance. As there were no resources like png, ico etc. I've 
added one but it was still working as expected. 

Also I've tried to make a fix for this behavior by adding the primary memcached 
node availability check in RequestTrackingHostValve where ignorePattern is 
evaluated. As far as I can tell this fix works and failover is working as 
expected in my application.

Original comment by d3da...@gmail.com on 3 Jul 2015 at 2:26

Google Code Exporter · Answer 6 · Thu Jul 30 2015 06:23:21 GMT+0800 (China Standard Time)

Can you submit a pull request with your change?

Original comment by martin.grotzke on 3 Jul 2015 at 9:41

Google Code Exporter · Answer 7 · Thu Jul 30 2015 06:23:21 GMT+0800 (China Standard Time)

Submitted the pull request with possible fix: 
https://github.com/magro/memcached-session-manager/pull/44

Original comment by d3da...@gmail.com on 6 Jul 2015 at 1:22

Google Code Exporter · Answer 8 · Thu Jul 30 2015 06:23:21 GMT+0800 (China Standard Time)

Did you have time to look into it by chance?

Original comment by d3da...@gmail.com on 14 Jul 2015 at 3:59

Google Code Exporter · Answer 9 · Thu Jul 30 2015 06:23:21 GMT+0800 (China Standard Time)

Sorry for the late response, business work took all the time... I had a look at 
your PR - AFAICS in the case of primary node unavailability requests that 
otherwise should be ignored then a NOT ignored but go through standard session 
handling.

I tend to think that while this may solve the specific issue, it's still just a 
workaround and there is a different root cause.

I'd say that requests that should be ignored should completely bypass session 
handling, so they should not depend on memcached availability at all. If such 
requests cause issues this is probably not the case. I'd prefer to find and fix 
this issue.

What do you think?

Original comment by martin.grotzke on 17 Jul 2015 at 4:22

Google Code Exporter · Answer 10 · Thu Jul 30 2015 06:23:21 GMT+0800 (China Standard Time)

This "fix" was made just to show what I mean and more like a treatment of the 
symptom then the cause. It's definitely not a solution for the problem. Also I 
was not able to reproduce this issue with the test app (wicket).
So I guess I'll invest a bit more time into investigation of this issue until 
it will be clear what causing it. Just had a little hope that you'll 
"magically" find the problem =).

Original comment by d3day.an...@gmail.com on 19 Jul 2015 at 1:32

Google Code Exporter · Answer 11 · Thu Jul 30 2015 06:23:21 GMT+0800 (China Standard Time)

Yeah, ok :-) Great that you're investigating this!

Original comment by martin.grotzke on 21 Jul 2015 at 8:13