[p4] Perforce broker - processes killed with signal 11 (SIGSEGV)

Discussion:

jamiep

2012-10-21 17:55:01 UTC

Posted on behalf of forum user 'jamiep'.

Good evening all,

We had a problem with P4Broker (v2012.1) last week when it started having
problems spawning new processes, they were all killed with signal 11 (SIGSEGV):

[...] Perforce Broker info: 2012/10/18 22:41:11 pid 4972
***@tpdfn_fnet-tp-dfn2 192.168.0.1 [p4/2011.1/SOLARIS10X86_64/393975]
'user-info' Action: [PASS] [perforce1:9020] Perforce
Broker info: 2012/10/18 22:41:11 pid 4970 completed. Perforce Broker info:
2012/10/18 22:41:11 pid 4972 completed. Perforce Broker error: Process 4974
exited on a signal 11! Perforce Broker error: Process 4975 exited on a
signal 11! Perforce Broker error: Process 4976 exited on a signal 11!
Perforce Broker error: Process 4977 exited on a signal 11! Perforce Broker
error: Process 4978 exited on a signal 11! Perforce Broker error: Process
4979 exited on a signal 11! Perforce Broker error: Process 4980 exited on a
signal 11! Perforce Broker error: Process 4981 exited on a signal 11!
Perforce Broker error: Process 4982 exited on a signal 11! Perforce Broker
error: Process 4983 exited on a signal 11!

Unfortunately, these are the only logs which were generated, which isn't
much
to go on. Would increasing the log level help? I'm slightly concerned that
increasing the verbosity would also decrease throughput.

The p4broker and p4d versions are as follows:

Server version: P4D/SOLARIS10X86_64/2011.1/409024 (2012/01/25) Broker version:
P4BROKER/SOLARIS10X86_64/2012.1/473528

At this point p4broker was receiving tens of requests per second, but we had
a similar level of load the previous night and it was able to cope.

It is running on a Solaris (x86) system alongside p4d, there were no problems
with p4d. The box was not CPU bound, disk bound, it had sufficient spare swap
and memory. There was no spike in CPU usage, in fact it dropped off - since
p4broker was no longer redirecting to the p4d.

Has anyone noticed anything similar with p4broker?

I can post our entire sanitised p4broker.cfg if necessary but here are the
basics:

target = perforce1:9020; listen = 9019; zeroconf = false; server-name =
"Perforce"; server-desc = "Perforce depot"; directory =
/perforce/p4broker; logfile = /perforce/extensions/logs/p4broker.log;
debug-level = server=1; admin-name = "Perforce administrators";
admin-phone = 1000; admin-email = ***@example.com; redirection = selective;
altserver: replica1 { target = perforce2:9020; }

Thanks,
Jamie

--
Please click here to see the post in its original format:
http://forums.perforce.com/index.php?/topic/2122-perforce-broker-processes-killed-with-signal-11-sigsegv

Michael Mirman

2012-10-22 00:00:10 UTC

Permalink

I remember once having this problem, and it turned out to be that I had installed the wrong version (either 64bit instead of 32bit, or something like that).
First, I would run "file p4broker" to see if its result matches my expectation.

--
Michael Mirman
The MathWorks, Inc. Tel: (508) 647-7555
3 Apple Hill Drive, Natick, MA 01760-2098

Post by jamiep
We had a problem with P4Broker (v2012.1) last week when it started having
problems spawning new processes, they were all killed with signal 11

jamiep

2012-10-22 07:15:01 UTC

Permalink

Posted on behalf of forum user 'jamiep'.

Hi Michael,

[http://forums.perforce.com/index.php?app=forums&module=forums&section=findpost&pid=6465]

Originally posted to the perforce-user mailing list by: Michael Mirman
I remember once having this problem, and it turned out to be that I had installed the wrong version (either 64bit instead of 32bit, or something like that).
First, I would run "file p4broker" to see if its result matches my expectation.

Thanks for replying, I've checked and it looks like it's 64-bit (as I
would expect):

perforce1{perforce}502: file p4broker p4broker: ELF 64-bit LSB executable AMD64
Version 1, dynamically linked, stripped perforce1{perforce}503: file p4d p4d:
ELF 64-bit LSB executable AMD64 Version 1, dynamically linked, stripped

The broker had been working fine up to that point, there was a spike of requests
at
the time it stopped, however there was a similar spike the night before and it
didn't
fail at that stage.

Do you routinely restart the broker every week? Is this something that we should

be doing?

Any further suggestions would be appreciated.

Thanks,
Jamie

--
Please click here to see the post in its original format:
http://forums.perforce.com/index.php?/topic/2122-perforce-broker-processes-killed-with-signal-11-sigsegv

Michael Mirman

2012-10-22 13:49:31 UTC

Permalink

Jamie -

Post by jamiep
Do you routinely restart the broker every week? Is this something that we
should be doing?

No, we restart the broker extremely rarely. Mostly, only when we upgrade it.
We generate the broker config frequently, which changes the settings, and how the broker redistributes the load, but we don't restart the broker.
Our currently running broker is 2012.1/473528.

Coincidentally, some of our processes crashed with the same segv code, but it was on the p4d side, and I finally tracked it down. It was related to the replication and a missing journal that replication tried to read.
I don't think it's your case, but I wanted to mention it just in case.

Does your broker work at all and only sometimes segfaults, or nothing goes through the broker?
Is it possible that the segfault comes from a filter program the broker tries to start?
IOW, try to connect the segfault to a specific operation the broker performs.

I searched our broker logs and did not see any segfault.

--
Michael Mirman
MathWorks, Inc.
508-647-7555

jamiep

2012-10-22 16:35:01 UTC

Permalink

Posted on behalf of forum user 'jamiep'.

[http://forums.perforce.com/index.php?app=forums&module=forums&section=findpost&pid=6468]

Originally posted to the perforce-user mailing list by: Michael Mirman
No, we restart the broker extremely rarely. Mostly, only when we upgrade it.
We generate the broker config frequently, which changes the settings, and how the broker redistributes the load, but we don't restart the broker.
Our currently running broker is 2012.1/473528.

OK, I thought it might be related to the length of time it's been running.

Quote

Coincidentally, some of our processes crashed with the same segv code, but it was on the p4d side, and I finally tracked it down. It was related to the replication and a missing journal that replication tried to read.
I don't think it's your case, but I wanted to mention it just in case.

We are using replication, however our p4d process did not fail, in fact there
were no error messages at all in the p4d log.

Quote

Does your broker work at all and only sometimes segfaults, or nothing goes through the broker?
Is it possible that the segfault comes from a filter program the broker tries to start?
IOW, try to connect the segfault to a specific operation the broker performs.

This is the first time it's happened since we first enabled the broker in
August. Unfortunately, when it does go down it takes down all access to our
depot, so when it happened it was quickly noticed.

We recently changed the broker configuration to rule out some issues, so at the
moment all it does is PASS. There are very few rules in place and no filters
enabled.

I think we may just have to wait and see if it happens again, which is
unfortunate. I have checked that there was no core dump generated.

Quote

I searched our broker logs and did not see any segfault.

Slightly off this subject - do you have any problems with redirecting read-only
commands to a replica for automated processes? We had some issues where a script
would create a label or client, then run a subsequent command against the
replica which depended on the label or client existing - but there was not
enough time for the replica to receive the update, so the it fails.

Thanks,
Jamie

--
Please click here to see the post in its original format:
http://forums.perforce.com/index.php?/topic/2122-perforce-broker-processes-killed-with-signal-11-sigsegv

Michael Mirman

2012-10-22 16:53:21 UTC

Permalink

Post by jamiep
Slightly off this subject - do you have any problems with redirecting read-only
commands to a replica for automated processes? We had some issues where a script
would create a label or client, then run a subsequent command against the
replica which depended on the label or client existing - but there was not
enough time for the replica to receive the update, so the it fails.

Yes!
This is actually currently the biggest problem with replication that we need to solve.
Upgrading hardware and making the replication faster helps only to a certain extent.

When it's a human being who submitted a change and now expects to see it on a certain screen, I believe we can explain that there might be a "small" delay. (Of course, "small" is in the eye of the beholder.)
It's harder to deal with the automation.

One example: we mirror certain information from submitted changes into a different database, which is used for the automatic builds. Once in a while, the build starts against the change, which cannot be found on the replica the build is using.
In this case, we plan to move the mirroring mechanism from the master to a replica. So, changes are mirrored only after they show up on the replica.
This approach changes certain requirements and expectations for the replica, but at least all queries will have consistent results.

--
Michael Mirman
MathWorks, Inc.
508-647-7555

jamiep

2012-10-24 09:05:01 UTC

Permalink

Posted on behalf of forum user 'jamiep'.

[http://forums.perforce.com/index.php?app=forums&module=forums&section=findpost&pid=6474]

Originally posted to the perforce-user mailing list by: Michael Mirman
One example: we mirror certain information from submitted changes into a different database, which is used for the automatic builds. Once in a while, the build starts against the change, which cannot be found on the replica the build is using.
In this case, we plan to move the mirroring mechanism from the master to a replica. So, changes are mirrored only after they show up on the replica.
This approach changes certain requirements and expectations for the replica, but at least all queries will have consistent results.

Do you think the new forwarding replica would help with the scenario where you
have an automated process making a changes which it then relies upon later?

<http://kb.perforce.com/article/1600/forwarding-replicas> - is this what
you mean?

Thanks,
Jamie

--
Please click here to see the post in its original format:
http://forums.perforce.com/index.php?/topic/2122-perforce-broker-processes-killed-with-signal-11-sigsegv

Michael Mirman

2012-10-24 10:20:11 UTC

Permalink

Post by jamiep
Do you think the new forwarding replica would help with the scenario where
you have an automated process making a changes which it then relies upon later?

I have to admit: I actually didn't think about the forwarding replica, and it certainly makes sense to try it.
My guess is that this replica must be based on "p4 pull" rather than the (now) old "p4 replicate", which is what we still use in most replicas. (The main reason we're still using the old "p4 replicate" is mostly that we have never integrated the new replication in our infrastructure.)

--
Michael Mirman
The MathWorks, Inc. Tel: (508) 647-7555
3 Apple Hill Drive, Natick, MA 01760-2098

Len Boyle

2012-10-22 14:08:56 UTC

Permalink

Have you seen if a core file was written.
If so did you look at the core file?

-----Original Message-----
From: perforce-user-***@perforce.com [mailto:perforce-user-***@perforce.com] On Behalf Of jamiep
Sent: Monday, October 22, 2012 3:15 AM
To: perforce-***@perforce.com
Subject: Re: [p4] Perforce broker - processes killed with signal 11 (SIGSEGV)

Posted on behalf of forum user 'jamiep'.

Hi Michael,

[http://forums.perforce.com/index.php?app=forums&module=forums&section=findpost&pid=6465]

Thanks for replying, I've checked and it looks like it's 64-bit (as I would expect):

perforce1{perforce}502: file p4broker p4broker: ELF 64-bit LSB executable AMD64 Version 1, dynamically linked, stripped perforce1{perforce}503: file p4d p4d:
ELF 64-bit LSB executable AMD64 Version 1, dynamically linked, stripped

The broker had been working fine up to that point, there was a spike of requests at the time it stopped, however there was a similar spike the night before and it didn't fail at that stage.

Do you routinely restart the broker every week? Is this something that we should

be doing?

Any further suggestions would be appreciated.

Thanks,
Jamie

jamiep

2012-10-22 16:40:01 UTC

Permalink

Posted on behalf of forum user 'jamiep'.

[http://forums.perforce.com/index.php?app=forums&module=forums&section=findpost&pid=6469]

Originally posted to the perforce-user mailing list by: Len Boyle
Have you seen if a core file was written.
If so did you look at the core file?

Thanks for the suggestion, that was one of the things I checked and core files
were disabled in the OS (Solaris).

I'll enable this so if/when it happens again we'll have something to
look at.

Thanks,
Jamie

--
Please click here to see the post in its original format:
http://forums.perforce.com/index.php?/topic/2122-perforce-broker-processes-killed-with-signal-11-sigsegv