Jan Friesse [Wed, 18 Nov 2020 16:52:12 +0000 (17:52 +0100)]
qnetd: Move client schedule disconnect handling
Client disconnect used to be per client fd in
the qnetd_client_net_socket_poll_loop_set_events_cb. Problem is, that
disconnect calls algorithm which may send message to other client
with fd which was already processed in the pr-poll-loop so POLLOUT is
not set till new loop exec is called (and that usually happens
because old one timeouts). To reproduce this problem use
ffsplit and make qnetd disconnect one of the clients - ffsplit
needs to send ack/nack votes, but it doesn't send them during
first iteration and waits for dpd timeout.
Jan Friesse [Wed, 18 Nov 2020 13:31:01 +0000 (14:31 +0100)]
qnetd: Improve dead peer detection
Previously dead peer detection timer was scheduled every dpd_interval,
added dpd_interval to all of the clients timestamp and if timestamp was
larger than client hearbeat interval * 1.2 then check if client sent
some message. If so, flag was reset.
This method was source of number of problems so instead different method
is now used.
Every single client has its own timer with timeout based on
(configurable) dpd_interval_coefficient and multiplied with
client heartbeat timeout. When message is received from client timer is
rescheduled. When timer callback is called (= client doesn't sent
message during timeout) then client is disconnected.
Jan Friesse [Thu, 5 Nov 2020 16:25:59 +0000 (17:25 +0100)]
timer-list: Improve efficiency of delete operation
Position in entries array, heap_pos, is added to the entry.
This has to be kept in sync for every move so new internal set/get
functions are added too.
Jan Friesse [Wed, 4 Nov 2020 16:26:54 +0000 (17:26 +0100)]
timer-list: Implement heap based timer-list
Previous timer-list was naive implementation of priority queue and very
slow when number of timers increased. This was not a problem because
only few timers were used. But with removal of dpd timer and replacement
with per-connection timer this may become problematic.
Solution is to use binary heap based priority queue which is much
faster.
Jan Friesse [Tue, 27 Oct 2020 15:55:23 +0000 (16:55 +0100)]
qnet: Add support for keep active partition vote
This patch adds qdevice-net part of keep active partition tie breaker
functionality. It's enabled by default.
When tie happens prefer partition with members of
previously active (quorate) partition. This is hard-coded
behavior of LMS algorithm so this setting affects only
FFSplit algorithm. By default it is disabled for backwards
compatibility.
This solves problem with FFSplit when node A (with lowest id) is killed,
node B gets vote and then node A starts up and creates single node
membership and gets vote.
Jan Friesse [Tue, 22 Sep 2020 11:31:24 +0000 (13:31 +0200)]
qnetd: Add support for keep active partition vote
When tie happens prefer partition with members of
previously active (quorate) partition. This is hard-coded
behavior of LMS algorithm so this setting affects only
FFSplit algorithm. By default it is disabled for backwards
compatibility.
This solves problem with FFSplit when node A (with lowest id) is killed,
node B gets vote and then node A starts up and creates single node
membership and gets vote.
Jan Friesse [Tue, 15 Sep 2020 13:10:53 +0000 (15:10 +0200)]
qnetd: Fix dpd timer
With default config of running dpd timer every 10 second and waiting for
2 * client_timeout to clear message received flag and then waiting
another 2 * client_timeout without message received it was possible that
client was marked as a dead after more than 40 seconds making qdevice to
stop sending votequorum hearbeat for too long so corosync lost votes
from qdevice.
This patch is simpler solution which just changes default dpd timer to
1 second and timeout to 1.2 * client_timeout.
Jan Friesse [Mon, 31 Aug 2020 08:08:45 +0000 (10:08 +0200)]
pr-poll-loop: Fix set_events_cb return code
When events is set to 0 and set_events return -2 it was changed to -1.
Solution is to check, if return code was 0 and only if so, change return
code to -1 if events is 0.
Jan Friesse [Thu, 20 Aug 2020 15:33:53 +0000 (17:33 +0200)]
heuristics: Remove qdevice instance pointer
Heuristics is designed to be component of its own, which doesn't depend
on qdevice_instance. Removing qdevice_instance pointer was easy as soon
as exec notifier got two user data pointers.
Jan Friesse [Tue, 14 Jul 2020 13:38:12 +0000 (15:38 +0200)]
build: Use git-version-gen during specfile build
Instead of copying parts of git-version-gen for spec target use
git-version-gen directly and parse final version into components
(rpmver, alphatag, numcomm) and use them.
Main reason is to simplify code a bit (sed scripts are a bit repetitive
tho), reuse the code and also allow building of RPM from dist tarball
generated from non-tagged commit or dirty git (not very useful).
The code relies on fact, that hyphen is never used in tagged release
name.
Signed-off-by: Jan Friesse <jfriesse@redhat.com> Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Jan Friesse [Tue, 19 Mar 2019 14:16:11 +0000 (15:16 +0100)]
qnetd: Check existence of NSS DB dir before fork
Previously, when user tried start corosync-qnetd without
initialized NSS database then generic (not very helpful
and misleading) NSS error was logged
"NSS error (-8015): The certificate/key database is in an old,
unsupported format.".
Solution is to check if it's possible to open NSS DB directory and
display (usually much more informative) result of strerror function.
Such check is called before fork, so init system can return error code
during start.
Signed-off-by: Jan Friesse <jfriesse@redhat.com> Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Jan Friesse [Wed, 14 Nov 2018 16:52:11 +0000 (17:52 +0100)]
init: Fix init scripts to work with containers
Previously init scripts were not using pid file so pidof was used. This
is usually not a problem, but when containers are used it may result to
killing improper instance when issued on host.
Solution is to always use pidfile.
Also try to use LSB complaint status codes.
Signed-off-by: Jan Friesse <jfriesse@redhat.com> Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Jan Friesse [Wed, 7 Nov 2018 14:16:13 +0000 (09:16 -0500)]
configure: move to AC_COMPILE_IFELSE
from AC_PREPROC_IFELSE which is strongly discouraged.
Our detection system was very weak and recent versions of clang did
show that PREPROC_IFELFE (cpp) would enable warning options that
the compiler does not support (clang).
Use a full compilation test to detect what works and what doesn't.
Jan Friesse [Mon, 3 Sep 2018 15:01:21 +0000 (17:01 +0200)]
build: Support for git archive stored tags
Attempt to solve problem with git archive generated tarballs
(used for example by github when release is downloaded) which are no
longer git tree and (in contrast to officially released tarballs) also
doesn't contain .tarball-version file so git-version-gen script simply
cannot obtain valid version info.
Solution is based on using gitattributes which is instructs git to
replace string in the .gitarchivever file by known ref names.
git-version-gen is enhanced to support this file and tries to parse
any string which looks like "tag: v[0-9]+.[0-9]+.[0-9]". If such string
is found it's used as a version. This file is used as a last attempt and
other methods (.tarball-version, git abbrev) have precedence.
Based on idea stated by Jan Pokorný <jpokorny@redhat.com>.
Signed-off-by: Jan Friesse <jfriesse@redhat.com> Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Jan Friesse [Mon, 27 Aug 2018 15:07:53 +0000 (17:07 +0200)]
qdevice: Propagate error to exit code
Net model never returned error when qdevice_model_run was called. This
was incorrect because with exception of local ipc close all other
disconnect reasons are errors.
Solution is to return proper error code.
Also instead of exit right after qdevice_model_run it's better to store
result value, try clean resources and use stored value to return correct
exit code.
Jan Friesse [Tue, 21 Aug 2018 08:01:26 +0000 (10:01 +0200)]
init: disable stderr output in systemd unit file
Usually both syslog and stderr are enabled so log entries are duplicated
in journal. Solution is to use similar patch as corosync c34208ad402b45f52b5d3ee8d2a08df0779ec9aa and disable stderr.
Jan Friesse [Wed, 8 Aug 2018 13:20:02 +0000 (09:20 -0400)]
qdevice-net-certutil: Implement scp wrapper
Standard scp doesn't handle copy of file from remote machine to remote
machine very well when agent forwarding is used and no key exists
between the machines.