[stats] allow knet_handle_get_stats to operate in a readlock context
- add global stat mutex lock to protect stats updates
- use global stat mutex lock across all the threads
- fix up some minor bugs:
- update RX crypto stats only when crypto is enabled
- update compress and crypto stats in a consistent fashion
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Jan Friesse [Tue, 25 Feb 2020 14:09:19 +0000 (15:09 +0100)]
[man] Enhance prio description of POLICY_PASSIVE
Some users found description of POLICY_PASSIVE priority confusing
(probably because "priority" word is too overloaded) so add
some redundancy to make description unambiguous.
This fixes most of the remaining covscan errors in doxyxml.c.
The ones that remain are caused by malloced structures being
stored in qb_hashtable_maps.
These still cause unfreed memory, because the contents of the maps
are never explictly freed, but as they are used until the very end of
the program (when the OS will free everything) I'm dubious as to
whether it's worth doing it in the code - or whether covscan will
work out what's going on anyway.
[sctp] major surgery to use only SCTP events to determine socket status
- drop concept of on_connected_epoll to determine if socket is ready or not
- provide much better debugging output at all levels
- incorporate fix from Xin Long <lxin@redhat.com> to gather socket status
at the right time
- deal with a recent kernel change on SCTP socket that broke knet (from rhel7):
[net] sctp: allow delivering notifications after receiving SHUTDOWN
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
When sending a message to multiple links, if one of those links
is not connection-oriented then msg_name & msg_namelen would be cleared,
thus breaking the send to any subsequent non-connection-oriented links.
So now, if we need to clear out msg_name & msg_namelen, we take a copy of the
msghdr and edit that instead,
If recvmsg() returns 0 for EOF then it's going to do so
until the error is rectified or read with getsockopt(). But
the _recvmmsg() wrapper keeps reading until the vector is full
thus returning a block of 512 EOF messages all of which the caller
has to plough through.
This patch causes _recvmmsg() to return as soon as it has got
the first EOF so the the caller can deal with it in good time
and not spin looking at the same thing over and over again.
I've also fixed a couple of typos in related comments
This fix for the ENETUNREACH problem works better than the last one
in that it also works with Linux kernels > 5.0.0 (which return
-ENETUNREACH) if an interfaces is brought down, and also on FreeBSD
which returns ENETDOWN.
[udp] don't make socket spin if a network I/F is down
UDP treats ENETUNREACH as a temporary error and just retries,
but this causes the TX thread to spin just doing sendto() therefore
blocking all other traffic.
(To reproduce this try starting corosync with 2 links configured in
corosync.conf but only one of them configured to the 'right' address
- it will spin in a tight loop and need to be killed with -9)
[RX] handle short write to the application properly
this change affects only applications that are not using knet
generated socketpairs to deliver/receive data to/from knet.
If an application uses a fd that is not SOCK_SEQPACKET (basically
streaming), we have to handle short writes accordingly, and knet
will continue delivering as long as there is progress.
The application is responsible to verify that the data packet
is complete as the delivery is not guaranteed to be complete.
The application can either embed the size of the packet in their
data structure or use the socket error notification callback
that will be invoked in case of errors or 0 data delivery.
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
[PMTUd] invalidate MTU for a link if the value is lower than minimum
Under heavy network load and packet loss, calculated MTU can be
too small. In that case we need to invalidate the link mtu,
that would remove the link from the rotation (and traffic) and
would give PMTUd time to get the right MTU in the next round.
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
- let's assume a 2 nodes (A and B) cluster setup
- node A sends fragmented packets to node B and there is
packet loss on the network.
- node B receives all those fragments and attempts to
reassemble them.
- node A sends packet seq_num X in Y fragments.
- node B receives only part of the fragments and stores
them in a defrag buf.
- packet loss stops.
- node A continues to send packets and a seq_num
roll-over takes place.
- node A sends a new packet seq_num X in Y fragments.
- node B gets confused here because the parts of the old
packet seq_num X are still stored and the buffer
has not been reclaimed.
- node B continues to rebuild packet seq_num X with
old stale data and new data from after the roll-over.
- node B completes reassembling the packet and delivers
junk to the application.
The solution:
Add a much stronger buffer reclaim logic that will apply
on each received packet and not only when defrag buffers
are needed, as there might be a mix of fragmented and not
fragmented packets in-flight.
The new logic creates a window of N packets that can be
handled at the same time (based on the number of buffers)
and clear everything else.
doxyxml: print_param: fix heap-buffer-overflow on read
in read_struct we can get the pi->paramtype assigned with:
> pi->paramtype = type?strdup(type):strdup("");
And in print_param we then always check the last character by getting
the strlen and subtracting one. But in the case where either type was
NULL and we assigned an empty string, or type wasn't null but
pointing to an empty string we ran into an read-heap-buffer-overflow
as here strlen is zero, and so we the first if branch evaluated to
> if (pi->paramtype[-1] == '*') {
which isn't valid. Depending on the OS, protection of surrounding
area due to said OS or the compiler, this can crash the program.
Similar issue was the case for the next check for double pointers,
here for all strings with strlen < 2.
To solve this get the strlen early and check if we cannot underflow
before doing the real read.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
the index would overflow the buffer and overwrite data in the link
structure. Depending on what was written the cluster could fall
apart in many ways, from crashing, to hung.
Fixes: https://github.com/kronosnet/kronosnet/issues/255
thanks to the proxmox developers and community for reporting the issue
and for all the help reproducing / debugging the problem.
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
[links] stabilize latency calculation when nodes are not responsive
The following scenario is more of a corner case than normal, but
this change allows to better deal with this situation:
1) 2 nodes cluster (corosync) (node A and node B)
2) kill -stop $(pidof corosync) on node A
3) node B will continue to send ping packets to node A
4) node A is accumulating those ping packets in the kernel network socket
5) wait some seconds and unpause node A
6) node A will start processing the ping packets in the queue
and send pong replies to node B
7) node B will see an extreme increase of latency due
those "obsoleted" ping/pong packets
8) node B, as latency increases, will take longer and longer
to notice that node A is down due to the pong_timeout adjustment
for latency (required for initial cluster spike).
the solution:
1) Use average latency to calculate pong_timeout_adj vs latency_max.
Averate latency will go down again in time, while latency_max is never
reset.
2) RX thread will filter out all pong packets that have higher latency
than currently configure pong_timeout. This barrier should have
been in place even before.
this solution reduces the latency spike on node B to a perfectly
reasonable level and it will all eventually stabilize over time
as latency samples increase and latency will reduce.
Please be aware that using a pong_timeout smaller than latency will
simply mark the link down now.
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Jan Friesse [Mon, 2 Sep 2019 11:56:34 +0000 (13:56 +0200)]
[handle] Set thread stack size on create
Musl libc has small stack size for threads. Knet needs ~300KiB (tested
at the time when this patch was created). Glibc seems to use ~8MiB. As a
compromise, 1MiB is used.
Jan Friesse [Mon, 2 Sep 2019 09:11:27 +0000 (11:11 +0200)]
[common] Conditionalize RTLD_DI_ORIGIN
RTLD_DI_ORIGIN is used to get absolute path of plugin. It is used only
for logging useful info and not strictly needed, so use it only when it
is defined (only musl is known to author of the patch)
[PMTUd] add dynamic pong timeout when using crypto
problem originally reported by proxmox community, users
observed that under pressure the MTU would flap back and forth
between 2 values due to other node response timeout.
implement a dynamic timeout multiplier when using crypto that
should solve the problem in a more flexible fashion.
When a timeout hits, those new logs will show:
[knet]: [info] host: host: 1 (passive) best link: 0 (pri: 0)
[knet]: [debug] pmtud: Starting PMTUD for host: 1 link: 0
[knet]: [debug] pmtud: Increasing PMTUd response timeout multiplier to (4) for host 1 link: 0
[knet]: [info] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 65429
[knet]: [debug] pmtud: PMTUD completed for host: 1 link: 0 current link mtu: 65429
[knet]: [info] pmtud: Global data MTU changed to: 65429
[knet]: [debug] pmtud: Starting PMTUD for host: 1 link: 0
[knet]: [debug] pmtud: Increasing PMTUd response timeout multiplier to (8) for host 1 link: 0
[knet]: [debug] pmtud: Increasing PMTUd response timeout multiplier to (16) for host 1 link: 0
[knet]: [debug] pmtud: Increasing PMTUd response timeout multiplier to (32) for host 1 link: 0
[knet]: [debug] pmtud: Increasing PMTUd response timeout multiplier to (64) for host 1 link: 0
[knet]: [debug] pmtud: PMTUD completed for host: 1 link: 0 current link mtu: 65429
[knet]: [debug] pmtud: Starting PMTUD for host: 1 link: 0
[knet]: [debug] pmtud: Increasing PMTUd response timeout multiplier to (128) for host 1 link: 0
[knet]: [debug] pmtud: PMTUD completed for host: 1 link: 0 current link mtu: 65429
and when the latency reduces and it is safe to be more responsive again:
[knet]: [debug] pmtud: Starting PMTUD for host: 1 link: 0
[knet]: [debug] pmtud: Decreasing PMTUd response timeout multiplier to (64) for host 1 link: 0
[knet]: [debug] pmtud: PMTUD completed for host: 1 link: 0 current link mtu: 65429
....
testing this patch on normal hosts is a bit challenging tho.
Patch was tested by hardcoding a super low timeout here:
internal changes:
- drop the concept of sec_header_size that was completely wrong
and unnecessary
- bump crypto API to version 3 due to the above change
- clarify the difference between link->proto_overhead and
link->status->proto_overhead. We cannot rename the status
one as it would also change ABI.
- add onwire.c with documentation on the packet format
and what various len(s) mean in context.
- add 3 new functions to calculate MTUs back and forth
and use them around, hopefully with enough clarification
on why things are done in a given way.
- heavily change thread_pmtud.c to use those new facilities.
- fix major calculation issues when using crypto (non-crypto
was not affected by the problem).
- fix checks around to make sure they match the new math.
- fix padding calculation.
- add functional PMTUd crypto test
this test can take several hours (12+) and should be executed
on a controlled environment since it automatically changes
loopback MTU to run tests.
- fix way the lowest MTU is calculated during a PMTUd run
to avoid spurious double notifications.
- drop redundant checks.
user visible changes:
- Global MTU is now calculated properly when using crypto
and values will be in general bigger than before due
to incorrect padding calculation in the previous implementation.
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
allocate on stack only once and make sure strings are null terminated
drop useless read loop since log msg are always smaller than PAGE_SIZE
and read are atomic at that level
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>