Numerous DMA timeouts when downloading data from external sources

After having solved a previous networking issue of the control board, I’m currently running into another kind of networking trouble: When trying to download large amounts of data (here it’s a container > 435 MB in size, but I need it urgently), everything seems to work fine for some 10 minutes, but then suddenly, the network connection dies, and I get a bunch of DMA timeouts on the control board console:

[ 1532.020301] miop 0000:06:00.0: DMA timeout, restart DMA controller.
[ 1533.059083] miop 0000:06:00.0: DMA timeout, restart DMA controller.
[ 1534.099126] miop 0000:06:00.0: DMA timeout, restart DMA controller.
[ 1537.620021] miop 0000:06:00.0: DMA timeout, restart DMA controller.

And this is what I get on the console of the node the container is for (abridged, irrelevant messages removed):

mixtile@blade3n1:~$ sudo cephadm bootstrap --mon-ip $MYIP
[…]
Host looks OK
Cluster fsid: 89e84929-c54d-11f0-9c36-0e281c59af8c
Verifying IP 10.20.0.13 port 3300 ...
Verifying IP 10.20.0.13 port 6789 ...
Mon IP `10.20.0.13` is in CIDR network `10.20.0.0/24`
Mon IP `10.20.0.13` is in CIDR network `10.20.0.0/24`
Internal network (--cluster-network) has not been provided, OSD replication will default to the public_network
Pulling container image quay.io/ceph/ceph:v19...
Non-zero exit code 1 from /usr/bin/docker pull quay.io/ceph/ceph:v19
/usr/bin/docker: stdout v19: Pulling from ceph/ceph
/usr/bin/docker: stdout 4d1c9f96c84f: Pulling fs layer
/usr/bin/docker: stdout d0eb5eb08af7: Pulling fs layer
/usr/bin/docker: stdout d0eb5eb08af7: Waiting
[…]
/usr/bin/docker: stdout 4d1c9f96c84f: Retrying in 5 seconds
/usr/bin/docker: stdout 4d1c9f96c84f: Retrying in 4 seconds
/usr/bin/docker: stdout 4d1c9f96c84f: Retrying in 3 seconds
/usr/bin/docker: stdout 4d1c9f96c84f: Retrying in 2 seconds
/usr/bin/docker: stdout 4d1c9f96c84f: Retrying in 1 second
/usr/bin/docker: stderr read tcp 10.20.0.13:40398->212.6.82.160:443: read: connection reset by peer
Error: Failed command: /usr/bin/docker pull quay.io/ceph/ceph:v19

        ***************
        Cephadm hit an issue during cluster installation. Current cluster files will be deleted automatically.
        […]
        ***************

Deleting cluster with fsid: 89e84929-c54d-11f0-9c36-0e281c59af8c
ERROR: Failed command: /usr/bin/docker pull quay.io/ceph/ceph:v19

I have already improved the config of the Squid proxy on the control board by adding extra cache space:

61 # Uncomment and adjust the following to add a disk cache directory.
62 cache_dir aufs /var/cache/squid 8000 16 256
63 cache_mem 180 MB
64
65 # Maximum object size
66 maximum_object_size 1024 MB
67 maximum_object_size_in_memory 16 MB
68
69 #  How objects stored in cache are replaced
70 cache_replacement_policy heap LFUDA
71 memory_replacement_policy heap GDSF

Apparently, the control board gets somehow overwhelmed by the network traffic, but I’m not sure whether the problem liess on the internal network link, or the external one. At least the board has more than enough RAM. Here is what netdata says:

What’s wrong here? Any idea on how to solve (or, at least, to circumvent) it?

UPDATE:
After another try, not only the cluster’s network died, but apparently the control board crashed and also made the nodes crash! See the uptime from uptime just a munite after the crash:

mixtile@ClusterBox:~$ client_loop: send disconnect: Broken pipe
╭─jacek@epica ~  
╰─➤  ssh -l mixtile ClusterBox                                            255 ↵
Enter passphrase for key '/home/jacek/.ssh/id_ed25519': 
mixtile@ClusterBox:~$ uptime
 16:06:32 up 1 min,  load average: 1.42, 0.55, 0.19
mixtile@ClusterBox:~$ 

Why does a crash of the control board also carries away the nodes? This must not happen!

OK, it was two distinct errors:

Error #1: For whatever reason, opkg does not create the necessary directories Squid stored its disk cache and logs in. You need to make them yourself and then transfer ownership to the squid user and group:

sudo mkdir -p /var/cache/squid
sudo mkdir -p /var/log/squid
sudo mkdir -p /var/run/squid
sudo chown -R squid /var/cache/squid
sudo chown -R squid /var/log/squid
sudo chown -R squid /var/run/squid
sudo chgrp -R squid /var/cache/squid
sudo chgrp -R squid /var/log/squid
sudo chgrp -R squid /var/run/squid
sudo /etc/init.d/squid restart

After performing this, Squid creates a disk cache, which will accomodate all the stuff the board itself and the nodes download via HTTP(S).

Error #2: I have already added the necessary proxy entries into /etc/environment and the apt config, but Docker relies on its own proxy settings! So I had to add a special proxy config file:

mixtile@blade3n1:/usr/lib/systemd/system$ sudo mkdir -p /etc/systemd/system/docker.service.d
mixtile@blade3n1:/usr/lib/systemd/system$ sudo nano /etc/systemd/system/docker.service.d/http-proxy.conf

[Service]
Environment="HTTP_PROXY=http://10.20.0.1:3128"
Environment="HTTPS_PROXY=http://10.20.0.1:3128"

After that, pulling large containers worked flawlessly. :slight_smile: