operations-guide/doc/openstack-ops/ch_ops_network_troubleshooting.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chapter [
<!-- Some useful entities borrowed from HTML -->
<!ENTITY ndash  "&#x2013;">
<!ENTITY mdash  "&#x2014;">
<!ENTITY hellip "&#x2026;">
<!ENTITY plusmn "&#xB1;">
]>
<chapter xmlns="http://docbook.org/ns/docbook"
    xmlns:xi="http://www.w3.org/2001/XInclude"
    xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0"
    xml:id="network_troubleshooting">
    <?dbhtml stop-chunking?>
    <title>Network Troubleshooting</title>
    <para>Network troubleshooting can unfortunately be a very
        difficult and confusing procedure. A network issue can cause a
        problem at several points in the cloud. Using a logical
        troubleshooting procedure can help mitigate the confusion and
        more quickly isolate where exactly the network issue is. This
        chapter aims to give you the information you need to make
        yours.</para>
    <section xml:id="check_interface_states">
        <title>Using "ip a" to Check Interface States</title>
        <para>On compute nodes and nodes running nova-network, use the
            following command to see information about interfaces,
            including information about IPs, VLANs, and whether your
            interfaces are up.</para>
        <programlisting># ip a</programlisting>
        <para>If you're encountering any sort of networking
            difficulty, one good initial sanity check is to make sure
            that your interfaces are up. For example:</para>
        <programlisting>$ ip a | grep state
1: lo: &lt;LOOPBACK,UP,LOWER_UP&gt; mtu 16436 qdisc noqueue state UNKNOWN
2: eth0: &lt;BROADCAST,MULTICAST,UP,LOWER_UP&gt; mtu 1500 qdisc pfifo_fast state UP qlen 1000
3: eth1: &lt;BROADCAST,MULTICAST,UP,LOWER_UP&gt; mtu 1500 qdisc pfifo_fast master br100 state UP qlen 1000
4: virbr0: &lt;NO-CARRIER,BROADCAST,MULTICAST,UP&gt; mtu 1500 qdisc noqueue state DOWN
6: br100: &lt;BROADCAST,MULTICAST,UP,LOWER_UP&gt; mtu 1500 qdisc noqueue state UP</programlisting>
        <para>You can safely ignore the state of virbr0, which is a
            default bridge created by libvirt and not used by
            OpenStack.</para>
    </section>
    <section xml:id="network_traffic_in_cloud">
        <title>Network Traffic in the Cloud</title>
        <para>If you are logged in to an instance and ping an external
            host, for example google.com, the ping packet takes the
            following route:</para>
        <informalfigure>
            <mediaobject>
                <imageobject>
                    <imagedata width="5in"
                        fileref="figures/network_packet_ping.png"/>
                </imageobject>
            </mediaobject>
        </informalfigure>
        <orderedlist>
            <listitem>
                <para>The instance generates a packet and places it on
                    the virtual NIC inside the instance, such as,
                    eth0.</para>
            </listitem>
            <listitem>
                <para>The packet transfers to the virtual NIC of the
                    compute host, such as, vnet1. You can find out
                    what vent NIC is being used by looking at the
                    /etc/libvirt/qemu/instance-xxxxxxxx.xml file.
                </para>
            </listitem>
            <listitem>
                <para>From the vnet NIC, the packet transfers to a
                    bridge on the compute node, such as,
                        <code>br100.</code>
                </para>
                <para>If you run FlatDHCPManager, one bridge is on
                    the compute node. If you run VlanManager, one
                    bridge exists for each VLAN.</para>
                <para>To see which bridge the packet will use, run the
                    command:
                    <programlisting><prompt>$</prompt> brctl show</programlisting>
                </para>
                <para>Look for the vnet NIC. You can also reference
                    nova.conf and look for the flat_interface_bridge
                    option.</para>
            </listitem>
            <listitem>

                <para>The packet transfers to the main NIC of the
                    compute node. You can also see this NIC in the
                    brctl output, or you can find it by referencing
                    the flat_interface option in nova.conf.</para>

            </listitem>
            <listitem>

                <para>After the packet is on this NIC, it transfers to
                    the compute node's default gateway. The packet is
                    now most likely out of your control at this point.
                    The diagram depicts an external gateway. However,
                    in the default configuration with multi-host, the
                    compute host is the gateway.</para>

            </listitem>
        </orderedlist>
        <para>Reverse the direction to see the path of a ping
            reply.</para>
        <para>From this path, you can see that a single packet travels
            across four different NICs. If a problem occurs with any
            of these NICs, a network issue occurs.</para>
    </section>
    <section xml:id="failure_in_path">
        <title>Finding a Failure in the Path</title>
        <para>Use ping to quickly find where a failure exists in the
            network path. In an instance, first see if you can ping an
            external host, such as google.com. If you can, then there
            shouldn't be a network problem at all.</para>
        <para>If you can't, try pinging the IP address of the compute
            node where the instance is hosted. If you can ping this
            IP, then the problem is somewhere between the compute node
            and that compute node's gateway.</para>
        <para>If you can't ping the IP address of the compute node,
            the problem is between the instance and the compute node.
            This includes the bridge connecting the compute node's
            main NIC with the vnet NIC of the instance.</para>
        <para>One last test is to launch a second instance and see if
            the two instances can ping each other. If they can, the
            issue might be related to the firewall on the compute
            node.</para>
    </section>
    <section xml:id="tcpdump">
        <title>tcpdump</title>
        <para>One great, although very in-depth, way of
            troubleshooting network issues is to use tcpdump. It's
            recommended to use tcpdump at several points along the
            network path to correlate where a problem might be. If you
            prefer working with a GUI, either live or by using a
            tcpdump capture do also check out <link
                xlink:title="Wireshark"
                xlink:href="http://www.wireshark.org/"
                >Wireshark</link> (http://www.wireshark.org/).</para>
        <para>For example, run the following command:</para>
        <para>
            <code>tcpdump -i any -n -v 'icmp[icmptype] =
                icmp-echoreply or icmp[icmptype] = icmp-echo'</code>
        </para>
        <para>Run this on the command line of the following
            areas:</para>
        <orderedlist>
            <listitem>
                <para>An external server outside of the cloud.</para>
            </listitem>
            <listitem>
                <para>A compute node.</para>
            </listitem>
            <listitem>
                <para>An instance running on that compute node.</para>
            </listitem>
        </orderedlist>
        <para>In this example, these locations have the following IP
            addresses:</para>
        <remark>DWC: Check formatting of the following:</remark>
        <programlisting>
                          Instance
                          10.0.2.24
                          203.0.113.30
                          Compute Node
                          10.0.0.42
                          203.0.113.34
                          External Server
                          1.2.3.4
                        </programlisting>
        <para>Next, open a new shell to the instance and then ping the
            external host where tcpdump is running. If the network
            path to the external server and back is fully functional,
            you see something like the following:</para>
        <para>On the external server:</para>
        <programlisting>12:51:42.020227 IP (tos 0x0, ttl 61, id 0, offset 0, flags [DF], proto ICMP (1), length 84)
    203.0.113.30 &gt; 1.2.3.4: ICMP echo request, id 24895, seq 1, length 64
12:51:42.020255 IP (tos 0x0, ttl 64, id 8137, offset 0, flags [none], proto ICMP (1), length 84)
    1.2.3.4 &gt; 203.0.113.30: ICMP echo reply, id 24895, seq 1, length 64</programlisting>
        <para>On the Compute Node:</para>
        <programlisting>12:51:42.019519 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto ICMP (1), length 84)
    10.0.2.24 &gt; 1.2.3.4: ICMP echo request, id 24895, seq 1, length 64
12:51:42.019519 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto ICMP (1), length 84)
    10.0.2.24 &gt; 1.2.3.4: ICMP echo request, id 24895, seq 1, length 64
12:51:42.019545 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto ICMP (1), length 84)
    203.0.113.30 &gt; 1.2.3.4: ICMP echo request, id 24895, seq 1, length 64
12:51:42.019780 IP (tos 0x0, ttl 62, id 8137, offset 0, flags [none], proto ICMP (1), length 84)
    1.2.3.4 &gt; 203.0.113.30: ICMP echo reply, id 24895, seq 1, length 64
12:51:42.019801 IP (tos 0x0, ttl 61, id 8137, offset 0, flags [none], proto ICMP (1), length 84)
    1.2.3.4 &gt; 10.0.2.24: ICMP echo reply, id 24895, seq 1, length 64
12:51:42.019807 IP (tos 0x0, ttl 61, id 8137, offset 0, flags [none], proto ICMP (1), length 84)
    1.2.3.4 &gt; 10.0.2.24: ICMP echo reply, id 24895, seq 1, length 64</programlisting>
        <para>On the Instance:</para>
        <programlisting>12:51:42.020974 IP (tos 0x0, ttl 61, id 8137, offset 0, flags [none], proto ICMP (1), length 84)
 1.2.3.4 &gt; 10.0.2.24: ICMP echo reply, id 24895, seq 1, length 64</programlisting>
        <para>Here, the external server received the ping request and
            sent a ping reply. On the compute node, you can see that
            both the ping and ping reply successfully passed through.
            You might also see duplicate packets on the compute node,
            as seen above, because tcpdump captured the packet on both
            the bridge and outgoing interface.</para>
    </section>
    <section xml:id="iptables">
        <title>iptables</title>
        <para>Nova automatically manages iptables, including
            forwarding packets to and from instances on a compute
            node, forwarding floating IP traffic, and managing
            security group rules.</para>
        <para>Run the following command to view the current iptables
            configuration:</para>
        <programlisting># iptables-save</programlisting>
        <note><para>If you modify the
            configuration, it reverts the next time you restart
            nova-network. You must use OpenStack to manage
            iptables.</para></note>
    </section>
    <section xml:id="network_config_database">
        <title>Network Configuration in the Database</title>
        <para>The nova database table contains a few tables with
            networking information:</para>
        <itemizedlist>
            <listitem>
                <para>fixed_ips: contains each possible IP address
                    for the subnet(s) added to Nova. This table is
                    related to the instances table by way of the
                    fixed_ips.instance_uuid column.</para>
            </listitem>
            <listitem>
                <para>floating_ips: contains each floating IP address
                    that was added to nova. This table is related to
                    the fixed_ips table by way of the
                    floating_ips.fixed_ip_id column.</para>
            </listitem>
            <listitem>
                <para>instances: not entirely network specific, but
                    it contains information about the instance that is
                    utilizing the fixed_ip and optional
                    floating_ip.</para>
            </listitem>
        </itemizedlist>
        <para>From these tables, you can see that a Floating IP is
            technically never directly related to an instance, it must
            always go through a Fixed IP.</para>
        <section xml:id="deassociate_floating_ip">
            <title>Manually De-Associating a Floating IP</title>
            <para>Sometimes an instance is terminated but the Floating
                IP was not correctly de-associated from that instance.
                Because the database is in an inconsistent state, the
                usual tools to de-associate the IP no longer work. To
                fix this, you must manually update the
                database.</para>
            <para>First, find the UUID of the instance in
                question:</para>
            <programlisting>mysql&gt; select uuid from instances where hostname = 'hostname';</programlisting>
            <para>Next, find the Fixed IP entry for that UUID:</para>
            <programlisting>mysql&gt; select * from fixed_ips where instance_uuid = '&lt;uuid&gt;';</programlisting>
            <para>You can now get the related Floating IP
                entry:</para>
            <programlisting>mysql&gt; select * from floating_ips where fixed_ip_id = '&lt;fixed_ip_id&gt;';</programlisting>
            <para>And finally, you can de-associate the Floating
                IP:</para>
            <programlisting>mysql&gt; update floating_ips set fixed_ip_id = NULL, host = NULL where fixed_ip_id = '&lt;fixed_ip_id&gt;';</programlisting>
            <para>You can optionally also de-allocate the IP from the
                user's pool:</para>
            <programlisting>mysql&gt; update floating_ips set project_id = NULL where fixed_ip_id = '&lt;fixed_ip_id&gt;';</programlisting>
        </section>
    </section>
    <section xml:id="debug_dhcp_issues">
        <title>Debugging DHCP Issues</title>
        <para>One common networking problem is that an instance boots
            successfully but is not reachable because it failed to
            obtain an IP address from dnsmasq, which is the DHCP
            server that is launched by the nova-network
            service.</para>
        <para>The simplest way to identify that this the problem with
            your instance is to look at the console output of your
            instance. If DHCP failed, you can retrieve the console log
            by doing:</para>
        <programlisting>$ nova console-log &lt;instance name or uuid&gt;</programlisting>
        <para>If your instance failed to obtain an IP through DHCP,
            some messages should appear in the console. For example,
            for the Cirros image, you see output that looks
            like:</para>
        <programlisting>udhcpc (v1.17.2) started
Sending discover...
Sending discover...
Sending discover...
No lease, forking to background
starting DHCP forEthernet interface eth0 [ [1;32mOK[0;39m ]
cloud-setup: checking http://169.254.169.254/2009-04-04/meta-data/instance-id
wget: can't connect to remote host (169.254.169.254): Network is unreachable</programlisting>
        <para>After you establish that the instance booted properly,
            the task is to figure out where the failure is.</para>
        <para>A DHCP problem might be caused by a misbehaving dnsmasq
            process. First, debug by checking logs and then
            restart the dnsmasq processes only for that project
            (tenant). In VLAN mode there is a dnsmasq process for each
            tenant. Once you have restarted targeted dnsmasq
            processes, the simplest way to rule out dnsmasq causes is
            to kill all of the dnsmasq processes on the machine, and
            restart nova-network. As a last resort, do this as
            root:</para>
        <programlisting># killall dnsmasq
# restart nova-network</programlisting>
        <note><para>It's openstack-nova-network on RHEL/CentOS/Fedora but nova-network on Ubuntu/Debian.</para></note>
        <para>Several minutes after nova-network is restarted, you
            should see new dnsmasq processes running:</para>
        <programlisting># ps aux | grep dnsmasq
nobody 3735 0.0 0.0 27540 1044 ? S 15:40 0:00 /usr/sbin/dnsmasq --strict-order --bind-interfaces --conf-file=
    --domain=novalocal --pid-file=/var/lib/nova/networks/nova-br100.pid --listen-address=192.168.100.1
    --except-interface=lo --dhcp-range=set:'novanetwork',192.168.100.2,static,120s --dhcp-lease-max=256
    --dhcp-hostsfile=/var/lib/nova/networks/nova-br100.conf --dhcp-script=/usr/bin/nova-dhcpbridge --leasefile-ro
root 3736 0.0 0.0 27512 444 ? S 15:40 0:00 /usr/sbin/dnsmasq --strict-order --bind-interfaces --conf-file=
     --domain=novalocal --pid-file=/var/lib/nova/networks/nova-br100.pid --listen-address=192.168.100.1
     --except-interface=lo --dhcp-range=set:'novanetwork',192.168.100.2,static,120s --dhcp-lease-max=256
     --dhcp-hostsfile=/var/lib/nova/networks/nova-br100.conf --dhcp-script=/usr/bin/nova-dhcpbridge --leasefile-ro</programlisting>
        <para>If your instances are still not able to obtain IP
            addresses, the next thing to check is if dnsmasq is seeing
            the DHCP requests from the instance. On the machine that
            is running the dnsmasq process, which is the compute host
            if running in multi-host mode, look at /var/log/syslog to
            see the dnsmasq output. If dnsmasq is seeing the request
            properly and handing out an IP, the output looks
            like:</para>
        <programlisting>Feb 27 22:01:36 mynode dnsmasq-dhcp[2438]: DHCPDISCOVER(br100) fa:16:3e:56:0b:6f
Feb 27 22:01:36 mynode dnsmasq-dhcp[2438]: DHCPOFFER(br100) 192.168.100.3 fa:16:3e:56:0b:6f
Feb 27 22:01:36 mynode dnsmasq-dhcp[2438]: DHCPREQUEST(br100) 192.168.100.3 fa:16:3e:56:0b:6f
Feb 27 22:01:36 mynode dnsmasq-dhcp[2438]: DHCPACK(br100) 192.168.100.3 fa:16:3e:56:0b:6f test</programlisting>
        <para>If you do not see the DHCPDISCOVER, a problem exists
            with the packet getting from the instance to the machine
            running dnsmasq. If you see all of above output and your
            instances are still not able to obtain IP addresses then
            the packet is able to get from the instance to the host
            running dnsmasq, but it is not able to make the return
            trip.</para>
        <para>If you see any other message, such as:</para>
        <programlisting>Feb 27 22:01:36 mynode dnsmasq-dhcp[25435]: DHCPDISCOVER(br100) fa:16:3e:78:44:84 no address available</programlisting>
        <para>Then this may be a dnsmasq and/or nova-network related
            issue. (For the example above, the problem happened to be
            that dnsmasq did not have any more IP addresses to give
            away because there were no more Fixed IPs available in the
            OpenStack Compute database).</para>
        <para>If there's a suspicious-looking dnsmasq log message,
            take a look at the command-line arguments to the dnsmasq
            processes to see if they look correct.</para>
        <programlisting>$ ps aux | grep dnsmasq</programlisting>
        <para>The output looks something like:</para>
        <programlisting>108 1695 0.0 0.0 25972 1000 ? S Feb26 0:00 /usr/sbin/dnsmasq -u libvirt-dnsmasq --strict-order --bind-interfaces
 --pid-file=/var/run/libvirt/network/default.pid --conf-file= --except-interface lo --listen-address 192.168.122.1
 --dhcp-range 192.168.122.2,192.168.122.254 --dhcp-leasefile=/var/lib/libvirt/dnsmasq/default.leases
 --dhcp-lease-max=253 --dhcp-no-override
nobody 2438 0.0 0.0 27540 1096 ? S Feb26 0:00 /usr/sbin/dnsmasq --strict-order --bind-interfaces --conf-file=
 --domain=novalocal --pid-file=/var/lib/nova/networks/nova-br100.pid --listen-address=192.168.100.1
 --except-interface=lo --dhcp-range=set:'novanetwork',192.168.100.2,static,120s --dhcp-lease-max=256
 --dhcp-hostsfile=/var/lib/nova/networks/nova-br100.conf --dhcp-script=/usr/bin/nova-dhcpbridge --leasefile-ro
root 2439 0.0 0.0 27512 472 ? S Feb26 0:00 /usr/sbin/dnsmasq --strict-order --bind-interfaces --conf-file=
 --domain=novalocal --pid-file=/var/lib/nova/networks/nova-br100.pid --listen-address=192.168.100.1
 --except-interface=lo --dhcp-range=set:'novanetwork',192.168.100.2,static,120s --dhcp-lease-max=256
 --dhcp-hostsfile=/var/lib/nova/networks/nova-br100.conf --dhcp-script=/usr/bin/nova-dhcpbridge --leasefile-ro</programlisting>
        <para>If the problem does not seem to be related to dnsmasq
            itself, at this point, use tcpdump on the interfaces to
            determine where the packets are getting lost.</para>
        <para>DHCP traffic uses UDP. The client sends from port 68 to
            port 67 on the server. Try to boot a new instance and then
            systematically listen on the NICs until you identify the
            one that isn't seeing the traffic. To use tcpdump to
            listen to ports 67 and 68 on br100, you would do:</para>
        <programlisting># tcpdump -i br100 -n port 67 or port 68</programlisting>
        <para>You should be doing sanity checks on the interfaces
            using command such as "<code>ip a</code>" and "<code>brctl
                show</code>" to ensure that the interfaces are
            actually up and configured the way that you think that
            they are.</para>
    </section>
    <section xml:id="debugging_dns_issues">
        <title>Debugging DNS Issues</title>
        <para>If you are able to ssh into an instance, but it takes a
            very long time (on the order of a minute) to get a prompt,
            then you might have a DNS issue. The reason a DNS issue
            can cause this problem is that the ssh server does a
            reverse DNS lookup on the IP address that you are
            connecting from. If DNS lookup isn't working on your
            instances, then you must wait for the DNS reverse lookup
            timeout to occur for the ssh login process to
            complete.</para>
        <para>When debugging DNS issues, start by making sure the host
            where the dnsmasq process for that instance runs is able
            to correctly resolve. If the host cannot resolve, then the
            instances won't be able either.</para>
        <para>A quick way to check if DNS is working is to
            resolve a hostname inside your instance using the
                <code>host</code> command. If DNS is working, you
            should see:</para>
        <programlisting>$ host openstack.org
openstack.org has address 174.143.194.225
openstack.org mail is handled by 10 mx1.emailsrvr.com.
openstack.org mail is handled by 20 mx2.emailsrvr.com.</programlisting>
        <para>If you're running the Cirros image, it doesn't have the
            "host" program installed, in which case you can use ping
            to try to access a machine by hostname to see if it
            resolves. If DNS is working, the first line of ping would
            be:</para>
        <programlisting>$ ping openstack.org
PING openstack.org (174.143.194.225): 56 data bytes</programlisting>
        <para>If the instance fails to resolve the hostname, you have
            a DNS problem. For example:</para>
        <programlisting>$ ping openstack.org
ping: bad address 'openstack.org'</programlisting>
        <para>In an OpenStack cloud, the dnsmasq process acts as the
            DNS server for the instances in addition to acting as the
            DHCP server. A misbehaving dnsmasq process may be the
            source of DNS-related issues inside the instance. As
            mentioned in the previous section, the simplest way to
            rule out a misbehaving dnsmasq process is to kill all of
            the dnsmasq processes on the machine, and restart
            nova-network. However, be aware that this command affects
            everyone running instances on this node, including tenants
            that have not seen the issue. As a last resort, as
            root:</para>
        <programlisting># killall dnsmasq
# restart nova-network</programlisting>
        <para>After the dnsmasq processes start again, check if DNS is
            working.</para>
        <para>If restarting the dnsmasq process doesn't fix the issue,
            you might need to use tcpdump to look at the packets to
            trace where the failure is. The DNS server listens on UDP
            port 53. You should see the DNS request on the bridge
            (such as, br100) of your compute node. If you start
            listening with tcpdump on the compute node:</para>
        <programlisting># tcpdump -i br100 -n -v udp port 53
tcpdump: listening on br100, link-type EN10MB (Ethernet), capture size 65535 bytes</programlisting>
        <para>Then, if you ssh into your instance and try to
                <code>ping openstack.org</code>, you should see
            something like:</para>
        <programlisting>16:36:18.807518 IP (tos 0x0, ttl 64, id 56057, offset 0, flags [DF], proto UDP (17), length 59)
 192.168.100.4.54244 &gt; 192.168.100.1.53: 2+ A? openstack.org. (31)
16:36:18.808285 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 75)
 192.168.100.1.53 &gt; 192.168.100.4.54244: 2 1/0/0 openstack.org. A 174.143.194.225 (47)</programlisting>
    </section>


</chapter>