Juniper vMX - Intel X520/X710 vFPC Crashing

UPDATE: A new copy of the i40e driver has been provided by Juniper (unofficially). You can try this at your own risk, see here for details.

When using a Juniper vMX with a Intel X520 or Intel X710 based card and SR-IOV, under load the vFPC may crash. This seems to be a common problem across all available versions (as of writing) - currently the latest version is 18.3R1 which still includes the driver that has the issue. If you experience this issue the vMX will not be accessible (except through the vCP management port) - all interfaces on the vFPC will be down. Manually restarting the vFPC (or the whole vMX) will bring the interfaces back up until it happens again.

When the issue occurs, check dmesg on the host which will reveal why it crashed:

[6612327.272322] i40e 0000:01:00.0: VF 0 successfully set multicast promiscuous mode
[6612327.272437] i40e 0000:01:00.0: VF 0 successfully set unicast promiscuous mode
[6612327.343639] i40e 0000:01:00.1: VF 0 successfully set multicast promiscuous mode
[6612327.343748] i40e 0000:01:00.1: VF 0 successfully set unicast promiscuous mode
[6612327.414933] i40e 0000:01:00.2: VF 0 successfully set multicast promiscuous mode
[6612327.415048] i40e 0000:01:00.2: VF 0 successfully set unicast promiscuous mode
[6612327.486171] i40e 0000:01:00.3: VF 0 successfully set multicast promiscuous mode
[6612327.486283] i40e 0000:01:00.3: VF 0 successfully set unicast promiscuous mode
[6612327.557505] i40e 0000:05:00.2: VF 0 successfully set multicast promiscuous mode
[6612327.557616] i40e 0000:05:00.2: VF 0 successfully set unicast promiscuous mode
[6612327.628490] i40e 0000:05:00.3: VF 0 successfully set multicast promiscuous mode
[6612327.628649] i40e 0000:05:00.3: VF 0 successfully set unicast promiscuous mode
[6612335.986778] i40e 0000:01:00.0: VF 0 successfully set multicast promiscuous mode
[6612335.986886] i40e 0000:01:00.0: VF 0 successfully set unicast promiscuous mode
[6612336.029362] i40e 0000:01:00.1: VF 0 successfully set multicast promiscuous mode
[6612336.029469] i40e 0000:01:00.1: VF 0 successfully set unicast promiscuous mode
[6612336.054603] i40e 0000:01:00.2: VF 0 successfully set multicast promiscuous mode
[6612336.054710] i40e 0000:01:00.2: VF 0 successfully set unicast promiscuous mode
[6612336.080682] i40e 0000:01:00.3: VF 0 successfully set multicast promiscuous mode
[6612336.080789] i40e 0000:01:00.3: VF 0 successfully set unicast promiscuous mode
[6612336.106680] i40e 0000:05:00.2: VF 0 successfully set multicast promiscuous mode
[6612336.106787] i40e 0000:05:00.2: VF 0 successfully set unicast promiscuous mode
[6612336.132026] i40e 0000:05:00.3: VF 0 successfully set multicast promiscuous mode
[6612336.132133] i40e 0000:05:00.3: VF 0 successfully set unicast promiscuous mode
[8248098.894633] i40e 0000:01:00.0: TX driver issue detected, PF reset issued
[8248098.894640] i40e 0000:01:00.0: TX driver issue detected on VF 0

The PF reset problem is well known for cards that use the i40e driver; I have personally experienced this exact issue (when not using SR-IOV) on ESXi, Windows Server, CentOS, Debian and Ubuntu. In each case upgraded to a newer driver has resolved it. When using the vMX this is not an option - during the startup of the vMX on KVM the i40e driver is built from the source included with the vMX; the included driver with the vMX has been modified by Juniper and it is not a stock Intel driver.

I have had a support case open for this issue since late June (2018). As of writing, (October 2018) there is still no fix available. I have spoken to multiple people who are also encountering the issue on their own installations of the vMX. According to Juniper this is a known issue and it is being tracked in PR1382787 which is an internal PR; no details can be found on the Juniper Knowledgebase for it.

Juniper has provided a work around - it is available on request from them. The work around involves installing a software package (using the standard add system software .... from the router). With the software package installed the vFPC will be restarted automatically when the PF reset happens. The work around reduces the down time from how ever long the manual intervention takes to around 1-2 minutes.