Wednesday, April 06, 2011

An example of when to use VLANs, and the danger of closet monkeys

I wrote a couple days ago about abusing VLANs.  Just yesterday I had an occasion to use VLANs for a client, so I thought I'd write about that.  There's also a "closet monkey" anecdote in here, as a cautionary tale as to why you shouldn't let outside techs into your network closets or server rooms unsupervised.

This client recently entered into an arrangement with a hosted provider for voice-over-IP telephony.  The arrangement this provider offers installs Polycom SIP phones at the business location, along with a gateway device that is installed on the network to aggregate the SIP devices and trunk calls back to the service provider's facility.  (As far as I can tell, call control is handled at the provider's facility, but that's not important right now.)  This gateway device, in addition to its VoIP functionality, is capable of acting as a fairly generic NAT appliance.  This particular provider's installer apparently uses a playbook for installing these devices that involves removing any gateway device that customer already has and replacing it by their device.  This device also provides DHCP with a variety of specialized options preloaded for the benefit of the phones, including, apparently, their own DNS servers, which their system makes some use of in some way that wasn't clearly explained to me.

However, in my client's case this didn't quite work out.  My client has Windows 2008 Small Business Server running at that location, with Active Directory in use.  The SBS server provides both DNS and DHCP for the network; DHCP was not being provided by the existing gateway devices (a Watchguard firewall).  So when they ripped the Watchguard out of the network and installed their gateway device, the DHCP server in their device conflicted with the DHCP server in the SBS server; fortunately, the gateway detected this and shut off its DHCP server.  This resulted in the phones not getting all the extra DHCP options that they needed for optimal operation, nor did they have access to the provider's DNS servers. 

It was about this point that they called me, to ask if there was some way to change the DNS for the network to point to their servers instead of the local Windows server.  Of course, that's not acceptable; this client is using Active Directory, and in an AD environment it is absolutely nonnegotiable that all AD clients must use the Active Directory DNS servers, at least for all zones that describe the AD forest.  I was, however, willing to configure the Windows server to use the provider's DNS servers as first-level forwarders, which would mean that any query not answerable by the zones defined in that server would be forwarded to the provider for resolution.  (It is fairly rare for people to understand how DNS works; perhaps I'll blog about this in the future.)

So, while on the conference call, I went to remote into the client's site, in order to make the necessary changes to the DNS service.  And here's where I ran into more problems.  The VPN would not connect, for the fairly simple reason that they had disconnected the Watchguard firewall that was being used as the VPN endpoint.  (It was at this point that I and my client discovered that they had done this.)  Further discussion and inspection determined that they had disconnected the Watchguard from the WAN side, and I suspected also from the LAN side, although that wasn't confirmed until the next day when I went on site.  This was clearly unacceptable.  Remote access via that device is essential to this client's business operations as well as to my ability to provide them remote support; also, this client runs a FTP server at this location which is used for communications with a couple of business partners, which was obviously also made unavailable as a result of this change.  It's possible that I might have been able to configure this new gateway device to provide comparable services; however, my main complaint is that this provider removed a gateway device without discussion or even notification as part of their install routine.  There's a reason more experienced network engineers like myself refer to such people as "closet monkeys".  When I was a full-time systems person I generally refused to let anyone outside the organization into my server room or network closets without direct supervision; it's incidents like this that explain why.

Anyhow, during the 45-minute conference call two nights ago, after it became apparent that this installers had rather significantly broken my client's network and that I would have to go in to fix it, we then discussed how to make all this work in harmony.  Apparently their device doesn't like operating behind another firewall, and I suspect it will also not play well in router-on-a-stick mode.  We could have arranged that using the Watchguard's "optional" network, but that would have required them to break from their playbook and negotiate with me, and getting a closet monkey to negotiate with the customer is usually impossible.  However, they had actually done me a favor in disguise.  This client has Comcast business cable modem service using an SMC cable modem.  This modem supports transparent bridging but cannot be configured to do it by the customer; turning that on can only be done from the provider end.  When I migrated the client from DSL to cable modem, about a month ago, I would have preferred transparent bridging but didn't want to deal with calling Comcast to set it up, so I set up a double NAT solution instead by configuring the modem to map each public IP to a RFC 1918 IP, and then using those mapped IPs at the firewall's WAN interface.  This solution was less than ideal, in my opinion, but was working, so I left it alone.  However, the installers for this system had apparently contacted Comcast and had the modem switched to transparent bridging to better support their device.  A blessing in disguise.  Anyhow, this meant that the cable modem was now presenting five public IP addresses (five of the six usable addresses of a /29 network, the sixth having been allocated to the cable modem itself) on its LAN ports, but their gateway device only needed one; I could use one of the others for the client's firewall and restore remote access, and another for the FTP site; only minor reconfiguration of the firewall would be needed, once it had been reconnected, of course.  The only question remaining was how to run both devices in parallel, without conflict.

Here's where VLANs come in.  The strategy here is to have one VLAN for the PCs (and printers and servers and other devices) and another, entirely separate VLAN, for the phones.  This not only allows my client to continue using their firewall device, which has been set up for their specific business needs, but also allows the provider's edge device to serve all the special DHCP options to the phones that are required to make the phones work correctly, and allow the phones to get the DNS servers that the provider wants them to use, without interfering with the needs of the active directory environment.  It's truly as if there were two entirely separate LANs.  (There isn't even any routing between the two VLANs; while I could have set that up, there was no benefit to doing so.)

The only remaining issue was in how to get the phones to live on their VLAN without having to run additional cabling.  The phones in question, as I mentioned, are Polycom SIP devices.  Like most VoIP phones, they have passthrough Ethernet ports so that you don't have to install additional cabling to install them; you just plug the PC into the phone and the phone into the wall where the PC was plugged in before.  In addition, like most VoIP phones, they support 802.1q tagging for VoIP traffic, which means the phone's traffic is tagged with 802.1q tags that allow a suitably capable switch to segregate the traffic for the phones from the traffic from the passthrough port (which is sent untagged).  The provider wasn't able to advise me on how to set the phones up to do this, but I was able to figure it out anyway, having set up VoIP telephony systems before.  Furthermore, Polycom has fairly decent documentation for its phones available on the net; all that was required was the addition of a special DHCP option to the Windows DHCP server, and I was able to fairly quickly find out what option was needed and what syntax these phones needed for that.  This allowed the phones to operate on the voice VLAN while still using the same cable for passthrough data traffic to any device connected to the phone's passthrough port.

So, I defined a second VLAN on the client's switches, and set up all but three ports on the switches as "untagged 1 tagged 101" (1 being the data VLAN and 101 being the voice VLAN).  The phones, when they boot, execute a DHCP discover on the data (untagged) VLAN.  The Windows server responds with an IP address offer that includes the DHCP tag that tells the phone to switch to VLAN 101.  The phone then rejects the DHCP offer and switches its VoIP interface to the tagged VLAN and sends another DHCP discover on the voice VLAN, which is responded to by the VoIP gateway device with all of the settings that are particular to the voice network.  Other devices on the data network (such as workstations) just ignore this DHCP option and proceed as usual.  Two of the ports that were not set up this way were set up as "untagged 101"; one of these was connected to the edge device (so that the edge device would not get 802.1q tags that it wasn't set up to deal with) and the other I used for configuration and troubleshooting access during the process.  The final remaining port goes to an unmanaged gigabit switch that interconnects the client's servers; that switch is not 802.1q aware and thus should also not received tagged packets, and in any case no device on that switch needs to see voice traffic.

In this case, VLANs were key to solving this client's problem.  The traffic segregation and quality of service wasn't really an issue; this client's network is small enough that it's unlikely that there'd be capacity issues.  In this case segregation was mandated by the need to have distinct DHCP environments.  In theory I could have used a DHCP server that used the requesting client's client ID or MAC address to serve different DHCP options, but such features are not standard in most DHCP servers.  The VLAN solution was simpler. 

One of the problems small businesses often face (often without knowing it) is that there's a bevy of solution providers out there that are offering what amount to turnkey solutions, and in most cases the solution is being deployed by people who are only trained to deal with a small subset of the possible environments they'll run into.  Sometimes that'll work out OK, but really if you want a good result you need someone involved who is looking out for your needs, concerns, and interests.  You just can't count on someone else's technician to do that.  The solution they provide has been optimized for their needs, not necessarily for yours.