-
Quentin Bolsee authoredQuentin Bolsee authored
Tiny Nets
TinyNets presents a networking strategy for distributed robotic control systems.
Networked Control Systems (NCS)
The field of Networked Control Systems, or NCS, is unique from many other networking fields. NCS refers to any application where many devices are linked together to perform control of a physical system. They are common in robotics and avionics, where many sensors and actuators work together to perform a common goal (i.e. walking, stabilization, etc), and in manufacturing, where machine degrees of freedom are linked to close positioning control loops, and where multiple machines are linked to coordinate material handling and production scheduling.
Critically,
- In NCS, total throughput is valued but not a key metric.
- Rather, message sizes are typically very small (between three to fifty bytes) and message delay time is the critical metric. Often, messages are only one-packet in length.
- Determinism in Message Delivery Time is critical - systems must guarantee that certain control loops 'close' within a defined set of time, less they become unstable.
- Robustness is critical - NCS should not contain any Single Points of Failure
- Statelessness is critical - NCS should not pause operation under any circumstances to re-converge on routing solutions, as this adds fatal indeterminism to message delivery.
The State of the Art in NCS
State of the art Networked Control Systems employ simple Switched Ethernet, or proprietary versions thereof, in order to route traffic. Hardware endpoints are fitted with an Ethernet PHY and are connected in a heirarchy of switches. Ethernet MAC addresses are used, and all routing takes place on Layer 2.
Switched Ethernet has become the industry standard because of its relative interoperability and high speeds. Critically, the last 10 years has seen Switched Ethernet take up large portions of market share because it solves many problems associated with Fieldbusses. Most importantly, adding devices to a Fieldbus always caused a linear increase in message delivery time, as is not the case with Switched Ethernet.
Dissatisfaction with Switched Ethernet
However, Switched Ethernet was not originally developed for Networked Control Systems, and many in industry have pointed out that it will not fulfill customer needs in the near- and long-term future.
In Switched Ethernet, because a Minimum Spanning Tree is created, nodes in a particular layer compete for link-time on the layer above. Message delay time increases linearly with the probability that peers are transmitting at the same time, and with the number of peers on that layer.
In addition, Switched Ethernet contains Single Points of Failure, where a broken link or switch means that the network must re-run the Spanning Tree Protocol algorithm - a process that often takes seconds. Because Switched Ethernet graphs are highly heirarchical, it is often the case that failure on a single link can cause entire sections of the network to fail, or become unreachable.
Device endpoints in NCS are scaling down in size and up in number. Requiring that each endpoint carries with it an RJ45 Magnetic Jack and Ethernet PHY is dubious, and sets a lower limit on the size and complexity that sensors and actuators in an NCS must posses.
Switched Ethernet is non-programmable. I.E. Switches are black-box ICs and do not allow systems designers to arbitrarily add functions to a system on the networking layer. For example, many NCS designers would like to implement message priorities and load balancing, but this is not possible on Layer 2.
Constraints and Cost Functions for TinyNet
Constraints
In the design of TinyNets, we operate under the following constraints:
- TinyNet should be trivially integrated on device endpoints. I.E. an endpoint should not require any additional hardware circuitry. This allows the network to scale down into micro-robotics applications.
- TinyNets should run entirely in C or C++ on the processors used on endpoints and routers, meaning that network protocols can be openly modified within an Autonomous System to perform application-specific tasks. TinyNets is Open Source Software.
- TinyNets should run with no global state. It should not have to re-converge on routing solutions in the face of broken or modified links, additions to the network, or changes in traffic patterns.
Direct Comparisons
It will be difficult to perform one-to-one comparisons between our network and the state of the art, as we are proposing a completely new solution in response to problems in NCS that we believe cannot be addressed with incremental modifications to existing technologies.
Proving our Merit
However, we can offer analysis as to why we believe our approach is substantially better than current offerings - or has a better problem-solution fit than other technologies.
Realtime / Convergence Free Multipath Routing in a Distance-Vector Routing Protocol
- Existing Multipath Routing Technologies offer multipath routing (which eliminates the switching-bottleneck issues associated with switched ethernet), however, they do so using link-state routing that requires each router to share common knowledge about the complete network graph. In the face of link outages or router failures, networks must re-converge - a process that interrupts flows and causes massive increases, or complete failures, in message deliveries. For example
- ECMP (Equal Cost Multipath Routing): more of a tool than an actual strategy; simply considers multiple paths when there are multiple best paths, i.e. load balancing mechanism
- OSPF (Open Shortest Path First): Computes shortest path tree using Dijstra -- must know entire graph; wikipedia states convergence time is on the order of seconds (links to Cisco default parameters that set timeouts to be multiple seconds); specifically for Ethernet and offers a multipath version
- SPB (Shortest Path Bridging): allows multiple equal cost paths and claims network is unaffected when a node fails except for the path(s) affected by the node failure, i.e. still cannot find another path if there is a unique shortest path containing a broken link
- TRILL (TRansparent Interconnect of Lots of Links): must know entire graph, otherwise extremely similar to our protocol (uses hop counts and has similar flooding procedure); operates in Layer 2 and uses Fabric Shortest Path First (FSPF) to calculate alternate routes in node failure scenarios
We seek to demonstrate that these re-convergence times would cause operational failure in NCS, thus eliminating ECMP and OSPF as possible solutions to the NCS problem.
The three protocols in question (OSPF, SPB, TRILL) require knowing the entire graph to perform a global shortest path calculation. All three of them allow for multipath consideration when there are multiple best paths. 200 ms seems to be the lower bound on convergence times as FSPF is quoted to having as good as 200 ms convergence time in the book "IBM SAN Solution Design Best Practices for VMWare" book. The vanilla OSPF protocol that Cisco offers indicates around an order second convergence time while optimized versions offer similar timing to FSPF (see paper).
200 ms sounds like a reasonable convergence time (and is quoted as being extremely fast) so to prove our merit, we need to demonstrate systems that do not have multiple shortest paths using the protocols above. This should highlight the main benefit of our protocol, that being the capability to perform real time alternate path calculations in a reasonable amount of time.
We propose designing multiple experiments to showcase the benefit of our protocol:
- Grid structure network - test latency of corner communication and traffic of network during node failure. This test will serve as a control since there will be many shortest paths (if hop count is used as the metric).
- Ring structure network - we already have graphs for this from the other protocol and it will demonstrate the speed at which each protocol finds the only other path in the event of a node failure.
- Mesh network - fully connected network that will test our network utilization (ensure ringing doesn't happen or is at least bounded tightly). Test the latency of cross network communication in the event of a node failure. We should expect to see minimal decision time in our protocol and minimal flooding.