IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 6, JUNE 2012

889

Programming Time-Multiplexed Reconfigurable Hardware Using a Scalable Neuromorphic Compiler Kirill Minkovich, Narayan Srinivasa, Senior Member, IEEE, Jose M. Cruz-Albrecht, Member, IEEE, Youngkwan Cho, and Aleksey Nogin

Abstract— Scalability and connectivity are two key challenges in designing neuromorphic hardware that can match biological levels. In this paper, we describe a neuromorphic system architecture design that addresses an approach to meet these challenges using traditional complementary metal–oxide– semiconductor (CMOS) hardware. A key requirement in realizing such neural architectures in hardware is the ability to automatically configure the hardware to emulate any neural architecture or model. The focus for this paper is to describe the details of such a programmable front-end. This programmable front-end is composed of a neuromorphic compiler and a digital memory, and is designed based on the concept of synaptic timemultiplexing (STM). The neuromorphic compiler automatically translates any given neural architecture to hardware switch states and these states are stored in digital memory to enable desired neural architectures. STM enables our proposed architecture to address scalability and connectivity using traditional CMOS hardware. We describe the details of the proposed design and the programmable front-end, and provide examples to illustrate its capabilities. We also provide perspectives for future extensions and potential applications. Index Terms— Neuromorphic scalable architecture, synapses.

systems,

neurons, routing,

I. I NTRODUCTION

T

HERE are two challenging aspects of building neuromorphic circuits in mature complementary metal–oxide– semiconductor (CMOS) technology to match biological brainlike architectures: scalability and connectivity. Scalability means that the circuits have to be expandable to match their biological brains in terms of synaptic and neuronal densities. The challenge here is to implement 106 neurons and 1010 synapses with an average fanout of 104 in a square centimeter of CMOS [1]. Connectivity means that the circuit must be capable of having both short- and long-range (by physical distance) connections between neurons. A large part of this Manuscript received May 20, 2011; revised March 11, 2012; accepted March 11, 2012. Date of publication April 11, 2012; date of current version May 10, 2012. This work was supported in part by the Defense Advanced Research Projects Agency SyNAPSE under Grant HRL0011-09-C-001. This work is approved for public release and distribution is unlimited. The views, opinions, and/or findings contained in this article are those of the authors and should not be interpreted as representing the official views or policies, either expressed or implied, of the DARPA or the Department of Defense. K. Minkovich, N. Srinivasa, Y. Cho, and A. Nogin are with the Center for Neural and Emergent Systems, Department of Information and System Sciences, HRL Laboratories LLC, Malibu, CA 90265 USA (e-mail: [email protected]; [email protected]; [email protected]; [email protected]). J. M. Cruz-Albrecht is with the Microelectronics Laboratory, HRL Laboratories, LLC, Malibu, CA 90265 USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2012.2191795

challenge is how to implement a connectivity of 104 synapses per neuron [2]. Unfortunately, even the exponential growth in transistor density being experienced today is not sufficient to realize such massive connectivity and synaptic densities in a traditional CMOS process. Recent approaches to address these challenges have been to integrate CMOS with nanotechnology [3], [4] in order to achieve the required synaptic densities. These solutions use crossbar architectures predominantly, but the connectivity challenge remains a daunting task for such solutions [2], [5], [6]. To meet these challenges, a novel synaptic time-multiplexing (STM) concept is developed along with a neural fabric design [7]. This combination has the advantages of greater flexibility and long-range connectivity. It also provides a method to overcome the limitations of conventional CMOS technology to match the synaptic density and connectivity requirements found in mammalian brains while maintaining non-linear synapses and learning. The proposed neuromorphic system architecture to support large-scale brain models is composed of three main components (see Fig. 1): the reconfigurable front-end that will enable programming of neural architectures into hardware, the analog core that houses the neurons and synapses in a neural fabric and performs the neural and synaptic computations, and the back-end that will enable the storage and retrieval of synaptic conductances during the operation of the chip. This paper aims to describe the details of the programmable front-end. The programmable front-end is designed based on the concept of STM and is essential in programming any arbitrary neural architecture into a CMOS neuromorphic hardware in a scalable fashion. The outline for this paper is as follows. The STM concept is introduced in Section II and the details of the associated neural fabric design are provided in Section III. In Section IV, we compare our proposed design with other well-known neuromorphic designs. In Section V, the details of a scalable neuromorphic compiler are described. Section VI discusses issues related to hardware implementation of our proposed neuromorphic architecture, including scaling the system to multi-chip implementations. This section also highlights the potential extensions planned on the neuromorphic compiler and the possible applications for the proposed concepts. Concluding remarks are provided in Section VII. II. STM Typical neurons have a firing rate that range from 0.5 Hz to 100 Hz, with momentary excursions to higher or lower frequencies [8]. In contrast, modern electronics have grown

2162–237X/$31.00 © 2012 IEEE

890

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 6, JUNE 2012 # Neurons, # Synapses, Connectivity

Spike Period

Neuromorphic Compiler Brain Architecture

Routing, Neuron Placement

Programmable Front-End (focus of this paper)

(a)

Cycle Set switch states

Digital Memory Acquire Switch states

Analog Core with Cortical Fabric (neurons, synapses)

Store

Retrieve

Analog Memory

Cycle

(store synaptic conductances)

(b) Timeslot

Fig. 1. Our proposed neuromorphic hardware components for supporting large-scale neural architectures (106 neurons and 1010 synapses in a square centimeter) is shown here. The focus of this paper will be on the programmable front-end.

Fig. 2. STM concept to model large-scale neuromorphic architectures illustrates the breakdown of an STM cycle into separate STM timeslots.

at the rate as predicted by Moore’s law [9] by exponentially increasing the clock speed (in the gigahertz range) and by increasing transistor density. The key idea in STM is to exploit this difference in operating speed between electronics and mammalian brains and trade-off space for speed of processing in order to address the scalability and connectivity challenges. To enable this, the physical connections between neurons are time-multiplexed. As illustrated by a simple example in Fig. 2, the set of three decoupled networks on the left, in a three-timeslot sequence, provides the same set of connections as the neural network with high synaptic connection density shown on the right. By integrating all the synaptic inputs for a given neuron in a sequence rather than in parallel, the STM concept reproduces the fully connected network while reducing the hardware requirements to only a few physical synapses per neuron and storage of the other synapse states. During this process, the sequential steps are operated at a much higher frequency than the maximum brain operating speed desired. This operating frequency is referred to hereinafter as the STM frequency. The set of STM timeslots needed to describe all the synapses is referred to as the STM cycle, as shown in Fig. 3(b), and its cycle time determines the STM frequency. This feature enables decoupled networks (Fig. 2) to be processed sequentially at each STM timeslot until all of them are covered. The sum of the duration of all these STM timeslots will make up the total system time or STM cycle and all the network connections for the complete network are realized at the end of the STM cycle. The STM approach is applicable for neurons with spike outputs, and time multiplexing applies only to the synapses

Timeslot

Timeslot

Fig. 3. (a) Time-domain diagram of a spike signal and an STM cycle. The STM cycle is about 10 times smaller than a typical spike period. (b) Detail of an STM cycle composed of multiple timeslots. The synapses states and the routing fabric are updated on each time slot.

and the connections. In our hardware implementations, we have used leaky integrate and fire (LIF) neurons [10], [11]. The LIF neurons, however, are not time-multiplexed and perform continuous integration of input current to produce a spike whenever the membrane voltage exceeds a threshold. Fig. 3(a) shows a diagrammatical representation of a spike produced by a neuron. In the STM approach, spike rates can be in the biological range (∼50 Hz) [10]. Higher rates in the order of KHz [11] are also compatible with the STM approach. Fig. 3(b) shows the STM timing diagram. The cycle period is about 10 times shorter than the spike period. This ensures that the time-multiplexing does not introduce a significant change in the timing of the spikes. Each timeslot is smaller than the STM cycle. In a typical 90-nm CMOS technology implementation, an example of timeslot duration is 1 μs. This is based on transistor-level simulations. For example, if we select 100 timeslots per STM cycle, the cycle time is 100 μs. The spike pulse width is typically equal to the cycle time. The average spike period is 1 ms, which is equivalent to a rate of 1 KHz. For operation at biological rate, more timeslots or longer timeslots (longer than 1 μs) can also be used. As sparse connectivity is achievable at each STM timeslot, thereby requiring a minimal number of physical synapses, it becomes necessary to account for the synaptic states after each STM timeslot. This is because each physical synapse is reused to perform the role of a different synapse in each STM timeslot of the STM cycle. This is accomplished by using a synaptic analog memory (as shown in Fig. 1) such that the appropriate synaptic states can be stored and retrieved at the appropriate STM timeslot during the actual functioning of the hardware. In each STM timeslot, the synapse conductance (weight) is: 1) read from memory; 2) updated if required as dictated by the spike timing-dependent plasticity (STDP) rule [10], [12]–[15], whereby the synaptic strengths are changed in proportion to timing difference between when the input spike transmitted to the neuron and the output spike that is generated by the neuron; and 3) written to memory. The weight is still an analog signal, but it only changes during the update portion of a timeslot. In a typical 90-nm CMOS technology implementation, with timeslot durations of 1 μs, an example of timing within each timeslot is as follows: read time of 100 ns, weight update time of 800 ns, and writing time of 100 ns. These values can be achieved when

MINKOVICH et al.: PROGRAMMING TIME-MULTIPLEXED RECONFIGURABLE HARDWARE

891

(a)

Fig. 4. Example of (a) routing spikes into nodal element and (b) routing spikes out of each nodal element.

using capacitors for analog memory. Using an analog memory and analog update circuits obviates the need to use data conversion circuits, such as digital-to-analog convertors and analog-to-digital-convertors. Analog circuits are more compact than digital circuits. However, the STM process can also work with digital circuits. The key requirement for these updates based on STDP is that a neuron spike remains available for at least an entire STM cycle from the timeslot at which the spike occurs. This ensures that all synapses connected to the neuron receive the spike and, thus, the weight updates based on STDP are accurate. We are developing integrated circuits based on the STM concept to implement neuromorphic systems [11]. As mentioned earlier, the focus of this paper is to describe how to automate the process of programming any given neural architecture onto the reconfigurable hardware based on the STM concept. The neuronal computation and its associated adaptations (e.g., STDP) that are required within the STMbased hardware will be described in another article. III. N EURAL FABRIC FOR STM-BASED A RCHITECTURE In order to realize the concept of STM in CMOS, we have developed a neural fabric design that is part of the analog core (see Fig. 1). This design is amenable for spiking neurons [16]. An example of this fabric design is shown in Fig. 4. It consists of a network of nodal elements and fabric switches. The nodal elements house the neurons and synapses with learning modules. In each node there are the following components: 1) one neuron [10], [17]; 2) four physical synapses with STDP [18]; 3) wires and switches [11]; 4) local analog memory [19]; and 5) local digital memory. In the simplified diagram, only the neuron, synapses and some switches are shown. The memories are not shown. As mentioned earlier, since the neuron is not timemultiplexed, it operates in continuous time. Each physical synapse is time-multiplexed to implement multiple virtual synapses. The number of virtual synapses implemented by each physical synapse is equal to the number of STM timeslots in one STM cycle. For example, to achieve 100 virtual synapses per physical synapse, there must be 100 STM timeslots in one STM cycle. In a CMOS implementation that can be achieved by using an STM timeslot duration of

(b)

Fig. 5. Abstracted neural fabric design with (a) abstraction of a nodal element (circle in the center) and its associated switching fabric and (b) simple 3 × 3 grid of abstracted neural hardware showing the neural fabric with nine nodal elements.

1 μs and an STM cycle duration of 100 μs. Each physical synapse has access to a set of analog registers. The number of analog registers per physical synapse is equal to the number of timeslots per cycle. For example, if there are 100 STM timeslots in one STM cycle, then there must be 100 analog registers per physical synapse. These registers are used to store the synaptic conductance state. We have designed integrated circuits that implement the analog registers in several forms, including capacitors and memristors. These analog memories are implemented close to the physical synapses to reduce space for local wires connecting synapses to memory. The typical access time for capacitor-based analog memories is about 100 ns and for memristor-based analog memories is about 10 μs. The memristor-based memory can be implemented on top of CMOS. We have implemented some initial memristorbased memory chip prototypes [11], [20]. The core cell size of a memristor-based memory can be as small as 0.01 μm2 . The switches allow implementing multiple virtual axons in the fabric. Each switch has access to a set of digital registers. These registers are used to store the switch state. The number of memory bits required per switch is equal to the number of STM timeslots per STM cycle. The memory can be implemented using SRAM. In 90-nm technology, each SRAM requires on the order of ∼1 μm2 . Smaller area can be achieved by using DRAM. High-density memristors with cell size as small as 0.01 μm2 could also be used for the digital memory. The digital memory is local to each node to reduce space of wires connecting switches to memory. The neuron in each nodal element receives spike signals that are transmitted through channels (combination of blue and gray lines in Fig. 4) and routed using fabric switches at each STM timeslot as shown in Fig. 4. When a neuron emits a spike, it is transmitted via its axon to the other neurons (magenta lines shown in Fig. 4). The synaptic elements house the learning modules to perform STDP learning. An example of two inputs spikes routed into the nodal element is shown in Fig. 4(a). Here the two fabric switches have to be closed or set to ON state as shown. Similarly, an example of a neuron spike routed through the axons to the grid lines is shown in Fig. 4(b). Here a different set of two fabric switches have to be set to ON state as shown. In order to route spikes between neurons within different nodal elements, the neural fabric consists of axonal grid lines

892

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 6, JUNE 2012

Post-synaptic Neurons

r

Axon

Pre-Synaptic Neuron

Fig. 6. Example of axon path from a presynaptic neuron to four postsynaptic neurons.

with additional diagonal fabric switches as shown in Fig. 5(a). A variety of different hardware configurations are possible for the neural fabric design, such as more output axon wires and arrangement of various configurations of diagonal switches; however, there are trade-offs for each configuration that may require more resources and/or increased complexity of hardware implementation. In this paper, for simplicity we will assume that the neural fabric design is a grid of nodal elements and fabric switches [a 3 × 3 grid is shown in Fig. 5(b)] and each nodal element consists a bank of horizontal, vertical and diagonal switches as shown in Fig. 5(a). In the nodal circuit shown in Fig. 5(a), if we assume that there are a set of K routing channels in each of its four sides (blue and green lines in the figure) and each channel is implemented by a physical wire segment, then the length of each wire segment corresponds to the pitch between nodal elements (or the nodal pitch). Each nodal element contains 4 K wire segments. A network with N nodal elements has 4 KN wire segments. The amount of wire segments affects the average length of axonal connections that can be achieved. In each time slot each neuron output is distributed by an axon path to four other postsynaptic neurons. Fig. 6 shows an example of an axon path (shown in red). There are up to N axon paths that need to be synthesized in each time slot for the entire network. In each time slot, there are 4 KN wire segments available to synthesize the N axon paths. Therefore, the average length of an axon path is 4 K, assuming that all segments are utilized. If we assume a utilization ratio, p, of the wire segments, then the average length of an axon path is 4 Kp. All the lengths are given in units of nodal pitch. Each axon path of a neuron is used to connect to four postsynaptic neurons. If we assume that the four postsynaptic neurons selected during a given time slot are located in approximately the same area (Fig. 6), then the average Manhattan distance from the presynaptic neuron to the postsynaptic neurons can be as large as ∼4 Kp. For example, if K = 32 and p = 0.5, then the average Manhattan distance from the presynaptic neuron to the postsynaptic neurons can be 64. This is an average distance, not a maximum distance. An average Manhattan distance of 64 in a square grid of nodes is enough to build time-multiplexed networks in which each neuron can access over 104 other different neurons. In an integrated circuit implementation, the maximum axon length will be below 10 mm. The typical RC delay of a wire

itself without switches is in the order of ∼10 ns. Each axon will have from 1 to about 100 series switches. The typical propagation time in an axon with switches is less than 1 μs. For the STM concept to work, the slot time should be at least 10 times this propagation time. IV. C OMPARISON TO E XISTING W ORK When examining the large field of neuromorphic hardware [21]–[26], a common communication scheme starts to emerge. The address event representation (AER) [27]–[30] communication scheme is used to transmit spiking information from one set of neurons to another. AER uses time-multiplexing for encoding spiking data for efficient communication. By exploiting the neural firing patterns, AER is able to pack the spiking data from several groups of neurons into a single communication bus. The popularity of AER has grown to a point where most neuromorphic chip designs use it to connect groups of spiking neurons together. In these designs, transceivers encode and decode spikes over a small set of high-speed wires by encoding each axon with a unique binary representation, an address-event. To save hardware real estate, neurons are grouped together to share a common encoder and decoder. This leads to several questions: How should the neurons be grouped together? How should they communicate to each other? How many neurons should share AER hardware? The AER work answers questions by presenting an efficient and scalable architecture that allows a set of neurons, which share common AER hardware, to communicate with each other. The common feature between AER and STM is that both use time-division-multiple-access links to save on routing hardware. Beyond that these schemes are very different. AER can be thought of as a bus architecture where processing units (neurons) use encoders and decoders to communicate with each other using a fixed communication path. STM, on the other hand, is more like a set of direct point-to-point connections that needs to be modified several times before each processing unit has all their inputs. Unlike AER, in the STM approach the routing fabric is updated after each timeslot, regardless of the presence or absence of a neuron spike. Furthermore, in the STM approach, the address to encode destination of spikes is not required and routing is not triggered by spike events but is based on a global STM clock. In STM, point-to-point connection are made to send parallel signals while in AER a common bus is connected to all the nodes to send a common signal. Both neuromorphic communication schemes can coexist but have different uses just like their digital counterparts. For example, field programmable gate array (FPGA) routing fabric is very useful in sending data from simple devices for shorter distances with tight bounds on arrival time, while bus architectures are useful for sending data from groups of simple devices for longer distances. We envision the STM technique being used on chip to route the spikes while AER would be used for off-chip communication. While we cannot have a direct comparison of power consumption between AER and STM and also because it is tedious due to multiple variants of AER and possible

MINKOVICH et al.: PROGRAMMING TIME-MULTIPLEXED RECONFIGURABLE HARDWARE

Fig. 7. Two possible approaches for modeling synaptic pathways by pairwise spring pairs as part of the neuron placement algorithm is shown here.

variants of STM, it is expected that STM routing fabric will use more power when the firing rate is low (due to STMs constant reconfiguration of routing fabric) and less power when the firing rate is high (due to AERs encoding and decoding overhead). Furthermore, we will show that, even beyond the communication scheme, the existing neuromorphic hardware is quite different from our proposed architecture. To evaluate the state of neuromorphic hardware that is currently used to simulate large-scale neural networks, the following popular neuromorphic chips will be discussed: Neurogrid [21], SPINNAKER [22], and FACETS [24]. Neurogrid is an architecture developed at Stanford consisting of 4 × 4 array of neurocores, where each neurocore has a 256 × 256 array of neuron circuits with up to 6000 synapse connections. Each group of neurons was connected using a grid of AER connections instead of a bus [28], [31]. If the receiver of the AER message was not the final recipient, then the message would be forwarded to its neighbors. On bus architectures, adding additional groups of neurons required increasing the length of the bus. While on the grid architecture, additional groups of neurons can be added without additional hardware. What makes this architecture unique is that it uses broadcasting instead of routing to distribute the spiking data. The broadcast-based routing also makes this architecture sensitive to extreme firing rates in a correlated area of the chip, which could overflow the buffers. SPINNAKER was designed as a network of fast custom neural simulators. Each neural simulator chip has 20 ARM processing cores that are sufficient to simulate 1000 spiking neurons. Each chip is connected with six two-way links to six other neighboring chips to form an Ethernet-based ringlike network. The packets are routed using the chip id and not the neuron id to reduce the size of the routing table. The digital neuron simulation used by SPINNAKER is quite different from the analog emulation of our hardware. The FACETS program is for designing wafer-scale system made of highly connected neuron (up to 16 K inputs) blocks called high-input count analog neural networks (HICANN). Each HICANN can communicate to adjacent HICANNs through a crossbar fabric. The fabric is divided into wire pairs, each of which carries events from 64 presynaptic neurons by serially transmitting 6-bit neuron numbers using AER. This system can also suffer spike routability problems when a localized group of neurons has a high firing rate and must transmit these spikes across the chip. All three of these systems use a type of AER to transmit spiking information. An important difference between AER

893

and STM is that in STM the firing rate does not affect the routability of spikes. In AER, when the firing rate reaches a certain level, the buffers will start to fill up and either spike will have to be dropped or the chip will have to slow down the computation. This is because AER schemes are designed for an average expected firing rate. The firing rate directly affects power consumption. In STM, if the network can be routed in the allotted number of timeslots, then it is guaranteed to route all the spikes. This is because, in STM, the routing fabric can handle the situation where all the neurons are firing at every time step. The downside is the constant switching of the routing fabric. This is because it results in unnecessary power consumption. However, this is the very mechanism that allows spike transmission to be insensitive to the firing rate. V. S CALABLE N EUROMORPHIC C OMPILER In order to program our neuromorphic hardware [10] for any desired neural network, the topology would first have to convert into a connectivity matrix or a graph representation. This matrix along with the statistics on the number of neurons and synapses is provided as input to a neuromorphic compiler (Fig. 1). The neuromorphic compiler compiles the neural network structure description into: 1) an assignment of the network’s neurons and synapses to hardware neurons and virtual (multiplexed) synapses and 2) an STM-compatible routing schedule with switch states for the neural fabric at each STM timeslot. For each neuron, the exact location on the chip on the neural fabric should be determined. This is the placement problem. The quality of neuron placement can affect the ability of the routing algorithm to efficiently find the needed synaptic pathways to cover all the synapses within an STM duty cycle. For each synaptic pathway, a set of required grid lines from an output axon of the presynaptic neuron to an input dendrite of the postsynaptic neuron should be determined, and the switches on the way must be set to the ON state. This is the routing problem. Each synapse uniquely connects a presynaptic neuron and the corresponding postsynaptic neuron, so that the synaptic pathway cannot share any intermediate segments with other synapses. However, when some synapses have the same common presynaptic neuron, they can share the same grid line segments from the common presynaptic neuron. Hence, from this point of view, routing is a one-to-many problem. The goal of the neuromorphic compiler is to determine the location of each neuron and to assign synapses to each STM timeslot within an STM cycle. The assignment of synapses determines a set of ON switches for the fabric in each STM timeslot to ensure that the resulting circuits do not to share any grid lines if they transmit signals from different presynaptic neurons. The quality of the solution is determined by the total number of STM timeslots needed to implement the neural architecture. The last component of the compilation process is to compress the switch states obtained from the routing algorithms in order to minimize the amount of digital memory needed to store them (Fig. 1). The complete neuromorphic compiler flow can be split into placement, routing, and configuration compression.

894

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 6, JUNE 2012

The problem of routing and placement is closely related to problems in other programmable hardware, such as the FPGA. There are some interesting differences between the neuromorphic solution proposed in this paper and those designed for other programmable hardware, such as the FPGA. In such applications, most current algorithms [32]– [34] for placing and routing expect a single timeslot and, therefore, do not have to address the immense routing demands placed by the problem described in this paper. Unlike FPGA circuits, the neuromorphic hardware is expected to use every neuron device during routing. However, a study of FPGA architecture [35] show that reconfigurable hardware with 100% device utilization results in almost a 200% routing area increase due to congestion problems. A single neuromorphic chip is expected to house 106 neurons while the largest FPGA in 2011, the Xilinx Virtex-6 LX760, contains less than 800 000 logic cells [36] and only utilizes 60% of them at any given time. The connectivity on typical neuromorphic chips is two orders of magnitude higher than that of current FPGA. This prompted the development of a completely novel neuromorphic compiler that can scale to support large-scale network architectures in hardware. To achieve this three key concepts were developed. First, an efficient and automatic method for generating neural architectures was developed with small-world network-like topological properties [37] to evaluate our algorithms. Second, a placement algorithm was developed that could deal with the extremely large connectivity. Finally, a parallel router was developed to enable routing of billions of synapses. The details of these concepts will now be presented in the following sub-sections. A. Efficient Small-World Network Generation A generic method was developed to create networks with small-world network topology to evaluate our neuromorphic compiler. The motivation for network topology comes from studies of neural architectures which have shown that they are organized to exhibit small-world network-like topologies [8], [37]. A small-world network is a network defined by C (clustering coefficient), and L (average minimum path length), and N (number of neurons), where the average minimum distance between two nodes grows logarithmically to the total number of nodes (i.e., L α log(N)). By using this biologically plausible network topology, the improvements from the placement and routing algorithms can be clearly evaluated. It should be noted that while only small-world networks are being used to evaluate the neuromorphic compiler, the compiler will work with any network and with any connectivity. This method, will, given a C, L, and N, produce a network with exactly N neurons with both C and L within 0.01 when N is larger than 1 K . These networks are created in three steps: cluster creation, ring arrangement, and random edge addition. During cluster creation, 4 L clusters are created with N/4 L neurons each. Inside each cluster exactly C ∗ (n ∗ (n − 1))/2 random edges are inserted, avoiding any duplicates. The number of edges was derived from the fact that a clique contains (n ∗ (n − 1))/2 edges so the maximum number of edges occur when the cluster is a clique (in this case C would equal one).

During ring arrangement, the clusters are arranged in a ring and a single edge used to connect them together. At this point the N and C values are correct but L is larger than it should be. During random edge addition, random edges are inserted between the clusters until the desired L is achieved. As long as the number of edges added is much smaller than the number of initial edges in each cluster, a property we observed in all the small world network examples, the added edges do not have a significant impact on C. To evaluate this concept, a small-world network of size 104 and 105 neurons was created to represent the Precuneus brain region (C = 0.31 and L = 1.95) [37]. The 104 and 105 neuron networks contain 2 million synapses and 200 million synapses, respectively, and were targeted to be mapped onto a 100 × 100 and a 317 × 317 grid of neurons, respectively. In terms of density, the 105 neuron network will be 20 times denser than a mouse scale brain. We believe that this is the first efficient method for generating small-world networks with over a million neurons. Using these two small-world networks, both the placement and routing algorithms were evaluated as follows. B. Neuron Placement The goal of neuron placement is to find a unique location for each neuron such that the number of STM timeslots needed during the routing phase is minimized. The difficulty lies in the fact that, during the placement phase, the routing information is not known and is very difficult to calculate. Thus, evaluating the placement solution has to rely on heuristics. There are several methods to evaluate a placed solution and each one has a specific use. The length of the longest path determines the maximum operating frequency, the grid wirelength of all the synaptic pathways corresponds to the number of switches needed for routing, the amount of congestion corresponds to the number of STM timeslots needed to route the design, and hot spot minimization corresponds to an even distribution of synapses at each STM timeslot. We chose to minimize the grid wirelength, which also resulted in a reduction of STM timeslots since shorter synaptic pathways are easier to route. These very same challenges are faced by the electronic design automation (EDA) tools when performing placement for very-large-scale integration circuits. Using this knowledge, several established EDA algorithms for the placement phase of the neuromorphic compiler was adapted. The placement phase can be broken down into five steps: finding an optimization metric, analytic placement, diffusion-based smoothing, legalization, and simulated annealing. These are sorted not only by order of execution but also by the amount of impact. The placement algorithm is summarized in Algorithm 1 and described in detail in the following sub-sections. 1) Placement Using Quadratic Wirelength Minimization: Analytic placement [33] (see steps 1–8 in Algorithm 1) generates the initial placement solution by converting the problem to a quadratic programming problem where axonal projection or synaptic pathways are interpreted as springs and the neurons as connection points. The total potential energy of the springs, that is a quadratic function of

MINKOVICH et al.: PROGRAMMING TIME-MULTIPLEXED RECONFIGURABLE HARDWARE

Algorithm 1 Pseudo-code for the placement algorithm; each step is further expanded upon in the text below 1 placement (neural network N, hot spot map H) return placement P 2 N  = convert_synapses_to_neurological_star_model (N) 3 P  = solve_conjugate_gradient(N, H) 4 do P = P 5 P  = perform_IO_optimization (P  ) 6 P  = perform_cluster_reduction (P  ) 7 P  = solve_conjugate_gradient(N, H) 8 9 while(wirelength(P) < wirelength (P)) 10 P = perform_diffusion-based_smoothing (N, P) 11 P = perform_quad-tree_legalization (N, P) 12 P = perform_simulated_annealing (N, P) return P

their length, is minimized to produce a placement solution. If each synaptic pathway k was exactly connecting only two neurons, n i and n j , then the total force would be represented by (1), where neuron n i has position (x i , yi ) and neuron n j has position (x j , y j )   2  2  wk · x i − x j + yi − y j . (1) (Neurons n (i) and n j connected by synapse k) The key feature in the brain is each synaptic pathway has fanout connections to more than two neurons. Fig. 7 provides an example of two possible solutions to convert a synaptic pathway into a pair-wise spring given this one-to-many fanout requirement. The star model was chosen over the clique model for two reasons: 1) it more accurately represents the axonal structure of the brain and 2) for biological networks, the clique model would have introduced too many edges. The overhead of introducing an extra node for each neuron is overshadowed by the reduction of edges resulting from not using the clique model. Furthermore, this representation can be made more biologically accurate by reducing the weight on the axon and increasing the weights of the axonal terminals. Since this formulation is convex, it can be efficiently solved using the preconditioned conjugate gradient (CG) method [38]. When the standard formulation is applied to networks with brain-like connectivity, then the resulting highly clustered solutions, as demonstrated in iteration 1 of Fig. 8, are not very useful. The extreme connectivity prevents the simple solution of evenly spreading the input/output (IO) neurons along the border of the chip, thus preventing a placement solution with good density. In this paper, the analytic placement singlestep algorithm is transformed into an iterative one that iterates between four steps: IO assignment, spring force adjustment, CG solving, and synaptic pathway length analysis. In the IO assignment stage, after an initial placement with the IO evenly spread out, the center of mass of all the neurons, pall−neurons, is calculated. Then for each IO neuron d, the center of mass of its connected neurons, pd−neurons, is cal-

895

Fig. 8. Million-neuron cluster that is iteratively spread out during the analytic placement phase. This how a small number of additional iteration can greatly improve the placement solution.

Fig. 9. IO optimization example, where the red neuron is d, pall-neurons is the center of mass of all the neurons, pd -neurons is the center of mass of the neurons connected to d. This example shows how the new location for d would help spread out the center mass further.

culated and the vector < pall−neurons, pd−neurons> projects the new position of the IO neuron. This is repeated to determine a new location for each IO neuron. For example, consider Fig. 9, where d is an IO neuron, the chip is 10 × 10, pall−neurons is , and pd−neurons is , then the vector < pall−neurons, pd−neurons> would move d to . This helps the IO neurons maximize the pulling force they exert on the neurons, helping spread the neurons away from the center of mass, but sometimes the density of the synapses (the clustering coefficient) is too large. In such cases, no matter how the IO neurons are arranged, there would be a ball of neurons in the middle of the chip. To further enable the spread of neurons, the force of the springs is scaled up by α (w = wα) for synaptic path lengths that are longer than average and scaled down β (w = w/β) for synaptic path lengths that are shorter than average. After these two adjustments (IO placement and spring force adjustment) are made, the CG problem is solved and the result is analyzed. If there is an improvement, then the process is continued. Fig. 8 shows how the iterative steps help spread the neurons out on a million neuron network. With these optimizations, the average synaptic path length is reduced by a factor of 3X when comparing our placement algorithm to a random placement solution. This is a significant improvement because the average synaptic path length has a direct correlation to the number of STM timeslots required. Fig. 10(a) (and iteration 8 in Fig. 8) shows that while analytic placement can provide a very good initial starting placement, the neurons are not completely spread out. The

896

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 6, JUNE 2012

(a)

(b)

(c)

(d)

Fig. 10. Small example is shown as it is transformed through four stages of placement: (a) analytic, (b) diffusion-based smoothing, (c) legalization, and (d) simulated annealing.

Fig. 11. Placement results for small-world networks with 104 , 105 , and 106 neurons show a consistent 40% reduction to the average Manhattan length of the synaptic pathways.

subsequent placement steps help to correct this. After the network is placed using analytic placement, a common problem is a large amount of clustering that can happen in densely connected regions. In Fig. 10(a), the top left corner has a larger density than the rest of the graph. To reduce the density of these clusters, diffusion-based smoothing is applied. 2) Diffusion-Based Smoothing: The diffusion-based smoothing [39] (step 9 in Algorithm 1) may be viewed as preprocessing before the legalization stage. The algorithm starts with an overlapping global placement. It then adds forces based on the density of the layout in an iterative fashion to spread out the placement. A neuron migrates from an initial location to its final equilibrium location via a non-direct route. This route can be captured by a velocity function (2) that estimates the velocity of a neuron at every location in the circuit for a given time t v ( x, y) (t) = − H

v ( x, y)V (t) = −

(∂d(x,y) (t )) ∂x

d(x,y)(t) (∂d(x,y) (t )) ∂y d(x,y)(t)

.

(2)

This velocity at certain position and time is determined by the local density gradient and the density itself. Intuitively, a sharp density gradient causes cells to move faster. The method for calculating the gradients is presented in [38]. Fig. 10(b) is the result when diffusion-based smoothing is applied to the result from Fig. 10(a). This figure shows that while the neurons are more spread out, the neurons still do not align to the final grid. 3) Legalization: After the network is smoothed, using diffusion-based smoothing, the neurons need to be assigned to their final location on the grid. Legalization (step 10 in

Algorithm 2 Pseudo-code for the placement algorithm; each step is further expanded upon in the text below legalize (neuron n, quad-tree QT) 1 Q = closest_location_subquadrant (QT, n) 2 legalize(n, Q) 3 if (Q.is_leaf) 4 Q.fill = 0 5 n.placement = Q 6 return 7 end-if (Q) 8 //Qs  subquadrants  9 Q.fill = (Q s . fill)/4   s .fill Qs .location QQ.fill /4 10 Q.location = 11 return

Algorithm 1) can be expressed as an online bipartite matching problem [34] where each neuron has an edge to the possible locations on the chip. By sorting the neurons by number of outgoing synapses, we can solve this problem online to find the best location for each neuron. We define the best location as one that minimizes the bounding box around the neurons connected to the axon terminals. To find the best location, the legalization algorithm, described in Algorithm 2, uses a quad-tree structure. For every level of the quad-tree, four evaluations are made to determine the best quadrant (line 1) using the estimated location quadrant location. These locations are updated recursively to represent the center of mass of the free locations available inside each quadrant. When a neuron is allocated to a location, the estimated locations of all the corresponding quadrants are updated (line 10). As grid wirelength evaluations are very expensive for highly connected neurons, the quad-tree structure can calculate the best legal placement solution for each neuron using only O(log4(|neurons|) grid wirelength evaluations and the same number of quad-tree estimated location updates. The results in Fig. 10(c) clearly show each neuron occupying a legal location. 4) Simulated Annealing: After the network is legalized, there is still an opportunity to improve the grid wirelength using simulated annealing. While the ideal location of each neuron is defined by its synapses, the problem is that this ideal location is usually occupied by another neuron. In order to minimize the total grid wirelength, simulated annealing (step 11 in Algorithm 1) attempts to move the neuron to its own ideal location, which creates a chain of moves. Once this chain intersects itself, a series of moves is generated that are guaranteed to reduce the grid wirelength as long as the neurons being moved are only connected to each other through the chain. Fig. 10(d) shows the amount of refinement that can be produced by simulated annealing compared to the results in Fig. 10(c). To evaluate the quality of our placement algorithm, we measured the average Manhattan length of each synaptic pathway because the length of the minimum path between any two neurons on the chip is the Manhattan distance. Although, this does not take congestion into account, this metric has a strong correlation to the number of timeslots. When evaluating our placement algorithm,

MINKOVICH et al.: PROGRAMMING TIME-MULTIPLEXED RECONFIGURABLE HARDWARE

Fig. 11 shows that the average Manhattan length was reduced by 41%, 46%, and 38%, on the 104 , 105 , and 106 neuron networks, respectively. The output of the placement algorithm is in the form of a placement file and is provided to the routing stage as inputs. The steps described by Algorithm 1 enable a large reduction in the synaptic path length, which enables the router to have a large reduction in runtime and the number of timeslots required to implement any given neural architecture.

Algorithm 3 Pseudo-Code for the routing algorithm shows how the synaptic pathways are distributed among the timeslots route(Synapses S, Neurons N, Architecture A) 1 while S= ∅ 2 // minimum number of time slots based on fanin and fanout 3 fanin_restriction = max(count(s==(∗,n i ))|s S and I =[1,N.size])) A.fanin

4

fanout_restriction =

5

minTimeslot←max( fanin_restriction, fanout_restriction) S = sort_increasing_length(S) //break ties w/ intersecting segment count ST = round_robin_assign_synapses_to_STM_timeslots(S) F = {} foreach STi ∈ ST STi = sort_increasing_length(STi ) foreach s j ∈ STi // s j = {synapses w/ common presynaptic neuron j} find shortest paths with available lanes for s j using A* if found assign available lanes to s j else F = F ∪ sj end-if end-foreach end-foreach S=F end-while

C. Scalable Synapse Routing The goal of synapse routing is both to divide the synapses into the STM timeslots and to route the synapses within each timeslot. The approach is outlined in Algorithm 3. The first step is to estimate the number of STM timeslots (lines 3–5), assign the synapses to the timeslots (lines 6–7), route the synaptic paths (line 12), and repeat for any paths that cannot be routed (line 20). One of the most critical parts of the router is the assignment of synapses to timeslots because a bad assignment could result in a large increase in the number of paths that cannot be routed. To prevent assigning synapses that were going to be difficult to route together, an efficient congestion prevention algorithm was created, which reduces the number of intersecting segments [12] assigned to each timeslot. The other critical part of the routing algorithm is the A* [40] router which is responsible for routing each synaptic pathway. A* has been used extensively for FPGA routing [18] and allows for adjustment of the cost function to avoid congested areas. The example in Fig. 12 shows the result of the routing algorithm. In this example, minTimeslot (Algorithm 3) was estimated to be four but all the synaptic pathways could not be routed in four timeslots resulting in the allocation of two additional timeslots. In order for the router to scale beyond 108 synapses, the algorithm was parallelized by distributing the routing of the synapses inside each STM timeslot across a 736 CPU core cluster. A 47X speedup was observed to route the 106 timeslots of the 104 neuron network and a 276X speedup was observed to route the 1645 timeslots of the 105 neuron networks. This speedup enabled routing of the 105 neuron and 200 million synapses network in 1.26 days instead of a year. This can be further improved if we used a CPU/GPU cluster to less than a few hours. It is important to note that the compilation is only required once on an off-line basis to instantiate the structure of the neural architecture. The synaptic learning occurs during normal chip operation. If a single threaded version of the router was considered, then the improvements can be summarized through Fig. 13. In this figure, “# TS” represents the number of STM timeslots, “Runtime” describes the total runtime of the routing phase, “Initial Unrouted” represents the number of synapses that could not be routed after the first estimate on the number of STM timeslots. With these modifications the first truly scalable router of neural networks onto an STM-compatible neural fabric has been realized.

897

6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

max(count(s==(n i ,∗))|s S and i=[1,N.size])) A.fanout

Fig. 12. Routing solution for a 25 neuron network on a 5 × 5 STM fabric requires six timeslots to route all the synapses. The first two timeslots are shown. The yellow nodes are input neurons, the red inhibitory, and the green excitatory. The black lines are the routed synaptic pathways. The red dots are the switches which are closed.

D. Compression of Configuration Data The large number of neural elements combined with the large number of timeslots requires an equally large amount of memory to store the switch states or configuration data. For example, a 1-million-neuron chip with 1000 STM timeslots will require over 20 GB of storage space. To reduce this

898

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 6, JUNE 2012

Fig. 13. Reductions in terms of the number of timeslots, runtime and number of initial unrouted synapses with respect to three routing solutions: baseline without placement, with placement and with placement and congestion sorting.

requirement, two popular compression techniques were evaluated: Huffman coding [41] and Lempel–Ziv–Welch (LZW) [42]. Each STM timeslot of the neuromorphic is represented by a bitstream of 1s and 0s. If a switch is closed, then the corresponding bit in the bitstream is a “1” otherwise it is a “0.” As shown in Fig. 14, with about 20% switch utilization, most of the bitstream are 0s and the 1s are grouped in heavily congested area. The Huffman coding algorithm breaks the input into fixed-length pieces and determines the optimal variable-length code for each piece. The encoding is used to transform the original bitstream into the compressed bitstream. This algorithm was able to compress the bitstream for the whole STM design by 5.1X (Fig. 14). The LZW coding algorithm works similarly to Huffman coding but can split the input into variable-length pieces while determining the best encoding for them. By using variable-length splitting, a much higher compression ratio can be achieved. Specifically, it can compress the bitstream for the whole STM design by 10X. For both algorithms, the encoding step is performed by the neuromorphic compiler, which generates a dictionary and an encoded bitstream. The decoding, which is performed on chip, would employ hardware to first read in the dictionary then use the dictionary to decode the bitstream.

E. Complete Neuromorphic Compiler In terms of all of these metrics, it is clear from Fig. 13 that placement represents the largest improvement. This is even more evident in Fig. 15, which shows the congestion maps of the placed and unplaced 104 neuron network. The congestion maps are calculated by summing the activated switches over all the timeslots. The hotter regions represent areas where a lot of synapses are routed through. In the unplaced version, as expected, most synapses are routed through the center, while on the placed solution the congestion of each cluster is more distributed or less congested. This congestion information is then fed through the whole flow to minimize the hot stops and thereby reduce the number of STM timeslots required to implement all the synaptic pathways in a given neural architecture. In this paper, we presented the first complete and scalable method for mapping and routing neural networks onto an STM neural fabric.

Fig. 14. Switch setting for a specific STM timeslot. Red dots represented a switch which is closed.

Fig. 15. Effect of placement on routing congestion for the 104 neuron network is shown here. The color coding reflects congestion or the number of switches that are utilized in the neural fabric during routing for the entire STM duty cycle.

S1

S2

N1

S4 Δ1 ΔΝ

S3

Fig. 16. Modified nodal element design in the analog core to accommodate axonal delays in our architecture.

VI. D ISCUSSION The neural architecture described thus far assumes the presence of spiking neurons and synapses that can be modified by STDP. To realize more realistic biological network architectures, it is necessary to model other neuronal and synaptic mechanisms such as more sophisticated neuronal dynamics [43]–[47], homeostatic plasticity [48], [49], shortterm plasticity [50], [51], and receptor kinetics [52], [53]. The interesting aspect of our proposed programming frontend is that it is completely impervious to these details in the architecture. The analog core (see Fig. 1) is primarily affected by these inclusions where circuitry needs to be designed to perform additional neuronal and synaptic computations. Similarly, the analog memory (see Fig. 1) may also need to store more states such as the short-term plasticity state or receptor kinetic state for various types of receptors (such as AMPA, GABA, NMDA found in biology [52]). Given a desired neural model architecture to emulate, the neuromorphic compiler derives the total number of STM timeslots, the STM cycle, the quality of routing, and placement solution for each STM timeslot. Increasing the number of steps

MINKOVICH et al.: PROGRAMMING TIME-MULTIPLEXED RECONFIGURABLE HARDWARE

per STM cycle implies that the system will have to operate at much faster clock speeds but this can apply strain on the circuit design in terms of time to configure the hardware, time to perform neuronal and synaptic computations as well as the time to read and write from analog memory. It also has a negative effect in terms of the amount of power consumed. Therefore, there is always a trade-off to be made between the duration of the STM cycle, and the number of steps per STM cycle to ensure that the desired scalability and connectivity can be realized without exceeding the total power budget for the architecture. Similarly, the design of the neural fabric can have implications on the feasibility of emulating any neural architecture. For example, the total number of grid lines (and hence switches) that can be realized is dependent on the available space in the chip. Given an upper limit on the available number of grid lines, it has an impact on the connectivity that can be realized within a given STM cycle. This leads back to the trade-off that was outlined above between the STM cycle, number of STM timeslots and quality of the routing and placement solution. There are several improvements that can still be made to the neuromorphic compiler. These improvements fall into five categories: axonal delays, placement, routing, compressions and creating a feedback mechanism with a neural simulator. The neural fabric is being adapted to handle axonal delays within the chip. Axonal delays are an important feature that seems to play a key role in the formation of neuronal groups and memory [14], [15]. In particular, the nodal element is modified to delay the spike output from the neuron as shown in Fig. 16 at discrete amounts of delay. The neuromorphic compiler can be programmed at each STM timeslot to switch ON the appropriate delay channel. Currently, placement algorithm is designed to handle a single chip but once the chip scaled up to a multi-chip architecture, the issue of communication delays introduced by routing spikes between chips will have to be addressed. To perform this, the placement algorithm will not only have to optimize inter-chip communication but also to maintain correct axonal delay functionality (e.g., only routing synapses between chips that have a large axonal delay). The routing implementation presented in Section V will not scale, in terms of performance, to the required amount of routing needed for multi-chip routing. In our current implementation, for a network with 106 neurons and 108 synapses, routing accounts for 95% of the total runtime. We plan on reducing the runtime by first switching over to a multi-GPUbased routing algorithm and second reducing the amount of congestion in each STM timeslot. One way to reduce the congestion would be by performing several iterations of the Pathfinder [54] algorithm with all the synapses assigned to the same STM timeslot. This would quickly highlight which synaptic pathways could not be routed in a single timeslot. If we create a graph G, where each synapse is a node and an edge exists between two synapses that were routed through the same hardware resource, then the routing problem can be reduced to the NP-hard graph coloring problem of finding the minimum number of colors to color the nodes of G such that connected nodes have different colors. In this formulation, the number

899

of STM timeslots in an STM cycle is equal to the number of colors needed to color G. Using this formulation, several approximation algorithms and heuristics can be evaluated to reduce congestion. While improved routing will reduce our reliance on compression, it will not completely eliminate the huge amount of configuration data that would be generated. Currently, the compression schemes described earlier were used to optimize each STM timeslot independently but routing-hot-spots remain fixed throughout many consecutive timeslots. In a future version of the neuromorphic compiler, compression data will be viewed as 3-D array instead of a list (1-D array). This allows the evaluation of video compression techniques that can better handle the switch correlations between consecutive STM timeslots. The final improvement technique of adding a neural simulator feedback loop is one that has never before been considered in previous studies. By feeding the placement solution to the neural simulator of the desired neural model or architecture [55], the simulator can greatly improve the simulation runtime by making a better partition of the network for improved parallelization of the simulation. On the other hand, by feeding the neural simulator data to the neural compiler, the synaptic conductance data can be used in the analytic placement phase to allow synapses with larger synaptic conductances to have shorter path lengths and, thus, ensuring more reliable transmission. The simulator can also provide firing rate information for each neuron which can be used by the compiler to favor splitting synapses across chips which are fed by slow firing neurons. This would result in not only a faster simulation but also one that uses a lot less power. These five improvements will help the neuromorphic compiler further optimize the single chip solution presented in this paper and also address the challenges of multi-chip designs of the future. Overall this paper has shown that most standard techniques [32]– [34] cannot be used out of the box due to either the quality of solution produced or the runtime required to process graphs that have millions of vertices connected by hyperedges with degrees in the thousands. These techniques can be enhanced, as we describe in this paper, to provide efficient and effective solutions. Although most current circuit designs and the related tools are quite different from neural networks, with the increase of FPGA routing density [36] and the development of commercial time-multiplexed FPGAs [56], the work in this paper also has a significant potential benefit to the FPGA community. VII. C ONCLUSION We have presented a neuromorphic compiler that provides an automated approach to translate a neural model into a neuromorphic system implementation. In particular, the solution offered by the compiler leverages on the STM to enable large-scale neural architectures in traditional CMOS. We provided the details of the algorithms and examples of the various features of the compiler for the design of largescale neural architectures. Future extensions to address larger scales and applications were discussed. This approach offers

900

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 6, JUNE 2012

a novel method to address the challenges of scalability and connectivity and, thus, pave a way for programming largescale neural architectures in hardware. ACKNOWLEDGMENT The authors would like to thank P. Petre for contributions in architecture concepts. R EFERENCES [1] DARPA SyNAPSE Broad Agency Announcement (BAA) [Online]. Available: http://www.fbo.gov/spg/ODA/DARPA/CMO/BAA0828/listing.html [2] J. Bailey and D. Hammerstrom, “Why VLSI implementation of associative VLCNs require connection multiplexing,” in Proc. IEEE Int. Conf. Neural Netw., San Diego, CA, Jul. 1988, pp. 173–180. [3] D. B. Strukov and K. K. Likharev, “Prospects for terabit-scale nanoelectronic memories,” Nanotechnology, vol. 16, no. 1, pp. 137–148, 2005. [4] S. Jo, T. Chang, I. Ebong, B. Bhavitavya, P. Mazumder, and W. Lu, “Nanoscale memristor device as synapse in neuromorphic systems,” Nano Lett., vol. 10, no. 4, pp. 1297–1301, 2010. [5] C. Gao and D. Hammerstrom, “Cortical models onto CMOL and CMOS architectures and performance/price,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 54, no. 11, pp. 2502–2515, Nov. 2007. [6] J. Partzsch and R. Schuffny, “Analyzing the scaling of connectivity in neuromorphic hardware and in models of neural networks,” IEEE Trans. Neural Netw., vol. 22, no. 6, pp. 919–935, Jun. 2011. [7] G. Indiveri, B. Linares-Barranco, T. J. Hamilton, A. V. Schaik, R. Etienne-Cummings, T. Delbruck, S.-C. Liu, P. Dudek, P. Häfliger, S. Renaud, J. Schemme, G. Cauwenberghs, J. Arthur, K. Hynna, F. Folowosele, S. Saighi, T. Serrano-Gotarredona, J. Wijekoon, Y. Wang, and K. Boahen, “Neuromorphic silicon neuron circuits,” Frontiers Neurosci., vol. 5, no. 73, pp. l–23, 2011. [8] G. Buzsaki, Rhythms of the Brain. New York: Oxford Univ. Press, 2006. [9] H. Veendrick, Nanometer CMOS ICs: From Basics to ASICs. New York: Springer-Verlag, 2008. [10] J. M. Cruz-Albrecht, M. Yung, and N. Srinivasa, “Energy-efficient neuron, synapse and STDP integrated circuits,” IEEE Trans. Biomed. Circuits Syst, vol. PP, no. 99, p. 1, DOI: 10.1109/TBCAS.2011.2174152, 2012, to be published. [11] N. Srinivasa and J. M. Cruz-Albrecht, “Analog learning systems of neuromorphic adaptive plastic scalable electronics,” IEEE Pulse, vol. 3, no. 1, pp. 51–56, Jan.–Feb. 2012. [12] H. Markram, J. Lubke, M. Frotscher, and B. Sakmann, “Regulation of synapticefficacy by coincidence of postsynaptic APs and EPSPs,” Sci. Mag., vol. 2, vol. 75, pp. 213–215, 1997. [13] G. Bi and M. Poo, “Activity-induced synaptic modifications in hippocampal culture: Dependence on spike timing, synaptic strength and cell type,” J. Neurosci. vol. 18, no. 24, pp. 10464–10472, 1998. [14] E. Izhikevich, “Polychronization: Computation with spikes,” Neural Comput., vol. 18, no. 2, pp. 245–82, 2006. [15] E. Izhikevich, J. Gally, and G. Edelman, “Spike-timing dynamics of neuronal groups,” Cerebral Cortex, vol. 14, no. 8, pp. 933–44, 2004. [16] T. P. Vogels, K. Rajan, and L. F. Abbott, “Neural network dynamics,” Annu. Rev. Neurosci., vol. 28, pp. 357–376, Jul. 2005. [17] H. Chen, S. Saighi, L. Buhry, and S. Renaud, “Real-time simulation of biologically realistic stochastic, neurons in VLSI,” IEEE Trans. Neural Netw., vol. 21, no. 9, pp. 1511–1517, Sep. 2010. [18] A. Sharma and S. Hauck, “Accelerating FPGA routing using architecture-adaptive A* techniques,” in Proc. IEEE Int. Field-Program. Technol. Conf., Dec. 2005, pp. 225–232. [19] K. H. Kim, S. Gaba, D. Wheeler, J. Cruz-Albrecht, T. Hussain, N. Srinivasa, and W. Lu, “A functional hybrid memristor crossbar-array/cmos system for data storage and neuromorphic applications,” Nano Lett., vol. 12, no. 1, pp. 389–395, 2012. [20] D. Wheeler, K.-H. Kim, S. Gaba, E. Wang, S. Kim, I. Valles, J. Li, Y. Royter, J. Cruz-Albrecht, T. Hussain, W. Lu, and N. Srinivasa, “CMOSintegrated memristors for neuromorphic architectures,” in Proc. Int. Semicond. Device Res. Symp., Dec. 2011, pp. 1–2. [21] P. Merolla, J. Arthur, B. E. Shi, and K. Boahen, “Expandable networks for neuromorphic chips,” IEEE Trans. Circuits Syst. I, vol. 54, no. 2, pp. 301–311, Feb. 2007.

[22] M. Khan, D. Lester, L. Plana, A. Rast, X. Jin, E. Painkras, and S. Furber, “SpiNNaker: Mapping neural networks onto a massively-parallel chip multiprocessor,” in Proc. IEEE Int. Joint Conf. Neural Netw. World Congr. Comput. Intell., Jun. 2008, pp. 2849–2856. [23] J. Schemmel, A. Grubl, K. Meier, and E. Muller, “Implementing synaptic plasticity in a VLSI spiking neural network model,” in Proc. Int. Joint Conf. Neural Netw., 2006, pp. 1–6. [24] J. Schemmel, D. Bruderle, A. Grubl, M. Hock, K. Meier, and S. Millner, “A wafer-scale neuromorphic hardware system for large-scale neural modeling,” in Proc. IEEE Int. Symp. Circuits Syst., May–Jun. 2010, pp. 1947–1950. [25] R. Serrano-Gotarredona, M. Oster, P. Lichtsteiner, A. Linares-Barranco, R. Paz-Vicente, F. Gomez-Rodriguez, L. Camunas-Mesa, R. Berner, M. Rivas-Perez, T. Delbruck, S.-C. Liu, R. Douglas, P. Hafliger, G. JimenezMoreno, A. C. Ballcels, T. Serrano-Gotarredona, A. J. Acosta-Jimenez, and B. Linares-Barranco, “CAVIAR: A 45k neuron, 5M synapse, 12G connects/s AER hardware sensory-processing-learning-actuating system for high-speed visual object recognition and tracking,” IEEE Trans. Neural Netw., vol. 20, no. 9, pp. 1417–1438, Sep. 2009. [26] G. Indiveri and T. K. Horiuchi, “Frontiers in neuromorphic engineering,” Frontiers Neurosci., vol. 5, no. 118, pp. 1–2, 2011. [27] M. Mahowald, “VLSI analogs of neuronal visual processing: A synthesis of form and function,” Ph.D. dissertation, Dept. Comput. Neural Syst., California Inst. Technol., Pasadena, 1992. [28] K. Boahen, “A burst-mode word-serial address-event link-I: Transmitter design,” IEEE Trans. Circuits Syst. I, vol. 51, no. 7, pp. 1269–80, Jul. 2004. [29] K. Boahen, “A burst-mode word-serial address-event link-II: Receiver design,” IEEE Trans. Circuits Syst. I, vol. 51, no. 7, pp. 1281–1291, Jul. 2004. [30] K. Boahen, “Point-to-point connectivity between neuromorphic chips using address events,” IEEE Trans. Circuits Syst. II, vol. 47, no. 5, pp. 416–34, May 2000. [31] G. Indiveri, A. M. Whatley, and J. Kramer, “A reconfigurable neuromorphic VLSI multichip system applied to visual motion computation,” in Proc. 7th Int. Conf. Microelectron. Neural, Fuzzy Bio-Inspired Syst., 1999, pp. 37–44. [32] J. Luu, I. Kuon, P. Jamieson, T. Campbell, A. Ye, W. Fang, and J. Rose, “VPR 5.0: FPGA cad and architecture exploration tools with singledriver routing, heterogeneity and process scaling,” in Proc. ACM/SIGDA Int. Symp. Field Program. Gate Arrays, 2009, pp. 1–10. [33] N. Viswanathan, M. Pan, and C. Chu, “FastPlace 3.0: A fast multilevel quadratic placement algorithm with placement congestion control,” in Proc. Asia South Pacific Design Autom. Conf., Jan. 2007, pp. 135–140. [34] G. Padmini, X. Li, and L. Pileggi, “Architecture-aware FPGA placement using metric embedding,” in Proc. 43rd Annu. Design Autom. Conf., 2006, pp. 1–6. [35] A. DeHon, “Balancing interconnect and computation in a reconfigurable computing array,” in Proc. ACM/SIGDA 7th Int. Symp. Field Program. Gate Arrays, 1999, pp. 1–10. [36] Virtex6 Datasheet [Online]. Available: http://www.xilinx.com/ publications/prod_mktg/Virtex6_Product_Table.pdf [37] S. Achard, R. Salvador, B. Whitcher, J. Suckling, and E. Bullmore, “A resilient, low-frequency, small-world human brain functional network with highly connected association cortical hubs,” J. Neurosci., vol. 26, pp. 63–72, Jan. 2006. [38] J. Dongarra, A. Lumsdaine, R. Pozo, and K. Remington, “A sparse matrix library in C++ for high performance architectures,” in Proc. 2nd Object Oriented Numerics Conf., 1994, pp. 214–218. [39] H. Ren, D. Z. Pan, C. J. Alpert, and P. Villarrubia, “Diffusion-based placement migration,” in Proc. 42nd Annu. Design Autom. Conf., 2005, pp. 1–6. [40] P. Hart, N. Nilsson, and B. Raphael, “A formal basis for the heuristic determination of minimum cost paths,” IEEE Trans. Syst. Sci. Cybern., vol. 4, no. 2, pp. 100–107, Jul. 1968. [41] D. A. Huffman, “A method for the construction of minimum-redundancy codes,” Proc. IRE, vol. 40, no. 9, pp. 1098–1102, Sep. 1952. [42] T. Welch, “A technique for high-performance data compression,” IEEE Comput., vol. 17, no. 6, pp. 8–19, Jun. 1984. [43] A. L. Hodgkin and A. F. Huxley, “A quantitative description of membrane current and application to conduction and excitation in nerve,” J. Phys., vol. 117, no. 4, pp. 500–544, 1952. [44] E. Izhikevich, “Which model to use for cortical spiking neurons?” IEEE Trans. Neural Netw., vol. 15, no. 5, pp. 1063–1070, Sep. 2004. [45] R. FitzHugh, “Impulses and physiological states in theoretical models of nerve membrane,” Biophys. J., vol. 1, no. 6, pp. 445–466, 1961.

MINKOVICH et al.: PROGRAMMING TIME-MULTIPLEXED RECONFIGURABLE HARDWARE

[46] R. Brette and W. Gerstner, “Adaptive exponential integrate and fire model as an effective description of neuronal activity,” J. Neurophys., vol. 94, no. 5, pp. 3637–3642, 2005. [47] C. Bartolozzi and G. Indiveri, “Synaptic dynamics in analog VLSI,” Neural Comput., vol. 19, no. 10, pp. 2581–2603, 2007. [48] G. Turrigiano and S. Nelson, “Homeostatic plasticity in the developing nervous system,” Nature Rev. Neurosci., vol. 5, no. 2, pp. 97–107, 2004. [49] C. Bartolozzi and G. Indiveri, “Global scaling of synaptic efficacy: Homeostasis in silicon synapses,” Neurocomputing, vol. 72, nos. 4–6, pp. 726–731, 2009. [50] G. Fuhrmann, H. Markram, and M. Tsodyks, “Spike frequency adaptation and neocortical rhythms,” J. Neurophys., vol. 88, no. 2, pp. 761–770, 2002. [51] M. Boegerhausen, P. Suter, and S.-C. Liu, “Modeling short-term synaptic depression in silicon,” Neural Comput., vol. 15, no. 2, pp. 331–348, 2003. [52] S. H. Wu, C. L. Ma, and J. B. Kelly, “Contribution of AMPA, NMDA, and GABA(A) receptors to temporal pattern of postsynaptic responses in the inferior colliculus of the rat,” J. Neurosci., vol. 24, no. 19, pp. 4625–4634, 2004. [53] K. M. Hynna and K. Boahen, “Neuronal ion-channel dynamics in silicon,” in Proc. IEEE Circuits Syst. ISCAS Int. Symp., May 2006, pp. 3614–3617. [54] L. McMurchie and C. Ebeling, “PathFinder: A negotiation-based performance-driven router for FPGAs,” in Proc. 3rd Int. ACM Symp. Field Program. Arrays, 1995, pp. 111–117. [55] M. Hines and N. Carnevale, “NEURON: A tool for neuroscientists,” Neuroscientist, vol. 7, no. 2, pp. 123–135, 2001. [56] T. R. Halfhill (2010, Mar). Tabula’s Time Machine. Reed Electronics Group, Beijing, China [Online]. Available: http://www.tabula.com/ news/M11_Tabula_Reprint.pdf

Kirill Minkovich received the B.A. degree from the University of California, Berkeley, in 2003, and the M.S. and Ph.D. degrees from the University of California, Los Angeles, in 2006 and 2010, respectively, all in computer science. He joined the HRL Laboratories LLC, Malibu, CA, as a Post-Doctoral Fellow in 2010. He is currently a Research Staff Scientist with the Information and Systems Sciences Laboratories. His current research interests include synthesis for nanoscale architectures, hardware-based acceleration of synthesis algorithms, evaluation of novel architectures, and large-scale simulation of neural networks.

Narayan Srinivasa (M’00–SM’12) received the Ph.D. degree in mechanical engineering from the University of Florida, Gainesville, in 1994. He was a Beckman Fellow with the Beckman Institute, University of Illinois at Urbana-Champaign, Urbana, from 1994 to 1997. At the Beckman Institute, he was a member of the Human Computer Intelligent Interaction Group and worked in the areas of computer vision, robotics, and manufacturing. In 1998, he joined the Information and System Sciences Department, HRL Laboratories LLC, Malibu, CA, where he is currently a Principal Research Scientist and a Manager of the Center for Neural and Emergent Systems. He is currently the Program Manager and the Principal Investigator for two large multidisciplinary projects funded by DARPA, SyNAPSE, and Physical Intelligence, which attempt to develop a theoretical foundation inspired by brain science and physics to engineer electronic systems that exhibit intelligence. He has published over 80 technical publications and holds 28 issued U.S. patents with several patents pending. His current research interests include learning, perception, adaptive controls, and evolutionary dynamics.

901

Dr. Srinivasa is a member of the International Neural Networks Society and the American Association for the Advancement of Science and is on the editorial boards of Neural Networks and Bio-Inspired Cognitive Architecture journals. He has received numerous awards, including the HRL New Inventor Award, the GM Most Valuable Colleague Award, the HRL Distinguished Inventor Awards, the HRL Outstanding Team Award, and the HRL Chairman Award.

Jose M. Cruz-Albrecht (S’88–M’97) received the B.S. and M.S. degrees in physics from the University of Seville, Seville, Spain, in 1987 and 1989, the M.Eng. degree (with high distinction) from the University of Leuven, Leuven, Belgium, in 1990, and the M.S. and Ph.D. degrees in electrical engineering and computer sciences from the University of California, Berkeley, in 1992 and 1996, respectively. He was with the National Center of Microelectronics, Seville, the ESAT-MICAS Laboratory, K. U. Leuven, Heverlee, Belgium, the Electronics Research Laboratory, University of California, Berkeley, and with industry research laboratories. Since 2002, he has been with HRL Laboratories, Malibu, CA, where he is currently a Senior Research Staff Engineer. He has been involved in the architecture, design, modeling, and testing of analog chips, including cellular neural networks, chaotic circuits, high-speed ADC and DAC converters, time-encoding circuits, amplifier circuits, and biologically inspired neural circuits, in CMOS, HBT, and GaN technologies. He is currently the Lead of microelectronics hardware tasks for two DARPA Programs, SyNAPSE, and Physical Intelligence. He has managed multiple projects at HRL in the areas of high-speed electronics, neural circuits, nonlinear dynamics circuits, and power digital-to-analog converters. He has co-authored articles in two edited technical books and has co-authored about 20 papers in journals and conference proceedings. He holds over 30 issued patents. His current research interests include the integrated circuit implementation of neural circuits and design and analysis of nonlinear analog circuits. Dr. Cruz-Albrecht has been on the Program Committee of the IEEE Custom Integrated Circuit Conference and the Biologically Inspired Cognitive Architectures Conference. He has won several awards, including the HRL New Inventor Award, HRL Distinguished Inventor Awards, and HRL Outstanding Team Award.

Youngkwan Cho received the B.S. and M.S. degrees in computer science from the Seoul National University, Seoul, South Korea, and the Ph.D. degree in computer science from the University of Southern California, Los Angeles. He is a Research Staff Member with the Information and System Sciences Department, HRL Laboratories LLC, Malibu, CA. His current research interests include neuroscience, virtual and augmented reality, advanced 3-D object rendering, object modeling, visualization, computer vision, and image processing. He has served on the industry advisory boards of universities.

Aleksey Nogin received the Ph.D. degree in computer science from Cornell University, Ithaca, NY, in 2002. He joined the Information and System Sciences Department, HRL Laboratories LLC, Malibu, CA, in 2006, where he is currently a Research Staff Computer Scientist. His current research interests include computer architectures, cyber security, formal models, and automated reasoning.

Programming time-multiplexed reconfigurable hardware using a scalable neuromorphic compiler.

Scalability and connectivity are two key challenges in designing neuromorphic hardware that can match biological levels. In this paper, we describe a ...
1MB Sizes 0 Downloads 3 Views