

#### ENABLING ULTRA-LOW LATENCY EDGE PROCESSING WITH SILICON PHOTONICS

Johannes Feldmann – Salience Labs

Ultra-low latency edge processing with silicon photonics

1



# WHO, WHAT & WHY?



# WHO WE ARE

- Photonic computing and signal processing
- Founded in 2021
- Research from Oxford and Münster University
- Team of 18
- Based in Oxford, UK
- Ultra-low latency AI inference





#### THE TEAM



Vaysh, CEO McKinsey & Company



Johannes, CTO 



Chris, VP of Eng GRAPHCORE ©,



Enzo, Chief Architect MUAWEI



Andy, SW Architect





Joanna, CFO



Mark, Senior SW Eng





Jaganath, Principal Analog Eng intel



Nat, ML Eng UNIVERSITY OF CAMBRIDGE



Andres, Principal Hardware Eng 



Nick, ML Eng SAMSUNG



Rob, Ops Manager 



Yi-Ling, ML Eng Imperial College







Vasileios, Photonics Eng Gary, Photonics Eng Southampton Optalysys



Lakshmi, Lead Verification Eng cādence°



Olufemi, Lead **RTL Eng E** XILINX



Javaid, FPGA Eng Qualconn

Ultra-low latency edge processing with silicon photonics



### WHAT WE DO

- AI inference at low latency and low power
- Signal processing: fourier transformation, matrix inversion
- Pattern recognition
- Optical data transfer / interconnects

#### **Ultra-low latency optical compute**





# WHY PHOTONIC COMPUTING

- SPEED Run at 10–100 GHz, full matrix vector multiplication in a single clock step
- PARALLELMany vectors in a single shot onPROCESSINGdifferent wavelengths of light

SIMPLICITY 1-1 mapping of a matrix, no bandwidth limiting components



Efficient low latency compute





# PHOTONICS VS ELECTRONICS

| Parameter       | Photonics                                                                                         | Electronics                                                                |  |
|-----------------|---------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------|--|
| Speed           | Up to 100 GHz                                                                                     | Up to a few GHz                                                            |  |
| Power           | Linear operations "for free"<br>(BUT: EO conversions)                                             | Cost of switching capacitors and leakage currents                          |  |
| Parallelization | Wavelength multiplexing<br>Mode multiplexing<br>Polarization multiplexing<br>Duplication of cores | Via duplication of cores                                                   |  |
| Footprint       | µm scales<br>High compute densities via speed and<br>parallelization                              | nm scales                                                                  |  |
| Scalability     | Modulation speed<br>More wavelengths<br>MAC unit size<br>No need for 3 nm technology              | Going to smaller technology nodes<br>expensive<br>reaching physical limits |  |



### WHY NOW?

- Standard CMOS processes
- Volume manufacturing available now
- Full integration of lightsource, modulators and detectors
- Application specific, not general compute





#### LOW LATENCY, HIGH THROUGHPUT & EFFICIENCY

Estimated performance on Salience chip for ResNet 50 on ImageNet database





### **APPLICATION AREAS**

- Ultra-low latency image recognition
- Error corrections / signal cleaning
- Ultra-low latency signal processing
- Nanosecond pattern recognition & correlation detection
- Low precision matrix math



#### **EXAMPLE: DETECTION SYSTEM**



Collision

#### **Pattern recognition**

- Detect trigger signal
- Low latency (<1 ns)
- Start processing
- Start storing data

#### AI Inference cluster

- Multiple compute cores
- Optical interconnect for high bandwidth
- Low latency analysis
- Noise reduction
- Feedback loop



#### HOW DOES IT WORK?



# AMPLITUDE VS PHASE

# Amplitude modulation Optical waveguide Modulator

- Multiplication via attenuation
- Not phase sensitive
- No need for coherent light
- Robust

#### **Phase modulation**



- Interferometer performs rotations
- Exploits optical interference
- Needs coherent light
- Sensitive to variations



### AMPLITUDE VS PHASE



Phase modulation



- Interferometer performs rotations
- Exploits optical interference
- Needs coherent light
- Sensitive to variations



### DIFFERENT CONCEPTS AND APROACHES

| Parameter                | Salience, Amplitude                                                    | Phase (MZI arrays)                                                | Free space                                                         | Ring resonators                                                                                  |
|--------------------------|------------------------------------------------------------------------|-------------------------------------------------------------------|--------------------------------------------------------------------|--------------------------------------------------------------------------------------------------|
| Speed                    | Up to 100 GHz                                                          | Few GHz                                                           | Typcially kHz- MHz                                                 | Up to 100 GHz                                                                                    |
| Power                    | Low                                                                    | Medium                                                            | Very low                                                           | Medium (tuning)                                                                                  |
| Parallelization<br>(WDM) | Easy!                                                                  | Possible                                                          | Unlikely                                                           | Unlikely                                                                                         |
| Footprint                | Small                                                                  | Large                                                             | Very large                                                         | Small                                                                                            |
| Scalability              | Very good: speed,<br>parallelization and<br><mark>MAC unit size</mark> | Limited, due to difficult<br>phase control and<br>parallelization | Limited                                                            | Limited                                                                                          |
| Stability                | Every component is<br>broadband: high<br>tolerance to<br>variations    | Optical phase is very temperature sensitive                       | Difficult due to<br>alignment and stability<br>of free space parts | Every resonator needs<br>thermal tuning of the<br>sharp resonances to the<br>correct wavelengths |



Multiplication

ALIENCE



P<sub>2</sub>

 $P_{1} + P_{2}$ 

Addition

a: Amplitude of input lightb: State of modulatorc: Amplitude of output light

- Multiplication in (passive) transmission measurement
- Amplitude of light weighted by modulating element
- Combine power from two waveguides into one
- Avoid interference by use of different wavelengths

•

٠



 Combined MAC units calculate dot product: ab+cd+ef+...

ALIENCE



# PHOTONIC MATRIX MULTIPLICATION

- Combined MAC units calculate dot product: ab+cd+ef+...
- Multiple columns

ALIENCE



• Single time step (time of flight of the light)



#### Photodetector

# PHOTONIC MATRIX MULTIPLICATION

- Combined MAC units calculate dot product: ab+cd+ef+...
- Multiple columns

SALIENCE

- Full matrix vector multiplication
- Single time step (time of flight of the light)
- Increased compute density via wavelength division multiplexing

Vector 2 Vector 1





### IMPACT OF PARALLELISATION

- Data given for Resnet50
- 45 μs is already fastest!
  GPU: ca. 600 μs
- Latency reduction by using multiple input vectors
- Unique to photonics
- Extra boost from extra colours!





# THE FULL SYSTEM

- Photonics: Fast matrix math
- Electronics: Control and nonlinearities
- Chiplet approach
- Standard interfaces: Digital electronic in & out
- Photonic interfaces possible





# PRIME – SOFTWARE MODEL

- Software model of the hardware
- Tool to evaluate larger processors before availability of silicon
- Benchmarks for different workloads
- 8 models implemented: Resnet50, 3D-Unet, RetinaNet, Beit-L,...
- Software interface: Tensorflow



#### EXPERIMENTAL RESULTS



# CONVOLUTIONS

- Scan filter across image
- Image filtering: edge detection, sharpening, smoothing
- Multiple filter kernels can be applied at the same time
- Convolutional neural networks: vision applications, image classification, pattern recognition
- High speed and high efficiency



Feldmann, Youngblood, Karpov et al. , Nature 2021, 589, 52-58.



#### NEURAL NETWORKS

- Handwritten digit recognition
- Simple CNN tested on MNIST database
- Experimental accuracy: 95.3 %
- Theoretical accuracy: 96.1%



Feldmann, Youngblood, Karpov et al., Nature 2021, 589, 52-58.



### SCALABILITY

- Photonics scales in different ways compared to electronics: MAC unit size, modulator speed, parallelisation
- No need for newest technology node!





Feldmann et al., Nature, Vol 589, 7 January 2021.



#### SCALING PERFORMANCE



#### PROTOTYPE CHIPS

9x4 photonic matrix with FPGAMultiplexing 4 vectorsUp to 14 GHzUp to 32x32Foundry compatibility





#### SCALING PERFORMANCE



#### PROTOTYPE CHIPS

9x4 photonic matrix with FPGA Multiplexing 4 vectors Up to 14 GHz Up to 32x32 Foundry compatibility



SCALING

64x64 10 GHz 1000 TOPs 10 Vectors

0.5 TOPs



#### APPLICATION EXAMPLES



#### **EXAMPLE: DETECTION SYSTEM**



Collision

#### **Pattern recognition**

- Detect trigger signal
- Low latency (<1 ns)
- Start processing
- Start storing data

#### AI Inference cluster

- Multiple compute cores
- Optical interconnect for high bandwidth
- Low latency analysis
- Noise reduction
- Feedback loop



### PATTERN RECOGNITION

- Check for multiple patterns simultaneously
- Down to tens of picoseconds evaluation time
- Noise tolerant evaluation



Multiple datastreams

Patterns SALIENCE LABS Match!



# PATTERN RECOGNITION

- Check for multiple patterns simultaneously
- Down to tens of picoseconds evaluation time
- Noise tolerant evaluation



- Single clock step (up to 100 GHz)
  Fourier Transforms
- If data points match the MAC size, a fourier transformation can be carried out in a single clock step
- Larger transforms possible via decomposition

Patterns



Multiple datastreams



# **OPTICAL DATA TRANSFER**



- Optical in, optical out, NxN reconfigurable
- Ultra-low latency
- Signal replication
- Networking in latency critical environments
- High bandwidth, multiple wavelength channels



#### OUTLOOK



#### DEVELOPMENT TIMELINES









| Today        | Current prototype demonstration: photonic chip<br>driven by FPGA in lab                                               |
|--------------|-----------------------------------------------------------------------------------------------------------------------|
| Aug 2023     | Evaluation board: photonic chip with on chip light source, fabricated at production foundry, with driving electronics |
| Oct/Nov 2023 | Test chip: prototype with photonic chip packaged to a dedicated ASIC                                                  |

2024 Commercial prototype: high performance prototype with photonic chip packaged to a dedicated ASIC



### GET IN TOUCH!

Ask: We are looking for collaboration partners who can benefit from our ultra-low latency processing!

Johannes Feldmann: johannes@saliencelabs.ai

Vaysh Kewada: vaysh@saliencelabs.ai

www.saliencelabs.ai

