Artificial Intelligence and Machine Learning Fireside Chat: August 2024

There are many operator methods that explain how deep neural networks can approximate operators, not just functions. This concept is important because operators, like those found in differential equations, map functions to functions. This means that we can use neural networks to solve problems in physics, biology, actuarial sciences, statistical analysis, and financial analysis, because this approach isn’t limited to just differential equations, any number of other mathematical equations for other scientific and numerical analysis can be leveraged. For this discussion, we will just discuss Fourier Neural Operators.

Watch this Google Notebook LM AI generated Podcaset vid

Watch a Google NotebookLM generated podcast on this article below!

How do Fourier Neural Operators work?

Fourier Neural Operator (FNO) is very useful for image-to-image problems and comparisons. All that is required is to replace convolutional layers with Fourier layers, and this is how you establish your FNO’s and as these layers transform the input data, apply linear transformations in the frequency domain and then inverse Fourier transforms back to the geometric domain, you end up processing your predictions from fairly accurate patterns and dependencies from that frequency domain.

Fourier transforms are well-suited for representing physical phenomena and thus we can use it to capture the underlying physics on the objects or data sources to the AI model. A great example is to monitor amplitude frequencies and power spectral entropy or the smoothing effect on accelerometer and gyroscope data. Whats really neat is your able to calculate the optimal frequency and data population sizes to reduce overlapping windows, overfitting. One promising application use-case is zero-shot super resolution, which is where your data is trained on low-resolution data and then used to generate high-resolution solutions, essentially upscaling the results. Super resolution likely works when the low-resolution data captures enough essential features of the physics. Pushing the limits of down sampling could lead to inaccurate results.

Generalizing Neural Operators: Customization, Flexibility and Kernels

The FNO is a specific instance of a more general neural operator framework. This framework allows for customization by specifying different kernel functions in the neural operator layers. This flexibility enables users to tailor neural operators to their specific physics problems. You can leverage various linear problem and logistic regression algorithms by setting a range of K values to visualizing clusters in a 3D scatter plot, or star constellation maps. There are different fourier kernels that are suitable for periodic boundary conditions, like those of fluid flow problems or heat transfer equations. However, for complex geometries, there are a number of different kernels that can be used for different applications. Understanding the underlying physics and boundary conditions are very important on the onset of any FNO project, or you will end up with useless outputs.

BOM Example for a Data Center Hardware Kernel Prediction Matrix
1. Network Equipment

Switches and Routers

Metric Type	Recommended Kernel	Feature Engineering	Prediction Target	Validation Method
Traffic Patterns	Fourier Neural Operator	- Packet rate statistics- Queue depth trends- Buffer utilization	Port failure probability	Rolling window validation with 30-day segments
Hardware Health	RBF Kernel	- Temperature deltas- Power fluctuations- Fan speeds	Component failure risk	Cross-validation with historical failure data
Error Logs	String Kernel	- Error frequency analysis- Pattern matching scores- Time between errors	System instability risk	Precision-recall on past incidents

Load Balancers

Metric Type	Recommended Kernel	Feature Engineering	Prediction Target	Validation Method
Connection Stats	Periodic Kernel	- Connection rate trends- Session duration patterns- SSL handshake times	Service degradation risk	Weekly pattern analysis
Resource Usage	Matern Kernel	- CPU/Memory patterns- Thread utilization- Queue backlog	Resource exhaustion probability	Resource threshold validation

2. Server Infrastructure

Compute Servers

Metric Type	Recommended Kernel	Feature Engineering	Prediction Target	Validation Method
CPU Metrics	Composite RBF + Linear	- Load averages- Context switch rates- Cache hit ratios	Processor failure risk	Historical MTBF correlation
Memory Systems	RBF Kernel	- Page fault rates- Memory bandwidth- ECC error counts	Memory failure probability	Error rate trending
Storage I/O	Spectral Kernel	- IOPS patterns- Latency distributions- Queue depths	Disk subsystem failure	Performance degradation detection

Storage Systems

Metric Type	Recommended Kernel	Feature Engineering	Prediction Target	Validation Method
Disk Health	Custom SMART Kernel	- Reallocated sector count- Read error rates- Temperature trends	Drive failure probability	SMART attribute correlation
Controller Stats	RBF + Periodic	- Cache hit rates- Write coalescing efficiency- Battery health	Controller failure risk	Historical incident matching

3. Power and Cooling

Power Distribution

Metric Type	Recommended Kernel	Feature Engineering	Prediction Target	Validation Method
UPS Metrics	Matern Kernel	- Load percentage- Battery health- Temperature	UPS failure probability	Battery wear prediction
PDU Stats	RBF Kernel	- Current draw patterns- Power factor- Voltage stability	Circuit overload risk	Power envelope analysis

Cooling Systems

Metric Type	Recommended Kernel	Feature Engineering	Prediction Target	Validation Method
CRAC Units	Periodic + RBF	- Temperature deltas- Humidity levels- Airflow rates	Cooling failure risk	Thermal map correlation
Heat Exchange	Custom Thermal Kernel	- Heat load distribution- Coolant pressure- Flow rates	Thermal event probability	Temperature gradient analysis

Implementation Notes

Feature Extraction Parameters

• Sampling Rate: 1-5 minutes for most metrics

• Window Size: 24 hours for pattern analysis

• Aggregation Period: 1 hour for trend calculation

Kernel Optimization Guidelines

1. RBF Kernel Parameters

– Length scale: Adjust based on metric volatility

– Signal variance: Calibrate to metric range

2. Periodic Kernel Settings

– Period length: Match to workload cycles

– Length scale: Tune to noise level

3. Composite Kernel Weights

– Balance between long-term trends and short-term patterns

– Adjust based on false positive/negative rates

Validation Framework

• Training Period: Minimum 6 months of historical data

• Test Split: Rolling 30-day windows

• Metrics:

– Precision: Target > 85%

– Recall: Target > 80%

– Lead Time: Minimum 24 hours

– False Positive Rate: Target < 5%

Model Update Strategy

• Retrain Schedule: Monthly

• Incremental Updates: Daily parameter adjustment

• Validation Frequency: Weekly performance check

Integration Points

1. Monitoring Systems:

– Prometheus/Grafana

– Nagios/Zabbix

– Custom SNMP collectors

2. Alert Systems:

– Threshold definitions

– Escalation paths

– Automated response triggers

3. CMDB Integration:

– Asset correlation

– Maintenance history

– Replacement tracking

Mesh Invariance

Mesh invariance enable discretization, which allows for flexible mesh resolution. This feature enables the refinement of solutions and the capture of intricate details like shock waves. Neural operators offer several advantages over traditional methods, including their mesh invariance, ability to learn complex relationships, and potential for zero-shot super resolution. However, it's crucial to evaluate their performance carefully and understand their limitations. You can also use Laplace Neural Operators which generalize Fourier Neural Operators to handle exponential growth and decay problems. The possibilities from there are endless.

Use Case: Predicting Network Equipment Failure, Congestion and Packet Loss

Data Collection

Typical large ISPs like ATT, Charter/Spectrum, and Verizon collects a continuous stream of data from their massive scaled networks. This data can be processed with Fourier transforms to create output data on their equipment health, failure rates, and performance and use it as inputs to an AI/ML logistical regression model. We can use WebRTC clients to capture a level of metrics that are deeper insights into the network. These clients gather real-time metrics related to WebRTC sessions, including packet loss, jitter, round-trip time, local hardware details, and bandwidth estimations. By collecting this data, we can start building a complete picture of network performance, and unprecedented observability.

Forier Transform Application

Once this data is collected, the next step involves applying Fourier transforms to the time series data. This technique is essential because it converts the data from the time domain (where metrics vary over time) to the frequency domain. In the frequency domain, instead of tracking values as they change over time, the data reveals the strength of different frequencies. This allows us to better analyze patterns and trends that may not be obvious in the raw time series data and enable additional causation, and correlations with the actual un-expected failures. By comparing unexpected failures and their context to the prediction model created by the webRTC network_test tool data, we can predict with pretty accurate cadence, which equipment will fail, and when.

Feature Extraction, Historical Metrics and Logistical Regression Prep

When the data is transformed into the frequency domain, we are able to extract key features that are crucial for our predictive analysis. Dominant frequencies, data center temperature, and traffic capacity patterns all will be highlight periodic patterns of network health markers, which then is correlated with the MOS score of 1-5 {1 being terrible network performance, 5 being excellent). We can then, with that data, focus on the amplitude of these frequencies, the network segments involved, which can indicate how severe these congestion patterns are. Finally, phase information is extracted to help pinpoint shifts in network behavior over time.

We can leverage historical network failure data to analyze times when packet loss and jitter exceeded certain thresholds, asymmetric packet arrival delay, or when network outages occurred. By labeling this historical data, we can prepare the dataset for training machine learning models that identify pattern correlation and causations that lead to network equipment failures. At this stage, we can even take into account weather alminacs and news prediction feed to identify hurricanes, thunder storms, earth quakes, and various natural disasters.

Logistic Regression Model Training

Once the training model has consumed these training datasets and is grounded with output boundaries, we can implement a logistic regression model using the Fourier features extracted from the phase one model training data to gather dominant frequencies, amplitudes, and phase information—along with other relevant contextual data like time of day and overall network load. Wecan then train the model to predict the probability of network equipment failure, network congestion or packet loss surpassing a predefined threshold.

Conclusion

Fourier Neural Operators and Neural Operators represent a powerful approach to operator learning and physics-informed machine learning. Their mesh invariance and flexibility make them promising tools for solving complex problems.

With this trained model, we can continuously refine its real-time predictions outputs by grounding and comparing them to actual real world results. Based on the incoming WebRTC MOS scores of any session data, the model can assess the likelihood of congestion or packet loss, generate alerts or trigger proactive adjustments and replacements to the network. This helps mitigate potential problems before they impact the user experience, ensuring smoother and more reliable network.

This approach leverages AI to not only monitor network health but also to anticipate and address issues before they escalate, leading to more robust and resilient network.

Check out this youtube video by Steve Brunton summarizing these concepts here:

Artificial Intelligence and Machine Learning Fireside Chat

Friday, August 23, 2024

Applied Use Case of Physics Informed Neural Operators: From a Function Approximator to Predicting Failure of ISP Network Equipment