Update README.md

update readme
edit readme
2025-01-17 08:09:57 +01:00 · 2025-01-17 01:08:40 -06:00 · 2025-01-17 00:26:17 -06:00 · 2025-01-16 22:03:28 -06:00 · 2025-01-14 18:47:35 -06:00 · 2025-01-12 13:31:31 -06:00
6 changed files with 1123 additions and 79 deletions
--- a/README.md
+++ b/README.md
@ -1,15 +1,40 @@
-# nn - implementation of neural networks in c
+# nn - Neural Networks in C

-implements neural networks in c, targets embedded systems (microcontrollers, fpgas) 
+This repository implements various neural networks in C, focusing mainly on targetting embedded systems or creating hardware accelerators (FPGA-Based, ASIC, etc.) \

-#### current implementations
-`snn.c` - a simple feedforward neural network written in ~150loc. \
-`cnn.c` - TODO, implements a convolutional neural network \
-`cnn-hls.c` - TODO, has fpga hls specific types/pragmas in order to synthesize to verilog; run on an fpga \
+This project was created as part of my independent study course, where I am currently researching the design of hardware accelerators for high-performance workloads

-depends on native c libraries and [gsl](https://www.gnu.org/software/gsl/)
+### Current Implementations (project index)
+`snn.c` - A simple feedforward neural network written in ~150loc. Depends on c native libraries and [GSL](https://www.gnu.org/software/gsl/) \
+`cnn.c` - Implements a fully featured cnn library in ~600loc. Depends solely on C native libraries \
+`cnn-hls.c` - The version of `cnn.c` with HLS specific optimizations (Pragmas, Systolic Array Mutliplication, etc.); aims at being synthesized through Vitus HLS to create a FPGA Based CNN Accelerator \
+`mnist.c` - Driver code for `cnn.c` which trains on the [MNIST](https://yann.lecun.com/exdb/mnist/) dataset

-### future goals
-cnn w/ pragmas -> successfully compiled to verilog using vivado/vitus \
-self-made matrix multiplication library, relying only on native c ones \
-code cleanup and optimization
+### Usage
+`mnist.c` is a great example of how the library is used, but basic usage boils down to a few simple things:
+
+1) Importing `cnn.c` into your code
+2) Creating a network and creating layers:
+```c
+// an example of a lenet-5 inspired 8 layer network
+Network* network = create_network(8);
+network->layers[0] = create_input(IMG_HEIGHT, IMG_WIDTH, 1);
+network->layers[1] = create_conv(IMG_HEIGHT, IMG_WIDTH, 1, 6, 5, 1, 2);
+network->layers[2] = create_maxpool(network->layers[1]->height, network->layers[1]->width, network->layers[1]->channels, 2, 2);
+network->layers[3] = create_conv(network->layers[2]->height, network->layers[2]->width, network->layers[2]->channels, 16, 5, 1, 0);
+network->layers[4] = create_maxpool(network->layers[3]->height, network->layers[3]->width, network->layers[3]->channels, 2, 2);
+network->layers[5] = create_fc(120, network->layers[4]->height * network->layers[4]->width * network->layers[4]->channels, a_sigmoid);
+network->layers[6] = create_fc(84, 120, a_sigmoid);
+network->layers[7] = create_fc(NUM_CLASSES, 84, a_softmax);
+```
+3) Forward and backpropogation through the Network!
+
+## Project Information
+### Abstract
+For my project, I propose an implementation of a Convolutional Neural Network based handwritten digital classifier using the MNIST dataset on a Field Programmable Gate Array (FPGA). I utilize High Level Synthesis (HLS) tool called Vitus HLS developed by [AMD/Xilinx](https://www.xilinx.com/products/boards-and-kits.html) in order to implement the accelerator through C, eliminating the need to write any code in HDL Languages such as Verilog/VHDL. To reduce any performance losses, I implement a systolic array based architecture and utilize techniques such as pipelining, loop unrolling, and memory partitioning. Through this project, I aim to highlight the feasibility and viability of FPGAs for low latency, highly energy efficient machine learning workflows, possibly placing them in consideration as a replacement for GPUs for infrence based tasks.
+### What is an FPGA?
+A Field Programmable Gate Array, or FPGA, is a type of integrated circuit that is made up of a massive collection of unconnected digital logic parts. When someone designs *gateware* for an FPGA, they are essentially connecting these logic blocks together in a way that creates a new piece of hardware. FPGAs are also "field programmable," meaning that they can be reconfigured on-the-fly as per the designer's needs. While often used as tools for rapidly prototyping hardware designs, the nature of an FPGA's highly specialized and customizable hardware design allows them to achieve very low latency, high throughput, and be very energy efficient.
+#### What is High Level Synthesis (HLS)?
+High Level Synthesis is a method of designing gateware that allows a programmer to write the gateware in a higher level language like C, C++, or even [Python](https://fastmachinelearning.org/hls4ml/). A High Level Synthesis tool takes this description of the intended function of the hardware from a higher level language and then synthesizes it into RTL level code (such as Verilog or VHDL). Since writing in languages such as Verilog can be tedious and time consuming, HLS serves as an alternative for designers who want to efficiently build and verify hardware in a language that is much easier to write, and is also a tool that invites normal programmers with no experience writing HDL languages to start developing hardware. In this project, I chose to use HLS to not only work under my time constraint, but evaluate how well the HLS workflow truly is to an invidual with little to no experience in HDL languages.
+### Reflection and Next Steps
+This project was an amazing way to get involved with both the FPGA and hardware design/accelerator design space. I was able to gain a lot of hands on experience with the design workflow for developing gateware for an FPGA, and also was able to gain insights on performance optimization concepts and methods such as systolic arrays, loop pipelining/unrolling, and code inlining. Furthermore, I was able to work more with the mathematics and theory behind Deep Learning and Neural Networks, which is very good knowledge to have given the development of artifical intellegence. The next steps of this project include cleaning up and optimizing the code, possibly implementing quantization, batch normalization, and other types of layerz such as residual blocks to further improve performance for the neural network. On the hardware side, next steps include obtaining a physical FPGA development board to actually deploy this program onto, and possibly performing a rewrite of the code to not rely on HLS, but write the neural network from scratch in an HDL language such as Verilog.
--- a/cnn-hls.c
+++ b/cnn-hls.c
@ -0,0 +1,310 @@
+#include "ap_fixed.h"
+#include "hls_stream.h"
+#include "hls_math.h"
+#include <string.h>
+
+// Fixed point definitions for better hardware efficiency
+typedef ap_fixed<16,8> data_t;  // 16 bits total, 8 integer bits
+typedef ap_fixed<16,8> weight_t;
+typedef ap_fixed<32,16> acc_t;  // Wider accumulator to prevent overflow
+
+// Enums remain the same
+typedef enum {
+	input,
+	conv,
+	max_pool,
+	fully_connected
+} ltype;
+
+typedef enum {
+	fc_input,
+	fc_hidden,
+	fc_output,
+} fcpos;
+
+typedef enum {
+	a_sigmoid,
+	a_softmax,
+} activation;
+
+// Maximum size definitions for static arrays
+#define MAX_LAYER_SIZE 1024
+#define MAX_FILTER_SIZE 11
+#define MAX_CHANNELS 256
+#define MAX_FILTERS 256
+
+// Layer struct optimized for HLS
+struct Layer {
+	ltype type;
+	int height;
+	int width;
+	int channels;
+
+	union {
+		struct {
+			int num_filters;
+			int filter_size;
+			int stride;
+			int zero_padding;
+			int input_height;
+			int input_width;
+			int input_channels;
+			weight_t weights[MAX_FILTERS][MAX_CHANNELS][MAX_FILTER_SIZE][MAX_FILTER_SIZE];
+			data_t biases[MAX_FILTERS];
+		} conv_params;
+
+		struct {
+			int pool_size;
+			int stride;
+			int input_height;
+			int input_width;
+		} pool_params;
+
+		struct {
+			int output_size;
+			weight_t weights[MAX_LAYER_SIZE][MAX_LAYER_SIZE];
+			data_t biases[MAX_LAYER_SIZE];
+			activation type;
+		} fc_params;
+	} params;
+
+	data_t output[MAX_LAYER_SIZE];
+	data_t delta[MAX_LAYER_SIZE];
+	data_t pre_activation[MAX_LAYER_SIZE];
+};
+
+// Helper functions
+data_t sigmoid(data_t x) {
+	#pragma HLS INLINE
+	return 1.0 / (1.0 + hls::exp(-x));
+}
+
+data_t relu(data_t x) {
+	#pragma HLS INLINE
+	return (x > 0) ? x : 0;
+}
+
+// Systolic array matrix multiplication for fully connected layers
+void systolic_matrix_multiply(
+	const weight_t weights[MAX_LAYER_SIZE][MAX_LAYER_SIZE],
+	const data_t input[MAX_LAYER_SIZE],
+	acc_t output[MAX_LAYER_SIZE],
+	int M, int N) {
+
+	#pragma HLS PIPELINE II=1
+	#pragma HLS ARRAY_PARTITION variable=weights cyclic factor=16 dim=2
+	#pragma HLS ARRAY_PARTITION variable=input cyclic factor=16
+
+	static acc_t pe_array[MAX_LAYER_SIZE];
+	#pragma HLS ARRAY_PARTITION variable=pe_array cyclic factor=16
+
+	// Initialize processing elements
+	for (int i = 0; i < M; i++) {
+		#pragma HLS UNROLL factor=16
+		pe_array[i] = 0;
+	}
+
+	// Systolic computation
+	for (int k = 0; k < N; k++) {
+		for (int i = 0; i < M; i++) {
+			#pragma HLS PIPELINE II=1
+			#pragma HLS UNROLL factor=16
+			pe_array[i] += weights[i][k] * input[k];
+		}
+	}
+
+	// Write results
+	for (int i = 0; i < M; i++) {
+		#pragma HLS UNROLL factor=16
+		output[i] = pe_array[i];
+	}
+}
+
+// Optimized convolution forward pass
+void conv_forward(Layer& layer, const data_t input[MAX_LAYER_SIZE]) {
+	#pragma HLS INLINE off
+
+	const int padding = layer.params.conv_params.zero_padding;
+	const int stride = layer.params.conv_params.stride;
+	const int filter_size = layer.params.conv_params.filter_size;
+	const int num_filters = layer.params.conv_params.num_filters;
+	const int input_height = layer.params.conv_params.input_height;
+	const int input_width = layer.params.conv_params.input_width;
+	const int input_channels = layer.params.conv_params.input_channels;
+
+	// Create padded input buffer
+	data_t padded_input[MAX_CHANNELS][MAX_FILTER_SIZE][MAX_FILTER_SIZE];
+	#pragma HLS ARRAY_PARTITION variable=padded_input complete dim=1
+
+	const int padded_height = input_height + 2 * padding;
+	const int padded_width = input_width + 2 * padding;
+	const int output_height = (padded_height - filter_size) / stride + 1;
+	const int output_width = (padded_width - filter_size) / stride + 1;
+
+	// Main convolution loops
+CONV_FILTERS: for(int f = 0; f < num_filters; f++) {
+	CONV_OUTPUT_H: for(int oh = 0; oh < output_height; oh++) {
+		CONV_OUTPUT_W: for(int ow = 0; ow < output_width; ow++) {
+				#pragma HLS PIPELINE II=1
+
+				acc_t sum = 0;
+
+			CONV_CHANNELS: for(int c = 0; c < input_channels; c++) {
+				CONV_KERNEL_H: for(int fh = 0; fh < filter_size; fh++) {
+					CONV_KERNEL_W: for(int fw = 0; fw < filter_size; fw++) {
+							#pragma HLS UNROLL factor=3
+
+							int ih = oh * stride + fh;
+							int iw = ow * stride + fw;
+
+							if (ih >= 0 && ih < padded_height && iw >= 0 && iw < padded_width) {
+								sum += input[c * input_height * input_width + (ih-padding) * input_width + (iw-padding)] * 
+									layer.params.conv_params.weights[f][c][fh][fw];
+							}
+						}
+					}
+				}
+
+				sum += layer.params.conv_params.biases[f];
+				int output_idx = f * output_height * output_width + oh * output_width + ow;
+				layer.pre_activation[output_idx] = sum;
+				layer.output[output_idx] = relu(sum);
+			}
+		}
+	}
+}
+
+// Optimized max pooling forward pass
+void maxpool_forward(Layer& layer, const data_t input[MAX_LAYER_SIZE]) {
+	#pragma HLS INLINE off
+
+	const int pool_size = layer.params.pool_params.pool_size;
+	const int stride = layer.params.pool_params.stride;
+	const int input_height = layer.height;
+	const int input_width = layer.width;
+	const int input_channels = layer.channels;
+
+	const int output_height = (input_height - pool_size) / stride + 1;
+	const int output_width = (input_width - pool_size) / stride + 1;
+
+POOL_CHANNELS: for(int c = 0; c < input_channels; c++) {
+	POOL_OUTPUT_H: for(int oh = 0; oh < output_height; oh++) {
+		POOL_OUTPUT_W: for(int ow = 0; ow < output_width; ow++) {
+				#pragma HLS PIPELINE II=1
+
+				data_t max_val = -INFINITY;
+
+			POOL_WINDOW_H: for(int ph = 0; ph < pool_size; ph++) {
+				POOL_WINDOW_W: for(int pw = 0; pw < pool_size; pw++) {
+						#pragma HLS UNROLL
+
+						int ih = oh * stride + ph;
+						int iw = ow * stride + pw;
+						data_t val = input[c * input_height * input_width + ih * input_width + iw];
+						max_val = (val > max_val) ? val : max_val;
+					}
+				}
+
+				layer.output[c * output_height * output_width + oh * output_width + ow] = max_val;
+			}
+		}
+	}
+}
+
+// Optimized fully connected forward pass using systolic array
+void fc_forward(Layer& layer, const data_t input[MAX_LAYER_SIZE]) {
+	#pragma HLS INLINE off
+
+	const int output_size = layer.params.fc_params.output_size;
+	const int input_size = layer.height * layer.width * layer.channels;
+
+	// Use systolic array for matrix multiplication
+	acc_t temp_output[MAX_LAYER_SIZE];
+	systolic_matrix_multiply(layer.params.fc_params.weights, input, temp_output, output_size, input_size);
+
+	// Add biases and apply activation
+FC_OUTPUT: for(int o = 0; o < output_size; o++) {
+		#pragma HLS PIPELINE II=1
+
+		acc_t sum = temp_output[o] + layer.params.fc_params.biases[o];
+
+		if(layer.params.fc_params.type == a_sigmoid) {
+			layer.pre_activation[o] = sum;
+			layer.output[o] = sigmoid(sum);
+		} else {
+			layer.output[o] = sum; // For softmax, store raw values
+		}
+	}
+
+	// Apply softmax if needed
+	if(layer.params.fc_params.type == a_softmax) {
+		acc_t max_val = layer.output[0];
+		acc_t sum = 0;
+
+		// Find max value for numerical stability
+	SOFTMAX_MAX: for(int i = 1; i < output_size; i++) {
+			#pragma HLS PIPELINE II=1
+			max_val = (layer.output[i] > max_val) ? layer.output[i] : max_val;
+		}
+
+		// Compute exponentials and sum
+	SOFTMAX_EXP: for(int i = 0; i < output_size; i++) {
+			#pragma HLS PIPELINE II=1
+			layer.output[i] = hls::exp(layer.output[i] - max_val);
+			sum += layer.output[i];
+		}
+
+		// Normalize
+	SOFTMAX_NORM: for(int i = 0; i < output_size; i++) {
+			#pragma HLS PIPELINE II=1
+			layer.output[i] /= sum;
+		}
+	}
+}
+
+// Top-level function for HLS synthesis
+void cnn_forward(
+	data_t input[MAX_LAYER_SIZE],
+	data_t output[MAX_LAYER_SIZE],
+	Layer layers[],
+	int num_layers) {
+
+	#pragma HLS INTERFACE m_axi port=input offset=slave bundle=gmem0
+	#pragma HLS INTERFACE m_axi port=output offset=slave bundle=gmem1
+	#pragma HLS INTERFACE m_axi port=layers offset=slave bundle=gmem2
+	#pragma HLS INTERFACE s_axilite port=num_layers bundle=control
+	#pragma HLS INTERFACE s_axilite port=return bundle=control
+
+	data_t layer_input[MAX_LAYER_SIZE];
+	data_t layer_output[MAX_LAYER_SIZE];
+
+	// Copy input to local buffer
+	memcpy(layer_input, input, MAX_LAYER_SIZE * sizeof(data_t));
+
+	// Process each layer
+LAYER_LOOP: for(int i = 0; i < num_layers; i++) {
+		#pragma HLS LOOP_TRIPCOUNT min=1 max=20
+
+		Layer& current_layer = layers[i];
+
+		switch(current_layer.type) {
+			case conv:
+				conv_forward(current_layer, layer_input);
+				break;
+			case max_pool:
+				maxpool_forward(current_layer, layer_input);
+				break;
+			case fully_connected:
+				fc_forward(current_layer, layer_input);
+				break;
+			default:
+				break;
+		}
+
+		// Copy output to input buffer for next layer
+		memcpy(layer_input, current_layer.output, MAX_LAYER_SIZE * sizeof(data_t));
+	}
+
+	// Copy final output
+	memcpy(output, layer_input, MAX_LAYER_SIZE * sizeof(data_t));
+}
--- a/cnn.c
+++ b/cnn.c
@ -1,85 +1,605 @@
-#include <stdio.h>
+// convolutional neural network c header library
+// inspired by euske's nn1
+// meant to be synthesized into RTL through Vitus HLS for an FPGA implementation
+
 #include <stdlib.h>
 #include <math.h>
+#include <string.h>

 typedef enum {
-  input,
-  conv,
-  max_pool,
-  fully_connected,
-  output
+	input,
+	conv,
+	max_pool,
+	fully_connected
 } ltype;

 typedef enum {
-  relu,
-  softmax,
-  sigmoid,
-  tanh
+	fc_input,
+	fc_hidden,
+	fc_output,
+} fcpos;
+
+typedef enum {
+	a_sigmoid,
+	a_softmax,
 } activation;

 typedef struct {
-  int filters;
-  int filter_h;
-  int filter_w;
-  int stride;
-  int zeropadding; // amount of zeropadding (1 = one layer... etc.)
-} convparams;
+	ltype type;
+	int height;
+	int width;
+	int channels; // in this case, "channels" are the number of filters that are coming in

-typedef struct {
-  int pool_height; // height and width of the pooling window
-  int pool_width;
-} poolparams;
+	union {
+		struct {
+			int num_filters;
+			int filter_size; // single integer b/c filter will usually be square shaped
+			int stride;
+			int zero_padding; // single integer for how many layers of zero padding
+			int input_height;
+			int input_width;
+			int input_channels;
+			float (*weights);
+			float (*biases);
+		} conv_params;

-typedef struct {
-  ltype type;
-  activation atype;
+		struct {
+			int pool_size; // single integer again
+			int stride;
+			int input_height;
+			int input_width;
+		} pool_params;

-  int input_height;
-  int input_width;
-  int input_channels;
-
-  int output_height;
-  int output_width;
-  int output_channels;
-
-  union {
-    convparams layerconv;
-    poolparams layerpool;
-  } params;  
-
-  float* weights;
-  float* biases;
+		struct {
+			int output_size;
+			float* weights;
+			float* biases;
+			activation type;
+		} fc_params;
+	} params;
+	float* output;
+	float* delta;
+	float* pre_activation;
+	float (*activation_g)(float);
 } Layer;

-Layer* createlayer(ltype type, int height, int width, int channels, void* params) {
-  Layer* layer = (Layer*)malloc(sizeof(Layer));
-  layer->type = type;
-  layer->input_height = height;
-  layer->input_width = width;
-  layer->input_channels = channels;
+typedef struct {
+	Layer** layers;
+	int num_layers;
+} Network;

-  layer->weights = NULL;
-  layer->biases = NULL;
-
-  switch(type) {
-    case input: {
-      layer->output_height = input_height;
-      layer->output_width = input_width;
-      layer->output_channels = input_channels;
-      layer->activation = relu;
-      break;
-    }
-    case conv: {
-      convparams* cparams = (convparams*)params;
-      layer->params.layerconv = *cparams;
-      layer->activation = relu;
-
-      // https://cs231n.github.io/convolutional-networks/#pool - formula to find dimensions
-      layer->output_height = ((input_height + 2*conv_params->zero_padding - conv_params->filter_height) / conv_params->stride_height) + 1;
-      layer->output_width = ((input_width + 2*conv_params->zero_padding - conv_params->filter_width) / conv_params->stride_width) + 1;
-
-      layer->output_channels = convparams->filters;
-      
-
-    }
+Network* create_network(int capacity) {
+	Network* network = (Network*)malloc(sizeof(Network));
+	network->layers = (Layer**)malloc(capacity * sizeof(Layer*));
+	network->num_layers = capacity;
+	return network;
+}
+
+float he_init(int fan_in) {
+	float scale = sqrt(2.0f / fan_in);
+	float random = (float)rand() / RAND_MAX * 2 - 1;
+	return random * scale;
+}
+
+float glorot_init(int fan_in, int fan_out) {
+	float limit = sqrt(6.0f / (fan_in + fan_out));
+	float random = (float)rand() / RAND_MAX;
+	return random * 2 * limit - limit;
+}
+
+float relu(float x) {
+	return x > 0 ? x : 0;
+}
+
+float sigmoid(float x) {
+	return 1 / (1 + exp(-x));
+}
+
+float relu_g(float x) {
+	return x > 0 ? 1 : 0;
+}
+
+float sigmoid_g(float x) {
+	float sig = sigmoid(x);
+	return sig * (1 - sig);
+}
+
+void softmax(float* input, float* output, int size) {
+	float max = input[0];
+	for(int i = 1; i < size; i++) {
+		if(input[i] > max) {
+			max = input[i];
+		}
+	}
+	float sum = 0;
+	for(int i = 0; i < size; i++) {
+		output[i] = exp(input[i] - max);
+		sum += output[i];
+	}
+	for(int i = 0; i < size; i++) {
+		output[i] /= sum;
+	}
+}
+
+Layer* create_input(int height, int width, int channels) {
+	Layer* layer = (Layer*)malloc(sizeof(Layer));
+	layer->type = input;
+	layer->height = height;
+	layer->width = width;
+	layer->channels = channels;
+	layer->output = (float*)calloc(height * width * channels, sizeof(float));
+	return layer;
+}
+
+Layer* create_conv(int input_height, int input_width, int input_channels, int num_filters, int filter_size, int stride, int padding) {
+	Layer* layer = (Layer*)malloc(sizeof(Layer));
+	layer->type = conv;
+	layer->params.conv_params.num_filters = num_filters;
+	layer->params.conv_params.filter_size = filter_size;
+	layer->params.conv_params.stride = stride;
+	layer->params.conv_params.zero_padding = padding;
+	layer->params.conv_params.input_height = input_height;
+	layer->params.conv_params.input_width = input_width;
+	layer->params.conv_params.input_channels = input_channels;
+
+	// output dimensions
+	// https://cs231n.github.io/convolutional-networks/
+	int output_h = (input_height + 2 * padding - filter_size) / stride + 1;
+	int output_w = (input_width + 2 * padding - filter_size) / stride + 1;
+	layer->height = output_h;
+	layer->width = output_w;
+	layer->channels = num_filters;
+	layer->activation_g = relu_g;
+
+	// conv layer uses relu, use HE init
+	int weights_size = num_filters * input_channels * filter_size * filter_size;
+	int fan_in = input_channels * filter_size * filter_size;
+	layer->params.conv_params.weights = (float*)calloc(weights_size, sizeof(float));
+	for (int i = 0; i < weights_size; i++) {
+		layer->params.conv_params.weights[i] = he_init(fan_in);
+	}
+
+	layer->params.conv_params.biases = (float*)calloc(num_filters, sizeof(float));
+
+	layer->output = (float*) calloc(output_h * output_w * num_filters, sizeof(float));
+	layer->delta = (float*) calloc(output_h * output_w * num_filters, sizeof(float));
+	layer->pre_activation = (float*)calloc(output_h * output_w * num_filters, sizeof(float));
+
+	return layer;
+}
+
+Layer* create_maxpool(int input_height, int input_width, int input_channels, int pool_size, int stride) {
+	Layer* layer = (Layer*)malloc(sizeof(Layer));
+	layer->type = max_pool;
+	layer->params.pool_params.pool_size = pool_size;
+	layer->params.pool_params.stride = stride;
+	layer->params.pool_params.input_height = input_height;
+	layer->params.pool_params.input_width = input_width;
+
+
+	// output dimensions
+	// https://cs231n.github.io/convolutional-networks/
+	int output_h = (input_height - pool_size) / stride + 1;
+	int output_w = (input_width - pool_size) / stride + 1;
+	layer->height = output_h;
+	layer->width = output_w;
+	layer->channels = input_channels;
+
+	layer->output = (float*) calloc(output_h * output_w * input_channels, sizeof(float));
+	layer->delta = (float*) calloc(output_h * output_w * input_channels, sizeof(float));
+
+	return layer;
+}
+
+Layer* create_fc(int output_size, int input_size, activation type) {
+	Layer* layer = (Layer*)malloc(sizeof(Layer));
+	layer->type = fully_connected;
+	layer->params.fc_params.output_size = output_size;
+	layer->params.fc_params.type = type; // activation type can either be sigmoid or softmax (output layer)
+	layer->activation_g = (type == a_sigmoid) ? sigmoid_g : NULL; // null is softmax (doesnt have a gradient)
+
+	// use glorot initalization 
+	layer->params.fc_params.weights = (float*)calloc(output_size * input_size, sizeof(float));
+	for (int i = 0; i < (output_size * input_size); i++) {
+		layer->params.fc_params.weights[i] = glorot_init(input_size, output_size);
+	}
+
+	layer->params.fc_params.biases = (float*)calloc(output_size, sizeof(float));
+
+	layer->height = 1;
+	layer->width = 1;
+	layer->channels = output_size;
+	layer->output = (float*) calloc(output_size, sizeof(float));
+	layer->delta = (float*) calloc(output_size, sizeof(float));
+	layer->pre_activation = (float*) calloc(output_size, sizeof(float));
+
+	return layer;
+}
+
+void free_layer(Layer* layer) {
+	switch (layer->type) {
+		case input:
+			free(layer->output);
+			free(layer);
+			break;
+		case conv:
+			free(layer->params.conv_params.weights);
+			free(layer->params.conv_params.biases);
+			free(layer->output);
+			free(layer->delta);
+			free(layer->pre_activation);
+			free(layer);
+			break;
+		case max_pool:
+			free(layer->output);
+			free(layer->delta);
+			free(layer);
+			break;
+		case fully_connected:
+			free(layer->params.fc_params.weights);
+			free(layer->params.fc_params.biases);
+			free(layer->output);
+			free(layer->delta);
+			free(layer->pre_activation);
+			free(layer);
+			break;
+	}
+}
+
+void destroy_network(Network* network) {
+	if (!network) return;
+	for (int i = 0; i < network->num_layers; i++) {
+		if (network->layers[i]) {
+			free_layer(network->layers[i]);
+		}
+	}
+	free(network->layers);
+	free(network);
+}
+
+void conv_forward(Layer* layer, float* input) {
+	int padding = layer->params.conv_params.zero_padding;
+	int stride = layer->params.conv_params.stride;
+	int filter_size = layer->params.conv_params.filter_size;
+	int num_filters =  layer->params.conv_params.num_filters;
+	int input_height = layer->params.conv_params.input_height; 
+	int input_width = layer->params.conv_params.input_width;   
+	int input_channels = layer->params.conv_params.input_channels;  	
+
+	int padded_height = input_height + 2 * padding;
+	int padded_width = input_width + 2 * padding;
+	float* padded_input = (float*) calloc(padded_height * padded_width * input_channels, sizeof(float));
+
+	for (int c = 0; c < input_channels; c++) {
+		for (int h = 0; h < input_height; h++) {
+			for (int w = 0; w < input_width; w++) {
+				padded_input[c * padded_height * padded_width + (h + padding) * padded_width + (w + padding)] = input[c * input_height * input_width + h * input_width + w];
+			}
+		}
+	}
+
+	int output_height = (padded_height - filter_size) / stride + 1;
+	int output_width = (padded_width - filter_size) / stride + 1;
+	int output_size = output_height * output_width * num_filters;
+
+	// for every filter
+	for(int f = 0; f < num_filters; f++) {
+		for(int oh = 0; oh < output_height; oh++) {
+			for(int ow = 0; ow < output_width; ow++) {
+				float sum = 0;
+				for(int c = 0; c < input_channels; c++) {
+					for(int fh = 0; fh < filter_size; fh++) {
+						for(int fw = 0; fw < filter_size; fw++) {
+							int ih = oh * stride + fh;
+							int iw = ow * stride + fw;
+
+
+							if (ih >= 0 && ih < padded_height && iw >= 0 && iw < padded_width) {
+								int input_idx = c * padded_height * padded_width + ih * padded_width + iw;
+								int weight_idx = f * input_channels * filter_size * filter_size + 
+									c * filter_size * filter_size + 
+									fh * filter_size + fw;
+
+								sum += padded_input[input_idx] * layer->params.conv_params.weights[weight_idx];
+							}
+						}
+					}
+				}
+				sum += layer->params.conv_params.biases[f];
+				int output_idx = f * output_height * output_width + oh * output_width + ow;
+				layer->pre_activation[output_idx] = sum;
+				layer->output[output_idx] = relu(sum);
+			}
+		}
+	}	
+
+	free(padded_input);
+}
+
+void maxpool_forward(Layer* layer, float* input) {
+	int pool_size = layer->params.pool_params.pool_size;
+	int stride = layer->params.pool_params.stride;
+	// prev layer
+	int input_height = layer->height; 
+	int input_width = layer->width; 
+	int input_channels = layer->channels; 
+
+	int output_height = (input_height - pool_size) / stride + 1;
+	int output_width = (input_width - pool_size) / stride + 1;
+	int output_size = output_height * output_width * input_channels;
+
+	for(int c = 0; c < input_channels; c++) {
+		for(int oh = 0; oh < output_height; oh++) {
+			for(int ow = 0; ow < output_width; ow++) {
+				float max_val = -INFINITY;
+				for(int ph = 0; ph < pool_size; ph++) {
+					for(int pw = 0; pw < pool_size; pw++) {
+						int ih = oh * stride + ph;
+						int iw = ow * stride + pw;
+						float val = input[c * input_height * input_width + ih * input_width + iw];
+						if(val > max_val) {
+							max_val = val;
+						}
+					}
+				}
+				layer->output[c * output_height * output_width + oh * output_width + ow] = max_val;
+			}
+		}
+	}
+}
+
+void fc_forward(Layer* layer, float* input) {
+	int output_size = layer->params.fc_params.output_size;
+	int input_size = layer->height * layer->width * layer->channels;
+
+	// flatten
+	float* flattened_input = (float*) calloc(input_size, sizeof(float));
+	for(int i = 0; i < input_size; i++) {
+		flattened_input[i] = input[i];
+	}
+
+	// matmul (output = bias + (input * weight))
+	float* temp_output = (float*) calloc(output_size, sizeof(float));
+	for(int o = 0; o < output_size; o++) {
+		float sum = 0;
+		for(int i = 0; i < input_size; i++) {
+			sum += flattened_input[i] * layer->params.fc_params.weights[o * input_size + i];
+		}
+		sum += layer->params.fc_params.biases[o];
+		temp_output[o] = sum;
+	}
+
+	// apply the correct activation (sigmoid for non output layers, softmax for output)
+	if(layer->params.fc_params.type == a_sigmoid) {
+		for(int o = 0; o < output_size; o++) {
+			layer->pre_activation[o] = temp_output[o];
+			layer->output[o] = sigmoid(temp_output[o]);
+		}
+	} else if(layer->params.fc_params.type == a_softmax) {
+		softmax(temp_output, layer->output, output_size);
+	}
+
+	free(temp_output);
+	free(flattened_input);
+}
+
+void forward_propagation(Layer* layer, float* input_fc) {
+	int input_size;
+	switch(layer->type) {
+		case input:
+			// input to layer->output
+			input_size = (layer->height * layer->width * layer->channels);
+			for(int i = 0; i < input_size; i++) {
+				layer->output[i] = input_fc[i];
+			}
+			break;
+		case conv:
+			conv_forward(layer, input_fc);
+			break;
+		case max_pool:
+			maxpool_forward(layer, input_fc);
+			break;
+		case fully_connected:
+			fc_forward(layer, input_fc);
+			break;
+	}
+}
+
+void network_forward(Network* network, float* input) {
+	float* current_input = input;
+	for (int i = 0; i < network->num_layers; i++) {
+		forward_propagation(network->layers[i], current_input);
+		current_input = network->layers[i]->output;
+	}
+}
+
+void fc_backward(Layer* layer, float* prev_delta, float* input, float learning_rate) {
+	int output_size = layer->params.fc_params.output_size;
+	int input_size = layer->height * layer->width * layer->channels;
+
+	float* gradient;
+	if(layer->params.fc_params.type == a_softmax) {
+		gradient = (float*)malloc(output_size * sizeof(float));
+		for(int i = 0; i < output_size; i++) {
+			gradient[i] = layer->output[i];
+			if(prev_delta[i] > 0.5) { // one hot encoded
+				gradient[i] -= 1.0;
+			}
+		}
+	} else {
+		gradient = prev_delta;
+	}
+
+	// update weights and biases
+	for(int o = 0; o < output_size; o++) {
+		for(int i = 0; i < input_size; i++) {
+			layer->params.fc_params.weights[o * input_size + i] -= 
+				learning_rate * gradient[o] * input[i];
+		}
+		layer->params.fc_params.biases[o] -= learning_rate * gradient[o];
+	}
+
+	// gradient 
+	if(layer->activation_g) { 
+		for(int i = 0; i < input_size; i++) {
+			float sum = 0;
+			for(int o = 0; o < output_size; o++) {
+				sum += layer->params.fc_params.weights[o * input_size + i] * gradient[o];
+			}
+			layer->delta[i] = sum * layer->activation_g(layer->pre_activation[i]);
+		}
+	}
+
+	if(layer->params.fc_params.type == a_softmax) {
+		free(gradient);
+	}
+}
+
+
+
+void conv_backward(Layer* layer, float* prev_delta, float* input, float learning_rate) {
+	int num_filters = layer->params.conv_params.num_filters;
+	int channels = layer->channels;
+	int filter_size = layer->params.conv_params.filter_size;
+	int input_height = layer->height;
+	int input_width = layer->width;
+	int padding = layer->params.conv_params.zero_padding;
+	int stride = layer->params.conv_params.stride;
+	int output_height = (input_height + 2 * padding - filter_size) / stride + 1;
+	int output_width = (input_width + 2 * padding - filter_size) / stride + 1;
+
+	// gradient w/respect to filters
+	for(int f = 0; f < num_filters; f++) {
+		for(int c = 0; c < channels; c++) {
+			for(int fh = 0; fh < filter_size; fh++) {
+				for(int fw = 0; fw < filter_size; fw++) {
+					float grad = 0;
+					for(int oh = 0; oh < output_height; oh++) {
+						for(int ow = 0; ow < output_width; ow++) {
+							int ih = oh * stride + fh - padding;
+							int iw = ow * stride + fw - padding;
+							if(ih >= 0 && ih < input_height && iw >= 0 && iw < input_width) {
+								grad += input[c * input_height * input_width + ih * input_width + iw] * prev_delta[f * output_height * output_width + oh * output_width + ow];
+							}
+						}
+					}
+					int index = f * channels * filter_size * filter_size + c * filter_size * filter_size + fh * filter_size + fw;
+					layer->params.conv_params.weights[index] -= learning_rate * grad;
+				}
+			}
+		}
+	}
+
+	// gradient w/respect to biases
+	for(int f = 0; f < num_filters; f++) {
+		float grad = 0;
+		for(int oh = 0; oh < output_height; oh++) {
+			for(int ow = 0; ow < output_width; ow++) {
+				grad += prev_delta[f * output_height * output_width + oh * output_width + ow];
+			}
+		}
+		layer->params.conv_params.biases[f] -= learning_rate * grad;
+	}
+
+	// gradient with respect to inputs
+	for(int c = 0; c < channels; c++) {
+		for(int ih = 0; ih < input_height; ih++) {
+			for(int iw = 0; iw < input_width; iw++) {
+				float grad = 0;
+				for(int f = 0; f < num_filters; f++) {
+					for(int fh = 0; fh < filter_size; fh++) {
+						for(int fw = 0; fw < filter_size; fw++) {
+							int oh = (ih - fh + padding) / stride;
+							int ow = (iw - fw + padding) / stride;
+							if((ih - fh + padding) % stride == 0 && (iw - fw + padding) % stride == 0 && oh < output_height && ow < output_width) {
+								int w_index = f * channels * filter_size * filter_size + c * filter_size * filter_size + fh * filter_size + fw;
+								grad += layer->params.conv_params.weights[w_index] * prev_delta[f * output_height * output_width + oh * output_width + ow];
+							}
+						}
+					}
+				}
+				layer->delta[c * input_height * input_width + ih * input_width + iw] = grad * layer->activation_g(layer->pre_activation[c * input_height * input_width + ih * input_width + iw]);
+			}
+		}
+	}
+}
+
+void maxpool_backward(Layer* layer, float* prev_delta, float* input, float learning_rate) {
+	int pool_size = layer->params.pool_params.pool_size;
+	int stride = layer->params.pool_params.stride;
+	int input_height = layer->params.pool_params.input_height;
+	int input_width = layer->params.pool_params.input_width;
+	int channels = layer->channels;
+
+	// Zero initialize deltas
+	memset(layer->delta, 0, input_height * input_width * channels * sizeof(float));
+
+	int output_height = layer->height;
+	int output_width = layer->width;
+
+	for(int c = 0; c < channels; c++) {
+		for(int oh = 0; oh < output_height; oh++) {
+			for(int ow = 0; ow < output_width; ow++) {
+				// finds max value
+				int maxI = -1, maxJ = -1;
+				float maxVal = -INFINITY;
+
+				for(int ph = 0; ph < pool_size; ph++) {
+					for(int pw = 0; pw < pool_size; pw++) {
+						int ih = oh * stride + ph;
+						int iw = ow * stride + pw;
+
+						// checks bounds
+						if (ih < input_height && iw < input_width) {
+							float val = input[c * input_height * input_width + ih * input_width + iw];
+							if(val > maxVal) {
+								maxVal = val;
+								maxI = ih;
+								maxJ = iw;
+							}
+						}
+					}
+				}
+
+				// only propagate gradient if a valid max position is found
+				if(maxI != -1 && maxJ != -1) {
+					int delta_idx = c * output_height * output_width + oh * output_width + ow;
+					layer->delta[c * input_height * input_width + maxI * input_width + maxJ] = 
+						prev_delta[delta_idx];
+				}
+			}
+		}
+	}
+}
+
+void backward_propagation(Layer* layer, float* prev_delta, float* input_fc, float learning_rate) {
+	switch(layer->type) {
+		case fully_connected:
+			fc_backward(layer, prev_delta, input_fc, learning_rate);
+			break;
+		case conv:
+			conv_backward(layer, prev_delta, input_fc, learning_rate);
+			break;
+		case max_pool:
+			maxpool_backward(layer, prev_delta, input_fc, learning_rate);
+			break;
+		case input:
+			// No backpropagation for input layer
+			break;
+	}
+}
+
+void network_backward(Network* network, float* label, float learning_rate) {
+	// ouput
+	Layer* output_layer = network->layers[network->num_layers - 1];
+	// output gradient
+	for(int o = 0; o < output_layer->channels; o++) {
+		output_layer->delta[o] = output_layer->output[o] - label[o];
+	}
+	// backprop
+	for(int i = network->num_layers - 2; i >= 0; i--) {
+		Layer* current_layer = network->layers[i];
+		Layer* next_layer = network->layers[i + 1];
+		backward_propagation(current_layer, next_layer->delta, current_layer->output, learning_rate);
+	}
 }
--- a/mnist.c
+++ b/mnist.c
@ -0,0 +1,189 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include "cnn.c"
+
+#define IMG_HEIGHT 28
+#define IMG_WIDTH 28
+#define NUM_CLASSES 10
+#define BATCH_SIZE 32
+#define LEARNING_RATE 0.01
+#define NUM_EPOCHS 10
+
+float* read_mnist_images(const char* filename, int* num_images) {
+	FILE* fp = fopen(filename, "rb");
+	if (!fp) {
+		printf("Error opening file %s\n", filename);
+		return NULL;
+	}
+
+	int magic_number = 0;
+	fread(&magic_number, sizeof(int), 1, fp);
+	magic_number = ((magic_number & 0xff000000) >> 24) | 
+		((magic_number & 0x00ff0000) >> 8) |
+		((magic_number & 0x0000ff00) << 8) |
+		((magic_number & 0x000000ff) << 24);
+
+	if (magic_number != 2051) {
+		printf("Invalid MNIST image file format\n");
+		fclose(fp);
+		return NULL;
+	}
+
+	fread(num_images, sizeof(int), 1, fp);
+	*num_images = ((*num_images & 0xff000000) >> 24) |
+		((*num_images & 0x00ff0000) >> 8) |
+		((*num_images & 0x0000ff00) << 8) |
+		((*num_images & 0x000000ff) << 24);
+
+	int rows, cols;
+	fread(&rows, sizeof(int), 1, fp);
+	fread(&cols, sizeof(int), 1, fp);
+	rows = ((rows & 0xff000000) >> 24) |
+		((rows & 0x00ff0000) >> 8) |
+		((rows & 0x0000ff00) << 8) |
+		((rows & 0x000000ff) << 24);
+	cols = ((cols & 0xff000000) >> 24) |
+		((cols & 0x00ff0000) >> 8) |
+		((cols & 0x0000ff00) << 8) |
+		((cols & 0x000000ff) << 24);
+
+	if (rows != IMG_HEIGHT || cols != IMG_WIDTH) {
+		printf("Invalid image dimensions\n");
+		fclose(fp);
+		return NULL;
+	}
+
+	float* images = (float*)malloc(*num_images * IMG_HEIGHT * IMG_WIDTH * sizeof(float));
+	unsigned char* temp = (unsigned char*)malloc(IMG_HEIGHT * IMG_WIDTH);
+
+	for (int i = 0; i < *num_images; i++) {
+		fread(temp, 1, IMG_HEIGHT * IMG_WIDTH, fp);
+		for (int j = 0; j < IMG_HEIGHT * IMG_WIDTH; j++) {
+			images[i * IMG_HEIGHT * IMG_WIDTH + j] = temp[j] / 255.0f;
+		}
+	}
+
+	free(temp);
+	fclose(fp);
+	return images;
+}
+
+float* read_mnist_labels(const char* filename, int* num_labels) {
+	FILE* fp = fopen(filename, "rb");
+	if (!fp) {
+		printf("Error opening file %s\n", filename);
+		return NULL;
+	}
+
+	int magic_number = 0;
+	fread(&magic_number, sizeof(int), 1, fp);
+	magic_number = ((magic_number & 0xff000000) >> 24) |
+		((magic_number & 0x00ff0000) >> 8) |
+		((magic_number & 0x0000ff00) << 8) |
+		((magic_number & 0x000000ff) << 24);
+
+	if (magic_number != 2049) {
+		printf("Invalid MNIST label file format\n");
+		fclose(fp);
+		return NULL;
+	}
+
+	fread(num_labels, sizeof(int), 1, fp);
+	*num_labels = ((*num_labels & 0xff000000) >> 24) |
+		((*num_labels & 0x00ff0000) >> 8) |
+		((*num_labels & 0x0000ff00) << 8) |
+		((*num_labels & 0x000000ff) << 24);
+
+	float* labels = (float*)calloc(*num_labels * NUM_CLASSES, sizeof(float));
+	unsigned char* temp = (unsigned char*)malloc(*num_labels);
+
+	fread(temp, 1, *num_labels, fp);
+	for (int i = 0; i < *num_labels; i++) {
+		labels[i * NUM_CLASSES + temp[i]] = 1.0f;
+	}
+
+	free(temp);
+	fclose(fp);
+	return labels;
+}
+
+int main() {
+	// load mnist
+	int num_train_images, num_train_labels;
+	float* train_images = read_mnist_images("train-images-idx3-ubyte", &num_train_images);
+	float* train_labels = read_mnist_labels("train-labels-idx1-ubyte", &num_train_labels);
+
+	// creating a lenet-5 inspired network
+	Network* network = create_network(8);
+	network->layers[0] = create_input(IMG_HEIGHT, IMG_WIDTH, 1);
+	network->layers[1] = create_conv(IMG_HEIGHT, IMG_WIDTH, 1, 6, 5, 1, 2);
+	network->layers[2] = create_maxpool(network->layers[1]->height, network->layers[1]->width, network->layers[1]->channels, 2, 2);
+	network->layers[3] = create_conv(network->layers[2]->height, network->layers[2]->width, network->layers[2]->channels, 16, 5, 1, 0);
+	network->layers[4] = create_maxpool(network->layers[3]->height, network->layers[3]->width, network->layers[3]->channels, 2, 2);
+	network->layers[5] = create_fc(120, network->layers[4]->height * network->layers[4]->width * network->layers[4]->channels, a_sigmoid);
+	network->layers[6] = create_fc(84, 120, a_sigmoid);
+	network->layers[7] = create_fc(NUM_CLASSES, 84, a_softmax);
+
+	// training loop
+	for (int epoch = 0; epoch < NUM_EPOCHS; epoch++) {
+		float total_loss = 0.0f;
+		int correct = 0;
+
+		for (int i = 0; i < num_train_images; i++) {
+			// forward pass
+			network_forward(network, &train_images[i * IMG_HEIGHT * IMG_WIDTH]);
+
+			// accuracy
+			float* output = network->layers[network->num_layers - 1]->output;
+			int predicted = 0;
+			float max_prob = output[0];
+			for (int j = 1; j < NUM_CLASSES; j++) {
+				if (output[j] > max_prob) {
+					max_prob = output[j];
+					predicted = j;
+				}
+			}
+
+			int true_label = 0;
+			for (int j = 0; j < NUM_CLASSES; j++) {
+				if (train_labels[i * NUM_CLASSES + j] > 0.5f) {
+					true_label = j;
+					break;
+				}
+			}
+
+			if (predicted == true_label) correct++;
+
+			// backprop 
+			network_backward(network, &train_labels[i * NUM_CLASSES], LEARNING_RATE);
+
+			// cross entropy loss
+			float loss = 0.0f;
+			for (int j = 0; j < NUM_CLASSES; j++) {
+				if (train_labels[i * NUM_CLASSES + j] > 0.5f) {
+					loss -= log(output[j] + 1e-10);
+				}
+			}
+			total_loss += loss;
+
+			// progress
+			if ((i + 1) % 100 == 0) {
+				printf("Epoch %d/%d, Step %d/%d, Loss: %.4f, Accuracy: %.2f%%\n",
+					 epoch + 1, NUM_EPOCHS, i + 1, num_train_images,
+					 total_loss / (i + 1), 100.0f * correct / (i + 1));
+			}
+		}
+
+		printf("Epoch %d/%d completed, Average Loss: %.4f, Accuracy: %.2f%%\n",
+				 epoch + 1, NUM_EPOCHS, total_loss / num_train_images,
+				 100.0f * correct / num_train_images);
+	}
+
+	// Clean up
+	free(train_images);
+	free(train_labels);
+	destroy_network(network);
+
+	return 0;
+}
--- a/train-images.idx3-ubyte
+++ b/train-images.idx3-ubyte
--- a/train-labels.idx1-ubyte
+++ b/train-labels.idx1-ubyte
Author	SHA1	Message	Date
vikshar	bada43da23	Update README.md	2025-01-17 08:09:57 +01:00
vikshar	cc552a63a8	update readme	2025-01-17 01:08:40 -06:00
vikshar	90207ab834	edit readme	2025-01-17 00:26:17 -06:00
vikshar	cc944d8cc3	push to git	2025-01-16 22:03:28 -06:00
vikshar	40a2b072b8	adds backprop (unfinished)	2025-01-14 18:47:35 -06:00
vikshar	f549f9440c	added forward prop	2025-01-12 13:31:31 -06:00
vikshar	8eab98586c	refactor. simplify code - only gonna use mnist	2025-01-12 09:56:33 -06:00
vikshar	576cb9c490	free layer	2025-01-11 16:04:23 -06:00
vikshar	6521badd2d	create layers functions	2025-01-11 15:56:14 -06:00
vikshar	8c48894c42	create layer struct	2025-01-11 14:36:46 -06:00