Update README.md

update readme
edit readme
2025-01-17 08:09:57 +01:00 · 2025-01-17 01:08:40 -06:00 · 2025-01-17 00:26:17 -06:00 · 2025-01-16 22:03:28 -06:00 · 2025-01-14 18:47:35 -06:00 · 2025-01-12 13:31:31 -06:00
6 changed files with 1123 additions and 79 deletions
--- a/README.md
+++ b/README.md
@ -1,15 +1,40 @@
-# nn - implementation of neural networks in c
+# nn - Neural Networks in C
-implements neural networks in c, targets embedded systems (microcontrollers, fpgas) 
+This repository implements various neural networks in C, focusing mainly on targetting embedded systems or creating hardware accelerators (FPGA-Based, ASIC, etc.) \
-#### current implementations
+This project was created as part of my independent study course, where I am currently researching the design of hardware accelerators for high-performance workloads
 `snn.c` - a simple feedforward neural network written in ~150loc. \
 `cnn.c` - TODO, implements a convolutional neural network \
 `cnn-hls.c` - TODO, has fpga hls specific types/pragmas in order to synthesize to verilog; run on an fpga \
-depends on native c libraries and [gsl](https://www.gnu.org/software/gsl/)
+### Current Implementations (project index)
 `snn.c` - A simple feedforward neural network written in ~150loc. Depends on c native libraries and [GSL](https://www.gnu.org/software/gsl/) \
 `cnn.c` - Implements a fully featured cnn library in ~600loc. Depends solely on C native libraries \
 `cnn-hls.c` - The version of `cnn.c` with HLS specific optimizations (Pragmas, Systolic Array Mutliplication, etc.); aims at being synthesized through Vitus HLS to create a FPGA Based CNN Accelerator \
 `mnist.c` - Driver code for `cnn.c` which trains on the [MNIST](https://yann.lecun.com/exdb/mnist/) dataset
-### future goals
+### Usage
-cnn w/ pragmas -> successfully compiled to verilog using vivado/vitus \
+`mnist.c` is a great example of how the library is used, but basic usage boils down to a few simple things:
-self-made matrix multiplication library, relying only on native c ones \
+
-code cleanup and optimization
+1) Importing `cnn.c` into your code
 2) Creating a network and creating layers:
 ```c
 // an example of a lenet-5 inspired 8 layer network
 Network* network = create_network(8);
 network->layers[0] = create_input(IMG_HEIGHT, IMG_WIDTH, 1);
 network->layers[1] = create_conv(IMG_HEIGHT, IMG_WIDTH, 1, 6, 5, 1, 2);
 network->layers[2] = create_maxpool(network->layers[1]->height, network->layers[1]->width, network->layers[1]->channels, 2, 2);
 network->layers[3] = create_conv(network->layers[2]->height, network->layers[2]->width, network->layers[2]->channels, 16, 5, 1, 0);
 network->layers[4] = create_maxpool(network->layers[3]->height, network->layers[3]->width, network->layers[3]->channels, 2, 2);
 network->layers[5] = create_fc(120, network->layers[4]->height * network->layers[4]->width * network->layers[4]->channels, a_sigmoid);
 network->layers[6] = create_fc(84, 120, a_sigmoid);
 network->layers[7] = create_fc(NUM_CLASSES, 84, a_softmax);
 ```
 3) Forward and backpropogation through the Network!
 ## Project Information
 ### Abstract
 For my project, I propose an implementation of a Convolutional Neural Network based handwritten digital classifier using the MNIST dataset on a Field Programmable Gate Array (FPGA). I utilize High Level Synthesis (HLS) tool called Vitus HLS developed by [AMD/Xilinx](https://www.xilinx.com/products/boards-and-kits.html) in order to implement the accelerator through C, eliminating the need to write any code in HDL Languages such as Verilog/VHDL. To reduce any performance losses, I implement a systolic array based architecture and utilize techniques such as pipelining, loop unrolling, and memory partitioning. Through this project, I aim to highlight the feasibility and viability of FPGAs for low latency, highly energy efficient machine learning workflows, possibly placing them in consideration as a replacement for GPUs for infrence based tasks.
 ### What is an FPGA?
 A Field Programmable Gate Array, or FPGA, is a type of integrated circuit that is made up of a massive collection of unconnected digital logic parts. When someone designs *gateware* for an FPGA, they are essentially connecting these logic blocks together in a way that creates a new piece of hardware. FPGAs are also "field programmable," meaning that they can be reconfigured on-the-fly as per the designer's needs. While often used as tools for rapidly prototyping hardware designs, the nature of an FPGA's highly specialized and customizable hardware design allows them to achieve very low latency, high throughput, and be very energy efficient.
 #### What is High Level Synthesis (HLS)?
 High Level Synthesis is a method of designing gateware that allows a programmer to write the gateware in a higher level language like C, C++, or even [Python](https://fastmachinelearning.org/hls4ml/). A High Level Synthesis tool takes this description of the intended function of the hardware from a higher level language and then synthesizes it into RTL level code (such as Verilog or VHDL). Since writing in languages such as Verilog can be tedious and time consuming, HLS serves as an alternative for designers who want to efficiently build and verify hardware in a language that is much easier to write, and is also a tool that invites normal programmers with no experience writing HDL languages to start developing hardware. In this project, I chose to use HLS to not only work under my time constraint, but evaluate how well the HLS workflow truly is to an invidual with little to no experience in HDL languages.
 ### Reflection and Next Steps
 This project was an amazing way to get involved with both the FPGA and hardware design/accelerator design space. I was able to gain a lot of hands on experience with the design workflow for developing gateware for an FPGA, and also was able to gain insights on performance optimization concepts and methods such as systolic arrays, loop pipelining/unrolling, and code inlining. Furthermore, I was able to work more with the mathematics and theory behind Deep Learning and Neural Networks, which is very good knowledge to have given the development of artifical intellegence. The next steps of this project include cleaning up and optimizing the code, possibly implementing quantization, batch normalization, and other types of layerz such as residual blocks to further improve performance for the neural network. On the hardware side, next steps include obtaining a physical FPGA development board to actually deploy this program onto, and possibly performing a rewrite of the code to not rely on HLS, but write the neural network from scratch in an HDL language such as Verilog.
--- a/cnn-hls.c
+++ b/cnn-hls.c
@ -0,0 +1,310 @@
 #include "ap_fixed.h"
 #include "hls_stream.h"
 #include "hls_math.h"
 #include <string.h>
 // Fixed point definitions for better hardware efficiency
 typedef ap_fixed<16,8> data_t;  // 16 bits total, 8 integer bits
 typedef ap_fixed<16,8> weight_t;
 typedef ap_fixed<32,16> acc_t;  // Wider accumulator to prevent overflow
 // Enums remain the same
 typedef enum {
 	input,
 	conv,
 	max_pool,
 	fully_connected
 } ltype;
 typedef enum {
 	fc_input,
 	fc_hidden,
 	fc_output,
 } fcpos;
 typedef enum {
 	a_sigmoid,
 	a_softmax,
 } activation;
 // Maximum size definitions for static arrays
 #define MAX_LAYER_SIZE 1024
 #define MAX_FILTER_SIZE 11
 #define MAX_CHANNELS 256
 #define MAX_FILTERS 256
 // Layer struct optimized for HLS
 struct Layer {
 	ltype type;
 	int height;
 	int width;
 	int channels;
 	union {
 		struct {
 			int num_filters;
 			int filter_size;
 			int stride;
 			int zero_padding;
 			int input_height;
 			int input_width;
 			int input_channels;
 			weight_t weights[MAX_FILTERS][MAX_CHANNELS][MAX_FILTER_SIZE][MAX_FILTER_SIZE];
 			data_t biases[MAX_FILTERS];
 		} conv_params;
 		struct {
 			int pool_size;
 			int stride;
 			int input_height;
 			int input_width;
 		} pool_params;
 		struct {
 			int output_size;
 			weight_t weights[MAX_LAYER_SIZE][MAX_LAYER_SIZE];
 			data_t biases[MAX_LAYER_SIZE];
 			activation type;
 		} fc_params;
 	} params;
 	data_t output[MAX_LAYER_SIZE];
 	data_t delta[MAX_LAYER_SIZE];
 	data_t pre_activation[MAX_LAYER_SIZE];
 };
 // Helper functions
 data_t sigmoid(data_t x) {
 	#pragma HLS INLINE
 	return 1.0 / (1.0 + hls::exp(-x));
 }
 data_t relu(data_t x) {
 	#pragma HLS INLINE
 	return (x > 0) ? x : 0;
 }
 // Systolic array matrix multiplication for fully connected layers
 void systolic_matrix_multiply(
 	const weight_t weights[MAX_LAYER_SIZE][MAX_LAYER_SIZE],
 	const data_t input[MAX_LAYER_SIZE],
 	acc_t output[MAX_LAYER_SIZE],
 	int M, int N) {
 	#pragma HLS PIPELINE II=1
 	#pragma HLS ARRAY_PARTITION variable=weights cyclic factor=16 dim=2
 	#pragma HLS ARRAY_PARTITION variable=input cyclic factor=16
 	static acc_t pe_array[MAX_LAYER_SIZE];
 	#pragma HLS ARRAY_PARTITION variable=pe_array cyclic factor=16
 	// Initialize processing elements
 	for (int i = 0; i < M; i++) {
 		#pragma HLS UNROLL factor=16
 		pe_array[i] = 0;
 	}
 	// Systolic computation
 	for (int k = 0; k < N; k++) {
 		for (int i = 0; i < M; i++) {
 			#pragma HLS PIPELINE II=1
 			#pragma HLS UNROLL factor=16
 			pe_array[i] += weights[i][k] * input[k];
 		}
 	}
 	// Write results
 	for (int i = 0; i < M; i++) {
 		#pragma HLS UNROLL factor=16
 		output[i] = pe_array[i];
 	}
 }
 // Optimized convolution forward pass
 void conv_forward(Layer& layer, const data_t input[MAX_LAYER_SIZE]) {
 	#pragma HLS INLINE off
 	const int padding = layer.params.conv_params.zero_padding;
 	const int stride = layer.params.conv_params.stride;
 	const int filter_size = layer.params.conv_params.filter_size;
 	const int num_filters = layer.params.conv_params.num_filters;
 	const int input_height = layer.params.conv_params.input_height;
 	const int input_width = layer.params.conv_params.input_width;
 	const int input_channels = layer.params.conv_params.input_channels;
 	// Create padded input buffer
 	data_t padded_input[MAX_CHANNELS][MAX_FILTER_SIZE][MAX_FILTER_SIZE];
 	#pragma HLS ARRAY_PARTITION variable=padded_input complete dim=1
 	const int padded_height = input_height + 2 * padding;
 	const int padded_width = input_width + 2 * padding;
 	const int output_height = (padded_height - filter_size) / stride + 1;
 	const int output_width = (padded_width - filter_size) / stride + 1;
 	// Main convolution loops
 CONV_FILTERS: for(int f = 0; f < num_filters; f++) {
 	CONV_OUTPUT_H: for(int oh = 0; oh < output_height; oh++) {
 		CONV_OUTPUT_W: for(int ow = 0; ow < output_width; ow++) {
 				#pragma HLS PIPELINE II=1
 				acc_t sum = 0;
 			CONV_CHANNELS: for(int c = 0; c < input_channels; c++) {
 				CONV_KERNEL_H: for(int fh = 0; fh < filter_size; fh++) {
 					CONV_KERNEL_W: for(int fw = 0; fw < filter_size; fw++) {
 							#pragma HLS UNROLL factor=3
 							int ih = oh * stride + fh;
 							int iw = ow * stride + fw;
 							if (ih >= 0 && ih < padded_height && iw >= 0 && iw < padded_width) {
 								sum += input[c * input_height * input_width + (ih-padding) * input_width + (iw-padding)] * 
 									layer.params.conv_params.weights[f][c][fh][fw];
 							}
 						}
 					}
 				}
 				sum += layer.params.conv_params.biases[f];
 				int output_idx = f * output_height * output_width + oh * output_width + ow;
 				layer.pre_activation[output_idx] = sum;
 				layer.output[output_idx] = relu(sum);
 			}
 		}
 	}
 }
 // Optimized max pooling forward pass
 void maxpool_forward(Layer& layer, const data_t input[MAX_LAYER_SIZE]) {
 	#pragma HLS INLINE off
 	const int pool_size = layer.params.pool_params.pool_size;
 	const int stride = layer.params.pool_params.stride;
 	const int input_height = layer.height;
 	const int input_width = layer.width;
 	const int input_channels = layer.channels;
 	const int output_height = (input_height - pool_size) / stride + 1;
 	const int output_width = (input_width - pool_size) / stride + 1;
 POOL_CHANNELS: for(int c = 0; c < input_channels; c++) {
 	POOL_OUTPUT_H: for(int oh = 0; oh < output_height; oh++) {
 		POOL_OUTPUT_W: for(int ow = 0; ow < output_width; ow++) {
 				#pragma HLS PIPELINE II=1
 				data_t max_val = -INFINITY;
 			POOL_WINDOW_H: for(int ph = 0; ph < pool_size; ph++) {
 				POOL_WINDOW_W: for(int pw = 0; pw < pool_size; pw++) {
 						#pragma HLS UNROLL
 						int ih = oh * stride + ph;
 						int iw = ow * stride + pw;
 						data_t val = input[c * input_height * input_width + ih * input_width + iw];
 						max_val = (val > max_val) ? val : max_val;
 					}
 				}
 				layer.output[c * output_height * output_width + oh * output_width + ow] = max_val;
 			}
 		}
 	}
 }
 // Optimized fully connected forward pass using systolic array
 void fc_forward(Layer& layer, const data_t input[MAX_LAYER_SIZE]) {
 	#pragma HLS INLINE off
 	const int output_size = layer.params.fc_params.output_size;
 	const int input_size = layer.height * layer.width * layer.channels;
 	// Use systolic array for matrix multiplication
 	acc_t temp_output[MAX_LAYER_SIZE];
 	systolic_matrix_multiply(layer.params.fc_params.weights, input, temp_output, output_size, input_size);
 	// Add biases and apply activation
 FC_OUTPUT: for(int o = 0; o < output_size; o++) {
 		#pragma HLS PIPELINE II=1
 		acc_t sum = temp_output[o] + layer.params.fc_params.biases[o];
 		if(layer.params.fc_params.type == a_sigmoid) {
 			layer.pre_activation[o] = sum;
 			layer.output[o] = sigmoid(sum);
 		} else {
 			layer.output[o] = sum; // For softmax, store raw values
 		}
 	}
 	// Apply softmax if needed
 	if(layer.params.fc_params.type == a_softmax) {
 		acc_t max_val = layer.output[0];
 		acc_t sum = 0;
 		// Find max value for numerical stability
 	SOFTMAX_MAX: for(int i = 1; i < output_size; i++) {
 			#pragma HLS PIPELINE II=1
 			max_val = (layer.output[i] > max_val) ? layer.output[i] : max_val;
 		}
 		// Compute exponentials and sum
 	SOFTMAX_EXP: for(int i = 0; i < output_size; i++) {
 			#pragma HLS PIPELINE II=1
 			layer.output[i] = hls::exp(layer.output[i] - max_val);
 			sum += layer.output[i];
 		}
 		// Normalize
 	SOFTMAX_NORM: for(int i = 0; i < output_size; i++) {
 			#pragma HLS PIPELINE II=1
 			layer.output[i] /= sum;
 		}
 	}
 }
 // Top-level function for HLS synthesis
 void cnn_forward(
 	data_t input[MAX_LAYER_SIZE],
 	data_t output[MAX_LAYER_SIZE],
 	Layer layers[],
 	int num_layers) {
 	#pragma HLS INTERFACE m_axi port=input offset=slave bundle=gmem0
 	#pragma HLS INTERFACE m_axi port=output offset=slave bundle=gmem1
 	#pragma HLS INTERFACE m_axi port=layers offset=slave bundle=gmem2
 	#pragma HLS INTERFACE s_axilite port=num_layers bundle=control
 	#pragma HLS INTERFACE s_axilite port=return bundle=control
 	data_t layer_input[MAX_LAYER_SIZE];
 	data_t layer_output[MAX_LAYER_SIZE];
 	// Copy input to local buffer
 	memcpy(layer_input, input, MAX_LAYER_SIZE * sizeof(data_t));
 	// Process each layer
 LAYER_LOOP: for(int i = 0; i < num_layers; i++) {
 		#pragma HLS LOOP_TRIPCOUNT min=1 max=20
 		Layer& current_layer = layers[i];
 		switch(current_layer.type) {
 			case conv:
 				conv_forward(current_layer, layer_input);
 				break;
 			case max_pool:
 				maxpool_forward(current_layer, layer_input);
 				break;
 			case fully_connected:
 				fc_forward(current_layer, layer_input);
 				break;
 			default:
 				break;
 		}
 		// Copy output to input buffer for next layer
 		memcpy(layer_input, current_layer.output, MAX_LAYER_SIZE * sizeof(data_t));
 	}
 	// Copy final output
 	memcpy(output, layer_input, MAX_LAYER_SIZE * sizeof(data_t));
 }
--- a/cnn.c
+++ b/cnn.c
@ -1,85 +1,605 @@
-#include <stdio.h>
+// convolutional neural network c header library
 // inspired by euske's nn1
 // meant to be synthesized into RTL through Vitus HLS for an FPGA implementation
 #include <stdlib.h>
 #include <math.h>
 #include <string.h>
 typedef enum {
-  input,
+	input,
-  conv,
+	conv,
-  max_pool,
+	max_pool,
-  fully_connected,
+	fully_connected
  output
 } ltype;
 typedef enum {
-  relu,
+	fc_input,
-  softmax,
+	fc_hidden,
-  sigmoid,
+	fc_output,
-  tanh
+} fcpos;
 typedef enum {
 	a_sigmoid,
 	a_softmax,
 } activation;
 typedef struct {
-  int filters;
+	ltype type;
-  int filter_h;
+	int height;
-  int filter_w;
+	int width;
-  int stride;
+	int channels; // in this case, "channels" are the number of filters that are coming in
  int zeropadding; // amount of zeropadding (1 = one layer... etc.)
 } convparams;
-typedef struct {
+	union {
-  int pool_height; // height and width of the pooling window
+		struct {
-  int pool_width;
+			int num_filters;
-} poolparams;
+			int filter_size; // single integer b/c filter will usually be square shaped
 			int stride;
 			int zero_padding; // single integer for how many layers of zero padding
 			int input_height;
 			int input_width;
 			int input_channels;
 			float (*weights);
 			float (*biases);
 		} conv_params;
-typedef struct {
+		struct {
-  ltype type;
+			int pool_size; // single integer again
-  activation atype;
+			int stride;
-  
+			int input_height;
-  int input_height;
+			int input_width;
-  int input_width;
+		} pool_params;
  int input_channels;
-  int output_height;
+		struct {
-  int output_width;
+			int output_size;
-  int output_channels;
+			float* weights;
-
+			float* biases;
-  union {
+			activation type;
-    convparams layerconv;
+		} fc_params;
-    poolparams layerpool;
+	} params;
-  } params;  
+	float* output;
-
+	float* delta;
-  float* weights;
+	float* pre_activation;
-  float* biases;
+	float (*activation_g)(float);
 } Layer;
-Layer* createlayer(ltype type, int height, int width, int channels, void* params) {
+typedef struct {
-  Layer* layer = (Layer*)malloc(sizeof(Layer));
+	Layer** layers;
-  layer->type = type;
+	int num_layers;
-  layer->input_height = height;
+} Network;
  layer->input_width = width;
  layer->input_channels = channels;
-  layer->weights = NULL;
+Network* create_network(int capacity) {
-  layer->biases = NULL;
+	Network* network = (Network*)malloc(sizeof(Network));
-
+	network->layers = (Layer**)malloc(capacity * sizeof(Layer*));
-  switch(type) {
+	network->num_layers = capacity;
-    case input: {
+	return network;
-      layer->output_height = input_height;
+}
-      layer->output_width = input_width;
+
-      layer->output_channels = input_channels;
+float he_init(int fan_in) {
-      layer->activation = relu;
+	float scale = sqrt(2.0f / fan_in);
-      break;
+	float random = (float)rand() / RAND_MAX * 2 - 1;
-    }
+	return random * scale;
-    case conv: {
+}
-      convparams* cparams = (convparams*)params;
+
-      layer->params.layerconv = *cparams;
+float glorot_init(int fan_in, int fan_out) {
-      layer->activation = relu;
+	float limit = sqrt(6.0f / (fan_in + fan_out));
-
+	float random = (float)rand() / RAND_MAX;
-      // https://cs231n.github.io/convolutional-networks/#pool - formula to find dimensions
+	return random * 2 * limit - limit;
-      layer->output_height = ((input_height + 2*conv_params->zero_padding - conv_params->filter_height) / conv_params->stride_height) + 1;
+}
-      layer->output_width = ((input_width + 2*conv_params->zero_padding - conv_params->filter_width) / conv_params->stride_width) + 1;
+
-
+float relu(float x) {
-      layer->output_channels = convparams->filters;
+	return x > 0 ? x : 0;
-      
+}
-
+
-    }
+float sigmoid(float x) {
 	return 1 / (1 + exp(-x));
 }
 float relu_g(float x) {
 	return x > 0 ? 1 : 0;
 }
 float sigmoid_g(float x) {
 	float sig = sigmoid(x);
 	return sig * (1 - sig);
 }
 void softmax(float* input, float* output, int size) {
 	float max = input[0];
 	for(int i = 1; i < size; i++) {
 		if(input[i] > max) {
 			max = input[i];
 		}
 	}
 	float sum = 0;
 	for(int i = 0; i < size; i++) {
 		output[i] = exp(input[i] - max);
 		sum += output[i];
 	}
 	for(int i = 0; i < size; i++) {
 		output[i] /= sum;
 	}
 }
 Layer* create_input(int height, int width, int channels) {
 	Layer* layer = (Layer*)malloc(sizeof(Layer));
 	layer->type = input;
 	layer->height = height;
 	layer->width = width;
 	layer->channels = channels;
 	layer->output = (float*)calloc(height * width * channels, sizeof(float));
 	return layer;
 }
 Layer* create_conv(int input_height, int input_width, int input_channels, int num_filters, int filter_size, int stride, int padding) {
 	Layer* layer = (Layer*)malloc(sizeof(Layer));
 	layer->type = conv;
 	layer->params.conv_params.num_filters = num_filters;
 	layer->params.conv_params.filter_size = filter_size;
 	layer->params.conv_params.stride = stride;
 	layer->params.conv_params.zero_padding = padding;
 	layer->params.conv_params.input_height = input_height;
 	layer->params.conv_params.input_width = input_width;
 	layer->params.conv_params.input_channels = input_channels;
 	// output dimensions
 	// https://cs231n.github.io/convolutional-networks/
 	int output_h = (input_height + 2 * padding - filter_size) / stride + 1;
 	int output_w = (input_width + 2 * padding - filter_size) / stride + 1;
 	layer->height = output_h;
 	layer->width = output_w;
 	layer->channels = num_filters;
 	layer->activation_g = relu_g;
 	// conv layer uses relu, use HE init
 	int weights_size = num_filters * input_channels * filter_size * filter_size;
 	int fan_in = input_channels * filter_size * filter_size;
 	layer->params.conv_params.weights = (float*)calloc(weights_size, sizeof(float));
 	for (int i = 0; i < weights_size; i++) {
 		layer->params.conv_params.weights[i] = he_init(fan_in);
 	}
 	layer->params.conv_params.biases = (float*)calloc(num_filters, sizeof(float));
 	layer->output = (float*) calloc(output_h * output_w * num_filters, sizeof(float));
 	layer->delta = (float*) calloc(output_h * output_w * num_filters, sizeof(float));
 	layer->pre_activation = (float*)calloc(output_h * output_w * num_filters, sizeof(float));
 	return layer;
 }
 Layer* create_maxpool(int input_height, int input_width, int input_channels, int pool_size, int stride) {
 	Layer* layer = (Layer*)malloc(sizeof(Layer));
 	layer->type = max_pool;
 	layer->params.pool_params.pool_size = pool_size;
 	layer->params.pool_params.stride = stride;
 	layer->params.pool_params.input_height = input_height;
 	layer->params.pool_params.input_width = input_width;
 	// output dimensions
 	// https://cs231n.github.io/convolutional-networks/
 	int output_h = (input_height - pool_size) / stride + 1;
 	int output_w = (input_width - pool_size) / stride + 1;
 	layer->height = output_h;
 	layer->width = output_w;
 	layer->channels = input_channels;
 	layer->output = (float*) calloc(output_h * output_w * input_channels, sizeof(float));
 	layer->delta = (float*) calloc(output_h * output_w * input_channels, sizeof(float));
 	return layer;
 }
 Layer* create_fc(int output_size, int input_size, activation type) {
 	Layer* layer = (Layer*)malloc(sizeof(Layer));
 	layer->type = fully_connected;
 	layer->params.fc_params.output_size = output_size;
 	layer->params.fc_params.type = type; // activation type can either be sigmoid or softmax (output layer)
 	layer->activation_g = (type == a_sigmoid) ? sigmoid_g : NULL; // null is softmax (doesnt have a gradient)
 	// use glorot initalization 
 	layer->params.fc_params.weights = (float*)calloc(output_size * input_size, sizeof(float));
 	for (int i = 0; i < (output_size * input_size); i++) {
 		layer->params.fc_params.weights[i] = glorot_init(input_size, output_size);
 	}
 	layer->params.fc_params.biases = (float*)calloc(output_size, sizeof(float));
 	layer->height = 1;
 	layer->width = 1;
 	layer->channels = output_size;
 	layer->output = (float*) calloc(output_size, sizeof(float));
 	layer->delta = (float*) calloc(output_size, sizeof(float));
 	layer->pre_activation = (float*) calloc(output_size, sizeof(float));
 	return layer;
 }
 void free_layer(Layer* layer) {
 	switch (layer->type) {
 		case input:
 			free(layer->output);
 			free(layer);
 			break;
 		case conv:
 			free(layer->params.conv_params.weights);
 			free(layer->params.conv_params.biases);
 			free(layer->output);
 			free(layer->delta);
 			free(layer->pre_activation);
 			free(layer);
 			break;
 		case max_pool:
 			free(layer->output);
 			free(layer->delta);
 			free(layer);
 			break;
 		case fully_connected:
 			free(layer->params.fc_params.weights);
 			free(layer->params.fc_params.biases);
 			free(layer->output);
 			free(layer->delta);
 			free(layer->pre_activation);
 			free(layer);
 			break;
 	}
 }
 void destroy_network(Network* network) {
 	if (!network) return;
 	for (int i = 0; i < network->num_layers; i++) {
 		if (network->layers[i]) {
 			free_layer(network->layers[i]);
 		}
 	}
 	free(network->layers);
 	free(network);
 }
 void conv_forward(Layer* layer, float* input) {
 	int padding = layer->params.conv_params.zero_padding;
 	int stride = layer->params.conv_params.stride;
 	int filter_size = layer->params.conv_params.filter_size;
 	int num_filters =  layer->params.conv_params.num_filters;
 	int input_height = layer->params.conv_params.input_height; 
 	int input_width = layer->params.conv_params.input_width;   
 	int input_channels = layer->params.conv_params.input_channels;  	
 	int padded_height = input_height + 2 * padding;
 	int padded_width = input_width + 2 * padding;
 	float* padded_input = (float*) calloc(padded_height * padded_width * input_channels, sizeof(float));
 	for (int c = 0; c < input_channels; c++) {
 		for (int h = 0; h < input_height; h++) {
 			for (int w = 0; w < input_width; w++) {
 				padded_input[c * padded_height * padded_width + (h + padding) * padded_width + (w + padding)] = input[c * input_height * input_width + h * input_width + w];
 			}
 		}
 	}
 	int output_height = (padded_height - filter_size) / stride + 1;
 	int output_width = (padded_width - filter_size) / stride + 1;
 	int output_size = output_height * output_width * num_filters;
 	// for every filter
 	for(int f = 0; f < num_filters; f++) {
 		for(int oh = 0; oh < output_height; oh++) {
 			for(int ow = 0; ow < output_width; ow++) {
 				float sum = 0;
 				for(int c = 0; c < input_channels; c++) {
 					for(int fh = 0; fh < filter_size; fh++) {
 						for(int fw = 0; fw < filter_size; fw++) {
 							int ih = oh * stride + fh;
 							int iw = ow * stride + fw;
 							if (ih >= 0 && ih < padded_height && iw >= 0 && iw < padded_width) {
 								int input_idx = c * padded_height * padded_width + ih * padded_width + iw;
 								int weight_idx = f * input_channels * filter_size * filter_size + 
 									c * filter_size * filter_size + 
 									fh * filter_size + fw;
 								sum += padded_input[input_idx] * layer->params.conv_params.weights[weight_idx];
 							}
 						}
 					}
 				}
 				sum += layer->params.conv_params.biases[f];
 				int output_idx = f * output_height * output_width + oh * output_width + ow;
 				layer->pre_activation[output_idx] = sum;
 				layer->output[output_idx] = relu(sum);
 			}
 		}
 	}	
 	free(padded_input);
 }
 void maxpool_forward(Layer* layer, float* input) {
 	int pool_size = layer->params.pool_params.pool_size;
 	int stride = layer->params.pool_params.stride;
 	// prev layer
 	int input_height = layer->height; 
 	int input_width = layer->width; 
 	int input_channels = layer->channels; 
 	int output_height = (input_height - pool_size) / stride + 1;
 	int output_width = (input_width - pool_size) / stride + 1;
 	int output_size = output_height * output_width * input_channels;
 	for(int c = 0; c < input_channels; c++) {
 		for(int oh = 0; oh < output_height; oh++) {
 			for(int ow = 0; ow < output_width; ow++) {
 				float max_val = -INFINITY;
 				for(int ph = 0; ph < pool_size; ph++) {
 					for(int pw = 0; pw < pool_size; pw++) {
 						int ih = oh * stride + ph;
 						int iw = ow * stride + pw;
 						float val = input[c * input_height * input_width + ih * input_width + iw];
 						if(val > max_val) {
 							max_val = val;
 						}
 					}
 				}
 				layer->output[c * output_height * output_width + oh * output_width + ow] = max_val;
 			}
 		}
 	}
 }
 void fc_forward(Layer* layer, float* input) {
 	int output_size = layer->params.fc_params.output_size;
 	int input_size = layer->height * layer->width * layer->channels;
 	// flatten
 	float* flattened_input = (float*) calloc(input_size, sizeof(float));
 	for(int i = 0; i < input_size; i++) {
 		flattened_input[i] = input[i];
 	}
 	// matmul (output = bias + (input * weight))
 	float* temp_output = (float*) calloc(output_size, sizeof(float));
 	for(int o = 0; o < output_size; o++) {
 		float sum = 0;
 		for(int i = 0; i < input_size; i++) {
 			sum += flattened_input[i] * layer->params.fc_params.weights[o * input_size + i];
 		}
 		sum += layer->params.fc_params.biases[o];
 		temp_output[o] = sum;
 	}
 	// apply the correct activation (sigmoid for non output layers, softmax for output)
 	if(layer->params.fc_params.type == a_sigmoid) {
 		for(int o = 0; o < output_size; o++) {
 			layer->pre_activation[o] = temp_output[o];
 			layer->output[o] = sigmoid(temp_output[o]);
 		}
 	} else if(layer->params.fc_params.type == a_softmax) {
 		softmax(temp_output, layer->output, output_size);
 	}
 	free(temp_output);
 	free(flattened_input);
 }
 void forward_propagation(Layer* layer, float* input_fc) {
 	int input_size;
 	switch(layer->type) {
 		case input:
 			// input to layer->output
 			input_size = (layer->height * layer->width * layer->channels);
 			for(int i = 0; i < input_size; i++) {
 				layer->output[i] = input_fc[i];
 			}
 			break;
 		case conv:
 			conv_forward(layer, input_fc);
 			break;
 		case max_pool:
 			maxpool_forward(layer, input_fc);
 			break;
 		case fully_connected:
 			fc_forward(layer, input_fc);
 			break;
 	}
 }
 void network_forward(Network* network, float* input) {
 	float* current_input = input;
 	for (int i = 0; i < network->num_layers; i++) {
 		forward_propagation(network->layers[i], current_input);
 		current_input = network->layers[i]->output;
 	}
 }
 void fc_backward(Layer* layer, float* prev_delta, float* input, float learning_rate) {
 	int output_size = layer->params.fc_params.output_size;
 	int input_size = layer->height * layer->width * layer->channels;
 	float* gradient;
 	if(layer->params.fc_params.type == a_softmax) {
 		gradient = (float*)malloc(output_size * sizeof(float));
 		for(int i = 0; i < output_size; i++) {
 			gradient[i] = layer->output[i];
 			if(prev_delta[i] > 0.5) { // one hot encoded
 				gradient[i] -= 1.0;
 			}
 		}
 	} else {
 		gradient = prev_delta;
 	}
 	// update weights and biases
 	for(int o = 0; o < output_size; o++) {
 		for(int i = 0; i < input_size; i++) {
 			layer->params.fc_params.weights[o * input_size + i] -= 
 				learning_rate * gradient[o] * input[i];
 		}
 		layer->params.fc_params.biases[o] -= learning_rate * gradient[o];
 	}
 	// gradient 
 	if(layer->activation_g) { 
 		for(int i = 0; i < input_size; i++) {
 			float sum = 0;
 			for(int o = 0; o < output_size; o++) {
 				sum += layer->params.fc_params.weights[o * input_size + i] * gradient[o];
 			}
 			layer->delta[i] = sum * layer->activation_g(layer->pre_activation[i]);
 		}
 	}
 	if(layer->params.fc_params.type == a_softmax) {
 		free(gradient);
 	}
 }
 void conv_backward(Layer* layer, float* prev_delta, float* input, float learning_rate) {
 	int num_filters = layer->params.conv_params.num_filters;
 	int channels = layer->channels;
 	int filter_size = layer->params.conv_params.filter_size;
 	int input_height = layer->height;
 	int input_width = layer->width;
 	int padding = layer->params.conv_params.zero_padding;
 	int stride = layer->params.conv_params.stride;
 	int output_height = (input_height + 2 * padding - filter_size) / stride + 1;
 	int output_width = (input_width + 2 * padding - filter_size) / stride + 1;
 	// gradient w/respect to filters
 	for(int f = 0; f < num_filters; f++) {
 		for(int c = 0; c < channels; c++) {
 			for(int fh = 0; fh < filter_size; fh++) {
 				for(int fw = 0; fw < filter_size; fw++) {
 					float grad = 0;
 					for(int oh = 0; oh < output_height; oh++) {
 						for(int ow = 0; ow < output_width; ow++) {
 							int ih = oh * stride + fh - padding;
 							int iw = ow * stride + fw - padding;
 							if(ih >= 0 && ih < input_height && iw >= 0 && iw < input_width) {
 								grad += input[c * input_height * input_width + ih * input_width + iw] * prev_delta[f * output_height * output_width + oh * output_width + ow];
 							}
 						}
 					}
 					int index = f * channels * filter_size * filter_size + c * filter_size * filter_size + fh * filter_size + fw;
 					layer->params.conv_params.weights[index] -= learning_rate * grad;
 				}
 			}
 		}
 	}
 	// gradient w/respect to biases
 	for(int f = 0; f < num_filters; f++) {
 		float grad = 0;
 		for(int oh = 0; oh < output_height; oh++) {
 			for(int ow = 0; ow < output_width; ow++) {
 				grad += prev_delta[f * output_height * output_width + oh * output_width + ow];
 			}
 		}
 		layer->params.conv_params.biases[f] -= learning_rate * grad;
 	}
 	// gradient with respect to inputs
 	for(int c = 0; c < channels; c++) {
 		for(int ih = 0; ih < input_height; ih++) {
 			for(int iw = 0; iw < input_width; iw++) {
 				float grad = 0;
 				for(int f = 0; f < num_filters; f++) {
 					for(int fh = 0; fh < filter_size; fh++) {
 						for(int fw = 0; fw < filter_size; fw++) {
 							int oh = (ih - fh + padding) / stride;
 							int ow = (iw - fw + padding) / stride;
 							if((ih - fh + padding) % stride == 0 && (iw - fw + padding) % stride == 0 && oh < output_height && ow < output_width) {
 								int w_index = f * channels * filter_size * filter_size + c * filter_size * filter_size + fh * filter_size + fw;
 								grad += layer->params.conv_params.weights[w_index] * prev_delta[f * output_height * output_width + oh * output_width + ow];
 							}
 						}
 					}
 				}
 				layer->delta[c * input_height * input_width + ih * input_width + iw] = grad * layer->activation_g(layer->pre_activation[c * input_height * input_width + ih * input_width + iw]);
 			}
 		}
 	}
 }
 void maxpool_backward(Layer* layer, float* prev_delta, float* input, float learning_rate) {
 	int pool_size = layer->params.pool_params.pool_size;
 	int stride = layer->params.pool_params.stride;
 	int input_height = layer->params.pool_params.input_height;
 	int input_width = layer->params.pool_params.input_width;
 	int channels = layer->channels;
 	// Zero initialize deltas
 	memset(layer->delta, 0, input_height * input_width * channels * sizeof(float));
 	int output_height = layer->height;
 	int output_width = layer->width;
 	for(int c = 0; c < channels; c++) {
 		for(int oh = 0; oh < output_height; oh++) {
 			for(int ow = 0; ow < output_width; ow++) {
 				// finds max value
 				int maxI = -1, maxJ = -1;
 				float maxVal = -INFINITY;
 				for(int ph = 0; ph < pool_size; ph++) {
 					for(int pw = 0; pw < pool_size; pw++) {
 						int ih = oh * stride + ph;
 						int iw = ow * stride + pw;
 						// checks bounds
 						if (ih < input_height && iw < input_width) {
 							float val = input[c * input_height * input_width + ih * input_width + iw];
 							if(val > maxVal) {
 								maxVal = val;
 								maxI = ih;
 								maxJ = iw;
 							}
 						}
 					}
 				}
 				// only propagate gradient if a valid max position is found
 				if(maxI != -1 && maxJ != -1) {
 					int delta_idx = c * output_height * output_width + oh * output_width + ow;
 					layer->delta[c * input_height * input_width + maxI * input_width + maxJ] = 
 						prev_delta[delta_idx];
 				}
 			}
 		}
 	}
 }
 void backward_propagation(Layer* layer, float* prev_delta, float* input_fc, float learning_rate) {
 	switch(layer->type) {
 		case fully_connected:
 			fc_backward(layer, prev_delta, input_fc, learning_rate);
 			break;
 		case conv:
 			conv_backward(layer, prev_delta, input_fc, learning_rate);
 			break;
 		case max_pool:
 			maxpool_backward(layer, prev_delta, input_fc, learning_rate);
 			break;
 		case input:
 			// No backpropagation for input layer
 			break;
 	}
 }
 void network_backward(Network* network, float* label, float learning_rate) {
 	// ouput
 	Layer* output_layer = network->layers[network->num_layers - 1];
 	// output gradient
 	for(int o = 0; o < output_layer->channels; o++) {
 		output_layer->delta[o] = output_layer->output[o] - label[o];
 	}
 	// backprop
 	for(int i = network->num_layers - 2; i >= 0; i--) {
 		Layer* current_layer = network->layers[i];
 		Layer* next_layer = network->layers[i + 1];
 		backward_propagation(current_layer, next_layer->delta, current_layer->output, learning_rate);
 	}
 }
--- a/mnist.c
+++ b/mnist.c
@ -0,0 +1,189 @@
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include "cnn.c"
 #define IMG_HEIGHT 28
 #define IMG_WIDTH 28
 #define NUM_CLASSES 10
 #define BATCH_SIZE 32
 #define LEARNING_RATE 0.01
 #define NUM_EPOCHS 10
 float* read_mnist_images(const char* filename, int* num_images) {
 	FILE* fp = fopen(filename, "rb");
 	if (!fp) {
 		printf("Error opening file %s\n", filename);
 		return NULL;
 	}
 	int magic_number = 0;
 	fread(&magic_number, sizeof(int), 1, fp);
 	magic_number = ((magic_number & 0xff000000) >> 24) | 
 		((magic_number & 0x00ff0000) >> 8) |
 		((magic_number & 0x0000ff00) << 8) |
 		((magic_number & 0x000000ff) << 24);
 	if (magic_number != 2051) {
 		printf("Invalid MNIST image file format\n");
 		fclose(fp);
 		return NULL;
 	}
 	fread(num_images, sizeof(int), 1, fp);
 	*num_images = ((*num_images & 0xff000000) >> 24) |
 		((*num_images & 0x00ff0000) >> 8) |
 		((*num_images & 0x0000ff00) << 8) |
 		((*num_images & 0x000000ff) << 24);
 	int rows, cols;
 	fread(&rows, sizeof(int), 1, fp);
 	fread(&cols, sizeof(int), 1, fp);
 	rows = ((rows & 0xff000000) >> 24) |
 		((rows & 0x00ff0000) >> 8) |
 		((rows & 0x0000ff00) << 8) |
 		((rows & 0x000000ff) << 24);
 	cols = ((cols & 0xff000000) >> 24) |
 		((cols & 0x00ff0000) >> 8) |
 		((cols & 0x0000ff00) << 8) |
 		((cols & 0x000000ff) << 24);
 	if (rows != IMG_HEIGHT || cols != IMG_WIDTH) {
 		printf("Invalid image dimensions\n");
 		fclose(fp);
 		return NULL;
 	}
 	float* images = (float*)malloc(*num_images * IMG_HEIGHT * IMG_WIDTH * sizeof(float));
 	unsigned char* temp = (unsigned char*)malloc(IMG_HEIGHT * IMG_WIDTH);
 	for (int i = 0; i < *num_images; i++) {
 		fread(temp, 1, IMG_HEIGHT * IMG_WIDTH, fp);
 		for (int j = 0; j < IMG_HEIGHT * IMG_WIDTH; j++) {
 			images[i * IMG_HEIGHT * IMG_WIDTH + j] = temp[j] / 255.0f;
 		}
 	}
 	free(temp);
 	fclose(fp);
 	return images;
 }
 float* read_mnist_labels(const char* filename, int* num_labels) {
 	FILE* fp = fopen(filename, "rb");
 	if (!fp) {
 		printf("Error opening file %s\n", filename);
 		return NULL;
 	}
 	int magic_number = 0;
 	fread(&magic_number, sizeof(int), 1, fp);
 	magic_number = ((magic_number & 0xff000000) >> 24) |
 		((magic_number & 0x00ff0000) >> 8) |
 		((magic_number & 0x0000ff00) << 8) |
 		((magic_number & 0x000000ff) << 24);
 	if (magic_number != 2049) {
 		printf("Invalid MNIST label file format\n");
 		fclose(fp);
 		return NULL;
 	}
 	fread(num_labels, sizeof(int), 1, fp);
 	*num_labels = ((*num_labels & 0xff000000) >> 24) |
 		((*num_labels & 0x00ff0000) >> 8) |
 		((*num_labels & 0x0000ff00) << 8) |
 		((*num_labels & 0x000000ff) << 24);
 	float* labels = (float*)calloc(*num_labels * NUM_CLASSES, sizeof(float));
 	unsigned char* temp = (unsigned char*)malloc(*num_labels);
 	fread(temp, 1, *num_labels, fp);
 	for (int i = 0; i < *num_labels; i++) {
 		labels[i * NUM_CLASSES + temp[i]] = 1.0f;
 	}
 	free(temp);
 	fclose(fp);
 	return labels;
 }
 int main() {
 	// load mnist
 	int num_train_images, num_train_labels;
 	float* train_images = read_mnist_images("train-images-idx3-ubyte", &num_train_images);
 	float* train_labels = read_mnist_labels("train-labels-idx1-ubyte", &num_train_labels);
 	// creating a lenet-5 inspired network
 	Network* network = create_network(8);
 	network->layers[0] = create_input(IMG_HEIGHT, IMG_WIDTH, 1);
 	network->layers[1] = create_conv(IMG_HEIGHT, IMG_WIDTH, 1, 6, 5, 1, 2);
 	network->layers[2] = create_maxpool(network->layers[1]->height, network->layers[1]->width, network->layers[1]->channels, 2, 2);
 	network->layers[3] = create_conv(network->layers[2]->height, network->layers[2]->width, network->layers[2]->channels, 16, 5, 1, 0);
 	network->layers[4] = create_maxpool(network->layers[3]->height, network->layers[3]->width, network->layers[3]->channels, 2, 2);
 	network->layers[5] = create_fc(120, network->layers[4]->height * network->layers[4]->width * network->layers[4]->channels, a_sigmoid);
 	network->layers[6] = create_fc(84, 120, a_sigmoid);
 	network->layers[7] = create_fc(NUM_CLASSES, 84, a_softmax);
 	// training loop
 	for (int epoch = 0; epoch < NUM_EPOCHS; epoch++) {
 		float total_loss = 0.0f;
 		int correct = 0;
 		for (int i = 0; i < num_train_images; i++) {
 			// forward pass
 			network_forward(network, &train_images[i * IMG_HEIGHT * IMG_WIDTH]);
 			// accuracy
 			float* output = network->layers[network->num_layers - 1]->output;
 			int predicted = 0;
 			float max_prob = output[0];
 			for (int j = 1; j < NUM_CLASSES; j++) {
 				if (output[j] > max_prob) {
 					max_prob = output[j];
 					predicted = j;
 				}
 			}
 			int true_label = 0;
 			for (int j = 0; j < NUM_CLASSES; j++) {
 				if (train_labels[i * NUM_CLASSES + j] > 0.5f) {
 					true_label = j;
 					break;
 				}
 			}
 			if (predicted == true_label) correct++;
 			// backprop 
 			network_backward(network, &train_labels[i * NUM_CLASSES], LEARNING_RATE);
 			// cross entropy loss
 			float loss = 0.0f;
 			for (int j = 0; j < NUM_CLASSES; j++) {
 				if (train_labels[i * NUM_CLASSES + j] > 0.5f) {
 					loss -= log(output[j] + 1e-10);
 				}
 			}
 			total_loss += loss;
 			// progress
 			if ((i + 1) % 100 == 0) {
 				printf("Epoch %d/%d, Step %d/%d, Loss: %.4f, Accuracy: %.2f%%\n",
 					 epoch + 1, NUM_EPOCHS, i + 1, num_train_images,
 					 total_loss / (i + 1), 100.0f * correct / (i + 1));
 			}
 		}
 		printf("Epoch %d/%d completed, Average Loss: %.4f, Accuracy: %.2f%%\n",
 				 epoch + 1, NUM_EPOCHS, total_loss / num_train_images,
 				 100.0f * correct / num_train_images);
 	}
 	// Clean up
 	free(train_images);
 	free(train_labels);
 	destroy_network(network);
 	return 0;
 }
--- a/train-images.idx3-ubyte
+++ b/train-images.idx3-ubyte
--- a/train-labels.idx1-ubyte
+++ b/train-labels.idx1-ubyte
Author	SHA1	Message	Date
vikshar	bada43da23	Update README.md	2025-01-17 08:09:57 +01:00
vikshar	cc552a63a8	update readme	2025-01-17 01:08:40 -06:00
vikshar	90207ab834	edit readme	2025-01-17 00:26:17 -06:00
vikshar	cc944d8cc3	push to git	2025-01-16 22:03:28 -06:00
vikshar	40a2b072b8	adds backprop (unfinished)	2025-01-14 18:47:35 -06:00
vikshar	f549f9440c	added forward prop	2025-01-12 13:31:31 -06:00
vikshar	8eab98586c	refactor. simplify code - only gonna use mnist	2025-01-12 09:56:33 -06:00
vikshar	576cb9c490	free layer	2025-01-11 16:04:23 -06:00
vikshar	6521badd2d	create layers functions	2025-01-11 15:56:14 -06:00
vikshar	8c48894c42	create layer struct	2025-01-11 14:36:46 -06:00