End-to-End ETL Pipeline for Network Security using Docker & AWS

 

Introduction

In this project, I present a modular, reproducible ETL (Extract, Transform, Load) workflow for network security analytics, leveraging Docker containers, cloud platforms (AWS), and automated pipelines. The aim is seamless, secure data handling from ingestion to machine learning model deployment, addressing real-world security needs.


Objectives

Part 1: Data Ingestion

  • Gather raw data from multiple sources (CSV files, APIs, internal databases).

  • Automatically store processed data in MongoDB Atlas using a dedicated Docker container.

Part 2: Data Validation & Transformation

  • Validate data integrity, schema, and detect drift.

  • Preprocess data: clean, scale, encode, split into features and labels using Docker containers for each phase.

Part 3: Model Training, Evaluation & Deployment

  • Train ML models (like PKCKNN, robust scaler, KNN imputer) on processed features.

  • Evaluate model performance with detailed reports.

  • Deploy secure model as a containerized app on AWS using CI/CD and Docker.


Containers Involved (with Download Links)

  • Data Ingestion Container: Automates raw data extraction and loading to MongoDB Atlas.

  • Data Validation Container: Validates and checks schema integrity.

  • Data Transformation Container: Cleans and preprocesses raw dataset.

  • Model Trainer Container: Trains the chosen ML model.

  • Model Evaluation Container: Evaluates accuracy/performance, exports metrics.

  • Model Pusher Container: Deploys model as an AWS container.


Other Software (Purpose & Usage)

  • MongoDB Atlas: Cloud-hosted database for storing network data.

  • AWS EC2/App Runner: Application hosting/deployment of trained models.

  • AWS ECR: Secure Docker image registry for deployment.

  • Github Actions: CI/CD automation (build, test, deploy pipelines).

  • Python stack (pandas, sklearn): Data manipulation, ML model building.


Overall Architecture (Line Diagram & Workflow)

This DA is built as a multi-stage ETL pipeline, each stage encapsulated in its own Docker container for repeatability and security.
Workflow:

Raw Data → Data Ingestion → Validation → Transformation → Model Training → Model Evaluation → Deployment (AWS)

  • ETL Pipeline :

Fig:1


  • Data Ingestion :


Fig:2




  • Data Validation :

Fig:3


  • Data Transformation :

Fig:4


  • Model Trainer :

Fig:5



Detailed Description of Architecture

1. Data Ingestion Architecture

The data ingestion process forms the initial phase of any machine learning workflow. It represents the procedures by which raw data is collected from a source (commonly a database like MongoDB) and systematically brought into the pipeline for further processing.

  • The process begins with the Data Ingestion Config, which specifies parameters such as file paths for feature storage, training and testing files, train/test split ratios, and collection names.

  • Using the configuration, the Initiate Data Ingestion module reads and exports data from the source database into a feature store. It encapsulates schema verification, data extraction, and initial preprocessing tasks.

  • The raw output from this stage is stored as artifacts, most notably as CSV files (e.g., RAW.csvtrain.csvtest.csv) in the feature store. The data is further split into training and testing sets as required for subsequent steps.

  • The architecture ensures traceability and reproducibility by generating an ingestion artifact containing metadata and timestamps for all steps as shown in fig:1.

2. Data Validation Architecture

Data validation is essential for ensuring the integrity and usability of ingested data. This phase incorporates schema definition, type checks, and drift analysis to enable reliable model training as shown in fig:2.

  • The main components include Data Validation Config, which manages paths for various directories and report files.

  • The workflow begins with Initiate Data Validation leveraging the configuration and schema files. It reads in the previously ingested datasets (train.csvtest.csv) and validates key properties (such as column count and existence of required numerical columns).

  • For both training and testing data, checks are performed to ensure columns are not missing and that their numeric features are correctly mapped.

  • If a validation error arises (missing columns or features), the process is halted. Else, it generates validation status and drift metrics.

  • The results are captured in a report.json artifact, summarizing validation status, valid/invalid paths, and drift report locations.

3. Data Transformation Architecture

Data transformation ensures the raw, validated data is converted and engineered into formats optimal for model consumption. It includes feature scaling, encoding, and handling of imbalanced datasets as shown in fig:3.

  • This stage starts from the Data Transformation Config, describing required directories and file paths for transformation.

  • The Initiate Data Transformation module loads the validated data and applies a sequence of transformations such as scaling (RobustScaler), imputation for missing values, and encoding target features.

  • Targets and features are separated and mapped appropriately for both training and testing sets.

  • Advanced processes like SMOTETomek are used for handling class imbalance in training data.

  • Numpy arrays (train.npytest.npy) are generated, encapsulating the transformed data, and stored as transformation artifacts.

  • The transformation pipeline outputs a serialized preprocessor object (preprocessing.pkl) for reusability and consistency during inference.

4. Model Training Architecture

The model training section leverages all previously transformed and validated datasets and configurations to build, validate, and select the optimal predictive model as shown in fig:4.

  • Model Trainer Config acts as the entry controller, specifying directories, file paths of trained models, expected accuracy thresholds, and config files.

  • The process begins with Initiate Model Training followed by loading the transformed numpy array data generated during the transformation phase.

  • The training phase involves splitting the data into arrays: X_trainy_trainX_testy_test.

  • Model Factory constructs a pipeline of candidate models and evaluates them based on provided configuration and expected accuracy metrics.

  • The get_best_model module selects the top-performing model according to test scores. If the model score meets or exceeds a baseline, it is accepted for deployment; otherwise, the process reverts or flags failure.

  • Performance metrics are calculated and stored as artifacts, enabling transparent assessment and reproducibility.

  • The best model and associated metadata (score, parameters) are serialized and saved as a pickled object (model.pkl) for future use, along with the metric artifact summarizing its evaluation.

End-to-End Workflow Summary

Together, these architectures constitute a robust, scalable machine learning pipeline:

  • Data is ingested, validated, and transformed step by step with rigorous control on quality and format.

  • Each stage generates structured artifacts (csvnpyjsonpkl) for audit and reuse.

  • The pipeline supports modularity, allowing for updates and enhancements to individual modules (e.g., swapping out validation logic or transformation algorithms).

  • Critical checkpoints ensure only validated and accurately transformed data progresses to the training phase, thus increasing reliability.

  • Model selection leverages configurable factories and metric computation for objective, data-driven choice, ensuring that only the best-performing models are stored and pushed for deployment.

  • The overall workflow is suited for professional, repeatable ML engineering in research and industrial settings, with support for audits, diagnostics, and workflow automation as shown in fig:5.

Step-by-Step Procedure (with Image Insertion Points)

Part 1: Data Ingestion

Steps:

  1. Run Data Ingestion Docker container.

  2. Configure it to read CSV, API, or DB source.

  3. Output stored in MongoDB Atlas; train/test split is created as shown in fig:6.

  4. Total 22110 documents have been injected

  • Insert Screenshots Here:

  • Fig:6




Part 2: Data Validation & Transformation

Steps:

  1. Run Validation container.

  2. Check schema, fill missing values, generate validation reports.

  3. Execute Transformation container for scaling, encoding, and array generation.

  4. Drift Report of the validation is inserted below as shown in fig:7

  • Fig:7



Part 3: Model Training, Evaluation & Deployment

Steps:

  1. Start Model Trainer container, train chosen ML model.

  2. Save trained model and preprocessing objects.

  3. Run Evaluation container to generate performance metrics and accuracy graphs.

  4. Deploy trained model on AWS via Model Pusher container using CI/CD pipelines as shown in fig 9.

  5. Used MLflow application to track metrics. In this i used 4 models and accuracy score are as shown in fig:8. 

    • Fig:8

Fig-5
  • Fig:9

  • Fig:10




Container Modifications: Step-by-Step

  1. Clone original containers from GitHub Actions.

  2. Edit configuration files for your specific dataset/database/cloud credentials.

  3. Edit Python scripts to add/modify preprocessing/modeling logic.

  4. Update Dockerfile for new dependencies/exposed ports.

  5. Build modified container:
    docker build -t yourname/containername:tag .

  6. Push container to your GitHub repo.

  7. Test locally and validate cloud deployment.


Outcomes

  • Successfully ingested, cleaned, validated, and transformed network security datasets.

  • ML model trained and deployed in production AWS environment.

  • Modular pipeline ensures repeatable, secure workflow for future expansion.

  • Generated artifacts, validation reports, and deployment logs at every stage.


Conclusion

This project demonstrates modular cloud-native solutions for network security, delivering repeatable, secure and scalable workflows using Docker containers and AWS. Automation and validation at each stage ensures high data integrity and easier re-deployment for new datasets.


REFERENCES

The work presented in this project is informed by several credible academic and technical resources. The foundational understanding of containerization concepts and practical implementation methodologies was supported by the official Docker documentation and tutorial material provided by IIT Bombay (DockerTutorial). Additionally, configuration patterns and deployment workflows are deployed in my repositories on GitHub (GitHub Link), which will serve as valuable information about my project. Course materials, lecture notes, and recommended reading texts from the Cloud Computing curriculum, 7th Semester, School of Computer Science and Engineering (SCOPE), VIT Chennai, were also referenced to ensure that theoretical concepts aligned with academic standards and industry practices.

ACKNOWLEDGEMENT

I would like to express my sincere gratitude to VIT Chennai and the School of Computer Science and Engineering (SCOPE) for providing an environment that encourages research, innovation, and practical learning. I extend my appreciation to my faculty members, whose guidance and feedback were instrumental in the successful completion of this work. I also wish to acknowledge my friends and peers, whose collaboration and continuous support greatly contributed to problem-solving throughout the project. My heartfelt thanks go to my parents and family members for their constant motivation, encouragement, and understanding during the course of this academic endeavor. Lastly, I acknowledge the contribution of various educational resources, reference materials, and open-source communities whose work facilitated the development and execution of this project.

Sreekar Kannepalli


Comments