GCG_001 — Data Science & Engineering Open to work

Gonzalo
Cruz.

Location Madrid, Spain
Seeking Internship / Junior role
Degree Data Science & Engineering
Avail. Immediate
01

About

The most interesting problems in machine learning aren't the models — they're the systems around them. What happens when the data is too big to fit in RAM? How do you keep a pipeline running reliably at scale? How do you make a model actually useful after training? Those are the questions I want to work on.

My background in Telecommunications gave me a mathematical and engineering foundation most data science students don't get — linear algebra, probability theory, signal processing as first principles, not just tools. I try to bring that same rigour to engineering: understand something well enough to build it from scratch, then know when not to.

02

Education

2024 — Present Current

B.S. Data Science & Engineering

URJC · Madrid

Completed two full academic years in one (2024–2025). Coursework spans ML, Distributed Systems, Deep Learning, HPC, and Statistical Inference. Running parallel to founding and leading Datalab.

2019 — 2023 Transferred

B.S. Telecommunications Engineering

UC3M · Madrid

120+ credits in Networking, Physics, and Signal Processing. Transferred after four years to specialize in Data Science. The mathematical foundation — linear algebra, probability theory, signal processing — carries directly into modeling and engineering work.

2019 Certification

Certificate of Advanced English — C2

Cambridge Assessment

Highest level of the Cambridge English suite.

03

Experience

2020 — Present Teaching · 5 years

Math & Physics Tutor

Self-employed · Madrid

Five years tutoring high school students in mathematics and physics — calculus, linear algebra, mechanics, electromagnetism. Prepared students for university entrance exams (Selectividad). Teaching this long forces you to explain things clearly and find multiple ways into a problem, which turns out to be useful in data science too.

04

Stack

Engineering

  • Apache Airflow DAG orchestration across 2 pipeline projects; TaskFlow API
  • Apache Kafka Event streaming, acks=all producers, sub-batch delivery
  • PySpark DStreams real-time windowed aggregation with inverse reduction
  • Hadoop / HDFS MapReduce batch jobs on 314MB+ datasets; YARN job submission
  • Docker / Compose Containerized Hadoop clusters and Kafka brokers from scratch
  • pandas / NumPy Chunked processing of 1M+ row datasets without OOM

ML & Statistics

  • scikit-learn Classification pipelines, StandardScaler, OHE, RFECV, PCA
  • PyTorch CNNs, LLM fine-tuning, custom training loops
  • XGBoost Gradient boosting for churn and risk prediction
  • Time Series Forecasting, decomposition, stationarity testing
  • imbalanced-learn SMOTE for class imbalance on medical datasets
  • R / mgcv GAMs, GLMs, Gamma/Inverse Gaussian regression, statistical inference
  • From scratch Naive Bayes, AdaBoost, K-NN implemented by hand and benchmarked

High Performance Computing

  • OpenMP Shared-memory parallelism, thread management, loop parallelization
  • MPI Distributed-memory communication, collective operations
  • CUDA GPU kernel programming, memory hierarchy, parallel reduction

Other

  • C Systems programming, memory management, HPC coursework
  • Excel Data analysis, pivot tables, advanced formulas
05

Projects

01 — Data Engineering / MLOps

TripAdvisor Restaurants Pipeline

End-to-end data pipeline processing ~1.08M rows of European restaurant data. Five-stage DAG orchestrated with Apache Airflow: extract, clean, EDA, preprocessing, and streaming to Kafka. Every stage processes data in 50k-row chunks to avoid memory overload. Incremental PCA and StandardScaler fitted batch by batch using partial_fit. All parameters centralized in a single config.toml.

→ Incremental PCA + StandardScaler, acks=all Kafka producer, chunked throughout

Airflow Kafka Docker scikit-learn Incremental PCA Python
View on GitHub ↗

02 — Streaming / Distributed

Real-Time Hashtag Trending

Distributed real-time trending analysis using PySpark DStreams. Connects to a TCP tweet feed, filters for US-only content, and uses a 5-minute sliding window with 10-second slide intervals to rank hashtags. Uses inverse reduction functions for incremental window recalculation — only processes new arrivals and expired tweets instead of recomputing the full window each slide.

→ Sliding window with invFunc — O(slide interval), not O(window size)

PySpark DStreams Windowing Python
View on GitHub ↗

03 — Data Engineering

Hadoop Sentiment Analysis

AFINN-111 lexicon sentiment scoring on a 314MB Twitter JSON dataset using MapReduce on a fully Dockerized Hadoop cluster. Set up HDFS, YARN, and custom Docker images from scratch. Shell-scripted HDFS file upload and job submission — the kind of deployment automation that matters when working with distributed infrastructure.

→ 314MB dataset, full Hadoop cluster in Docker Compose, automated deploy script

Hadoop MapReduce HDFS / YARN Docker Compose Python
View on GitHub ↗

04 — Applied ML

Stroke Risk Prediction

Binary classification of stroke risk on a heavily imbalanced medical dataset. Made a deliberate decision to optimize for Recall over Accuracy — in clinical settings, a missed positive costs far more than a false alarm. SMOTE for class imbalance, RFECV to reduce features from 20+ down to 12. Compared Logistic Regression, MLP, LinearSVC, and RBF-SVM; correctly diagnosed RBF-SVM's failure mode (overfitting to the majority class despite high accuracy).

→ Logistic Regression: 0.88 Recall, 0.84 AUC on validation set

scikit-learn SMOTE RFECV SVM MLP Python
View on GitHub ↗

05 — ML from Scratch

Bank Customer Churn

Churn prediction with XGBoost plus a research component: Naive Bayes, AdaBoost, and K-NN implemented from mathematical first principles in base R, then benchmarked against library versions. The hand-rolled Naive Bayes outperformed the library on Recall — 34.55% vs 23.24% — because understanding the math let me tune it directly. XGBoost won overall after PCA for dimensionality reduction and K-Means for customer segmentation.

→ Manual NB: 34.55% Recall vs library 23.24% — the math matters

R XGBoost PCA K-Means AdaBoost
View on GitHub ↗

06 — Statistical Modeling / ML

Forest Fire Prediction

Regression study predicting burned area in Montesinho Natural Park. Moved beyond OLS to address heteroscedasticity and zero-inflation with Gamma GLMs, Inverse Gaussian GLMs, and GAMs with smoothing splines via mgcv. The GAMs revealed non-linear threshold effects for temperature and Drought Code that OLS and GLMs missed entirely — a reminder that model choice is a scientific decision, not a default.

→ GAMs uncovered non-linearities invisible to OLS and GLMs

R GAM GLM mgcv tidyverse
View on GitHub ↗

07 — Embedded Systems / Robotics

Dual-Mode RC Robot

Autonomous obstacle-avoiding vehicle built on an STM32 Nucleo-L152RE. Two operating modes: autonomous navigation using HC-SR04 ultrasonic ranging (three distance thresholds — forward, decelerate, hard stop), and manual RC control via Bluetooth commands from an Android app. Firmware written in C using STM32 HAL; motor speed controlled via PWM, with soft braking to protect the drivetrain.

→ Interrupt-driven: TIM2 capture, TIM3 PWM, USART2 async Bluetooth

C STM32 HAL PWM Bluetooth Embedded
View on GitHub ↗

08 — Statistical Inference

Whitehead Sample Size Analysis

Implementation of Whitehead's (1983) unified theory for optimal sample sizing in clinical trials. Starting from the mathematical derivation of Fisher Information and Score Statistics for binomial distributions, the project determines the sample size needed to detect a deviation from a baseline probability of p₀ = 0.003 with 80% power. Monte Carlo simulations validate theoretical error rates against empirical ones.

→ Score Test vs Exact Binomial vs Wilcoxon — asymptotic limits at p = 0.003

R MLE Score Test Monte Carlo R Markdown
View on GitHub ↗

09 — Time Series / Forecasting

Time Series Analysis with SARIMA

Analysis and forecasting of four different time series using SARIMA models. Each series goes through stationarity testing (ADF, KPSS), decomposition into trend, seasonality, and residuals, ACF/PACF inspection for order identification, and model selection via AIC/BIC. The seasonal component is modeled explicitly through the SARIMA(p,d,q)(P,D,Q)s structure, with diagnostic checks on residuals to validate each fit.

→ Kaggle competition · awarded · grid search + Fourier terms + temporal windowing

Python SARIMA statsmodels pandas Kaggle
View on GitHub ↗
06

Now

Building

Datalab — growing the association, planning the first workshop series and collaborative data projects with members at URJC.

Studying

High Performance Computing — OpenMP, MPI, CUDA. Time series analysis. Bayesian methods.

Learning

Chinese — working towards HSK4. LLM fine-tuning workflows with PyTorch and transformer architectures.

Seeking

Internship or junior role in data engineering or ML systems. Based in Madrid, open to remote. Available from August 2026.

— Updated May 2026

Let's
talk.

Looking for internships, junior roles, or just interesting conversations about data engineering and ML systems. Based in Madrid, open to remote.

Send an email