The most interesting problems in machine learning aren't the models — they're the systems around them. What happens when the data is too big to fit in RAM? How do you keep a pipeline running reliably at scale? How do you make a model actually useful after training? Those are the questions I want to work on.
My background in Telecommunications gave me a mathematical and engineering foundation most data science students don't get — linear algebra, probability theory, signal processing as first principles, not just tools. I try to bring that same rigour to engineering: understand something well enough to build it from scratch, then know when not to.
URJC · Madrid
Completed two full academic years in one (2024–2025). Coursework spans ML, Distributed Systems, Deep Learning, HPC, and Statistical Inference. Running parallel to founding and leading Datalab.
UC3M · Madrid
120+ credits in Networking, Physics, and Signal Processing. Transferred after four years to specialize in Data Science. The mathematical foundation — linear algebra, probability theory, signal processing — carries directly into modeling and engineering work.
Cambridge Assessment
Highest level of the Cambridge English suite.
URJC · Madrid
Founded the first data science student association at URJC. Defined the strategic roadmap, handled legal registration, and built the community from zero. Organizing technical workshops, coordinating members, and running the first collaborative data projects. The goal: make it easier for students to work on real problems together.
Self-employed · Madrid
Five years tutoring high school students in mathematics and physics — calculus, linear algebra, mechanics, electromagnetism. Prepared students for university entrance exams (Selectividad). Teaching this long forces you to explain things clearly and find multiple ways into a problem, which turns out to be useful in data science too.
Engineering
ML & Statistics
High Performance Computing
Other
End-to-end data pipeline processing ~1.08M rows of European restaurant data.
Five-stage DAG orchestrated with Apache Airflow: extract, clean, EDA, preprocessing,
and streaming to Kafka. Every stage processes data in 50k-row chunks to avoid memory
overload. Incremental PCA and StandardScaler fitted batch by batch using
partial_fit. All parameters centralized in a single config.toml.
→ Incremental PCA + StandardScaler, acks=all Kafka producer, chunked throughout
View on GitHub ↗Distributed real-time trending analysis using PySpark DStreams. Connects to a TCP tweet feed, filters for US-only content, and uses a 5-minute sliding window with 10-second slide intervals to rank hashtags. Uses inverse reduction functions for incremental window recalculation — only processes new arrivals and expired tweets instead of recomputing the full window each slide.
→ Sliding window with invFunc — O(slide interval), not O(window size)
View on GitHub ↗AFINN-111 lexicon sentiment scoring on a 314MB Twitter JSON dataset using MapReduce on a fully Dockerized Hadoop cluster. Set up HDFS, YARN, and custom Docker images from scratch. Shell-scripted HDFS file upload and job submission — the kind of deployment automation that matters when working with distributed infrastructure.
→ 314MB dataset, full Hadoop cluster in Docker Compose, automated deploy script
View on GitHub ↗Binary classification of stroke risk on a heavily imbalanced medical dataset. Made a deliberate decision to optimize for Recall over Accuracy — in clinical settings, a missed positive costs far more than a false alarm. SMOTE for class imbalance, RFECV to reduce features from 20+ down to 12. Compared Logistic Regression, MLP, LinearSVC, and RBF-SVM; correctly diagnosed RBF-SVM's failure mode (overfitting to the majority class despite high accuracy).
→ Logistic Regression: 0.88 Recall, 0.84 AUC on validation set
View on GitHub ↗Churn prediction with XGBoost plus a research component: Naive Bayes, AdaBoost, and K-NN implemented from mathematical first principles in base R, then benchmarked against library versions. The hand-rolled Naive Bayes outperformed the library on Recall — 34.55% vs 23.24% — because understanding the math let me tune it directly. XGBoost won overall after PCA for dimensionality reduction and K-Means for customer segmentation.
→ Manual NB: 34.55% Recall vs library 23.24% — the math matters
View on GitHub ↗
Regression study predicting burned area in Montesinho Natural Park. Moved beyond OLS
to address heteroscedasticity and zero-inflation with Gamma GLMs, Inverse Gaussian
GLMs, and GAMs with smoothing splines via mgcv. The GAMs revealed
non-linear threshold effects for temperature and Drought Code that OLS and GLMs
missed entirely — a reminder that model choice is a scientific decision, not a default.
→ GAMs uncovered non-linearities invisible to OLS and GLMs
View on GitHub ↗Autonomous obstacle-avoiding vehicle built on an STM32 Nucleo-L152RE. Two operating modes: autonomous navigation using HC-SR04 ultrasonic ranging (three distance thresholds — forward, decelerate, hard stop), and manual RC control via Bluetooth commands from an Android app. Firmware written in C using STM32 HAL; motor speed controlled via PWM, with soft braking to protect the drivetrain.
→ Interrupt-driven: TIM2 capture, TIM3 PWM, USART2 async Bluetooth
View on GitHub ↗Implementation of Whitehead's (1983) unified theory for optimal sample sizing in clinical trials. Starting from the mathematical derivation of Fisher Information and Score Statistics for binomial distributions, the project determines the sample size needed to detect a deviation from a baseline probability of p₀ = 0.003 with 80% power. Monte Carlo simulations validate theoretical error rates against empirical ones.
→ Score Test vs Exact Binomial vs Wilcoxon — asymptotic limits at p = 0.003
View on GitHub ↗Analysis and forecasting of four different time series using SARIMA models. Each series goes through stationarity testing (ADF, KPSS), decomposition into trend, seasonality, and residuals, ACF/PACF inspection for order identification, and model selection via AIC/BIC. The seasonal component is modeled explicitly through the SARIMA(p,d,q)(P,D,Q)s structure, with diagnostic checks on residuals to validate each fit.
→ Kaggle competition · awarded · grid search + Fourier terms + temporal windowing
View on GitHub ↗Datalab — growing the association, planning the first workshop series and collaborative data projects with members at URJC.
High Performance Computing — OpenMP, MPI, CUDA. Time series analysis. Bayesian methods.
Chinese — working towards HSK4. LLM fine-tuning workflows with PyTorch and transformer architectures.
Internship or junior role in data engineering or ML systems. Based in Madrid, open to remote. Available from August 2026.
— Updated May 2026
Looking for internships, junior roles, or just interesting conversations about data engineering and ML systems. Based in Madrid, open to remote.
Send an emailOverview
End-to-end data pipeline processing ~1.08M rows of European TripAdvisor restaurant data. The core constraint was memory: the dataset can't fit in RAM, so every stage processes data in 50k-row chunks from first to last.
Architecture
partial_fit — online learning, no full dataset in memoryacks=all and sub-batch delivery for fault toleranceconfig.tomlKey challenge
Making scikit-learn transformers work incrementally. The standard fit/transform API requires the full dataset. Switching to partial_fit required careful state management — the fitted transformers had to persist across chunk iterations and be reused for the transform pass.
Overview
Distributed real-time trending analysis using PySpark DStreams. Connects to a live TCP tweet feed, filters for US-only content, and maintains a 5-minute sliding window with 10-second slide intervals to rank hashtags continuously.
The interesting part
The naive approach — recomputing the full window on each slide — is O(window size). Instead, the pipeline uses inverse reduction functions (invFunc): on each slide, only new arrivals are added and expired tweets are subtracted. This makes each update O(slide interval), not O(window size). A 5-minute window with 10-second slides is 30x cheaper per update.
Stack decisions
Overview
AFINN-111 lexicon sentiment scoring on a 314MB Twitter JSON dataset using MapReduce on a fully Dockerized Hadoop cluster. The goal was to build and operate the infrastructure from scratch, not just run a managed service.
Infrastructure
MapReduce design
Mapper tokenizes tweet text and scores each token against the AFINN lexicon. Reducer aggregates per-tweet scores. The Dockerized setup means the entire cluster can be reproduced with a single docker compose up.
Overview
Binary classification of stroke risk on a heavily imbalanced medical dataset (5110 patients, ~5% positive). The core decision was to optimize for Recall over Accuracy — in a clinical setting, a missed stroke costs far more than a false alarm.
Pipeline
Results
Logistic Regression won with 0.88 Recall and 0.84 AUC on the validation set. RBF-SVM achieved high accuracy but failed on Recall — it was predicting the majority class almost exclusively. Diagnosing that failure mode was the most instructive part of the project.
Overview
Churn prediction on 10,000 bank customers with a research component: Naive Bayes, AdaBoost, and K-NN implemented from mathematical first principles in base R, then benchmarked against library implementations.
The from-scratch experiment
The hand-rolled Naive Bayes outperformed the library version on Recall — 34.55% vs 23.24%. The reason: the library applies Laplace smoothing by default, which dampens probabilities for rare classes. Implementing it from scratch meant I could tune this directly for the imbalanced dataset.
Full pipeline
Overview
Regression study predicting burned area in Montesinho Natural Park from meteorological and fire index covariates. The dataset has severe zero-inflation (many fires burn near-zero area) and extreme right skew — OLS is the wrong tool from the start.
Model progression
mgcv — final model, best fitKey finding
GAMs revealed non-linear threshold effects for temperature and Drought Code that were invisible to OLS and GLMs — above certain thresholds, burned area increases sharply. This is a physical phenomenon the parametric models couldn't capture. Model choice here is a scientific decision, not a default.
Overview
Dual-mode robotic vehicle built on an STM32 Nucleo-L152RE. Operates autonomously using ultrasonic obstacle detection, or manually via Bluetooth commands from an Android app. Firmware written in C using STM32 HAL.
Hardware
Firmware architecture
Fully interrupt-driven: TIM2 captures ultrasonic echo pulses, TIM3 generates PWM for motor speed, USART2 handles async Bluetooth commands, GPIO manages direction. Soft braking — progressive PWM duty cycle reduction rather than abrupt stop — protects the drivetrain from mechanical stress.
Overview
Implementation of Whitehead's (1983) unified theory for sample size determination in clinical trials. The goal: find the minimum sample size to detect a deviation from a baseline event probability of p₀ = 0.003 with Type I error α = 0.025 and 80% power.
Methods
Key finding
At p = 0.003, asymptotic approximations (Score Test) are computationally efficient but require sufficiently large n to be valid — small probabilities push the asymptotics hard. The Exact Binomial test provides a ground truth check, and the Wilcoxon confirms robustness when the normality assumption is questionable.
Overview
Analysis and forecasting of four different time series using SARIMA models. The workflow for each series follows the Box-Jenkins methodology: identify, estimate, and diagnose — repeated until the residuals behave like white noise.
Methodology
What makes it interesting
Working with four structurally different series forces you to make different modeling decisions for each one — different differencing orders, different seasonal periods, different levels of heteroscedasticity. The project is as much about diagnosing what's wrong with a model as it is about fitting one.