Skip to content
View atandra2000's full-sized avatar
💭
Learning has no ending
💭
Learning has no ending

Block or report atandra2000

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
atandra2000/README.md

Atandra Bharati

Deep Learning Research Engineer — building frontier AI architectures from scratch in raw PyTorch.

LLMs · Latent Diffusion · Multimodal · Video Understanding · Agentic ML

12 from-scratch projects · 78% memory optimization · 878-test agentic platform · 860M-param UNet trained from random init


🎯 Open To

Deep Learning Research Engineer · LLM Engineer · GenAI / Diffusion Engineer · Agentic ML Engineer

Remote-friendly · Available worldwide


🧭 Now

Shipping the Autonomous ML Research Engineer platform (15 phases, 23 agents) and exploring mixture-of-depths routing for sub-1B parameter LLMs.


🛠️ Stack

Languages & ML core   Python PyTorch CUDA

Architectures   Transformers · GQA · MLA · RoPE · SwiGLU · RMSNorm · MoE · Gated Delta Net · MTP · Diffusion UNet · VAE · GAN · CycleGAN · ST-GCN · HRNet · SigLIP

Optimization & numerics   BF16 · FP16 · FP8 · Flash Attention 2 · SDPA · torch.compile · channels_last · Gradient checkpointing · μP scaling · WSD LR · NorMuon · Chunked cross-entropy · Disk-backed token caching · Fused optimizers

Hardware validated   A100 80GB · RTX 5090 (Blackwell) · RTX 6000 Ada · RTX 3090 · P100 · 2× T4

Tooling   HuggingFace · diffusers · tiktoken · W&B · Comet · safetensors · ONNX · TensorRT · FastAPI · pydantic v2 · ChromaDB · Ollama Cloud


🏆 Highlights

  • 78% peak memory reduction (92 GB → 20 GB) for LLM pretraining via gradient checkpointing, chunked cross-entropy, and disk-backed token caching — enabling 2× batch-size headroom on a single A100 80GB.
  • Training loss 0.0947 at epoch 16 on Stable Diffusion 1.x (860M UNet) trained from random init across a 7-phase curriculum on 2× RTX 5090.
  • ~30 FPS inference on RTX 3090 for skeleton-based action recognition, served via ONNX + TensorRT + FastAPI.
  • 878 passing tests · 15 cooperating phases · 23 agents · 61 tools · 186 models in the Autonomous ML Research Engineer platform — full paper-to-conclusions loop with self-repair and provider-agnostic LLM routing.
  • 415.6M active / 868.6M stored params in FusionLLM — a novel hybrid of MLA + Gated Delta Net + MoE + MTP in a 24-layer decoder.
  • 643-line technical deep-dive on MLA (Multi-Head Latent Attention) covering KV-cache math, low-rank compression, the absorption-trick derivation, and decoupled RoPE mechanics.

📂 Projects

Domain Project Highlight Hardware Repo
LLM DeepSeek-v3-Lite (422M) MLA + AuxLossFreeGate MoE + MTP, end-to-end with absorption-trick inference A100 80GB
LLM LLaMA-3-Lite (515M) GQA · RoPE θ=500K · SwiGLU · RMSNorm · FA2 · chunked CE · 78% memory cut A100 80GB
LLM FusionLLM (415.6M / 868.6M) Novel MLA + Gated Delta Net + MoE + MTP hybrid · NorMuon + CautiousAdamW · WSD A100 80GB
LLM GPT-From-Scratch 200-line educational GPT-2 with fused QKV; HF weight loading MPS / CUDA
LLM TranslationLM (EN→IT) Encoder–decoder Transformer · loss 6.17 → 2.28 · BLEU/CER/WER P100
Vision Stable Diffusion 1.x (860M UNet) Custom UNet trained from random init · 7 phases · 1.3M+ images · best loss 0.0947 2× RTX 5090
Vision ActionRecognition (120 cls) HRNet pose + Two-Stream CTR-GCN · ~30 FPS · ONNX + TensorRT RTX 3090
Vision FaceAgingCycleGAN (256²) Per-layer AdaIN conditioning · 3-scale PatchGAN · LSGAN + R1 GP RTX 6000 Ada
Vision FaceGenerationVAE (β-VAE) 50 epochs · recon MSE 0.0152 · linear KL annealing · bilinear-upsample decoder P100
Vision DCGAN-Face-Generation 50 epochs · 202K CelebA · D loss → ln 2 ≈ 0.693 equilibrium 2× T4
Multimodal VisionLangModel (PaliGemma-style) SigLIP ViT + Gemma decoder + linear projector · zero pretrained weights P100
Agentic Autonomous ML Research Engineer 15-phase multi-agent platform · paper → plan → patch → train → evaluate → report Local + Ollama Cloud

✍️ Writing


🔬 Engineering Themes

  • From-scratch PyTorch — no Trainer, no Lightning, no accelerate; every layer written by hand
  • Single-GPU feasibility — BF16, gradient checkpointing, FA2, channels_last, fused optimizers
  • Faithful reproductions — DeepSeek-V3, LLaMA-3, PaliGemma, DCGAN implemented to the paper
  • Novel hybrids — FusionLLM (MLA + GDN + MoE + MTP), FaceAgingCycleGAN (AdaIN-conditioned CycleGAN)
  • Production hygiene — atomic checkpoints (.tmp.ptos.rename), full RNG-state reproducibility, W&B / Comet tracking, CI lint + tests
  • Data pipelines — resumable download → filter → tokenize → shard → streaming loader, with dedup and document packing
  • Post-training & inference — speculative decoding (MTP-as-draft), Min-SNR loss weighting, EMA, classifier-free guidance
  • Hardware breadth — MPS / CPU → Kaggle T4 / P100 → A100 80GB → 2× RTX 5090 → RTX 6000 Ada

🎓 Background

B.Tech, 2024 · Heritage Institute of Technology, Kolkata. Self-taught in deep learning through two years of from-scratch implementation — engineering discipline from infrastructure and constraint work translates directly to memory budgets, distributed training, and reproducible ML systems.


📫 Connect

Portfolio LinkedIn GitHub W&B Kaggle Comet Email


Last updated 2026-06-27 · Open to remote and on-site roles

Pinned Loading

  1. StableDiffusion StableDiffusion Public

    A Stable Diffusion 1.x-class latent diffusion model trained from scratch on 2× RTX 5090 (Blackwell) GPUs. Full UNet (~860M params), DDPM/DDIM, LAION pipeline, DDP+BF16.

    Python

  2. DeepSeek-v3-Lite DeepSeek-v3-Lite Public

    Faithful from-scratch reimplementation of DeepSeek-V3 (MLA + MoE + MTP), scaled for Chinchilla-optimal 422M training on a single A100 80GB

    Python 1