Codeaza Technologies
Engineering Internship · 8-Week Roadmap

Train the stack. Build the engine.
Ship a real product.

A gated, deep-technical path — from raw fundamentals to a data engine scraping 50+ job platforms, ending with your code live on rozgar.codeaza.org.

WEEK 1
Foundations
WEEK 2–4
The Data Engine
WEEK 4–8
Ship Rozgar
PHASE 1 · FOUNDATIONS

Stack bootcamp — ship a working artifact every day

One week to become dangerous across the whole stack. No passive tutorials — each day ends with a committed, running artifact. By Friday: a Next.js frontend calling a typed FastAPI backend, persisting to Postgres, populated with data you scraped yourself.

Day-by-day
  • Mon · Python that isn't beginner Python — type hints & mypy, dataclasses vs Pydantic models, async/await and the event loop, generators, context managers, uv for envs, ruff for lint/format. Ship a typed async CLI.
  • Tue · FastAPI internals — path/query/body params, Pydantic v2 validation, dependency injection, routers, background tasks, auto OpenAPI. Ship a CRUD service with tests.
  • Wed · Data layer — relational modelling, SQLModel/SQLAlchemy, Alembic migrations, connection pooling, indexes & the query planner, N+1 traps. Wire the API to Postgres.
  • Thu · Scraping primitives — HTTP semantics, httpx sync/async, HTML parsing with selectolax/BeautifulSoup, CSS/XPath selectors, robots.txt & rate limits, Playwright for JS pages.
  • Fri · Next.js + glue — App Router, server vs client components, data fetching from FastAPI, env config, deploy. Assemble the full loop end-to-end.
Outcomes
  • Stand up a typed FastAPI service with a real DB and migrations
  • Turn messy HTML into clean, validated rows
  • Ship a Next.js frontend that consumes your own API
  • Work in git properly — feature branches, small commits, PRs, review
python 3.12uvruffmypyfastapipydantic v2postgressqlmodelalembichttpxplaywrightnext.js
✓ Gate — a working full-stack app: Next.js → FastAPI → Postgres, seeded with self-scraped rows.
1
2
PHASE 2 · THE DATA ENGINE

Scraper framework — 10 platforms, one canonical shape

You don't write 50 throwaway scripts — you design a framework. A BaseScraper contract, a normalized Job schema every source maps into, a source registry, and a raw→parsed two-layer store. Adding platform #51 becomes a config + adapter, not a rewrite.

Architecture you build
  • BaseScraper ABC: fetch() → parse() → normalize() → validate() lifecycle
  • Canonical Job Pydantic model: title, org, location, salary_min/max, category, employment_type, source, source_url, posted_at, deadline, raw_hash
  • Per-source adapters that only implement selectors + field mapping
  • A source registry (declarative config: base URL, cadence, engine, enabled)
  • Two-layer persistence: immutable raw_pages + typed jobs, so re-parsing never re-fetches
The hard parts
  • Every site names the same field differently → a mapping layer
  • Dates in 6+ formats, salaries as free text → dedicated normalizers (dateparser, regex + heuristics)
  • Pagination, sessions, cookies, and pages that lie about their content
  • Fixture-based tests: saved HTML → asserted parse output, so a site change fails CI
ABCspydantic v2selectolaxhttpxdateparserrapidfuzzpytestpostgres
✓ Gate — 10 live sources feeding one normalized, validated schema.
PHASE 2 · THE DATA ENGINE

Scale to 50+ & beat the sites that fight back

Push past 50 sources, including JS-rendered boards and anti-bot walls. This is where you learn scraping at scale is a reliability problem, not a parsing one — plus deduplication so the same job across 5 boards collapses to one clean record.

Resilience & scale
  • A managed Playwright browser pool for JS-heavy sites, with an httpx fast-path fallback
  • Rate limiting, exponential backoff + jitter (tenacity), proxy rotation, polite concurrency caps
  • Per-source health checks: expected row-count bands, schema drift detection, a “source returned 0 / structure changed” alert
  • Structured logging (structlog) with per-run, per-source correlation IDs
Deduplication engine
  • Exact dedup via raw_hash + canonical URL
  • Near-duplicate detection: normalized title+org fuzzy match (rapidfuzz) + content shingling
  • Cross-source clustering → one canonical job with a list of source links
  • Quality bar: >95% of jobs carry title+org+URL; dedup rate reported each run
playwright pooltenacityrapidfuzzstructlogproxiescontent-hashing
✓ Gate — 50+ sources, dedup measured, per-source health monitored.
3
4
BRIDGE · ENGINE → PRODUCT

Orchestrate it, then expose it

The pipeline can't live on a laptop. It becomes infrastructure: scheduled, parallelised, containerised, and fronted by a clean read API that Rozgar's frontend will consume. This is the handoff from data engine to product backend.

Orchestration
  • Celery + Redis: one task per source, isolated failures, per-source cadence via Celery beat
  • Concurrency & retry policy so one dead site can't stall the run
  • Idempotent upserts — re-runs update, never duplicate
  • Dockerised services, deployed to the Codeaza server via Dokploy
Read API
  • GET /jobs with filters (category, location, gov/private, remote, deadline), cursor pagination, sort
  • Full-text + trigram search (pg_trgm, tsvector) over title/org/description
  • Response caching + ETags; typed Pydantic response models
  • By end of week 4 there is a live, self-refreshing, deduplicated feed of Pakistani jobs behind an API.
celeryredisdockerdokploypg_trgmtsvector
✓ Gate — scheduled pipeline + read API deployed and self-refreshing on the server.
PHASE 3 · SHIP ROZGAR

The feed — search that actually works

Build the core of rozgar.codeaza.org: the job feed. Fast search, filters a real job-seeker understands, infinite scroll, and SEO-friendly detail pages — all served from your week-4 API.

Build
  • Next.js feed with server-side search & filtering, streamed rendering
  • Debounced search, URL-synced filter state (shareable/back-button-safe)
  • SEO-friendly job detail pages (metadata, structured data, OG tags)
  • Real empty / loading / error states; skeletons, not spinners
  • Mobile-first — most Pakistani users are on phones on slow networks
Product judgement
  • Which filters matter, what a job card must show, when “no results” needs a suggestion
  • Perceived performance: optimistic UI, prefetching, image discipline
  • Reviewed with the team like a real feature PR
next.jstypescripttailwindshadcn/uitanstack-query
✓ Gate — a fast, searchable, mobile-first job feed live in production.
5
6
PHASE 3 · SHIP ROZGAR

Accounts & the job-alert engine

The feature that makes Rozgar sticky. A user saves a search (“BPS-17 govt jobs in Punjab”) and gets pinged the moment a match is scraped. This closes the loop: scraper → dedup → match → alert → user opens Rozgar.

Build
  • Auth (email / OTP) + user accounts, sessions, protected routes
  • Saved searches persisted as structured filter queries + alert preferences
  • A matching engine that runs on every batch of new jobs and enqueues matches
  • Delivery via email (Resend) and WhatsApp/SMS — whichever the user picked
  • Alert de-duplication + digest batching so nobody gets spammed
Why it matters
  • Your week-3 pipeline now directly triggers a user-facing event
  • This is the whole product working end-to-end, not a demo
  • Retention hinges on this — you'll instrument open/click rates
auth.jsresendwhatsapp apicelerywebhooksredis
✓ Gate — accounts live and real alerts firing on newly-scraped matches.
PHASE 3 · SHIP ROZGAR

Harden it — performance, quality, observability

Real products don't fall over. Make Rozgar fast, monitored, and trustworthy: caching, query tuning, error tracking, tests on the critical paths, CI, and a data-quality dashboard so pipeline health is visible at a glance.

Build & measure
  • Caching layer (Redis) + targeted DB indexes → sub-200ms search P95
  • Sentry across frontend + backend + Celery workers; release tracking
  • Tests: scraper fixtures, API contract tests, matcher unit tests
  • An admin / data-quality dashboard: sources up, jobs/day, dedup rate, alert delivery rate
  • CI (GitHub Actions) running lint + type-check + tests on every PR
The Codeaza standard
  • Every scraper failure surfaces in Sentry and is fixed same-day
  • The fix commit cites the short Sentry issue ID
  • You operate the product the way the team does
sentryredis cachegithub actionspytestmypygrafana-style dash
✓ Gate — production-ready: monitored, tested, fast, CI-gated.
7
8
PHASE 3 · SHIP ROZGAR

Launch, own a metric, defend your work

Final week. Ship to real users, pick one metric you own (activation, alert open-rate, jobs indexed), move it, and present the full 8 weeks to the team like an engineer defending real work.

Deliverables
  • Your features live in production, used by real Pakistani job-seekers
  • A metric you moved, with before/after numbers and the change that did it
  • Clean repo, README, and a runbook for operating the pipeline
  • A 20-minute demo + architecture walkthrough to the team
What you walk away with
  • A real production system on your CV you can explain end-to-end
  • Scraping 50+ sources, a live product, and the judgement behind every call
  • The portfolio piece most engineers never get to build
productionmetricsrunbookarchitecture review
✓ Gate — shipped to users, a metric moved, work defended in a team demo.
The full stack, by layer

Backend & Data

Python 3.12FastAPIPydantic v2SQLModelPostgresAlembicCeleryRedis

Scraping

httpxselectolaxPlaywrightrapidfuzztenacitydateparserstructlog

Frontend

Next.jsTypeScriptTailwindshadcn/uiTanStack QueryAuth.js

Infra & Ops

DockerDokployGitHub ActionsSentryCloudflareResend

Quality

pytestruffmypyfixturesCI gates

Ways of working

git flowcode reviewdaily standupAI-pairedrunbooks
Gates, not vibes

Each phase has a hard gate. Miss it and we course-correct fast — nobody drifts for 8 weeks. A strong finish converts to a full-time offer.

1
Stack fluency
Full-stack app: Next.js → FastAPI → Postgres, self-scraped data
2–3
Data engine
50+ sources → one normalized, deduplicated schema; quality bar met
4
Live backend
Scheduled pipeline + read API deployed and self-refreshing
5–6
Product features
Searchable feed + accounts + working alerts on rozgar.codeaza.org
7–8
Hardened & shipped
Monitored, tested, fast; a metric moved; defended in a team demo