Codeaza Engineering Internship

PHASE 1 · FOUNDATIONS

Stack bootcamp — ship a working artifact every day

One week to become dangerous across the whole stack. No passive tutorials — each day ends with a committed, running artifact. By Friday: a Next.js frontend calling a typed FastAPI backend, persisting to Postgres, populated with data you scraped yourself.

Day-by-day

Mon · Python that isn't beginner Python — type hints & mypy, dataclasses vs Pydantic models, async/await and the event loop, generators, context managers, uv for envs, ruff for lint/format. Ship a typed async CLI.
Tue · FastAPI internals — path/query/body params, Pydantic v2 validation, dependency injection, routers, background tasks, auto OpenAPI. Ship a CRUD service with tests.
Wed · Data layer — relational modelling, SQLModel/SQLAlchemy, Alembic migrations, connection pooling, indexes & the query planner, N+1 traps. Wire the API to Postgres.
Thu · Scraping primitives — HTTP semantics, httpx sync/async, HTML parsing with selectolax/BeautifulSoup, CSS/XPath selectors, robots.txt & rate limits, Playwright for JS pages.
Fri · Next.js + glue — App Router, server vs client components, data fetching from FastAPI, env config, deploy. Assemble the full loop end-to-end.

Outcomes

Stand up a typed FastAPI service with a real DB and migrations
Turn messy HTML into clean, validated rows
Ship a Next.js frontend that consumes your own API
Work in git properly — feature branches, small commits, PRs, review

python 3.12uvruffmypyfastapipydantic v2postgressqlmodelalembichttpxplaywrightnext.js

✓ Gate — a working full-stack app: Next.js → FastAPI → Postgres, seeded with self-scraped rows.

PHASE 2 · THE DATA ENGINE

Scraper framework — 10 platforms, one canonical shape

You don't write 50 throwaway scripts — you design a framework. A BaseScraper contract, a normalized Job schema every source maps into, a source registry, and a raw→parsed two-layer store. Adding platform #51 becomes a config + adapter, not a rewrite.

Architecture you build

BaseScraper ABC: fetch() → parse() → normalize() → validate() lifecycle
Canonical Job Pydantic model: title, org, location, salary_min/max, category, employment_type, source, source_url, posted_at, deadline, raw_hash
Per-source adapters that only implement selectors + field mapping
A source registry (declarative config: base URL, cadence, engine, enabled)
Two-layer persistence: immutable raw_pages + typed jobs, so re-parsing never re-fetches

The hard parts

Every site names the same field differently → a mapping layer
Dates in 6+ formats, salaries as free text → dedicated normalizers (dateparser, regex + heuristics)
Pagination, sessions, cookies, and pages that lie about their content
Fixture-based tests: saved HTML → asserted parse output, so a site change fails CI

ABCspydantic v2selectolaxhttpxdateparserrapidfuzzpytestpostgres

✓ Gate — 10 live sources feeding one normalized, validated schema.

PHASE 2 · THE DATA ENGINE

Scale to 50+ & beat the sites that fight back

Push past 50 sources, including JS-rendered boards and anti-bot walls. This is where you learn scraping at scale is a reliability problem, not a parsing one — plus deduplication so the same job across 5 boards collapses to one clean record.

Resilience & scale

A managed Playwright browser pool for JS-heavy sites, with an httpx fast-path fallback
Rate limiting, exponential backoff + jitter (tenacity), proxy rotation, polite concurrency caps
Per-source health checks: expected row-count bands, schema drift detection, a “source returned 0 / structure changed” alert
Structured logging (structlog) with per-run, per-source correlation IDs

Deduplication engine

Exact dedup via raw_hash + canonical URL
Near-duplicate detection: normalized title+org fuzzy match (rapidfuzz) + content shingling
Cross-source clustering → one canonical job with a list of source links
Quality bar: >95% of jobs carry title+org+URL; dedup rate reported each run

playwright pooltenacityrapidfuzzstructlogproxiescontent-hashing

✓ Gate — 50+ sources, dedup measured, per-source health monitored.

BRIDGE · ENGINE → PRODUCT

Orchestrate it, then expose it

The pipeline can't live on a laptop. It becomes infrastructure: scheduled, parallelised, containerised, and fronted by a clean read API that Rozgar's frontend will consume. This is the handoff from data engine to product backend.

Orchestration

Celery + Redis: one task per source, isolated failures, per-source cadence via Celery beat
Concurrency & retry policy so one dead site can't stall the run
Idempotent upserts — re-runs update, never duplicate
Dockerised services, deployed to the Codeaza server via Dokploy

Read API

GET /jobs with filters (category, location, gov/private, remote, deadline), cursor pagination, sort
Full-text + trigram search (pg_trgm, tsvector) over title/org/description
Response caching + ETags; typed Pydantic response models
By end of week 4 there is a live, self-refreshing, deduplicated feed of Pakistani jobs behind an API.

celeryredisdockerdokploypg_trgmtsvector

✓ Gate — scheduled pipeline + read API deployed and self-refreshing on the server.

PHASE 3 · SHIP ROZGAR

The feed — search that actually works

Build the core of rozgar.codeaza.org: the job feed. Fast search, filters a real job-seeker understands, infinite scroll, and SEO-friendly detail pages — all served from your week-4 API.

Build

Next.js feed with server-side search & filtering, streamed rendering
Debounced search, URL-synced filter state (shareable/back-button-safe)
SEO-friendly job detail pages (metadata, structured data, OG tags)
Real empty / loading / error states; skeletons, not spinners
Mobile-first — most Pakistani users are on phones on slow networks

Product judgement

Which filters matter, what a job card must show, when “no results” needs a suggestion
Perceived performance: optimistic UI, prefetching, image discipline
Reviewed with the team like a real feature PR

next.jstypescripttailwindshadcn/uitanstack-query

✓ Gate — a fast, searchable, mobile-first job feed live in production.

PHASE 3 · SHIP ROZGAR

Accounts & the job-alert engine

The feature that makes Rozgar sticky. A user saves a search (“BPS-17 govt jobs in Punjab”) and gets pinged the moment a match is scraped. This closes the loop: scraper → dedup → match → alert → user opens Rozgar.

Build

Auth (email / OTP) + user accounts, sessions, protected routes
Saved searches persisted as structured filter queries + alert preferences
A matching engine that runs on every batch of new jobs and enqueues matches
Delivery via email (Resend) and WhatsApp/SMS — whichever the user picked
Alert de-duplication + digest batching so nobody gets spammed

Why it matters

Your week-3 pipeline now directly triggers a user-facing event
This is the whole product working end-to-end, not a demo
Retention hinges on this — you'll instrument open/click rates

auth.jsresendwhatsapp apicelerywebhooksredis

✓ Gate — accounts live and real alerts firing on newly-scraped matches.

PHASE 3 · SHIP ROZGAR

Harden it — performance, quality, observability

Real products don't fall over. Make Rozgar fast, monitored, and trustworthy: caching, query tuning, error tracking, tests on the critical paths, CI, and a data-quality dashboard so pipeline health is visible at a glance.

Build & measure

Caching layer (Redis) + targeted DB indexes → sub-200ms search P95
Sentry across frontend + backend + Celery workers; release tracking
Tests: scraper fixtures, API contract tests, matcher unit tests
An admin / data-quality dashboard: sources up, jobs/day, dedup rate, alert delivery rate
CI (GitHub Actions) running lint + type-check + tests on every PR

The Codeaza standard

Every scraper failure surfaces in Sentry and is fixed same-day
The fix commit cites the short Sentry issue ID
You operate the product the way the team does

sentryredis cachegithub actionspytestmypygrafana-style dash

✓ Gate — production-ready: monitored, tested, fast, CI-gated.

PHASE 3 · SHIP ROZGAR

Launch, own a metric, defend your work

Final week. Ship to real users, pick one metric you own (activation, alert open-rate, jobs indexed), move it, and present the full 8 weeks to the team like an engineer defending real work.

Deliverables

Your features live in production, used by real Pakistani job-seekers
A metric you moved, with before/after numbers and the change that did it
Clean repo, README, and a runbook for operating the pipeline
A 20-minute demo + architecture walkthrough to the team

What you walk away with

A real production system on your CV you can explain end-to-end
Scraping 50+ sources, a live product, and the judgement behind every call
The portfolio piece most engineers never get to build

productionmetricsrunbookarchitecture review

✓ Gate — shipped to users, a metric moved, work defended in a team demo.

Train the stack. Build the engine.
Ship a real product.

Stack bootcamp — ship a working artifact every day

Scraper framework — 10 platforms, one canonical shape

Scale to 50+ & beat the sites that fight back

Orchestrate it, then expose it

The feed — search that actually works

Accounts & the job-alert engine

Harden it — performance, quality, observability

Launch, own a metric, defend your work

Backend & Data

Scraping

Frontend

Infra & Ops

Quality

Ways of working