Blog chevron_right How We Track AI Hiring Trends at HireIndex
2026-04-22 · by HireIndex Staff ai-hiringdatahow-it-works

How We Track AI Hiring Trends at HireIndex

Update — April 26, 2026: We expanded our data pipeline significantly this week. HireIndex now pulls from nine sources across three tiers: ATS boards (Greenhouse, Lever, Ashby, SmartRecruiters), aggregator APIs (RemoteOK, The Muse, Adzuna), and broad-coverage scrapers (LinkedIn, Indeed). Details on the new sources are at the bottom of this post.

Most job boards are downstream of recruiters. Someone posts a role, it gets distributed, and by the time you see it on LinkedIn or Indeed, thousands of other people have already seen it too. The signal is noisy, the data is shallow, and half the listings are stale.

HireIndex takes a different approach. We go upstream — directly to the applicant tracking systems (ATS) that companies use to publish roles in the first place. Greenhouse, Lever, Ashby, and a handful of others host the canonical job posting for most AI-forward companies. If a company is hiring an ML engineer, the listing almost certainly lives on one of those platforms before it gets syndicated anywhere else.

This post is a walkthrough of how we turn that raw data into the clean, city-by-skill index you see on the front page.

How does HireIndex’s weekly data pipeline work?

Every Monday morning, we run through the following sequence:

  1. Scrape. We pull from two types of sources simultaneously. First, we hit public ATS boards directly for every company in our curated list — Greenhouse and Lever via their open APIs, Ashby and SmartRecruiters via their public posting endpoints. This gives us structured, company-attributed data with clean fields. Second, we run keyword searches across aggregator APIs (RemoteOK, The Muse, Adzuna across US/GB/AU/CA) and broad scrapers (LinkedIn via Apify, Indeed via Apify) to catch roles at companies not yet in our curated list. The result is merged, deduplicated, and ATS data takes priority when the same role appears in multiple sources. We’re actively expanding the curated company list toward ~500 companies through Q2 2026.
  2. Classify. Job titles are surprisingly inconsistent. “Senior ML Engineer” at one company is “Staff Applied Scientist” at another is “Member of Technical Staff — AI Platform” at a third. We run each title through a keyword-matching pass that maps it to one of thirteen canonical skill categories: Machine Learning Engineer, Data Scientist, MLOps Engineer, AI Research, LLM Engineer, and so on. Roles that don’t fit any category get classified as generic “AI Software Engineer.”
  3. Normalize locations. Location strings are even messier than titles. A single role might be listed as “New York, NY; Remote (US); San Francisco, CA — hybrid.” We parse these strings against a regex-based alias map and assign each role to one or more canonical cities. Remote-only roles fall into a dedicated Remote bucket.
  4. Index by (skill, city) pair. Once every role has canonical skill and city tags, we generate an index: for every combination of skill and city that has at least one matching role, we create a landing page. This is the programmatic backbone of the site — roughly a hundred pages, each listing only the roles that actually match the criteria.
  5. Publish. The whole pipeline outputs a single JSON file that the static site generator consumes at build time. Every Monday, a new build ships with fresh data.

The entire process runs unattended — scrape, classify, normalize, and ship — in roughly the time it takes to drink a cup of coffee. Zero manual intervention between weekly runs.

What does the current AI hiring data reveal?

Four weeks of weekly snapshots is a short timeframe — too short for sweeping trend claims. But a few patterns are already visible in the latest index (448 open roles across 89 companies as of 2026-04-20).

The “AI Engineer” title is climbing fast. Three weeks ago, generic “AI Engineer” (and close variants like “AI Software Engineer”) titles made up 6.6% of indexed roles. As of this week, they’re 10.9%. That’s a 65% relative increase in three weeks. “ML Engineer” titles are flat at around 14% of the index over the same period. If this trajectory continues, “AI Engineer” will overtake “ML Engineer” as the dominant title within a few months — even though, in practice, the two roles are often indistinguishable in their job descriptions.

No single AI title dominates. The current breakdown is more fragmented than you’d expect: ML Engineer (13.8%), AI Engineer (10.9%), Data Scientist (10.5%), Research Scientist (6.7%), MLOps / AI Platform / Infrastructure (6.7%), Research Engineer (3.1%), and “LLM” in the title (just 0.9%). The remaining ~48% are spread across more specialized titles — Applied Scientist, Solutions Architect, AI Product Manager, and similar. If you’re searching by exact job title, you’re missing more than half the market.

Most AI listings don’t specify a city at all. Of the 448 currently-indexed roles, 257 — about 57% — list their location as “Remote / Unknown” or some equivalent catch-all. That’s not the same as being remote-friendly: many of these listings are office-based but the source ATS field was left empty. The takeaway for job seekers is that filtering by city excludes more than half the market by default. The 191 roles that do specify a location skew heavily toward a handful of metros: Paris (21), San Francisco (15), London (~12 across name variants), Singapore (5), Toronto (5), and New York (4). Everything else is a long tail of one-off cities.

A small group of companies dominates remote-eligible hiring. Of the roles that actually parse cleanly as remote, five companies account for nearly half of them: Scale AI (53), Databricks (52), Reddit (28), Cresta (26), and Applied Intuition (19). If you want remote-first AI work, these five companies are statistically your best bet right now. Aggressive coverage from a handful of companies skews the appearance of the remote market substantially.

What doesn’t HireIndex track or collect?

A few deliberate choices about what this site isn’t:

We don’t syndicate listings. Every role on HireIndex links directly back to the company’s own career page. You apply there. We don’t run an applicant tracking layer, don’t collect resumes, and don’t charge companies to post.

We don’t track compensation. Salary data is either self-reported (unreliable) or scraped from disclosure-mandated jurisdictions (biased toward a few US states). We’d rather show you no data than misleading data.

We don’t score candidates. The scoring we do is at the company level — how intensely a given company is investing in AI hiring, measured by headcount growth, role diversity, and team-page signals. That’s a signal for job seekers trying to decide where to focus, not a tool for companies to filter applicants.

What is HireIndex building next?

Coverage expansion is the main focus for the next couple of months. We’re targeting 500 companies by end of Q2, with better coverage of European and APAC markets (the current index skews heavily US). We’re also building a dashboard that shows hiring velocity over time — which companies are ramping up, which are slowing down, and how that correlates with public signals like earnings calls and funding announcements.

If you’re job-hunting in AI or ML right now, bookmark the skill-city page that matches what you’re looking for — try Machine Learning Engineer roles, Data Scientist positions, or remote AI jobs. Every Monday, it’ll have fresh roles.


Update: expanded data sources (April 26, 2026)

We added six new sources to the pipeline this week. Here’s what each one adds and why we added it.

Ashby is the ATS of choice for most VC-backed AI startups right now — it’s where companies like Cohere, Mistral, and a long tail of Series A/B AI companies post first. We now have 32 companies on Ashby in our index, all pulled directly from their public job board APIs. No scraping required, no rate limits.

SmartRecruiters covers a different segment — larger, more established tech companies that haven’t migrated to Greenhouse or Lever. Direct API access, same as Ashby.

RemoteOK is a free JSON API with no authentication required. It skews toward remote-first and distributed teams, which is a segment that ATS boards underrepresent. Useful for catching companies that don’t use traditional ATS platforms at all.

The Muse offers curated, higher-quality listings with good metadata — company size, culture data, and category tags. Signal-to-noise ratio is better than most aggregators. Requires a free API key.

Adzuna covers US, UK, Australia, and Canada with a single API. Useful for geographic coverage outside the US, where our ATS board data is thinner. Free tier is sufficient for our query volume.

Indeed via Apify fills in companies that don’t use any of the above ATS platforms. Indeed is still the largest job board by raw volume globally, and running targeted searches for ML/AI titles surfaces employers that don’t appear anywhere else in our pipeline. Each query runs against a direct Indeed search URL so we control the exact scope.

All six new sources feed into the same aggregation and deduplication step that previously handled LinkedIn, Greenhouse, and Lever. ATS data takes priority in deduplication — if a role appears on both Greenhouse and Indeed, the Greenhouse record (with its cleaner structured data) wins.