Most organizations start lookalike detection with a single idea: "find domains that are one or two edits away from our brand". It is a helpful baseline, but it breaks quickly in the wild. Attackers optimize for humans, not algorithms, and they combine multiple tricks — typos, TLD swaps, keywords, and infrastructure reuse — to avoid simple similarity checks.
Why edit distance fails
Levenshtein distance (or similar string metrics) treats every character equally. Real abuse does not:
- Homoglyphs use characters that look identical across fonts (for example, replacing "o" with "0" or using visually similar Unicode characters).
- Word boundary tricks add keywords that shift meaning: "support", "secure", "login", "verify", "billing".
- TLD swapping registers the same label under a cheaper or less regulated TLD.
- Subdomain deception moves the brand into a subdomain, leaving the registrable domain unrelated.
- Campaign infrastructure reuse means the domain string can be novel while the hosting stack is not.
Lookalike detection works best when you treat the domain name as only one signal among many. The highest-confidence cases appear when multiple independent signals agree.
Generating lookalike candidates
Start with coverage. Your pipeline should be able to create and evaluate candidates across these patterns:
1. Typo variants (human errors)
Common typo families include transpositions, dropped characters, doubled characters, and adjacent keyboard substitutions. Many attackers register the most likely typos because they also capture accidental traffic.
2. Brand + keyword combinations
These are high-risk because they map directly to phishing narratives. Examples: "{brand}-login", "{brand}-support", "{brand}-secure", "{brand}-billing". Even when the brand token is not a perfect match, the keyword strongly boosts malicious probability.
3. TLD and IDN variants
Evaluate the same label across relevant TLDs (country and generic). If you process internationalized domains, normalize IDNs to a canonical form and score based on what users actually see in the browser UI.
DNS & hosting signals
Once a domain exists, DNS behavior often reveals intent before the phishing page is live.
Nameserver patterns
Campaigns reuse DNS providers and nameserver pairs. Track clusters where many suspicious domains share the same NS configuration or change NS shortly after registration.
A/AAAA hosting and ASN reputation
Malicious infrastructure concentrates in certain networks. Assign risk weights based on:
- Hosting provider and ASN history.
- IP co-hosting with known malicious domains.
- Fast-flux style changes over time.
MX and email posture
If a lookalike domain sets up MX early, it can be used for outbound impersonation or for receiving victim replies. Also check SPF/DMARC posture; poorly configured policies are common in throwaway domains.
TLS & Certificate Transparency signals
Certificate Transparency (CT) logs are a powerful early-warning source. Many phishing domains request certificates soon after they become resolvable.
What to watch in CT
- New SAN entries that include the brand token or suspicious keywords.
- Issuer patterns reused across campaigns (not inherently bad, but helpful for correlation).
- Certificate timing: certificate issued within hours of registration is a common pre-launch pattern.
Web content signals
When the domain serves content, combine lightweight checks (safe to run frequently) with deeper analysis when risk is already high.
Structural similarity
Instead of pixel comparisons, fingerprint page structure: form fields, DOM trees, key assets, and common library bundles. Kits copy templates, so you can detect them even across different brands.
Redirect chains and cloaking
Look for conditional behavior: redirects only for specific user agents, geofencing, or CAPTCHA gates. These are anti-analysis controls; they increase risk even when you cannot see the final phishing page.
Scoring, triage and false positives
A practical scoring model typically has three stages:
- Name risk: brand similarity + keywords + TLD.
- Infrastructure risk: DNS, hosting, CT correlations.
- Content risk: similarity, credential capture indicators, cloaking.
False positives often come from resellers, affiliates, and legitimate "fan" or review domains. Reduce noise by whitelisting known partner domains and by requiring at least one non-name signal (infrastructure or content) before escalation.
Key takeaways
- String similarity alone misses the most damaging lookalikes.
- CT logs and DNS posture frequently provide earlier detection than web crawling.
- Infrastructure correlation turns one confirmed incident into faster detection for the next.
- High-confidence cases come from multi-signal agreement, not one metric.