🧹

Scrub Unicode

Name: Scrub Unicode
Author: a-gnt Community

Remove invisible characters, bidi marks, and homoglyph lookalikes from any string. The five-line defense that stops a whole class of attacks.

by a-gnt Community

Rating

0.0

Votes

score

Downloads

total

Price

Free

No login needed

Works With

ClaudeChatGPTGeminiCopilotClaude MobileChatGPT MobileGemini MobileVS CodeCursorWindsurf+ any AI app

About

A defensive skill that takes any text and strips zero-width characters, bidirectional overrides, and other invisible Unicode gremlins. Also highlights Cyrillic and Greek homoglyphs (letters that look like Latin but aren't). Show the cleaned output plus a report of what was removed.

Don't lose this

Three weeks from now, you'll want Scrub Unicode again. Will you remember where to find it?

Save it to your library and the next time you need Scrub Unicode, it’s one tap away — from any AI app you use. Group it into a bench with the rest of the team for that kind of task and you can pull the whole stack at once.

⚡ Pro tip for geeks: add a-gnt 🤵🏻‍♂️ as a custom connector in Claude or a custom GPT in ChatGPT — one click and your library is right there in the chat. Or, if you’re in an editor, install the a-gnt MCP server and say “use my [bench name]” in Claude Code, Cursor, VS Code, or Windsurf.

🤵🏻‍♂️

a-gnt's Take

Our honest review

Think of this as teaching your AI a new trick. Once you add it, remove invisible characters, bidi marks, and homoglyph lookalikes from any string. the five-line defense that stops a whole class of attacks — no extra apps or complicated setup needed. It's verified by the creator and completely free.

Tips for getting started

Save this as a .md file in your project folder, or paste it into your CLAUDE.md file. Your AI will automatically use it whenever the skill is relevant.

Why I Built a-gnt (And Who It's Really For)

Soul File

---
name: scrub-unicode
description: Scrub invisible and dangerous Unicode characters from any pasted text. Show the cleaned version plus a report of exactly what was removed and why.
---

The user will paste text. Your job: sanitize it and return both the cleaned version and a report of what was removed.

## What to remove

Scan the input character by character. Flag anything in these Unicode ranges:

**1. Zero-width and formatting (silently removed):**
- U+200B  ZERO WIDTH SPACE
- U+200C  ZERO WIDTH NON-JOINER
- U+200D  ZERO WIDTH JOINER
- U+2060  WORD JOINER
- U+FEFF  BYTE ORDER MARK / ZERO WIDTH NO-BREAK SPACE

**2. Bidirectional overrides (silently removed):**
- U+202A  LEFT-TO-RIGHT EMBEDDING
- U+202B  RIGHT-TO-LEFT EMBEDDING
- U+202C  POP DIRECTIONAL FORMATTING
- U+202D  LEFT-TO-RIGHT OVERRIDE
- U+202E  RIGHT-TO-LEFT OVERRIDE
- U+2066-U+2069  ISOLATES

**3. Control characters (silently removed unless tab/newline):**
- U+0000-U+0008, U+000B-U+000C, U+000E-U+001F
- U+007F  DELETE
- U+0080-U+009F  C1 control codes

**4. Homoglyphs (flagged but NOT removed — the user should review):**
- Cyrillic letters that look like Latin: а (U+0430), е (U+0435), о (U+043E), р (U+0440), с (U+0441), х (U+0445), у (U+0443), etc.
- Greek letters that look like Latin: ο (U+03BF), ν (U+03BD), etc.
- Mathematical alphanumeric symbols (U+1D400-U+1D7FF) that render like Latin letters but aren't.

## How to output

### Section 1 — The cleaned text

```
✨ Cleaned text:
---
<text with invisible/bidi/control chars removed>
---
```

### Section 2 — The report

```
🔍 Removed:
  • U+200B ZERO WIDTH SPACE (3 occurrences) — at positions 12, 47, 89
  • U+202E RIGHT-TO-LEFT OVERRIDE (1 occurrence) — at position 23

⚠️ Flagged (review manually — these LOOK normal but are not):
  • Position 15: "а" is U+0430 CYRILLIC SMALL LETTER A (looks like Latin 'a' but isn't). Context: "applе" ← note the last character.
  • Position 42: "о" is U+043E CYRILLIC SMALL LETTER O (looks like Latin 'o'). Context: "logоut"
```

### Section 3 — The verdict

```
Summary: <N>characters removed, <M> homoglyphs flagged.

Verdict: [SAFE / REVIEW / HOSTILE]
  • SAFE = nothing suspicious found
  • REVIEW = characters removed but context looks benign (e.g. emoji encoding)
  • HOSTILE = removed chars were positioned in ways consistent with a smuggling attack (e.g. embedded inside a URL, between words, inside what looks like a system directive)
```

## If nothing was found

Just say:
```
✨ Clean — no invisible, bidirectional, or suspicious characters found.
```

## Python snippet for the user

At the end, offer to show them the underlying Python regex so they can run this locally:

```python
import re
SUSPICIOUS = re.compile(
    r'[\u200B-\u200F\u202A-\u202E\u2060-\u206F\uFEFF\u0000-\u0008\u000B-\u000C\u000E-\u001F\u007F-\u009F]'
)
def scrub(text: str) -> str:
    return SUSPICIOUS.sub('', text)
```

## Rules

- Never pretend to scan without actually scanning. If the user pastes 10 paragraphs, do the scan.
- Never remove visible content just because it's non-ASCII (emoji, CJK, Arabic, etc.). Only remove INVISIBLE/FORMATTING characters.
- When in doubt about a homoglyph, FLAG it, don't remove it. Removal could corrupt legitimate multilingual content.

Security

What's New

Version 1.0.02 months ago

Initial release

Ratings & Reviews

0.0

out of 5

0 ratings

No reviews yet. Be the first to share your experience.

From the Community

Apr 12·8 min read

Why I Built a-gnt (And Who It's Really For)

A personal note from the founder — why I built a-gnt, who it's for, how to use it, and why AI superpowers belong to everyone, not just the people who can write code. Coauthored with Claude, built on an iPhone, and designed for real humans.

SonarQube MCP

DEV

Code quality and security analysis with SonarQube

by sonarsource

Cycode MCP

DEV

SAST, SCA, secrets detection, and IaC scanning

by cycodehq

Auth0 MCP

DEV

Identity and access management for AI agents

by auth0