Arrtour
Technical Report
MangaTrans is an automated manga translation system that processes Japanese comic pages through a four-stage pipeline: text detection, inpainting, OCR, and rendering. The system achieves approximately 3.4 seconds per page on RTX 4090 hardware. Key innovations include direction-aware textline merging, GPU-accelerated MangaOCR batch processing, and context-preserving full-page translation.
MangaTrans processes manga pages sequentially (one page at a time) through four stages:
Text Detection: DBNet detects textlines, which are merged into semantic blocks (speech bubbles)
Inpainting: LamaLarge removes detected text, producing clean background
OCR: MangaOCR (vertical) and RapidOCR (horizontal) extract Japanese text
Translation & Rendering: DeepL translates full-page context; text is rendered with direction-aware layout
Performance (RTX 4090, warm start):
Text detection + export: 0.4s
OCR: 0.7s
Inpainting (parallel with translation): 0.7s
Translation (DeepL): 1.0s
Rendering: 0.7s
Total: ~3.4s per page
Input Image
↓
[Stage 1: Detection]
DBNet → Textline Detection → Spatial Merging → Direction Voting
↓
merged.json (blocks with coordinates, direction)
↓
[Stage 2: Parallel Processing]
├─→ [OCR] MangaOCR (vertical) + RapidOCR (horizontal)
│ ↓
│ merged.json + text annotations
│ ↓
│ [Parallel Branch A: Translation]
│ DeepL full-page batch translation
│ ↓
│ merged.json + translations
│
└─→ [Parallel Branch B: Inpainting]
Mask Refinement (CRF) → LamaLarge Inpainting
↓
cleaned_image.png
↓
[Stage 3: Rendering]
Direction-aware text placement
↓
translated_image.png
Model: DBNet (Differentiable Binarization Network)
Architecture: ResNet-34 backbone
Input resolution: 2560px (configurable)
Output: Quadrilateral textline regions
Textline Merging Algorithm:
Spatial Clustering: Group nearby textlines based on:
Bounding box IoU (intersection-over-union)
Centroid proximity
Principal axis alignment
Convex Hull Construction: Compute minimum rotated rectangle for each cluster using Shapely polygon optimization
Direction Estimation: For each quadrilateral with vertices p_i = (x_i, y_i):
Identify two longest sides (text flow direction)
Compute structural vector s from long-side vectors
Classify as vertical if |s_x| ≤ |s_y| (height ≥ width)
Voting Mechanism: Each textline votes for horizontal ('h') or vertical ('v'); majority determines block direction
Output Format: Metadata stored in merged.json containing block list with quadrilateral coordinates, direction label, original text, and translation placeholder.
Mask Refinement:
CRF-based Boundary Alignment: Conditional Random Field aligns mask edges with text boundaries using image gradients
Morphological Dilation: 40px expansion ensures complete text coverage
Gaussian Smoothing: 5px kernel softens edges to reduce inpainting seams
Inpainting Model: LamaLarge (dreMaz/AnimeMangaInpainting)
Fourier convolution-based architecture
Resolution: 2560px (configurable)
Specialized for manga: preserves screentones, halftones, panel boundaries
Processing: Inpainting runs in parallel with translation to minimize latency.
Dual-OCR Architecture:
MangaOCR (Vertical Japanese Text, 70-90% of content):
Model: Vision Transformer (ViT) encoder + GPT decoder
Specialty: Native vertical text support, handles stylized manga fonts
GPU Batch Processing: Processes up to 32 text regions simultaneously
Performance gain: 2.3-2.9× speedup over sequential processing
RapidOCR (Horizontal Text, 10-30% of content):
Model: PaddleOCR ONNX Runtime
Sequential processing (ONNX limitation)
Use case: Sound effects, signs, horizontal captions
Processing Strategy:
Separate blocks by direction
MangaOCR: Batch process all vertical blocks together
RapidOCR: Sequential process horizontal blocks
Both run in parallel using thread-pool parallelism
Text Normalization: Remove all whitespace to produce canonical representations for translation.
Translation Provider: DeepL Pro API
Context-Aware Strategy:
Full-page batching: All text blocks from one page submitted as a single batch
Each block is a separate API parameter, maintaining structure
API returns translations in same order (one-to-one mapping)
Benefits:
Pronoun Resolution: Surrounding context disambiguates pronouns and references
Tone Consistency: Maintains formality level (casual vs. polite) across page
Idiomatic Expressions: Multi-bubble phrases translated appropriately, preserving confrontational or polite tones based on full dialogue flow
Robustness: Retry logic with exponential backoff; failed blocks retranslated individually.
Direction-Aware Framework: Switches between vertical and horizontal algorithms based on block direction metadata.
Vertical Rendering Algorithm (direction='v'):
Punctuation Rotation: Convert horizontal punctuation to vertical Unicode variants
"。" → "︒" (period), "(" → "︵" (left paren), etc.
80+ character mappings for authentic typesetting
Column-Based Layout (right-to-left):
Text flows top-to-bottom in columns
Newlines force new column
Character measurement accounts for variable-width CJK fonts
Vertical spacing: max(2, font_size/8) pixels
Font Sizing: Binary search over [12, max(w,h)×1.2] to find largest fitting size
Vertical Centering: Layout centered within bounding box if height < box height
Horizontal Rendering Algorithm (direction='h'):
Text Wrapping:
CJK: Character-by-character wrapping (no word boundaries)
Latin: Word-based wrapping with even distribution
Font Sizing: Same binary search as vertical
Centered Placement: Anchor='mm' (middle-middle) positioning
Stroke Rendering: Black text with white outline (stroke_width = max(1, font_size/14)) for readability over complex backgrounds.
Framework: FastAPI
Primary Endpoint: POST /process
Request Parameters (multipart form-data):
files: Image uploads (JPEG/PNG/WebP)
src_lang: Source language (default: ja)
tgt_lang: Target language (default: zh-CN)
provider: Translation provider (deepl/grok/deepseek)
device: Computation device (auto/cpu/cuda)
Response: Returns results array containing filename, base64-encoded rendered image, and metadata with detected text blocks.
API Key Authentication: X-API-Key header (configured via MANGA_API_KEYS)
Rate Limiting: Sliding window, default 1000 requests/minute per key
Concurrency Control: Semaphore-based job limiter prevents overload
Target: Chrome/Edge (Manifest V3)
Components:
Content Script: Detects manga panels, extracts images via DOM/canvas
Background Worker: Manages API communication, caches translated pages
API Integration: Constructs FormData, POSTs to endpoint, receives base64 images
Deployment:
Local: http://localhost:8000
LAN: http://192.168.x.x:8000 (firewall configuration required)
Public: ngrok/cloudflared tunneling
Challenge: Client-side rendering prevents direct image URL extraction.
Solution: Playwright browser automation
Navigate to reader page, wait for DOM ready
Scroll to trigger lazy loading, wait for decode
Element-level screenshots (omitBackground: true) preserve quality
Batch POST to API endpoint
Performance: ~30 pages/minute (bottleneck: network download)
Alternative: Browser extension content scripts intercept image URLs directly (bypasses screenshot quality loss, requires CORS configuration).