MangaTrans Reader Technical Report

Arrtour

Abstract

MangaTrans is an automated manga translation system that processes Japanese comic pages through a four-stage pipeline: text detection, inpainting, OCR, and rendering. The system achieves approximately 3.4 seconds per page on RTX 4090 hardware. Key innovations include direction-aware textline merging, GPU-accelerated MangaOCR batch processing, and context-preserving full-page translation.

1. System Overview

MangaTrans processes manga pages sequentially (one page at a time) through four stages:

Text Detection: DBNet detects textlines, which are merged into semantic blocks (speech bubbles)
Inpainting: LamaLarge removes detected text, producing clean background
OCR: MangaOCR (vertical) and RapidOCR (horizontal) extract Japanese text
Translation & Rendering: DeepL translates full-page context; text is rendered with direction-aware layout

Performance (RTX 4090, warm start):

Text detection + export: 0.4s
OCR: 0.7s
Inpainting (parallel with translation): 0.7s
Translation (DeepL): 1.0s
Rendering: 0.7s
Total: ~3.4s per page

2. Methods

2.1 Pipeline Processing Flow

Input Image

↓

[Stage 1: Detection]

DBNet → Textline Detection → Spatial Merging → Direction Voting

↓

merged.json (blocks with coordinates, direction)

↓

[Stage 2: Parallel Processing]

├─→ [OCR] MangaOCR (vertical) + RapidOCR (horizontal)

│ ↓

│ merged.json + text annotations

│ ↓

│ [Parallel Branch A: Translation]

│ DeepL full-page batch translation

│ ↓

│ merged.json + translations

│

└─→ [Parallel Branch B: Inpainting]

Mask Refinement (CRF) → LamaLarge Inpainting

↓

cleaned_image.png

↓

[Stage 3: Rendering]

Direction-aware text placement

↓

translated_image.png

2.2 Text Detection and Merging

Model: DBNet (Differentiable Binarization Network)

Architecture: ResNet-34 backbone
Input resolution: 2560px (configurable)
Output: Quadrilateral textline regions

Textline Merging Algorithm:

Spatial Clustering: Group nearby textlines based on:
- Bounding box IoU (intersection-over-union)
- Centroid proximity
- Principal axis alignment
Convex Hull Construction: Compute minimum rotated rectangle for each cluster using Shapely polygon optimization
Direction Estimation: For each quadrilateral with vertices p_i = (x_i, y_i):
- Identify two longest sides (text flow direction)
- Compute structural vector s from long-side vectors
- Classify as vertical if |s_x| ≤ |s_y| (height ≥ width)
Voting Mechanism: Each textline votes for horizontal ('h') or vertical ('v'); majority determines block direction

Output Format: Metadata stored in merged.json containing block list with quadrilateral coordinates, direction label, original text, and translation placeholder.

2.3 Mask Refinement and Inpainting

Mask Refinement:

CRF-based Boundary Alignment: Conditional Random Field aligns mask edges with text boundaries using image gradients
Morphological Dilation: 40px expansion ensures complete text coverage
Gaussian Smoothing: 5px kernel softens edges to reduce inpainting seams

Inpainting Model: LamaLarge (dreMaz/AnimeMangaInpainting)

Fourier convolution-based architecture
Resolution: 2560px (configurable)
Specialized for manga: preserves screentones, halftones, panel boundaries

Processing: Inpainting runs in parallel with translation to minimize latency.

2.4 Optical Character Recognition

Dual-OCR Architecture:

MangaOCR (Vertical Japanese Text, 70-90% of content):

Model: Vision Transformer (ViT) encoder + GPT decoder
Specialty: Native vertical text support, handles stylized manga fonts
GPU Batch Processing: Processes up to 32 text regions simultaneously
Performance gain: 2.3-2.9× speedup over sequential processing

RapidOCR (Horizontal Text, 10-30% of content):

Model: PaddleOCR ONNX Runtime
Sequential processing (ONNX limitation)
Use case: Sound effects, signs, horizontal captions

Processing Strategy:

Separate blocks by direction
MangaOCR: Batch process all vertical blocks together
RapidOCR: Sequential process horizontal blocks
Both run in parallel using thread-pool parallelism

Text Normalization: Remove all whitespace to produce canonical representations for translation.

2.5 Machine Translation

Translation Provider: DeepL Pro API

Context-Aware Strategy:

Full-page batching: All text blocks from one page submitted as a single batch
Each block is a separate API parameter, maintaining structure
API returns translations in same order (one-to-one mapping)

Benefits:

Pronoun Resolution: Surrounding context disambiguates pronouns and references
Tone Consistency: Maintains formality level (casual vs. polite) across page
Idiomatic Expressions: Multi-bubble phrases translated appropriately, preserving confrontational or polite tones based on full dialogue flow

Robustness: Retry logic with exponential backoff; failed blocks retranslated individually.

2.6 Text Rendering

Direction-Aware Framework: Switches between vertical and horizontal algorithms based on block direction metadata.

Vertical Rendering Algorithm (direction='v'):

Punctuation Rotation: Convert horizontal punctuation to vertical Unicode variants
- "。" → "︒" (period), "（" → "︵" (left paren), etc.
- 80+ character mappings for authentic typesetting
Column-Based Layout (right-to-left):
- Text flows top-to-bottom in columns
- Newlines force new column
- Character measurement accounts for variable-width CJK fonts
- Vertical spacing: max(2, font_size/8) pixels
Font Sizing: Binary search over [12, max(w,h)×1.2] to find largest fitting size
Vertical Centering: Layout centered within bounding box if height < box height

Horizontal Rendering Algorithm (direction='h'):

Text Wrapping:
- CJK: Character-by-character wrapping (no word boundaries)
- Latin: Word-based wrapping with even distribution
Font Sizing: Same binary search as vertical
Centered Placement: Anchor='mm' (middle-middle) positioning

Stroke Rendering: Black text with white outline (stroke_width = max(1, font_size/14)) for readability over complex backgrounds.

3. API and Extension Architecture

3.1 RESTful API Design

Framework: FastAPI

Primary Endpoint: POST /process

Request Parameters (multipart form-data):

files: Image uploads (JPEG/PNG/WebP)
src_lang: Source language (default: ja)
tgt_lang: Target language (default: zh-CN)
provider: Translation provider (deepl/grok/deepseek)
device: Computation device (auto/cpu/cuda)

Response: Returns results array containing filename, base64-encoded rendered image, and metadata with detected text blocks.

3.2 Security

API Key Authentication: X-API-Key header (configured via MANGA_API_KEYS)

Rate Limiting: Sliding window, default 1000 requests/minute per key

Concurrency Control: Semaphore-based job limiter prevents overload

3.3 Browser Extension Architecture

Target: Chrome/Edge (Manifest V3)

Components:

Content Script: Detects manga panels, extracts images via DOM/canvas
Background Worker: Manages API communication, caches translated pages
API Integration: Constructs FormData, POSTs to endpoint, receives base64 images

Deployment:

Local: http://localhost:8000
LAN: http://192.168.x.x:8000 (firewall configuration required)
Public: ngrok/cloudflared tunneling

4. Specialized Optimizations

4.1 Hitomi.la Website Integration

Challenge: Client-side rendering prevents direct image URL extraction.

Solution: Playwright browser automation

Navigate to reader page, wait for DOM ready
Scroll to trigger lazy loading, wait for decode
Element-level screenshots (omitBackground: true) preserve quality
Batch POST to API endpoint

Performance: ~30 pages/minute (bottleneck: network download)

Alternative: Browser extension content scripts intercept image URLs directly (bypasses screenshot quality loss, requires CORS configuration).

Page updated

Google Sites

Report abuse