Arrtour
MangaTrans is an automated Japanese manga translation system designed to turn raw page images into readable translated pages with minimal manual intervention. At a high level, the service detects text regions, groups them into semantically useful blocks, performs OCR, translates the recovered dialogue, removes the original text from the artwork, and finally re-renders the translated result back into the page. The implementation is intentionally modular: detection, OCR, translation, inpainting, and rendering can each evolve independently as models or providers change.
In practice, the pipeline is optimized not merely for correctness but for usability. Direction-aware block merging improves OCR quality on mixed vertical and horizontal layouts, full-page translation preserves conversational context better than sentence-by-sentence dispatch, and the service overlaps inpainting with translation to reduce visible latency. On warm RTX 4090 runs, end-to-end processing is typically around 3.4 seconds per page, which is fast enough for interactive reading workflows while still preserving a reasonably high visual standard.
MangaTrans operates on one page at a time, but each page passes through a carefully staged pipeline rather than a monolithic black box. The first stage exports textline detections and converts them into merged text blocks that approximate speech balloons, captions, and other semantically coherent regions. The second stage performs OCR over those merged regions. Once text has been extracted, translation and image cleanup proceed concurrently: the system sends page text to the selected translation backend while a parallel branch prepares masks and runs inpainting to erase the original Japanese text. The final stage renders translated text back into the cleaned image using a direction-aware layout engine.
This organization reflects a practical trade-off. OCR must happen before translation because the translation request depends on recognized text, but inpainting can safely overlap with translation because both consume the same detection output and do not depend on one another. The result is a pipeline that remains conceptually simple while still taking advantage of parallelism where it matters most.
In a typical warm-start run on RTX 4090 hardware, text detection and export take about 0.4 seconds, OCR takes roughly 0.7 seconds, inpainting takes about 0.7 seconds, translation takes around 1.0 second for a fast remote provider, and rendering contributes another 0.7 seconds. The aggregate latency is therefore close to 3.4 seconds per page, though the exact number varies with image complexity, OCR load, and the behavior of the active translation provider.
The processing sequence can be summarized as follows:
Rather than treating OCR, translation, and rendering as unrelated scripts, MangaTrans keeps them tied together through a shared intermediate representation. The merged.json export stores block geometry, dominant reading direction, OCR text, and translation text, making it possible to inspect or replace each stage independently without rewriting the entire system.
Text detection begins with DBNet, configured for high-resolution manga pages and tuned to output quadrilateral textline regions rather than coarse rectangular boxes. This distinction matters because manga text often follows curved balloons, tilted captions, or narrow vertical columns that do not fit neatly into axis-aligned rectangles. The detector therefore serves primarily as a textline proposal stage rather than as a final semantic segmentation layer.
The more consequential logic happens immediately afterward. Nearby textlines are clustered using overlap, centroid proximity, and geometric alignment, then consolidated into block-level regions that better approximate how a reader interprets the page. Convex-hull and rotated-rectangle operations are used to stabilize the resulting geometry, and a lightweight voting procedure estimates whether the block should be treated as vertical or horizontal. This direction label is later reused by both OCR routing and text rendering, which means the quality of block construction has downstream impact on nearly every later stage.
The output of this step is not just a set of coordinates. It is the beginning of a page-level document model: each block has geometry, reading direction, original text slots, and translation placeholders. That representation allows the rest of the pipeline to stay synchronized even when execution becomes partially parallel.
OCR is handled through a hybrid strategy because manga pages rarely fit a single typographic pattern. Vertical Japanese dialogue dominates many works, but horizontal captions, effects, signs, and interface elements appear often enough that a one-model solution is unnecessarily brittle. MangaTrans therefore routes vertical blocks to MangaOCR and horizontal blocks to RapidOCR.
MangaOCR is especially valuable for vertically arranged Japanese dialogue, where its model architecture and training bias make it significantly more robust than general OCR engines. The implementation further improves throughput by batching vertical regions on the GPU rather than dispatching them one by one. RapidOCR, by contrast, is used as a pragmatic horizontal-text path and remains suitable for shorter annotations or sound effects that do not match MangaOCR's strengths. After recognition, whitespace is normalized so that downstream translation sees a cleaner and more stable text representation.
Translation is deliberately page-centric rather than fragment-centric. Once OCR finishes, MangaTrans submits the recovered text blocks as a coordinated batch so that the translator can see neighboring lines and preserve discourse-level cues. This improves pronoun resolution, register consistency, and the handling of short reactive utterances that are ambiguous when translated in isolation.
The service supports multiple providers through the same /process interface. In addition to local and no-op modes, the current runtime can dispatch through DeepL, Grok, DeepSeek, ChatGPT-compatible endpoints, Gemini, Claude, and OpenRouter. The server also accepts provider-specific key, model, and endpoint overrides, which makes it possible to keep sensible defaults on the backend while still allowing frontends to request a different model ID when needed.
This indirection is important in practice because model churn is real. Providers rename models, deprecate aliases, and expose different base URLs over time. By treating model selection as configurable runtime data instead of hard-wired code, MangaTrans stays resilient without forcing every frontend to be rebuilt whenever a provider changes its preferred identifier.
Image cleanup is performed in parallel with translation. Starting from the detected text regions, the system refines masks so that the inpainting branch removes text aggressively enough to clear the original dialogue while remaining conservative enough to preserve screen tones, panel borders, and local line art. The refinement path combines boundary-aware processing with controlled expansion and smoothing before the mask reaches the inpainting model.
The inpainting backend is LamaLarge, chosen because it performs well on illustrated content and tends to preserve manga-specific textures more gracefully than generic photographic inpainting models. This branch produces a cleaned page image that is ready for translated text insertion. Because it runs concurrently with translation, it contributes relatively little to wall-clock latency once OCR is complete.
Rendering is where the system stops being a pure vision pipeline and becomes a page composition engine. The renderer reads the block geometry and direction metadata produced earlier and chooses between vertical and horizontal layout logic. Vertical text is set in right-to-left columns with punctuation rotation and spacing rules chosen to better match native manga conventions. Horizontal text uses a more conventional wrapping strategy while still respecting the available balloon shape and size.
Font sizing is determined dynamically rather than by a fixed template. The renderer searches for the largest size that fits the target region, then centers the result in a way that avoids obviously mechanical placement. To preserve readability against textured or noisy backgrounds, text is drawn with a stroke whose thickness scales with font size. The goal is not to imitate a professional letterer in every edge case, but to produce pages that are readable, visually stable, and coherent enough for real reading sessions.
The backend is exposed through a FastAPI application whose primary entry point is POST /process. The endpoint accepts multipart form data and can process either uploaded files or image URLs, which allows browser clients, automation tools, and desktop applications to share the same service contract. The response returns the rendered image data together with block metadata so that clients can do more than simply swap pixels; they can also inspect OCR and translation results when needed.
The API surface has gradually evolved from a narrow demo endpoint into a broader compatibility layer. Core fields such as source language, target language, provider selection, and device preference remain central, but the service also accepts concurrency controls together with provider-specific key, model, and endpoint overrides. In practice, this means a frontend can submit either uploaded images or remote image URLs, choose a translation backend, and optionally request a different runtime model without needing a separate endpoint for each provider. The wire format is therefore somewhat more verbose than a minimal demo API, but the trade-off is worthwhile because every client can speak to the same backend in the same way.
The current service is optimized for local or trusted-network deployment rather than for hardened multi-tenant exposure. CORS is permissive, credentials are typically passed through form fields for provider-specific requests, and overload protection relies mainly on internal semaphore-style concurrency controls instead of a full external gateway. That design is appropriate for a self-hosted reader or personal translation appliance, but it should not be mistaken for a production internet-facing API perimeter.
This distinction is worth stating plainly because the developer experience is good precisely because the system assumes a controlled environment. Browser extensions, userscripts, and the Electron client can all talk to the same local server without much ceremony. If the service is later exposed publicly, authentication, origin policy, and request accounting would need to be tightened correspondingly.
MangaTrans currently serves three distinct frontend modes: an Electron desktop reader, a browser extension, and a degraded userscript variant for environments where a full extension is not practical. All three ultimately converge on the same backend contract, but they differ in how they acquire image data and how much browser integration they can rely on.
The browser extension is the most capable web client because it can coordinate content scripts, background workers, storage, and request handling more freely. The userscript is intentionally more limited and must operate within the security constraints of whichever manager or host environment is running it. The Electron application sits at the opposite end of the spectrum, enjoying the most control over local files and UI state while still delegating actual translation work to the same backend service. This shared architecture keeps behavior aligned across platforms without forcing the UI codebases to become identical.
The default deployment model is a local FastAPI server running on port 5000, which is then reached by the desktop app, the extension, or a userscript over localhost, LAN, or an HTTPS tunnel. In a purely local setup, 127.0.0.1:5000 is the most straightforward choice. For cross-device use inside a home network, the service can bind to 0.0.0.0:5000 and be reached through a private LAN address, provided the firewall allows inbound traffic. For browser-based use on HTTPS sites, a public HTTPS tunnel is often the most frictionless option because it avoids mixed-content restrictions.
Configuration is environment-driven. The server keeps its baseline behavior in environment variables, including the default translation provider, concurrency settings, OCR batch size, provider API keys, and per-provider model identifiers. The frontends can still override selected values at request time, but they do not need to specify everything on every request. In effect, the backend defines a stable default operating profile, while the UI layers only send custom model IDs or credentials when the user wants to deviate from that default. This arrangement keeps deployment simple for local use while still preserving enough flexibility for rapid provider iteration.
Performance in MangaTrans is shaped by a mixture of GPU work, network latency, and page structure. Detection, inpainting, and vertical OCR are largely compute-bound, while remote translation time varies with provider responsiveness and network conditions. This means the slowest stage is not always the same stage: on a powerful local GPU with a fast provider, rendering and OCR become more visible; on a slower network or a more variable API, translation dominates total latency.
The main consistent throughput win comes from batching. MangaOCR benefits noticeably when vertical regions are processed together, which reduces per-region overhead and improves GPU utilization. A representative measurement is shown below.
Batch Size Time per Image Relative Speedup
1 0.171 s 1.0x
8 0.074 s 2.3x
16 0.062 s 2.8x
32 0.058 s 2.9x
From an engineering perspective, the system's bottlenecks are heterogeneous. DBNet, LamaLarge, and batched MangaOCR lean heavily on the GPU. Remote translation remains network-bound and provider-dependent. Mask refinement and serialization are relatively smaller costs, but they still shape responsiveness when everything else is already optimized.
The next layer of optimization is therefore less about squeezing a single kernel and more about controlling pipeline shape. Quantization could reduce VRAM pressure, better overlap between pages could improve throughput in batch scenarios, and translation caching could suppress redundant external calls. At the same time, each optimization introduces its own maintenance cost. MangaTrans currently favors a design that is fast enough for interactive use while remaining inspectable and easy to modify.
MangaTrans demonstrates that a practical manga translation workflow does not require an end-to-end generative monolith. A staged pipeline, backed by explicit geometry, OCR routing, configurable translation providers, parallel inpainting, and direction-aware rendering, is sufficient to produce translated pages that are both readable and operationally useful. The system's architecture is also intentionally future-proof in a narrow, realistic sense: providers can change, models can be swapped, and frontend clients can differ, yet the shared service contract remains stable.
What makes the project most compelling is not any single model choice but the way the pieces are arranged. Detection creates a page representation that later stages can trust. OCR is split according to the typographic reality of manga rather than the convenience of a single engine. Translation is treated as a contextual page task, not just a bag of strings. Rendering respects reading direction instead of flattening everything into Western left-to-right text. Together, these decisions move the system closer to a reading tool than a mere demo pipeline.
There is still ample room to grow. Better multilingual typography, stronger balloon-aware relayout, cloud deployment patterns, and more robust public-facing service controls would all extend the system's reach. Even so, the current implementation already occupies a valuable middle ground: it is experimental enough to remain flexible, yet polished enough to support serious real-world use.
The implementation relies on a conventional Python service stack, with fastapi and uvicorn providing the API layer, opencv-python and Pillow handling image operations, and numpy or scipy supporting numerical routines. OCR is built around manga-ocr and rapidocr-onnxruntime, while translation requests are issued through OpenAI-compatible or provider-specific HTTP clients such as openai, httpx, and requests. Geometric processing uses libraries such as shapely and pyclipper, and general runtime utilities include packages like loguru and tqdm.
Several core model components are vendored directly into the repository, including DBNet, the inpainting stack, textline merging logic, and mask-refinement utilities. At the repository level, the main areas of interest are mangatrans/api for the FastAPI service, mangatrans/mangaocr for the core vision and rendering pipeline, and the frontend directories that expose the system through Electron, extension, and userscript interfaces.
Report Version: 1.1
Last Updated: February 22, 2026