Jump to content

Tool:Indicwiki Transliteration Tool

From Wikitech
Toolforge tools
Indicwiki Transliteration Tool
Website https://indicwiki-transliterate-api.toolforge.org
Description The Indicwiki Transliteration Tool consists of an API and a UserScript designed to facilitate transliteration between Indic languages, specifically focusing on Hindi and Urdu, with support for additional Indic scripts.
Keywords transliteration, hindi, urdu, indic, api, userscript, wikipedia
Author(s) Agamya Samuel
Maintainer(s) Agamya Samuel (View all)
Source code API: https://gitlab.wikimedia.org/toolforge-repos/indicwiki-transliterate-api
UserScript: https://meta.wikimedia.org/wiki/User:Agamyasamuel/Indicwiki-Transliterate-User-Script.js
License MIT License

Overview

The Indicwiki Transliteration Tool is a Toolforge-hosted service that provides transliteration capabilities for Indic languages, primarily between Hindi (Devanagari) and Urdu (Perso-Arabic) scripts, along with other related Indic scripts such as Gurmukhi, Shahmukhi, and Sindhi variants. It includes a REST API that acts as a proxy for transliteration requests and a UserScript for seamless integration into Wikipedia editing workflows.

The tool's purpose is to enhance language interoperability on Wikimedia platforms, allowing users to convert text between different scripts without leaving the wiki environment. This is particularly useful for contributors working on multilingual content, cross-wiki coordination, or content creation in related languages.

Key Features:

  • Proxy API for transliteration between specific Indic language pairs.
  • UserScript that adds an in-browser transliteration interface to Wikipedia pages.
  • Support for multiple transliteration directions, including auto-detection for certain scripts.
  • Designed for bots, tools, gadgets, and direct user interaction via browser extensions.
  • Backed by open-source code for community contributions and custom deployments.

While the core focus is on Hindi-Urdu, the API supports additional pairs like Gurmukhi-Shahmukhi and Sindhi variants, addressing a broader range of Indic language needs. This flexibility helps in handling diverse scripts used across South Asia, considering nuances like script detection and accurate phonetic mapping. However, it may not cover all Indic languages or handle complex linguistic edge cases perfectly, such as dialectal variations or ambiguous transliterations.

Web Service and API

The web service is hosted on Toolforge and provides a RESTful API for transliteration. The API base URL is: https://indicwiki-transliterate-api.toolforge.org

Interactive API documentation and testing (likely Swagger UI) is available at: https://indicwiki-transliterate-api.toolforge.org/docs

The API is built to handle POST requests for transliteration, returning JSON responses suitable for scripts, bots, web frontends, and the accompanying UserScript. No authentication is required, but users should respect Toolforge usage guidelines to avoid excessive requests.

API Endpoints

All endpoints use POST methods and expect a JSON body with a "text" field containing the string to transliterate. Responses are typically JSON with a "transliterated_text" field (inferred from standard practices; confirm via docs for exact schema).

  • POST /transliterate/AutoDetectPersioArabicScript
    • Automatically detects and transliterates text in Perso-Arabic scripts.
    • Body: {"text": "string"}
    • Useful for mixed or unknown Perso-Arabic input.
  • POST /transliterate/AutoDetectSindhiHindiScript
    • Automatically detects and transliterates text in Sindhi or Hindi-related scripts.
    • Body: {"text": "string"}
    • Handles auto-detection for Sindhi Devanagari or Hindi variants.
  • POST /transliterate/GurmukhiToShahmukhi
    • Transliterates from Gurmukhi (Punjabi) to Shahmukhi (Punjabi in Perso-Arabic).
    • Body: {"text": "string"}
  • POST /transliterate/HindiToUrdu
    • Transliterates from Hindi (Devanagari) to Urdu (Perso-Arabic).
    • Body: {"text": "string"}
    • Core endpoint for Hindi-Urdu conversion.
  • POST /transliterate/ShahmukhiToGurmukhi
    • Transliterates from Shahmukhi to Gurmukhi.
    • Body: {"text": "string"}
  • POST /transliterate/SindhiDEVToRoman
    • Transliterates from Sindhi Devanagari to Roman (Latin) script.
    • Body: {"text": "string"}
  • POST /transliterate/SindhiDEVToSindhiUR
    • Transliterates from Sindhi Devanagari to Sindhi Urdu (Perso-Arabic).
    • Body: {"text": "string"}
  • POST /transliterate/SindhiURToSindhiDEV
    • Transliterates from Sindhi Urdu to Sindhi Devanagari.
    • Body: {"text": "string"}
  • POST /transliterate/UrduToHindi
    • Transliterates from Urdu to Hindi.
    • Body: {"text": "string"}

Refer to the interactive API documentation for full parameters, potential query options, and live testing.

Example JSON Response

For a request to /transliterate/HindiToUrdu with {"text": "नमस्ते दुनिया"}:

{
  "transliterated_text": "نمستے دنیا"
}

Field meanings:

  • transliterated_text – The converted text in the target script.

Note: Actual response schema may vary; this is based on typical transliteration APIs. Check the docs for precise format. Edge cases like non-transliterable characters (e.g., emojis, numbers) might be preserved or handled specially.

Quick Start Examples

Using curl

1. Transliterate Hindi to Urdu:

curl -X POST "https://indicwiki-transliterate-api.toolforge.org/transliterate/HindiToUrdu" \
-H "Content-Type: application/json" \
-d '{"text": "नमस्ते"}'

Expected response: {"transliterated_text": "نمستے"}

2. Transliterate Urdu to Hindi:

curl -X POST "https://indicwiki-transliterate-api.toolforge.org/transliterate/UrduToHindi" \
-H "Content-Type: application/json" \
-d '{"text": "اسلام علیکم"}'

Expected response: {"transliterated_text": "इस्लाम अलैकुम"}

3. Auto-detect Perso-Arabic:

curl -X POST "https://indicwiki-transliterate-api.toolforge.org/transliterate/AutoDetectPersioArabicScript" \
-H "Content-Type: application/json" \
-d '{"text": "نمستے"}'

Using JavaScript (fetch)

async function transliterateHindiToUrdu(text) {
  const response = await fetch('https://indicwiki-transliterate-api.toolforge.org/transliterate/HindiToUrdu', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ text })
  });

  if (!response.ok) {
    console.error('Transliteration failed', response.status);
    return;
  }

  const data = await response.json();
  console.log(`Transliterated: ${data.transliterated_text}`);
}

transliterateHindiToUrdu('नमस्ते दुनिया');

Error Handling

Standard HTTP status codes are used:

  • 200 OK – Successful transliteration.
  • 400 Bad Request – Invalid input (e.g., missing "text" field, unsupported script).
  • 404 Not Found – Endpoint not available.
  • 500 Internal Server Error – Backend issue, such as transliteration service failure.

Error responses may include JSON with a "detail" or "error" field explaining the issue, e.g., {"detail": "Invalid script detection"}. Always validate input text for length and content to avoid errors. Consider edge cases like empty strings, which might return 400, or very long texts exceeding potential limits (not specified; test empirically).

Usage Notes and Best Practices

  • Cache responses for repeated transliterations to reduce API load.
  • For high-traffic applications, implement client-side caching or rate limiting.
  • The tool proxies external transliteration services (e.g., possibly Google or AI4Bharat); accuracy depends on the backend—report linguistic issues upstream if possible.
  • When using the UserScript:
    • Install on supported wikis (Hindi, Urdu, etc.).
    • It modifies page content in-place; use cautiously on live articles.
    • Supports dropdown selection for different transliteration modes.
  • Respect Wikimedia's terms: Do not use for bulk scraping or non-Wikimedia purposes without permission.
  • For large texts, split into chunks to avoid timeouts or limits.
  • Test with diverse inputs, including loanwords, proper names, and punctuation, as transliteration rules vary by language.

Development and Source Code

The API is likely implemented in Python (common for Toolforge), acting as a proxy to underlying transliteration libraries (e.g., AI4Bharat or similar). The UserScript is JavaScript-based, integrating with MediaWiki's API.

Repository may include setup instructions; contribute via merge requests.

Reporting Bugs and Feature Requests

  • GitLab Issues: https://gitlab.wikimedia.org/toolforge-repos/indicwiki-transliterate-api/issues (or create if none)
  • Phabricator: Search for related tasks or create one tagged with Toolforge.
  • When reporting:
    • Include the endpoint, request body, and response.
    • Steps to reproduce, including input text.
    • Browser/OS details for UserScript issues.
    • Specify if the bug is in accuracy, performance, or functionality.

Consider linguistic nuances: Bugs might stem from backend libraries; provide examples of expected vs. actual output.

See Also