A diachronic study determining syntactic and semantic features of Urdu-English neural machine translation

Tamkeen Zehra Shah; Muhammad Imran; Sayed M. Ismail

doi:10.1016/j.heliyon.2023.e22883

A diachronic study determining syntactic and semantic features of Urdu-English neural machine translation

Tamkeen Zehra Shah
, Muhammad Imran
, Sayed M. Ismail

Research output: Contribution to journal › Article › peer-review

18 Scopus citations

Abstract

Machine translation produces marginal accuracy rates for low-resource languages, but its deep learning model expects to yield improved accuracy with time. This longitudinal study investigates how Google Translate's Urdu-to-English translated output has evolved between 2018 and 2021. Accuracy and acceptability of the translations have been determined by, a) an interlinear gloss that identifies core semantic units and grammatical functions to be translated and, b) a descriptive comparison of the translated text's syntactic and semantic properties with those of the source text. Overall, despite a 50 % error rate that persists over the three-year interval, the research reports significant improvement in the overall intelligibility of the translations, in contrast to initial results from 2018, which exhibited rampant non-localized errors. Working backwards from instances of errors to morphosyntactic and semantic patterns underlying them, the study concludes that the pro-drop feature of Urdu, Urdu's case-marking system, identification of clause boundaries, polysemous terms, and orthographically similar words pose the greatest difficulty in neural machine translation. These results point to the need for incorporating syntactic information in training data.

Original language	English
Article number	e22883
Journal	Heliyon
Volume	10
Issue number	1
DOIs	https://doi.org/10.1016/j.heliyon.2023.e22883
State	Published - 15 Jan 2024
Externally published	Yes

Keywords

Comparative syntax
Google translate
Interlinear gloss
Low-resource language
Neural machine translation
Urdu

Access to Document

10.1016/j.heliyon.2023.e22883

Cite this

@article{dc9f07fc140c484aa9cb6223b21196b1,

title = "A diachronic study determining syntactic and semantic features of Urdu-English neural machine translation",

abstract = "Machine translation produces marginal accuracy rates for low-resource languages, but its deep learning model expects to yield improved accuracy with time. This longitudinal study investigates how Google Translate's Urdu-to-English translated output has evolved between 2018 and 2021. Accuracy and acceptability of the translations have been determined by, a) an interlinear gloss that identifies core semantic units and grammatical functions to be translated and, b) a descriptive comparison of the translated text's syntactic and semantic properties with those of the source text. Overall, despite a 50 \% error rate that persists over the three-year interval, the research reports significant improvement in the overall intelligibility of the translations, in contrast to initial results from 2018, which exhibited rampant non-localized errors. Working backwards from instances of errors to morphosyntactic and semantic patterns underlying them, the study concludes that the pro-drop feature of Urdu, Urdu's case-marking system, identification of clause boundaries, polysemous terms, and orthographically similar words pose the greatest difficulty in neural machine translation. These results point to the need for incorporating syntactic information in training data.",

keywords = "Comparative syntax, Google translate, Interlinear gloss, Low-resource language, Neural machine translation, Urdu",

author = "Shah, \{Tamkeen Zehra\} and Muhammad Imran and Ismail, \{Sayed M.\}",

note = "Publisher Copyright: {\textcopyright} 2023 The Authors",

year = "2024",

month = jan,

day = "15",

doi = "10.1016/j.heliyon.2023.e22883",

language = "English",

volume = "10",

journal = "Heliyon",

issn = "2405-8440",

publisher = "Elsevier Ltd",

number = "1",

}

TY - JOUR

T1 - A diachronic study determining syntactic and semantic features of Urdu-English neural machine translation

AU - Shah, Tamkeen Zehra

AU - Imran, Muhammad

AU - Ismail, Sayed M.

PY - 2024/1/15

Y1 - 2024/1/15

N2 - Machine translation produces marginal accuracy rates for low-resource languages, but its deep learning model expects to yield improved accuracy with time. This longitudinal study investigates how Google Translate's Urdu-to-English translated output has evolved between 2018 and 2021. Accuracy and acceptability of the translations have been determined by, a) an interlinear gloss that identifies core semantic units and grammatical functions to be translated and, b) a descriptive comparison of the translated text's syntactic and semantic properties with those of the source text. Overall, despite a 50 % error rate that persists over the three-year interval, the research reports significant improvement in the overall intelligibility of the translations, in contrast to initial results from 2018, which exhibited rampant non-localized errors. Working backwards from instances of errors to morphosyntactic and semantic patterns underlying them, the study concludes that the pro-drop feature of Urdu, Urdu's case-marking system, identification of clause boundaries, polysemous terms, and orthographically similar words pose the greatest difficulty in neural machine translation. These results point to the need for incorporating syntactic information in training data.

AB - Machine translation produces marginal accuracy rates for low-resource languages, but its deep learning model expects to yield improved accuracy with time. This longitudinal study investigates how Google Translate's Urdu-to-English translated output has evolved between 2018 and 2021. Accuracy and acceptability of the translations have been determined by, a) an interlinear gloss that identifies core semantic units and grammatical functions to be translated and, b) a descriptive comparison of the translated text's syntactic and semantic properties with those of the source text. Overall, despite a 50 % error rate that persists over the three-year interval, the research reports significant improvement in the overall intelligibility of the translations, in contrast to initial results from 2018, which exhibited rampant non-localized errors. Working backwards from instances of errors to morphosyntactic and semantic patterns underlying them, the study concludes that the pro-drop feature of Urdu, Urdu's case-marking system, identification of clause boundaries, polysemous terms, and orthographically similar words pose the greatest difficulty in neural machine translation. These results point to the need for incorporating syntactic information in training data.

KW - Comparative syntax

KW - Google translate

KW - Interlinear gloss

KW - Low-resource language

KW - Neural machine translation

KW - Urdu

UR - https://www.scopus.com/pages/publications/85179111778

U2 - 10.1016/j.heliyon.2023.e22883

DO - 10.1016/j.heliyon.2023.e22883

M3 - Article

AN - SCOPUS:85179111778

SN - 2405-8440

VL - 10

JO - Heliyon

JF - Heliyon

IS - 1

M1 - e22883

ER -

A diachronic study determining syntactic and semantic features of Urdu-English neural machine translation

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this