Frequently Asked Questions List for GMA


last updated: May 17, 2004


Table of Contents:

I. Administrative

  1. How can I make sure that the GMA package I received is genuine?
  2. How do I sign up for the GMA email list(s)?
  3. I found a bug! How do i report it?
  4. What is a release candidate and which version of GMA should i choose?

II. Technical

  1. What is "mapping bitext correspondence" and how does it differ from inducing translation models?
  2. On what platforms does GMA run?
  3. How efficient is GMA?
  4. What language pairs can GMA be used for?
  5. What language-specific resources are required/desirable for use with GMA? What language-specific resources are included with this package?
  6. When should I re-optimize the GMA parameters?
  7. Where can I learn more about how SIMR and GSA work?

I. Administrative



  1. How can I make sure that the GMA package I received is genuine?
    Verify the md5sum of the package against the md5sum listed for that package on the main download page.

  2. How do I sign up for the GMA email lists?
    You can use the web-based interface to (un)subscribe to the moderated GMA-announce list and the unmoderated GMA list.

  3. I found a bug! How do i report it?
    You can use our bugzilla server at http://nlp.cs.nyu.edu/bugzilla to report a bug. When it asks you which component of the Proteus Project to use, pick 'GMA'.

  4. What is a release candidate and which version of GMA should i choose?
    A release candidate (RC) is a pre-release of a new version of the software. Its purpose is to facilitate further testing, and to give prospective users an idea of what to expect in the final release. Unless you are a tester or developer, you should download only final versions.

II. Technical

  1. What is "mapping bitext correspondence" and how is it different from inducing translation models?
    A bitext map is a partial (ideally quite dense) relation between the tokens and token boundaries of a text and those of its translation. Translation models are relations between types, not tokens. E.g., GMA can tell you that the 3rd word in text X arose as a translation of the 4th word in X's translation Y, but it cannot tell you whether that pair of words would be a good entry in a bilingual dictionary. Methods exist for converting between bitext maps and translation models, but the reliable ones are not trivial.

  2. On what platforms does GMA run?
    Starting from version 2.0, GMA has been thoroughly tested on Linux/i386 and Solaris/SPARC. Since it is all in Java, it should in theory run the same way on other platforms. We know of users who have successfully run it under Windows, but we have not done thorough testing ourselves.

  3. How efficient is GMA?
    The underlying algorithms are linear in the size of the input. However, GMA 2.0 is the first release of a complete rewrite (in Java), and we haven't got around to doing any serious optimization yet. Therefore the current implementation is still very slow and memory intensive.

  4. What language pairs can GMA be used for?
    We're not aware of any written languages that GMA cannot be used for. So far, GMA has been applied to:
    • French/English
    • Spanish/English
    • Korean/English
    • Chinese/English
    • Arabic/English
    • Czech/English
    • Malay/English
    • Russian/English
    The next version will include a module for retargeting GMA to new language pairs.

  5. What language-specific resources are required/desirable for use with GMA? What language-specific resources are included with this package?
    GMA is based on an implementation of the Smooth Injective Map Recognizer (SIMR) algorithm. SIMR works best when supplied with language-specific information such as seed translation lexicons and lists of stop words. No such resources are included with this distribution, except stop words for English, French, and Malay (all encoded in ISO8859-1) and an English-Malay tralex, since these are used in the testing suite for the program. Even without seed lexicons, the software can be useful for language pairs that share lots of cognates, but performance will suffer without lists of stopwords. If you want to work with a language that does not use the roman alphabet, then you definitely need a seed translation lexicon (see the HOWTO section on matching predicates). If you have some resources of this type that you would like to share, we'd be happy to include them on our resources page and give you credit.

  6. When should I re-optimize the GMA parameters?
    SIMR has several numerical parameters that should be re-optimized every time you decide to use a new resource, new tokenization of the input, new matching predicate, etc.. If you just use the default parameters, as many people have done with Gale & Church's algorithm, the accuracy of the output may suffer greatly. To learn how to re-optimize the parameters, read the tech report on "Porting..." mentioned below, and the HOWTO-train file.

  7. Where can I learn more about how SIMR and GSA work?
    To better understand what this software does, we suggest you read one or more of the following publications on this subject. Or just get the book: