Plain Text code - 116 lines - apertium

Project: apertium
Link:
Plain Text, pasted on Mar 27:
 Apertium
 ========

  Idea by GenX
  
  Corpus-based lexicalised feature transfer
  -----------------------------------------
  Problem Statement (as on wiki): Sometimes we get really inadequate translations even though you'd never hear stuff like that. One of those things is when we output something as definite when it is never used as definite. One way of dealing with this is a lot of rules and lists in transfer, but those are hard to do. So, how about looking at a corpus for information about some features like definiteness, aspect, evidentiality, impersonal/reflexive pronoun use in Romance languages etc. 
  
  DEFINITENESS
  ------------
    
  Proposing a module "Definiteness Adapter" which will lie just before morphological generation in the apertium pipeline.
    
    
 1) What module does?
    =================
    -Remove explicit definiteness marker if not expected by the Language.
    -Introduce an explicit definiteness marker if it is expected by the Langusge.
    
 2) Approach
    ========
    I am planning to use a hybrid module which primarily uses machine learning approach for making the decision of removing/inserting explicit 'definite' markers.
    
    On top of it (if required) a rule base layer, which can include PROHIBITION rules or others.
   
 3) Architecture
    ============
    The task would require two modules.
    1) Module 1, which uses trained model to make prediction and situated just before the morhological generator in the apertium pipeline. It will contain a rule based sub-module enabling user to add rules, based on the error analysis, which will be applied to the machine prediced output.
    2) Module 2, a stand alone module which builds the model from a raw langauge corpus.
    
    
                          Trained Lang. Model
                                  |
                          ________V________
    sequence+feature --->|    Module 1     |--->sequence'+feature
                         |_________________| 
    
    
    
                    _________________                                _________________
    Raw Corpus --->|    Morph        |---> Morph output stream  --->|    Module 2     |--->Trained Lang. Model
                   |_________________|                              |_________________|
    
    
 4) Learn form Corpus
    =================
    Idea is to learn a model which can predict existence/non-existence of the of explicit 'definite' marker. This clearly demands a large training corpus. 
    
    Since for this task we only need to predict the definiteness markers so developing a training corpus is comparatively an easy task.
        1. Every language has finite number of definiteness marker.
        2. Raw text is easy to get.
   
   
 5) Corpus Preparation
    ==================
    Procedure for preparing training corpus.
    1. Choose a representative raw corpus (Wikipedia is a good option) for the language.
    2. Run apertium morph analyzer of the respective language on this corpus.
    3. Select and populate the training features and prediction label, and arrange in a format, that could be fed to a classifier.
    
    Eg:- for English, the definiteness marker is "the", the file would look like something
    
    Sentence: At the beginning of the year 2009
    Class Label : 1 if definiteness marker occur before the token, 0 otherwise
    
    LEMMA        FEATURE1(say POS)        FEATURE2(...)      CLASS
    -----        -----------------        -------------      -----
    at              pr                       X                 0  
    beginning       vb                       Y                 1    
    of              pr                       Z                 0    
    year            n                        P                 1    
    2009            num                      N                 0    
    
    
 6) Choice of Classifier
    ====================
    Plan to use CRF as the classifier to predict the presence of 'definite' marker. CRF has an established reputation for sequence labeling tasks. In case the accuracies suffer then other available options are - SVM, Bayes etc.
    
 7) Choice of Features
    ==================
    1) LEMMA - lemma in lower case
               Using lower case to reduce the vocab and using case information as a separate feature instead. This will limit the vocab and hence sparsity. 
    2) POS   - POS tag of the token, use UNKNOWN if the POS tag is unknown.
    3) CASE  - alphabetic case of the lemma 
               a. Initial Capital(IC)
                     eg. England, Stanford, 
               b. All Upper(AU)
                     eg. MIT, CPU
               c. All Lower(AL)
                     eg. core,pen
               d. Number_Digits(Ni)
                     eg. 2009(N4), 100(N3)
               e. Alpha Numeric (AN)
                     eg. F16, b12, 99ace
               f. Others (O)
                     eg. . , - @ + '
    
 8) Learning Template
    =================
    An intelligent baseline should be to start with state of the art learning template for chunking. Then tweak it to increase accuracy. Difficult to methodalize, need a few hit and trials. Have done it earlier for other problems.
    
    
 9) Post Editing - A rule based approach
    ====================================
    This sub-module, will be a rule base and facilitate overriding of the prediction by the system. Rules can be written by doing error analysis and finding frequent cases where the model makes mistake.
    
    
10) Comments
    ========
    1. Above description is provided for definiteness but this module can be easily ported to other concepts like aspect etc.
    2. I feel linguistic feature could be added  based on the language.
Create a new paste based on this one

Comments: