New Document

Asian Applied Natural Language Processing for Linguistics Diversity and Language Resource Development (ADD)

ADD | Program | Venue | Accommodation |
HLT Asian Professional School and Workshop

Background

Currently, we have to admit that the inability to access to computers skillfully and to access to the information on computer network causes a lot of disadvantages. The opportunity in the information society is differentiated by the ability in accessing and expressing each own knowledge. The Internet connecting people dramatically changes the way we communicate with each other, i.e. computer to computer and human to computer. By means of computer and the Internet, many research and development projects have been continuously conducted to facilitate the interaction between human and computer in a natural and smart way. However, language as a communication tool that we can efficiently use to communicate and record our experiences. Many advanced countries have been spending a lot of attempts to enable their people to communicate with computers by their natural communication skills, mostly by their own mother tongue. To maintain the language diversity, we realize that the technology of digitizing and processing the natural language plays an important role.


Thai Computational Linguistics Laboratory of NICT Asia Research Center, NECTEC, Asian Language Resource Network of NUT and SIIT, who are in the forefront of the research in natural language processing (NLP), are cooperating in organizing a series of applied NLP lectures for enlarging the NLP expert community especially through the viable experiences. The series of lectures covers the broad range of NLP research realized in many practical projects. Experts in the successful projects are invited to provide the fundamental theories and their implementations. Many systems and development tools will be introduced in hands-on exercises aiming at future adoption of the technologies. Enlargement of the NLP expert community, collaboration, knowledge and resource sharing are our primary interests.

Objectives

The initiative of the School for Asian Applied NLP is fundamentally aiming at sharing the experiences among the experts in the field of NLP. Bridging the experts to the less experienced region will not only help leveraging the technology of using one's mother tongue in the region, but also strengthen the language research. The activities range from technology transferring to sharing the research and language resource development through a series of lectures and hands-on exercises. Followings are the objectives.

  1. Build experts in NLP
  2. Build a human network of NLP expert for sharing the experience, expertise, and collaboration in studying and applying NLP
  3. Support the development of language resources for studying and evaluating the technology
  4. Support the development of standards for language resource development
  5. Support the research and development of NLP common utilities
  6. Support the implementation of the existing NLP utilities

Organizer and Supporter

By means of technology transfer, experts, and financial assistance, a series of Applied NLP lectures is supported by

  • NICT Asia Research Center
  • Asian Language Resources Network Project
  • National Electronics Computer and Technology Center (NECTEC)
  • Sirindhorn International Institute of Technology (SIIT)
  • Asia-Pacific Association for Machine Translation (AAMT)
  • Asian Federation of Natural Language Processing (AFNLP)
  • PAN Localization Project, CRULP

Activities

In order to increase the number of NLP experts and language resources, we propose a framework of activities consisting of the following two phases for integration.

1.       A series of lectures on applied NLP

2.       Workshop on applied NLP and language resource development

 

1. A series of lectures on applied NLP

Topics of great relevance to researchers and developers in applied NLP have been selected for the program.  

The program is provided in 3 courses. Each course will end up in 9 days together with a hands-on training. A workshop to update each other works will be consecutively held in a period of 1 day after each course.

Course 1: Introduction to NLP

v      Overview NLP

o        State of the art of NLP

v      Morphological Analysis for Non-segmenting Languages

o        Hidden Markov model

o        Conditional Random Field

o        Word Segmentation

§         Longest matching approach

§         Maximal matching approach

§         Part-of-speech tri-gram approach

o        Word Extraction

v      Search Engine

o        History

o        Search Engine Types

o        Crawler

o        Indexer

o        Searcher

o        Lucene Agreement

o        Lucene Example

o        Lucene API

o        Eclipt

o        Analyzer

o        Administration Tools

o        Evaluation

v      Corpus Development

o        Introduction to Corpus Linguistics

§         What is a corpus?

§         What is in a corpus?

§         What is (are) purpose(s) of corpora?

§         How many kinds of corpora are there?

§         How to acquire source data?

o        Corpus for Machine Translation

§         How does corpus relate to MT?

§         How to build a corpus for MT?

§         Frequently found problems in corpus development

§         What is corpus tool(s) and what does it do?

§         Part of speech tagger and practice and ORCHID

v      KUI for Asian Wordnet Initiative

o        KUI the Knowledge Unifying Initiator

o        Wordnet

o        Euro Wordnet

o        Asian Wordnet Initiative

v      Web Language Engineering

o        Collaborative crawler

o        Language identification

§         N-gram model

§         String kernel approach

§         Web document archiving

o        Multi-lingual search engine

v      Machine Translation

o        Overview

§         History of Machine translation

§         Problems on machine translation

§         Techniques for machine translation

§         Trend of research on machine translation

o        Parsing

§         Introduction to parsing techniques

§         Parser's components: "parser = mechanism + grammar rules"

§         General characterization of parsing techniques

§         Apply parsing to MT

o        Corpus based MT

§         Introduction to corpus based MT

§         Developing corpus

§         Example based approach

§         Statistical based approach

§         Improving quality of MT by using corpus

o        Experiment

§         Develop their own corpus

§         Test on toolkit

v      Speech Recognition

o        Fundamentals of language & speech

o        Overview of ASR

o        Acoustic modeling by HMM

o        Language modeling

o        Speech decoding

o        Introduction to toolkits for ASR

o        Corpus & ASR component preparation

o        Building your own ASR

o        Evaluation & improvement

v      Speech Synthesis

o        Overview of TTS

o        Text processing

o        Linguistic/prosodic processing

o        Waveform synthesis

o        Introduction to toolkits for TTS

o        Corpus preparation

o        Building your own TTS

v      Optical Character Recognition

o        Overview (Pre-processing, Processing, Post-processing)

o        Pre-processing

§         Image Enhancement

§         Alignment

§         Page Layout

§         Binarization

§         Character Segmentation

o        Processing

§         Feature Extraction (Thinning, Bitmap, Edge, Structural Extraction, Projection, Run Length Coding)

§         Training (SVM, Neural Network)

o        Post-processing

§         Pasting

§         Language Model

§         Spelling Correction

v      Advanced Research

 

Course 2: Advanced NLP

v      Information retrieval and information extraction

v      Search engine

v      Text summarization

v      Machine Translation

v      Interlingua and knowledge representation

v      Text, web and data mining

v      Text classification and language identification

 

Course 3: Image and Speech processing

v      OCR

v      Image recognition

v      Image search

v      Image corpus

v      Text-to-speech

v      Speech recognition

v      Speech search

v      Spoken language identification

v      Speech corpus

 

2. Workshop on Asian Applied NLP and language resource development

Workshop on Asian Applied NLP and language resource development is planned to host in the second phase, after the school. The workshop aims to bring together researchers in NLP and computational Linguistics, updating and exploring novel ideas on NLP.

Venue

Sirindhorn International Institute of Technology, Pathumthani, Thailand

Accommodation

Pathumthani Place

Date

The school will be organized in a sequence of the above 3 courses.

Schedule of the 1st School of Asian ANLP (Course 1)

June 1 - July 21, 2006         Call for participation

July 21, 2006                  Due date of application submission

July 31, 2006                  Notification of participant acceptance

August 21 - September 1, 2006    The 1st School of Asian ANLP (Course 1: Introduction to NLP) 

Participation

The primary participants/trainees are researchers, developers and integrators from Asian region who have a high potential to be a leader and to continue establishing R&D on this field in the country. Each course has a capacity of 30 participants. For each course, the project prepares some financial support to subsidize only 12 participants whom will be nominated from the institutional representatives of each country and will be selected by the program committee on profile basis. The person who has good knowledge background of NLP, strong interest in NLP and language study, good computer programming skill, and strong willing to participate and make contribution in NLP implementation projects is appreciated.

Application

Please complete the application form ( ) and send to

virach@tcllab.org (subject: application of ADD)
or to the fax number 66-2564-7992

Achievement

Participants are expected to have a general background on NLP together with some familiar with advanced NLP via theoretical and practical experiences. After the completion of the school, participants should be able to function as a leader in country to implement and establish NLP related research and development to cope with at least each own native language. Collaboration and language resource sharing is expected on the expert network.

Program committee

Yoshimoto Shigetoshi    NICT Asia Research Center, Thailand

Isahara Hitoshi                AAMT

Yoshiki Mikami Nagaoka University of Technology, Japan

Thaweesak Koanantakool          NECTEC, Thailand

Benjamin Tsou                AFNLP

Sarmad Hussain             CRULP, Pakistan

Thanaruk Theeramunkong         SIIT, Thailand

Virach Sornlertlamvanich             TCL, NICT, Thailand

Contact person

Dr. Virach Sornlertlamvanich

Thai Computational Linguistics Laboratory (TCL), NICT Asia Research Center

112 Phahon Yothin Road, Klong 1, Klong Luang

Pathumthani 12120, Thailand

Email: virach@tcllab.org

Tel: +66-2564-7990        Fax: +66-2564-7992