Indonesian languages are underrepresented in global NLP research community (Aji., et al 2022). Despite having more than 700 diverse languages across Indonesia, NLP research and technologies only focus on a tiny fraction of them. IndoNLP is a research community with a mission to advance Indonesian language NLP research.

News

We released a high-quality parallel sentiment corpus on 10 low-resource languages, NusaX dataset. Check the paper.

One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia was presented as a long paper at ACL 2022

Challenges we solve

  • Low-resource: The lack of resource (i.e., dataset or model) is one of the challenge of Indonesian NLP. Some of existing resources are hard to find since they are not released publicly, or locked behind institutional biuroecacy. Therefore, global communities (esp. those that do not speak Indonesian) is having difficulty to adapt their research and technology to Indonesian languages.

  • Language and Dialect diversity: Indonesian local languages are rich on dialects. The same local language might be spoken differently within cities/villages. They also often use different language style depending on their speaking partners (i.e., casual vs. polite).

  • Multilinguality: Most of Indonesian is at least bilingual: they can speak Indonesian + their own local language. In addition, they often learn other languages as well such as English, Arabic, Chinese etc, due to school or religion reason. Therefore, their daily conversations are filled with code-switching, and these languages are intertwined with each other.

How to contribute?

See our current project(s) and if any of those interest you, please reach us! You can also contact us if you have cool ideas or looking for a collaborations with local researchers, and they have to be open-source projects.

We are currently building a crowd-sourced, Indonesian resources covering as many local languages as possible. We are also encouraging local researchers to release their existing resource which was locked behind private institution to be more accessible, thus advancing the progress in Indonesian NLP. See here If you are interested.