Symbol Kurs

2023-01-16 – 2023-01-20 Natural Language Processing (NLP)

For CAS AML participants as elective module 6.

Reiter

Natural Language Processing

Natural Language Processing (NLP) means the production, preparation and analysis of textual data in natural (as opposed to programming) languages. NLP can be used for a variety of purposes, such as the identification of persons, places or dates (so-called Named Entitiy Recognition), the annotation of word forms (so-called Part-of-Speech) or even the analysis of dependencies within sentence constructions (for example subject-verb-object agreement). In computational linguistics, NLP has long been rule-based and language models have been made explicit and formal. In the past approximately five years, solutions which use machine learning based language models, have become generally accepted. Within a very short time, the results of rule-based NLP have not only been achieved, but significantly exceeded. The most recent deep learning language model (Transformer) and the later developed pre-trained systems (e.g. GPT-3 of OpenAI or BERT) allow the production of texts that are not or hardly distinguishable from human written texts. 

Within this module we prepare texts in order to train and annotate them with self-created neural networks in a second step. We focus on three levels: 1) the preparation and segmentation of text [pre-processing]; 2) forms of annotation and corresponding evaluation from text [information extraction]; 3) epistemological foundations and theoretical classification [critical text analysis]. 

The module can be used with texts from the own field/topic area. Ideally, the texts are prepared in txt format and brought to course. In addition, texts are made available to the participants. 

Methods: Tokenization of text (depending on language), pre-training of language models (vectorization), use of Jupyter Notebooks/Google Colab [will be communicated at the first day of class] to train neural networks, practical studies on available (or self-constructed) text corpora. 

Goal: Application of chosen NLP subtask on text data, evaluation and presentation of the outcome 

Learning outcomes

After the course participants can
  • perform basic preprocessing and segmentation of text for NLP purposes
  • perform basic information extraction (know forms of annotation and corresponding evaluation)
  • perform basic critical text analysis (epistemological foundations and theoretical classification)
  • apply neural networks for NLP tasks
Target group
  • CAS Advanced Machine Learning participants
  • Other interested people
Prerequisites 
  • Basic familiarity with Python and Jupyter notebooks
  • Basic machine learning and neural netowork skills 
  • Own laptop
Methods
  • Theoretical introductions
  • Hands-on tutorials with Jupyter notebooks 
  • Project work with presentation
Format
  • About 15-20 hrs presence (online possible)
  • About 30 hours project work
  • Assessment as oral presentation of project work
Certificate 
  • A certificate with 2 ECTS will be delivered to participants who have attended the whole training and presented a successful project work
Coaches
  • The coaches are local or external experts
Monday
09:35 - 13:00
Tuesday
09:00 - 12:30
Wednesday
09:00 - 12:30
Thursday
10:00 - 13:30
Friday
09:00 - 13:00

Presentation day: To be agreed on during the course
Time :  Mon to Fri, mornings
Physical Location : HG 114
URL : ZOOM Link https://unibe-ch.zoom.us/j/63308769108?pwd=M1dhZGhySmpjTEtISjhrOW92NkFxQT09
Training language: English
Participants : Max 25
Registration : Mandatory via Ilias (login and sign up)
Coaches :  Tao Fan, Jonas Widmer, Ismail Prada, Tobias Hodel & Christa Schneider
Certificate : Certificate for full training attendance
[Tuesday]
TBD

[Wednesday]
TBD

[Friday]
TBD

Optional: