Version: Next

Tokenizer (English)

Description

Segments a given text into Tokens (usually words, numbers, punctuations, ...). Works best with english text.

Required input

A stream with a string property which contains a text.

Configuration

Simply assign the correct output of the previous stream to the tokenizer input. To use this component you have to download or train an openNLP model: https://opennlp.apache.org/models.html

Output

Adds a list to the stream which contains all tokens of the corresponding text.

Example:

Input: (text: "Hi, how are you?")

Output: (text: "Hi, how are you?", tokens: ["Hi", ",", "how", "are", "you", "?"])

Description​

Required input​

Configuration​

Output​

Description

Required input

Configuration

Output