According to your own research project, you only need to implement DatasetReader and Model, and then run your various experiments with config files. Basically, we need to understand three features below to start our project with AllenNLP
- Define Your DatasetReader
- Define Your Model
- Setup Your Config Files
来自 <https://towardsdatascience.com/allennlp-startup-guide-24ffd773cd5b>
The DatasetReader takes raw dataset as input and applies the preprocessing like lowercasing, tokenization and so on. Finally, it outputs the list of the Instance object which holds preprocessed each data as attributes. In this post, the Instance object has the document and the label information as attributes.
First, we should inherit(继承)the DatasetReader class to make our own. Then we need to implement the three methods: __init__ ,_read andtext_to_instance. So let’s look at the way how to implement our own DatasetReader. I’ll skip the implementation of the read method because it doesn’t relate to the usage of AllenNLP so much. But if you’re interested in it, you can refer to this link though.
The implementation __init__ will be as follows. We can control the arguments of this method via config files.
@DatasetReader.register('imdb')
ImdbDatasetReader(DatasetReader):
def __init__(self, token_indexers, tokenizer):
self._tokenizer = tokenizer
self._token_indexers = token_indexers
In this post, I set token_indexers
and tokenizer
as the arguments because I assume that we change the way of indexing or tokenization in the experiments. The token_indexers
performs indexing and the tokenizer
performs tokenization. The class I implemented has the decorator (DatasetReader.register('imdb')
) which enables us to control it by config files.
The implementation text_to_instance
will be as follows. This method is the main process of DatasetReader. The text_to_instance
takes each raw data as input, applies some preprocessing and output each raw data as a Instance
. In IMDB, it takes the review string and the polarity label as input. (评论字符串和极性标签作为输入)
def text_to_instance(self, string:str, label:int) -> Instance:
fields:Dict[str,Field] = {}
tokens = self._tokenizer.tokenize(string)
fields["tokens"] = TextField(tokens, self._token_indexers)
fields["label"] = LabelField(label, skip_indexing=True)
return Instance(fields)