Tokenizer¶
- class igwn_ligolw.tokenizer.Tokenizer¶
Bases:
objectA tokenizer for LIGO Light Weight XML Stream and Array elements. Converts (usually comma-) delimited text streams into sequences of Python objects. An instance is created by calling the class with the delimiter character as the single argument. Text is appended to the internal buffer by passing it to the .append() method. Tokens are extracted by iterating over the instance. The Tokenizer is able to directly extract tokens as various Python types. The .set_types() method is passed a sequence of the types to which tokens are to be converted. The types will be used in order, cyclically. For example, passing [int] to set_types() causes all tokens to be converted to integers, while [str, int] causes the first token to be returned as a string, the second as an integer, then the third as a string again, and so on. The default is to extract all tokens as strings. If a token type is set to None then the corresponding tokens are skipped. For example, invoking .set_types() with [int, None] causes the first token to be converted to an integer, the second to be skipped the third to be converted to an integer, and so on. This can be used to improve parsing performance when only a subset of the input stream is required.
Example:
>>> from igwn_ligolw import tokenizer >>> t = tokenizer.Tokenizer(u",") >>> t.set_types([str, int]) >>> list(t.append("a,10,b,2")) ['a', 10, 'b'] >>> list(t.append("0,")) [20]
Notes. The last token will not be extracted until a delimiter character is seen to terminate it. Tokens can be quoted with ‘”’ characters, which will be removed before conversion to the target type. An empty token (two delimiters with only whitespace between them) is returned as None regardless of the requested type. To prevent a zero-length string token from being interpreted as None, place it in quotes.
Attributes Summary
The current contents of the internal buffer.
Methods Summary
Append a unicode string object to the tokenizer's internal buffer.
Set the types to be used cyclically for token parsing.
Attributes Documentation
- data¶
The current contents of the internal buffer.
Methods Documentation
- append()¶
Append a unicode string object to the tokenizer’s internal buffer.
- set_types()¶
Set the types to be used cyclically for token parsing. This function accepts an iterable of callables. Each callable will be passed the token to be converted as a unicode string. Special fast-paths are included to handle the Python builtin types float, int, long, and str. The default is to return all tokens as unicode string objects.