docria.model¶

Docria document model ( primary module )

Classes

`DataType`(typename, **kwargs)	Data type declaration
`DataTypeBinary`(typename, **kwargs)	Bytes field type, field with raw binary data
`DataTypeBool`(typename, **kwargs)	Boolean field type
`DataTypeEnum`(value)	Type names
`DataTypeFloat`(typename, **kwargs)	64 bit floating point (double) field type
`DataTypeInt32`(typename, **kwargs)	Signed 32 bit integer field type
`DataTypeInt64`(typename, **kwargs)	Signed 64 bit integer field type
`DataTypeNoderef`(typename, **kwargs)	Node reference field type in same or other layer
`DataTypeNoderefList`(typename, **kwargs)	Multi node reference field type in same or other layer
`DataTypeNodespan`(typename, **kwargs)	Nodespan field type, sequence of nodes
`DataTypeString`(typename, **kwargs)	String field type
`DataTypeTextspan`(typename, **kwargs)	Textspan field type, text sequence
`DataTypes`()	Layer field type factory
`Document`(**kwargs)	The document which contains all data
`ExtData`(type, data)	User-defined typed data container
`Node`(args, *kwargs)	Basic building block of the document model
`NodeCollection`(fieldtypes)	Base class for all node collections
`NodeCollectionQuery`(collection, predicate)	Represents a query to document data
`NodeFieldCollection`(collection, field)	Field from a node collection
`NodeLayerCollection`(schema)	Node collection, internally a list with gaps which will compact when 25% of the list is empty.
`NodeLayerSchema`(name)	Node layer declaration
`NodeList`(*elems[, fieldtypes])	Python list enriched with extra indexing and presentation functionality for optimal use in Docria.
`NodeSpan`(left_most_node, right_most_node)	Represents a span of nodes in a layer
`Offset`(offset)	Text offset object
`Text`(name, text)	Text object, consisting of text and an index of current offsets
`TextSpan`(text, start_offset, stop_offset)	Text span, consisting of a start and stop offset.

Exceptions

`DataValidationError`(message)	Failed to validate document
`SchemaError`(message)	Failed to validate a part of the schema
`SchemaValidationError`(message, fields)	Schema validation failed

Docria document model ( primary module )

class docria.model.DataType(typename, **kwargs)[source]¶

Data type declaration

__init__(typename, **kwargs)[source]¶

cast_up(dtype)[source]¶

Find the largest type capable of representing both.

Parameters: dtype (DataType) – type to cast
Return type: DataType
Returns: self or dtype
Note

String and numbers are not considered being equal.

cast_up_possible(dtype)[source]¶

Check if type can be merged with another type.

Return type: bool

class docria.model.DataTypeBinary(typename, **kwargs)[source]¶

Bytes field type, field with raw binary data

__init__(typename, **kwargs)[source]¶

class docria.model.DataTypeBool(typename, **kwargs)[source]¶

Boolean field type

__init__(typename, **kwargs)[source]¶

class docria.model.DataTypeEnum(value)[source]¶: Type names

class docria.model.DataTypeFloat(typename, **kwargs)[source]¶

64 bit floating point (double) field type

__init__(typename, **kwargs)[source]¶

class docria.model.DataTypeInt32(typename, **kwargs)[source]¶

Signed 32 bit integer field type

__init__(typename, **kwargs)[source]¶

class docria.model.DataTypeInt64(typename, **kwargs)[source]¶

Signed 64 bit integer field type

__init__(typename, **kwargs)[source]¶

class docria.model.DataTypeNoderef(typename, **kwargs)[source]¶

Node reference field type in same or other layer

__init__(typename, **kwargs)[source]¶

class docria.model.DataTypeNoderefList(typename, **kwargs)[source]¶

Multi node reference field type in same or other layer

__init__(typename, **kwargs)[source]¶

class docria.model.DataTypeNodespan(typename, **kwargs)[source]¶

Nodespan field type, sequence of nodes

__init__(typename, **kwargs)[source]¶

class docria.model.DataTypeString(typename, **kwargs)[source]¶

String field type

__init__(typename, **kwargs)[source]¶

class docria.model.DataTypeTextspan(typename, **kwargs)[source]¶

Textspan field type, text sequence

__init__(typename, **kwargs)[source]¶

class docria.model.DataTypes[source]¶: Layer field type factory

exception docria.model.DataValidationError(message)[source]¶

Failed to validate document

__init__(message)[source]¶

class docria.model.Document(**kwargs)[source]¶

The document which contains all data

__getitem__(key)[source]¶

__delitem__(key)[source]¶

__contains__(item)[source]¶

__init__(**kwargs)[source]¶

Construct new document

Parameters: kwargs – property key, values

add_layer(_Document__name, **kwargs)[source]¶

Create and add layer with specified schema

Parameters

__name – the name of the layer
kwargs – key value pairs with e.g. name of field = type of field

Returns

NodeLayerCollection instance with the specified schema

add_text(name, text)[source]¶

Add text to the document

Parameters

name – name of the context
text – the raw string

Returns

Text instance that can be used to derive spans form

compile(extra_fields_ok=False, type_validation=True, **kwargs)[source]¶

Compile the document, validates and assigns compacted ids to nodes (internal use)

Parameters

extra_fields_ok – ignores extra fields in node if set to True
type_validation – do type validation, if set to False and type is not correct will result in undefined behaviour, possibly corrupt storage.

Return type

Dict[str, Tuple[Dict[int, int], List[int]]]

Returns

Dictionary of text id to Dict(offset, offset-id)

Raises

SchemaValidationError –

property layer: Dict[str, docria.model.NodeLayerCollection]¶

Layer dict

Return type: Dict[str, NodeLayerCollection]

property layers: Dict[str, docria.model.NodeLayerCollection]¶

Alias for layer()

Return type: Dict[str, NodeLayerCollection]

printschema()[source]¶: Prints the full schema of this document to stdout, containing layer fields and typing information

remove_layer(name, fieldcascade=False)[source]¶

Remove layer from document if it exists.

Parameters

name – name of layer
fieldcascade – force removal, and cascade removal of referring fields in other layers, default: false which will result in exception if any layer is referring to name

Return type

bool

Returns

True if layer was removed, False if it does not exist

property text: Dict[str, docria.model.Text]¶

Text

Return type: Dict[str, Text]

property texts: Dict[str, docria.model.Text]¶

Alias for text()

Return type: Dict[str, Text]

class docria.model.ExtData(type, data)[source]¶

User-defined typed data container

__init__(type, data)[source]¶

class docria.model.Node(*args, **kwargs)[source]¶

Basic building block of the document model

Example

>>> from docria.model import Document, DataTypes as T, Node
>>>
>>> doc = Document()
>>> tokens = doc.add_layer("token", pos=T.string)
>>>
>>> node = Node(pos="NN")
>>>
>>> tokens.add_many([ node ])
>>>
>>> print(node["pos"])  # Gets the field of pos
>>> print(node.get("pos"))  # Node works like a dictionary
>>> print(node.keys())  # return set fields
>>> print("pos" in node)  # check if pos field is set.

__init__(*args, **kwargs)[source]¶

detach()[source]¶: Remove itself from the document model

property fld¶: Get a pythonic wrapper for this node .e.g node.fld.id == node[“id”]

property i¶

Get the index of this node.

Returns: -1 if not bound to a layer, [0,) if bound in a layer

is_dangling()[source]¶

Check if this node is dangling i.e. is not attached to an existing layer, possibly removed or never added.

Return type: bool

is_valid(noexcept=True)[source]¶

Validate this node against schema

Parameters: noexcept – set to False if exceptions should be raised if validation failure, this will give the exact cause of validation failure.
Return type: bool
Returns: true if valid

iter_span(node)[source]¶

Return iterator which will give the span from this node to the given node

Parameters: node (Node) – target node (inclusive)
Note

This method corrects for order, i.e. if the target node is to the left of this node, the returned iterator will start at target node.

property left: Union[None, docria.model.Node]¶

Get the node left of this node

Return type: Optional[Node]

property right: Union[None, docria.model.Node]¶

Get the node right of this node

Return type: Optional[Node]

with_id(id)[source]¶

Utility method to set id and return this node. This is an unsafe method and should only be used when you know what you are doing.

Parameters: id – internal id
Returns: self

class docria.model.NodeCollection(fieldtypes)[source]¶

Base class for all node collections

__init__(fieldtypes)[source]¶

to_list()[source]¶

Convert this collection to a NodeList containing all node references

Return type: NodeList
Returns: NodeList with all nodes in this layer

class docria.model.NodeCollectionQuery(collection, predicate)[source]¶

Represents a query to document data

__init__(collection, predicate)[source]¶

class docria.model.NodeFieldCollection(collection, field)[source]¶

Field from a node collection

__init__(collection, field)[source]¶

covered_by(*range)[source]¶

Covered by predicate

Parameters: range – tuple of start, stop
Returns: covered by predicate

property dtype¶: Get the DataType for this field

filter(cond)[source]¶

Generic filter function.

Parameters: cond (Callable[[Any], bool]) – a callable which will be given the value of this field, it is expected to match filter semantics.
Returns: filter predicate

has_value()[source]¶: Has value predicate, does a field value exist :return: has value predicate

intersected_by(*range)[source]¶

Intersected by predicate

Parameters: range – tuple of start, stop
Returns: intersected by predicate

is_any(*item)[source]¶

Is any of predicate, does field value exist in given items.

Parameters: item – the items to verify against
Returns: is any predicate

is_none()[source]¶: Is none predicate, field value is none :return: is none predicate

to_list()[source]¶: Convert this node field collection to a python list with field elements.

class docria.model.NodeLayerCollection(schema)[source]¶

Node collection, internally a list with gaps which will compact when 25% of the list is empty.

__getitem__(item)[source]¶

Get nodes

Example

>>> doc = ... # type: docria.model.Document
>>> token_layer = doc["token"]
>>>
>>> # Get sequence of nodes using node ids (tokens)
>>> token_layer[0:10]  # type: NodeList
>>>
>>> # Find all tokens with a particular field value
>>> tokens = token_layer[token_layer["pos"] == "NN"]
>>>
>>> token_layer["pos"]  # type: NodeFieldCollection

__init__(schema)[source]¶

add(*args, **kwargs)[source]¶

Add node to this layer.

Parameters

args – Node objects, if used then kwargs are ignored
kwargs – create nodes from given properties, ignored if len(args) > 0

Return type

Node

Returns

node if kwargs was used

Example

>>> layer = doc["layer-name"]  # type: NodeLayerCollection
>>> layer.add(field1="Data", field2=42, field3=text[0:12])
>>> layer.add(node1, node2)
>>> layer.add(*nodes)

add_field(name, dtype, init_with_default=True)[source]¶

Add new field to the schema

Parameters

name (str) – name of the field
type – type of the field
init_with_default – set all existing nodes fields to default value

:raises SchemaValidationError if the field conflicts with existing field

add_many(nodes, default_fill=True, full_validation=True)[source]¶

Add many nodes

Parameters

nodes (Iterable[Node]) – list of node references to add
default_fill – set to True if default values should be added to nodes
full_validation – set to True to do full field validation

Note

If full_validation is set to True, it will first add all nodes, and then perform validation. Internal references between nodes in the nodes input is allowed.

compact()[source]¶

Compact this layer to have no gaps.

All node references will be stored sequentially in memory.

filter(*fields, fn)[source]¶

Create a node filter predicate

Parameters

fields – the fields for the predicate
pred – callable object which given values will return true/false

iter_nodespan(left_most, right_most)[source]¶

Iterator for node in given span

Parameters

left_most (Node) – left most, lowest index node
right_most (Node) – right most, highest index node, inclusive.

Return type

Iterator[Node]

Returns

iterator yielding zero or more elements

left(n)[source]¶

Return type: Optional[Node]
Returns: node to the left or lower index than given n or None if none available.

property name: str¶

Name of layer

Return type: str

remove(node)[source]¶

Remove nodes

Parameters: node (Union[Node, Iterable[Node]]) – the node or list of nodes to remove

remove_field(name, leave_data=False)[source]¶

Remove existing field

Parameters

name (str) – the name of the field to remove
leave_data – leave any existing data in nodes, validation fails with default settings if not cleaned out.

Return type

bool

Returns

true if the field was remove, false if the field could not be found

retain(nodes)[source]¶: Retain all nodes in the given list nodes, remove everything else.

right(n)[source]¶

Return type: Optional[Node]
Returns: node to the right or larger index than given n or None if none available.

property schema: docria.model.NodeLayerSchema¶

Get layer schema

Return type: NodeLayerSchema

sort(keyfn)[source]¶

Sort the nodes, rearrange the node reference order by the given key function

Parameters: keyfn – a function (input: Node) -> value to sort by.

to_pandas(fields=None, materialize_spans=False, include_ref_field=True)[source]¶

Convert this layer to a pandas Dataframe

Requires Pandas which is not a requirement for Docria.

Parameters

fields (Optional[List[str]]) – which fields to include, by default all fields are included.
materialize_spans – converts span fields to a materialized string
include_ref_field – include the python node reference as __ref field in the dataframe.

Return type

pandas.DataFrame

Returns

pandas.Dataframe with the contents of this layer

unsafe_initialize(nodes)[source]¶

Directly replaces all nodes with the provided list, no checks for performance.

Note

Unsafe, used for direct initialization by codecs.

Return type: NodeLayerCollection
Returns: self

validate(node)[source]¶

Validate node against schema, will throw SchemaTypeError if not valid.

Return type: bool

class docria.model.NodeLayerSchema(name)[source]¶

Node layer declaration

Consists of name and field type declarations

__init__(name)[source]¶

class docria.model.NodeList(*elems, fieldtypes=None)[source]¶

Python list enriched with extra indexing and presentation functionality for optimal use in Docria.

__getitem__(item)[source]¶: Get field value by nnam, node by index, new lists using standard slices or a list of indices

__init__(*elems, fieldtypes=None)[source]¶

class docria.model.NodeSpan(left_most_node, right_most_node)[source]¶

Represents a span of nodes in a layer

Parameters

left_most_node (Node) – the node most to the left
right_most_node (Node) – the node most to the right

__init__(left_most_node, right_most_node)[source]¶

__getitem__(item)¶

__len__()[source]¶

Computes the number of nodes currently contained within this node span.

This function has complexity O(n).

__init__(left_most_node, right_most_node)[source]¶

text(field='text')[source]¶: Return text from left to right :param field: the text span field to use :rtype: str :return: string

textspan(field='text')[source]¶: Return text from left to right :param field: the text span field to use :return: string

class docria.model.Offset(offset)[source]¶

Text offset object

__init__(offset)[source]¶

exception docria.model.SchemaError(message)[source]¶

Failed to validate a part of the schema

__init__(message)[source]¶

exception docria.model.SchemaValidationError(message, fields)[source]¶

Schema validation failed

__init__(message, fields)[source]¶

class docria.model.Text(name, text)[source]¶

Text object, consisting of text and an index of current offsets

__init__(name, text)[source]¶

compile(offsets)[source]¶

Compiles text for serialization

Returns: List of segments

class docria.model.TextSpan(text, start_offset, stop_offset)[source]¶

Text span, consisting of a start and stop offset.

Note

Use str(span) to get a real string.

__init__(text, start_offset, stop_offset)[source]¶

covered_by(span)[source]¶: Checks if this span is covered by given span :param span: the span to be covered by :return: boolean indicating cover

intersected_by(span)[source]¶: Checks if this span is intersected by given span :param span: the span to be intersected by :return: boolean indicating intersection

span_to(right_span)[source]¶: Helper function to return new TextSpan from this position to the given span :type right_span: TextSpan :param right_span: right most span :rtype: TextSpan :return: TextSpan

text_to(right_span)[source]¶: Helper function to return new TextSpan from this position to the given span :type right_span: TextSpan :param right_span: right most span :rtype: str :return: TextSpan

trim()[source]¶: Return trimmed span range by whitespace, move start forward, stop backward until something which is not whitespace is encountered. :return self or new instance if new span

trim_()[source]¶: Trim this span in-place by removing whitespace, move start forward, stop backward until something which is not whitespace is encountered. :return self