docria.model¶
Docria document model ( primary module )
Classes
|
Data type declaration |
|
Bytes field type, field with raw binary data |
|
Boolean field type |
|
Type names |
|
64 bit floating point (double) field type |
|
Signed 32 bit integer field type |
|
Signed 64 bit integer field type |
|
Node reference field type in same or other layer |
|
Multi node reference field type in same or other layer |
|
Nodespan field type, sequence of nodes |
|
String field type |
|
Textspan field type, text sequence |
Layer field type factory |
|
|
The document which contains all data |
|
User-defined typed data container |
|
Basic building block of the document model |
|
Base class for all node collections |
|
Represents a query to document data |
|
Field from a node collection |
|
Node collection, internally a list with gaps which will compact when 25% of the list is empty. |
|
Node layer declaration |
|
Python list enriched with extra indexing and presentation functionality for optimal use in Docria. |
|
Represents a span of nodes in a layer |
|
Text offset object |
|
Text object, consisting of text and an index of current offsets |
|
Text span, consisting of a start and stop offset. |
Exceptions
|
Failed to validate document |
|
Failed to validate a part of the schema |
|
Schema validation failed |
Docria document model ( primary module )
- class docria.model.DataType(typename, **kwargs)[source]¶
Data type declaration
- class docria.model.DataTypeBinary(typename, **kwargs)[source]¶
Bytes field type, field with raw binary data
- class docria.model.DataTypeFloat(typename, **kwargs)[source]¶
64 bit floating point (double) field type
- class docria.model.DataTypeNoderef(typename, **kwargs)[source]¶
Node reference field type in same or other layer
- class docria.model.DataTypeNoderefList(typename, **kwargs)[source]¶
Multi node reference field type in same or other layer
- class docria.model.DataTypeNodespan(typename, **kwargs)[source]¶
Nodespan field type, sequence of nodes
- class docria.model.Document(**kwargs)[source]¶
The document which contains all data
- add_layer(_Document__name, **kwargs)[source]¶
Create and add layer with specified schema
- Parameters
__name – the name of the layer
kwargs – key value pairs with e.g. name of field = type of field
- Returns
NodeLayerCollection instance with the specified schema
- add_text(name, text)[source]¶
Add text to the document
- Parameters
name – name of the context
text – the raw string
- Returns
Text instance that can be used to derive spans form
- compile(extra_fields_ok=False, type_validation=True, **kwargs)[source]¶
Compile the document, validates and assigns compacted ids to nodes (internal use)
- Parameters
extra_fields_ok – ignores extra fields in node if set to True
type_validation – do type validation, if set to False and type is not correct will result in undefined behaviour, possibly corrupt storage.
- Return type
Dict
[str
,Tuple
[Dict
[int
,int
],List
[int
]]]- Returns
Dictionary of text id to Dict(offset, offset-id)
- Raises
- property layer: Dict[str, docria.model.NodeLayerCollection]¶
Layer dict
- Return type
Dict
[str
,NodeLayerCollection
]
- property layers: Dict[str, docria.model.NodeLayerCollection]¶
Alias for
layer()
- Return type
Dict
[str
,NodeLayerCollection
]
- printschema()[source]¶
Prints the full schema of this document to stdout, containing layer fields and typing information
- remove_layer(name, fieldcascade=False)[source]¶
Remove layer from document if it exists.
- Parameters
name – name of layer
fieldcascade – force removal, and cascade removal of referring fields in other layers, default: false which will result in exception if any layer is referring to name
- Return type
bool
- Returns
True if layer was removed, False if it does not exist
- property text: Dict[str, docria.model.Text]¶
Text
- Return type
Dict
[str
,Text
]
- property texts: Dict[str, docria.model.Text]¶
Alias for
text()
- Return type
Dict
[str
,Text
]
- class docria.model.Node(*args, **kwargs)[source]¶
Basic building block of the document model
- Example
>>> from docria.model import Document, DataTypes as T, Node >>> >>> doc = Document() >>> tokens = doc.add_layer("token", pos=T.string) >>> >>> node = Node(pos="NN") >>> >>> tokens.add_many([ node ]) >>> >>> print(node["pos"]) # Gets the field of pos >>> print(node.get("pos")) # Node works like a dictionary >>> print(node.keys()) # return set fields >>> print("pos" in node) # check if pos field is set.
- property fld¶
Get a pythonic wrapper for this node .e.g node.fld.id == node[“id”]
- property i¶
Get the index of this node.
- Returns
-1 if not bound to a layer, [0,) if bound in a layer
- is_dangling()[source]¶
Check if this node is dangling i.e. is not attached to an existing layer, possibly removed or never added.
- Return type
bool
- is_valid(noexcept=True)[source]¶
Validate this node against schema
- Parameters
noexcept – set to False if exceptions should be raised if validation failure, this will give the exact cause of validation failure.
- Return type
bool
- Returns
true if valid
- iter_span(node)[source]¶
Return iterator which will give the span from this node to the given node
- Parameters
node (
Node
) – target node (inclusive)- Note
This method corrects for order, i.e. if the target node is to the left of this node, the returned iterator will start at target node.
- property left: Union[None, docria.model.Node]¶
Get the node left of this node
- Return type
Optional
[Node
]
- property right: Union[None, docria.model.Node]¶
Get the node right of this node
- Return type
Optional
[Node
]
- class docria.model.NodeCollectionQuery(collection, predicate)[source]¶
Represents a query to document data
- class docria.model.NodeFieldCollection(collection, field)[source]¶
Field from a node collection
- covered_by(*range)[source]¶
Covered by predicate
- Parameters
range – tuple of start, stop
- Returns
covered by predicate
- property dtype¶
Get the DataType for this field
- filter(cond)[source]¶
Generic filter function.
- Parameters
cond (
Callable
[[Any
],bool
]) – a callable which will be given the value of this field, it is expected to match filter semantics.- Returns
filter predicate
- intersected_by(*range)[source]¶
Intersected by predicate
- Parameters
range – tuple of start, stop
- Returns
intersected by predicate
- class docria.model.NodeLayerCollection(schema)[source]¶
Node collection, internally a list with gaps which will compact when 25% of the list is empty.
- __getitem__(item)[source]¶
Get nodes
- Example
>>> doc = ... # type: docria.model.Document >>> token_layer = doc["token"] >>> >>> # Get sequence of nodes using node ids (tokens) >>> token_layer[0:10] # type: NodeList >>> >>> # Find all tokens with a particular field value >>> tokens = token_layer[token_layer["pos"] == "NN"] >>> >>> token_layer["pos"] # type: NodeFieldCollection
- add(*args, **kwargs)[source]¶
Add node to this layer.
- Parameters
args – Node objects, if used then kwargs are ignored
kwargs – create nodes from given properties, ignored if len(args) > 0
- Return type
- Returns
node if kwargs was used
- Example
>>> layer = doc["layer-name"] # type: NodeLayerCollection >>> layer.add(field1="Data", field2=42, field3=text[0:12]) >>> layer.add(node1, node2) >>> layer.add(*nodes)
- add_field(name, dtype, init_with_default=True)[source]¶
Add new field to the schema
- Parameters
name (
str
) – name of the fieldtype – type of the field
init_with_default – set all existing nodes fields to default value
:raises SchemaValidationError if the field conflicts with existing field
- add_many(nodes, default_fill=True, full_validation=True)[source]¶
Add many nodes
- Parameters
nodes (
Iterable
[Node
]) – list of node references to adddefault_fill – set to True if default values should be added to nodes
full_validation – set to True to do full field validation
- Note
If full_validation is set to True, it will first add all nodes, and then perform validation. Internal references between nodes in the nodes input is allowed.
- compact()[source]¶
Compact this layer to have no gaps.
All node references will be stored sequentially in memory.
- filter(*fields, fn)[source]¶
Create a node filter predicate
- Parameters
fields – the fields for the predicate
pred – callable object which given values will return true/false
- left(n)[source]¶
- Return type
Optional
[Node
]- Returns
node to the left or lower index than given n or None if none available.
- property name: str¶
Name of layer
- Return type
str
- remove_field(name, leave_data=False)[source]¶
Remove existing field
- Parameters
name (
str
) – the name of the field to removeleave_data – leave any existing data in nodes, validation fails with default settings if not cleaned out.
- Return type
bool
- Returns
true if the field was remove, false if the field could not be found
- right(n)[source]¶
- Return type
Optional
[Node
]- Returns
node to the right or larger index than given n or None if none available.
- property schema: docria.model.NodeLayerSchema¶
Get layer schema
- Return type
- sort(keyfn)[source]¶
Sort the nodes, rearrange the node reference order by the given key function
- Parameters
keyfn – a function (input: Node) -> value to sort by.
- to_pandas(fields=None, materialize_spans=False, include_ref_field=True)[source]¶
Convert this layer to a pandas Dataframe
Requires Pandas which is not a requirement for Docria.
- Parameters
fields (
Optional
[List
[str
]]) – which fields to include, by default all fields are included.materialize_spans – converts span fields to a materialized string
include_ref_field – include the python node reference as __ref field in the dataframe.
- Return type
pandas.DataFrame
- Returns
pandas.Dataframe with the contents of this layer
- class docria.model.NodeLayerSchema(name)[source]¶
Node layer declaration
Consists of name and field type declarations
- class docria.model.NodeList(*elems, fieldtypes=None)[source]¶
Python list enriched with extra indexing and presentation functionality for optimal use in Docria.
- class docria.model.NodeSpan(left_most_node, right_most_node)[source]¶
Represents a span of nodes in a layer
- Parameters
- __getitem__(item)¶
- __len__()[source]¶
Computes the number of nodes currently contained within this node span.
This function has complexity O(n).
- class docria.model.Text(name, text)[source]¶
Text object, consisting of text and an index of current offsets
- class docria.model.TextSpan(text, start_offset, stop_offset)[source]¶
Text span, consisting of a start and stop offset.
- Note
Use str(span) to get a real string.
- covered_by(span)[source]¶
Checks if this span is covered by given span :param span: the span to be covered by :return: boolean indicating cover
- intersected_by(span)[source]¶
Checks if this span is intersected by given span :param span: the span to be intersected by :return: boolean indicating intersection
- span_to(right_span)[source]¶
Helper function to return new TextSpan from this position to the given span :type right_span:
TextSpan
:param right_span: right most span :rtype:TextSpan
:return: TextSpan
- text_to(right_span)[source]¶
Helper function to return new TextSpan from this position to the given span :type right_span:
TextSpan
:param right_span: right most span :rtype:str
:return: TextSpan