docria.algorithm¶

Functions for various processing purposes

Functions

`bfs`(start, children[, is_result])	Breadth first search
`chain`(*fns)	Create a new function for a sequence of functions which will be applied in sequence
`children_of`(layer, *props)	Get children of a given property
`dfs`(start, children[, is_result])	Depth first search
`dfs_leaves`(start, children[, is_result])	Depth first search, only returning the leaves i.e. those without children or outgoing links.
`dominant_right`(segments)	Resolves overlapping segments by using the dominant right rule, i.e. the longest wins and if equal length, the rightmost wins.
`dominant_right_span`(nodes[, spanfield])	Resolves overlapping spans by using the dominant right rule, i.e. the longest wins and if equal length, the rightmost wins.
`get_prop`(prop[, default])	First order function which can be used to extract property of nodes
`group_by_span`(group_nodes, layer_nodes[, ...])	Groups all nodes in layer_nodes into the corresponding bucket_node
`is_covered_by`(span_a, span_b)	Covered by predicate :type span_a: `TextSpan` :param span_a: the node that is tested for cover :type span_b: `TextSpan` :param span_b: the node that might cover span_a :rtype: `bool` :return: true or false
`sequence_to_textspans`(token_sequence, text)	Convert a sequence of strings, e.g.
`span_translate`(doc, mapping_layer, ...)	Translate span ranges from a partial extraction to the original data.

Functions for various processing purposes

docria.algorithm.bfs(start, children, is_result=None)[source]¶

Breadth first search

Parameters

start (Node) – the start node
children (Callable[[Node], Iterator[Node]]) – function returning children iterator for given node
is_result (Optional[Callable[[Node], bool]]) – optional, function indicating if node should be emitted, default is true for all.

:return iterator of found nodes with depth during search

Return type: Iterator[Tuple[int, Node]]

docria.algorithm.chain(*fns)[source]¶: Create a new function for a sequence of functions which will be applied in sequence

docria.algorithm.children_of(layer, *props)[source]¶

Get children of a given property

Note: the code will check against schema if it is an array or single node.

docria.algorithm.dfs(start, children, is_result=None)[source]¶

Depth first search

Parameters

start (Node) – start node
children (Callable[[Node], Iterator[Node]]) – function returning children iterator for given node
is_result (Optional[Callable[[Node], bool]]) – optional, function indicating if node should be emitted, default is true for all.

:return iterator of nodes found during search

Return type: Iterator[Node]

docria.algorithm.dfs_leaves(start, children, is_result=None)[source]¶

Depth first search, only returning the leaves i.e. those without children or outgoing links

Parameters

start (Node) – start node
children (Callable[[Node], Iterator[Node]]) – function returning children iterator for given node
is_result (Optional[Callable[[Node], bool]]) – optional, function indicating if node should be emitted, default is true for all.

:return iterator of nodes found during search

Return type: Iterator[Node]

docria.algorithm.dominant_right(segments)[source]¶

Resolves overlapping segments by using the dominant right rule, i.e. the longest wins and if equal length, the rightmost wins.

Parameters: segments (List[Tuple[int, int, Any]]) – tuple of (start, stop, data)
Return type: List[Any]
Returns: list of data

docria.algorithm.dominant_right_span(nodes, spanfield='text')[source]¶

Resolves overlapping spans by using the dominant right rule, i.e. the longest wins and if equal length, the rightmost wins.

Parameters

nodes (Iterable[Node]) – nodes to resolve
spanfield (str) – the name of the spanfield

Return type

List[Node]

Returns

list of nodes

docria.algorithm.get_prop(prop, default=None)[source]¶: First order function which can be used to extract property of nodes

docria.algorithm.group_by_span(group_nodes, layer_nodes, resolution='intersect', group_span_field='text', layer_span_field=None, include_empty_groups=True)[source]¶

Groups all nodes in layer_nodes into the corresponding bucket_node

Nodes with textspans that equals to NIL/None are ignored.

Parameters

group_nodes (List[Node]) – the nodes to group by
layer_nodes (Dict[str, Iterable[Node]]) – the nodes to assign to zero or more groups
resolution –
which resolution algorithm that shall be used: intersect or cover
- ”intersect”: the identity function for resolutions (all intersects are grouped)
- ”cover”: imposes a requirement that the group node must fully cover the layer node (node_start >= group_start and node_stop <= group_stop)
group_span_field – name of textspan property name, default field is “text”
layer_span_field (Optional[Dict[str, str]]) – dictionary {layer: field name for textspan}, default field is “text”
include_empty_groups – include groups which does not contain any matching layer nodes

Return List of tuples

(group node, dictionary with layer name -> [ content of group for this layer ])

Return type

List[Tuple[Node, Dict[str, List[Node]]]]

docria.algorithm.is_covered_by(span_a, span_b)[source]¶: Covered by predicate :type span_a: TextSpan :param span_a: the node that is tested for cover :type span_b: TextSpan :param span_b: the node that might cover span_a :rtype: bool :return: true or false

docria.algorithm.sequence_to_textspans(token_sequence, text, start_offset=0, stop_offset=None, k=1)[source]¶

Convert a sequence of strings, e.g. produced by a tokenizer and return matching textspans in a raw text.

Parameters

token_sequence (List[str]) – sequence of strings to find
text (Text) – the raw text to search in
start_offset (int) – the starting offset, default is from the start
stop_offset (Optional[int]) – the stop offset, default is to the end
k (int) –
maximum number of tokens to skip to search for better matching tokens (if a token is not present in text, k = 1 will test

one token ahead and if it is closed select this one instead)

Return type

List[TextSpan]

Returns

list of spans, the spans which could not be found will have zero length at last position

docria.algorithm.span_translate(doc, mapping_layer, target_source_map, layer_remap, source_target_remap)[source]¶

Translate span ranges from a partial extraction to the original data.

Target is the original data, Source is the partial extraction ranges.

Parameters

doc (Document) – document
mapping_layer (str) – the layer which contains the mapping
target_source_map (Tuple[str, str]) – tuple of (target field, source field)
layer_remap (str) – the layer which should be mapped
source_target_remap (Tuple[str, str]) – tuple of (source field, target field)