docria.algorithm

Functions for various processing purposes

Functions

bfs(start, children[, is_result])

Breadth first search

chain(*fns)

Create a new function for a sequence of functions which will be applied in sequence

children_of(layer, *props)

Get children of a given property

dfs(start, children[, is_result])

Depth first search

dfs_leaves(start, children[, is_result])

Depth first search, only returning the leaves i.e. those without children or outgoing links.

dominant_right(segments)

Resolves overlapping segments by using the dominant right rule, i.e. the longest wins and if equal length, the rightmost wins.

dominant_right_span(nodes[, spanfield])

Resolves overlapping spans by using the dominant right rule, i.e. the longest wins and if equal length, the rightmost wins.

get_prop(prop[, default])

First order function which can be used to extract property of nodes

group_by_span(group_nodes, layer_nodes[, ...])

Groups all nodes in layer_nodes into the corresponding bucket_node

is_covered_by(span_a, span_b)

Covered by predicate :type span_a: TextSpan :param span_a: the node that is tested for cover :type span_b: TextSpan :param span_b: the node that might cover span_a :rtype: bool :return: true or false

sequence_to_textspans(token_sequence, text)

Convert a sequence of strings, e.g.

span_translate(doc, mapping_layer, ...)

Translate span ranges from a partial extraction to the original data.

Functions for various processing purposes

docria.algorithm.bfs(start, children, is_result=None)[source]

Breadth first search

Parameters
  • start (Node) – the start node

  • children (Callable[[Node], Iterator[Node]]) – function returning children iterator for given node

  • is_result (Optional[Callable[[Node], bool]]) – optional, function indicating if node should be emitted, default is true for all.

:return iterator of found nodes with depth during search

Return type

Iterator[Tuple[int, Node]]

docria.algorithm.chain(*fns)[source]

Create a new function for a sequence of functions which will be applied in sequence

docria.algorithm.children_of(layer, *props)[source]

Get children of a given property

Note: the code will check against schema if it is an array or single node.

docria.algorithm.dfs(start, children, is_result=None)[source]

Depth first search

Parameters
  • start (Node) – start node

  • children (Callable[[Node], Iterator[Node]]) – function returning children iterator for given node

  • is_result (Optional[Callable[[Node], bool]]) – optional, function indicating if node should be emitted, default is true for all.

:return iterator of nodes found during search

Return type

Iterator[Node]

docria.algorithm.dfs_leaves(start, children, is_result=None)[source]

Depth first search, only returning the leaves i.e. those without children or outgoing links

Parameters
  • start (Node) – start node

  • children (Callable[[Node], Iterator[Node]]) – function returning children iterator for given node

  • is_result (Optional[Callable[[Node], bool]]) – optional, function indicating if node should be emitted, default is true for all.

:return iterator of nodes found during search

Return type

Iterator[Node]

docria.algorithm.dominant_right(segments)[source]

Resolves overlapping segments by using the dominant right rule, i.e. the longest wins and if equal length, the rightmost wins.

Parameters

segments (List[Tuple[int, int, Any]]) – tuple of (start, stop, data)

Return type

List[Any]

Returns

list of data

docria.algorithm.dominant_right_span(nodes, spanfield='text')[source]

Resolves overlapping spans by using the dominant right rule, i.e. the longest wins and if equal length, the rightmost wins.

Parameters
  • nodes (Iterable[Node]) – nodes to resolve

  • spanfield (str) – the name of the spanfield

Return type

List[Node]

Returns

list of nodes

docria.algorithm.get_prop(prop, default=None)[source]

First order function which can be used to extract property of nodes

docria.algorithm.group_by_span(group_nodes, layer_nodes, resolution='intersect', group_span_field='text', layer_span_field=None, include_empty_groups=True)[source]

Groups all nodes in layer_nodes into the corresponding bucket_node

Nodes with textspans that equals to NIL/None are ignored.

Parameters
  • group_nodes (List[Node]) – the nodes to group by

  • layer_nodes (Dict[str, Iterable[Node]]) – the nodes to assign to zero or more groups

  • resolution

    which resolution algorithm that shall be used: intersect or cover

    • intersect”: the identity function for resolutions (all intersects are grouped)

    • cover”: imposes a requirement that the group node must fully cover the layer node (node_start >= group_start and node_stop <= group_stop)

  • group_span_field – name of textspan property name, default field is “text”

  • layer_span_field (Optional[Dict[str, str]]) – dictionary {layer: field name for textspan}, default field is “text”

  • include_empty_groups – include groups which does not contain any matching layer nodes

Return List of tuples

(group node, dictionary with layer name -> [ content of group for this layer ])

Return type

List[Tuple[Node, Dict[str, List[Node]]]]

docria.algorithm.is_covered_by(span_a, span_b)[source]

Covered by predicate :type span_a: TextSpan :param span_a: the node that is tested for cover :type span_b: TextSpan :param span_b: the node that might cover span_a :rtype: bool :return: true or false

docria.algorithm.sequence_to_textspans(token_sequence, text, start_offset=0, stop_offset=None, k=1)[source]

Convert a sequence of strings, e.g. produced by a tokenizer and return matching textspans in a raw text.

Parameters
  • token_sequence (List[str]) – sequence of strings to find

  • text (Text) – the raw text to search in

  • start_offset (int) – the starting offset, default is from the start

  • stop_offset (Optional[int]) – the stop offset, default is to the end

  • k (int) –

    maximum number of tokens to skip to search for better matching tokens (if a token is not present in text, k = 1 will test

    one token ahead and if it is closed select this one instead)

Return type

List[TextSpan]

Returns

list of spans, the spans which could not be found will have zero length at last position

docria.algorithm.span_translate(doc, mapping_layer, target_source_map, layer_remap, source_target_remap)[source]

Translate span ranges from a partial extraction to the original data.

Target is the original data, Source is the partial extraction ranges.

Parameters
  • doc (Document) – document

  • mapping_layer (str) – the layer which contains the mapping

  • target_source_map (Tuple[str, str]) – tuple of (target field, source field)

  • layer_remap (str) – the layer which should be mapped

  • source_target_remap (Tuple[str, str]) – tuple of (source field, target field)