docria.algorithm¶
Functions for various processing purposes
Functions
|
Breadth first search |
|
Create a new function for a sequence of functions which will be applied in sequence |
|
Get children of a given property |
|
Depth first search |
|
Depth first search, only returning the leaves i.e. those without children or outgoing links. |
|
Resolves overlapping segments by using the dominant right rule, i.e. the longest wins and if equal length, the rightmost wins. |
|
Resolves overlapping spans by using the dominant right rule, i.e. the longest wins and if equal length, the rightmost wins. |
|
First order function which can be used to extract property of nodes |
|
Groups all nodes in layer_nodes into the corresponding bucket_node |
|
Covered by predicate :type span_a: |
|
Convert a sequence of strings, e.g. |
|
Translate span ranges from a partial extraction to the original data. |
Functions for various processing purposes
- docria.algorithm.bfs(start, children, is_result=None)[source]¶
Breadth first search
- Parameters
:return iterator of found nodes with depth during search
- Return type
Iterator
[Tuple
[int
,Node
]]
- docria.algorithm.chain(*fns)[source]¶
Create a new function for a sequence of functions which will be applied in sequence
- docria.algorithm.children_of(layer, *props)[source]¶
Get children of a given property
Note: the code will check against schema if it is an array or single node.
- docria.algorithm.dfs(start, children, is_result=None)[source]¶
Depth first search
- Parameters
:return iterator of nodes found during search
- Return type
Iterator
[Node
]
- docria.algorithm.dfs_leaves(start, children, is_result=None)[source]¶
Depth first search, only returning the leaves i.e. those without children or outgoing links
- Parameters
:return iterator of nodes found during search
- Return type
Iterator
[Node
]
- docria.algorithm.dominant_right(segments)[source]¶
Resolves overlapping segments by using the dominant right rule, i.e. the longest wins and if equal length, the rightmost wins.
- Parameters
segments (
List
[Tuple
[int
,int
,Any
]]) – tuple of (start, stop, data)- Return type
List
[Any
]- Returns
list of data
- docria.algorithm.dominant_right_span(nodes, spanfield='text')[source]¶
Resolves overlapping spans by using the dominant right rule, i.e. the longest wins and if equal length, the rightmost wins.
- docria.algorithm.get_prop(prop, default=None)[source]¶
First order function which can be used to extract property of nodes
- docria.algorithm.group_by_span(group_nodes, layer_nodes, resolution='intersect', group_span_field='text', layer_span_field=None, include_empty_groups=True)[source]¶
Groups all nodes in layer_nodes into the corresponding bucket_node
Nodes with textspans that equals to NIL/None are ignored.
- Parameters
group_nodes (
List
[Node
]) – the nodes to group bylayer_nodes (
Dict
[str
,Iterable
[Node
]]) – the nodes to assign to zero or more groupsresolution –
which resolution algorithm that shall be used: intersect or cover
”intersect”: the identity function for resolutions (all intersects are grouped)
”cover”: imposes a requirement that the group node must fully cover the layer node (node_start >= group_start and node_stop <= group_stop)
group_span_field – name of textspan property name, default field is “text”
layer_span_field (
Optional
[Dict
[str
,str
]]) – dictionary {layer: field name for textspan}, default field is “text”include_empty_groups – include groups which does not contain any matching layer nodes
- Return List of tuples
(group node, dictionary with layer name -> [ content of group for this layer ])
- Return type
- docria.algorithm.is_covered_by(span_a, span_b)[source]¶
Covered by predicate :type span_a:
TextSpan
:param span_a: the node that is tested for cover :type span_b:TextSpan
:param span_b: the node that might cover span_a :rtype:bool
:return: true or false
- docria.algorithm.sequence_to_textspans(token_sequence, text, start_offset=0, stop_offset=None, k=1)[source]¶
Convert a sequence of strings, e.g. produced by a tokenizer and return matching textspans in a raw text.
- Parameters
token_sequence (
List
[str
]) – sequence of strings to findtext (
Text
) – the raw text to search instart_offset (
int
) – the starting offset, default is from the startstop_offset (
Optional
[int
]) – the stop offset, default is to the endk (
int
) –maximum number of tokens to skip to search for better matching tokens (if a token is not present in text, k = 1 will test
one token ahead and if it is closed select this one instead)
- Return type
List
[TextSpan
]- Returns
list of spans, the spans which could not be found will have zero length at last position
- docria.algorithm.span_translate(doc, mapping_layer, target_source_map, layer_remap, source_target_remap)[source]¶
Translate span ranges from a partial extraction to the original data.
Target is the original data, Source is the partial extraction ranges.
- Parameters
doc (
Document
) – documentmapping_layer (
str
) – the layer which contains the mappingtarget_source_map (
Tuple
[str
,str
]) – tuple of (target field, source field)layer_remap (
str
) – the layer which should be mappedsource_target_remap (
Tuple
[str
,str
]) – tuple of (source field, target field)