Docria

Docria provides a hypergraph document model implementation with a focus on NLP (Natural Language Processing) applications.

Docria provides:

  • In-memory object representations in Python and Java

  • Binary serialization format based on MessagePack

  • File formats optimized for storing and accessing millions of documents locally and in a cluster context

Quickstart

To install the python version:

pip install docria

The first steps

from docria.model import Document, DataTypes as T
import re

# Stupid tokenizer
tokenizer = re.compile(r"[a-zA-Z]+|[0-9]+|[^\s]")
starts_with_uppercase = re.compile(r"[A-Z].*")

doc = Document()

# Create a new text context called 'main' with the text 'This code was written in Lund, Sweden.'
doc.maintext = "This code was written in Lund, Sweden.")
#               01234567890123456789012345678901234567
#               0         1         2         3

# Create a new layer with fields: id, text and head.
#
# Fields:
#   id is an int32
#   uppercase is a boolean indicating if the token is uppercase
#   text is a textspan from context 'main'
#
tokens = doc.add_layer("token", id=T.int32(), uppercase=T.bool(), text=T.span())

# Adding nodes: Solution 1
i = 0
for m in tokenizer.finditer(str(main_text)):
  token_node = tokens.add(id=i, text=main_text[m.start():m.end()])

  # Check if it is uppercase
  token_node["uppercase"] = starts_with_uppercase.fullmatch(m.text()) is not None
  i += 1

# Reading nodes
for tok in tokens:
   print(tok["text"])

# Filtering, only uppercase tokens
for tok in tokens[tokens["uppercase"] == True]:
   print(tok["text"])

Concepts

The document model consists of the following concepts:

  • Document: The overall container for everything (all nodes, layers, texts must be contained within)

  • Document properties: a single dictionary per document to store metadata in.

  • Text: The basic text representation, a wrapped string to track spans.

  • Text Spans: Subsequence of a string, can always be converted into a hard string by using str(span)

  • Node Spans: Start and stop node in a layer which will produce a sequence of nodes.

  • Layer: Collection of nodes

  • Layer Schema: Definition of field names and types when document is serialized

  • Node: Single node with zero or more fields with values

  • Node fields: Key, value pairs.

from docria.model import Document

doc = Document()
doc.maintext # alias to doc.text["main"] with special support for
             # creating a main text via doc.maintext = "string"

doc.props  # Document metadata dictionary
doc.layers # Layer dictionary, layer name to node layer collection
doc.layer  # Alias to above
doc.texts  # Text dictionary.
doc.text   # Alias to above

Examples

Reading document collections

from docria.storage import MsgpackDocumentReader
from docria.codec import MsgpackDocument

with MsgpackDocumentReader(open("path_to_your_docria_file.docria", "rb")) as reader:
   for rawdoc in reader:
      # rawdoc is of type MsgpackDocument
      doc = rawdoc.document() #  type: docria.Document

      # Print the schema
      doc.printschema()

      for token in doc["token"]:
         # ... do something with the data contained within.
         pass

# You can use MsgpackDocumentReader as a normal instance
# and manually use .close() when done or on the GC to eat it up.

The principle is mostly the same with :class:~`docria.storage.TarMsgpackReader` with the exception it expects a filepath, not a filelike object.

Writing document collections

from docria.storage import MsgpackDocumentReader
from docria.codec import MsgpackDocument

with MsgpackDocumentWriter(open("path_to_your_docria_file.docria", "wb")) as writer:
   # using the previous doc in "The first steps"
   writer.write(doc)

# Rewriting or filtering
with MsgpackDocumentWriter(open("path_to_your_output_docria_file.docria", "wb")) as writer:
   with MsgpackDocumentReader(open("path_to_your_input_docria_file.docria", "rb")) as reader:
      for rawdoc in reader:
         writer.write(rawdoc)  # this is decompression and memory copy of the raw data

The principle is mostly the same with :class:~`docria.storage.TarMsgpackWriter` with the exception it expects a filepath, not a filelike object.

Reading and writing documents to bytes

from docria.codec import MsgpackCodec, MsgpackDocument

binarydata = bytes()  # from any location
binarydata = io.BytesIO()  # or

# To decode into a document
doc = MsgpackCodec.decode(binarydata)

# To encode into a document
binarydata = MsgpackCodec.encode(doc)

# Access data without a full deserialization
rawdoc = MsgpackDocument(binarydata)
rawdow.properties()  # Document metadata as dictionary

# Document texts, dictionary name to list of strings
# (each segment which potentially has annotation) which can be joined to get the full text.
rawdoc.texts()

schema = rawdoc.schema() # advanced access to the contents of this document, lists layers and fields.

doc = rawdoc.document() # full document deserialization

Layer and field query

from docria import Document, DataTypes as T, NodeSpan, NodeList

doc = Document()
doc.maintext = "Lund is a city in Sweden."
#               0123456789012345678901234
#               0         1         2

# Only ordered layers exist in docria, this means all nodes are added sequentially.
# T.span() is equivalent to T.span("main") which referes to the main text
token_layer = doc.add_layer("token", part_of_speech=T.string(), text=T.span(), head=T.noderef("token"))

# Annotation output by CoreNLP 3.9.2 and Basic dependencies
# We set node references later.
first = token_layer.add(part_of_speech="NNP", text=doc.maintext[0:4])
token_layer.add(part_of_speech="VBZ", text=doc.maintext[5:7])
token_layer.add(part_of_speech="DT", text=doc.maintext[8:9])
token_layer.add(part_of_speech="NN", text=doc.maintext[10:14])
token_layer.add(part_of_speech="IN", text=doc.maintext[15:17])
token_layer.add(part_of_speech="NNP", text=doc.maintext[18:24])
last = token_layer.add(part_of_speech=".", text=doc.maintext[24:])

# Create a node span and convert into a list
sent_tokens = NodeSpan(first, last).to_list()

# When setting heads, no validation takes place.
sent_tokens[0]["head"] = token_layer[3] # head = city
sent_tokens[1]["head"] = token_layer[3] # head = city
sent_tokens[2]["head"] = token_layer[3] # head = city
sent_tokens[4]["head"] = token_layer[5] # head = Sweden
sent_tokens[5]["head"] = token_layer[3] # head = city
sent_tokens[6]["head"] = token_layer[3] # head = city

sent_tokens.validate() # We can manually initiate validate for these nodes to fail faster.

# This first query finds all roots by checking if the head is None, and finally picks the first one.
first_root = token_layer[token_layer["head"].is_none()].first()

# This second query finds all nodes with the head equal to first_root
tokens_with_head_first_root = token_layer[token_layer["head"] == first_root]

# Then we print tokens in layer order from matching token to including root token
for tok in tokens_with_head_first_root:
    # iter_span is invariant to order, it will always produce low id to high id.
    print(NodeList(first_root.iter_span(tok))["text"].to_list())

Change presentation settings

The settings used for pretty printing is controlled by the global variable docria.printout.options which is a docria.printout.PrintOptions.

By convention pretty printing will output [layer name]#[internal id] where the internal id can be used to get the node. However, this id is only guaranteed to be static if the layer is not changed, if changed it is invalid.

For references in general use the Node object.

API Reference

docria.model

Docria document model ( primary module )

docria.algorithm

docria.codec

Codecs, encoding/decoding documents to/from binary or text representations

docria.storage

docria.printout

docria.model

Docria document model ( primary module )

Classes

DataType(typename, **kwargs)

Data type declaration

DataTypeBinary(typename, **kwargs)

DataTypeBool(typename, **kwargs)

DataTypeEnum

Type names

DataTypeFloat(typename, **kwargs)

DataTypeInt32(typename, **kwargs)

DataTypeInt64(typename, **kwargs)

DataTypeNoderef(typename, **kwargs)

DataTypeNoderefList(typename, **kwargs)

DataTypeNodespan(typename, **kwargs)

DataTypeString(typename, **kwargs)

DataTypeTextspan(typename, **kwargs)

DataTypes

Common datatypes and factory methods for parametrical types

Document(**kwargs)

The document which contains all data

ExtData(type, data)

User-defined typed data container

Node(*args, **kwargs)

Basic building block of the document model

NodeCollection(fieldtypes)

NodeCollectionQuery(collection, predicate)

Represents a query to document data

NodeFieldCollection(collection, field)

Field from a node collection

NodeLayerCollection(schema)

Node collection, internally a list with gaps which will compact when 25% of the list is empty.

NodeLayerSchema(name)

Node layer declaration

NodeList(*elems[, fieldtypes])

Python list enriched with extra indexing and presentation functionality for optimal use in Docria.

NodeSpan(left_most_node, right_most_node)

Represents a span of nodes in a layer

Offset(offset)

Text offset object

Text(name, text)

Text object, consisting of text and an index of current offsets

TextSpan(text, start_offset, stop_offset)

Text span, consisting of a start and stop offset.

Exceptions

DataValidationError(message)

Failed to validate document

SchemaError(message)

Failed to validate a part of the schema

SchemaValidationError(message, fields)

Schema validation failed

class docria.model.DataType(typename, **kwargs)[source]

Bases: object

Data type declaration

__init__(typename, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

cast_up(dtype)[source]

Find the largest type capable of representing both.

Parameters

dtype (DataType) – type to cast

Return type

DataType

Returns

self or dtype

Note

String and numbers are not considered being equal.

cast_up_possible(dtype)[source]

Check if type can be merged with another type.

Return type

bool

class docria.model.DataTypeBinary(typename, **kwargs)[source]

Bases: docria.model.DataType

__init__(typename, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

class docria.model.DataTypeBool(typename, **kwargs)[source]

Bases: docria.model.DataType

__init__(typename, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

class docria.model.DataTypeEnum[source]

Bases: enum.Enum

Type names

class docria.model.DataTypeFloat(typename, **kwargs)[source]

Bases: docria.model.DataType

__init__(typename, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

class docria.model.DataTypeInt32(typename, **kwargs)[source]

Bases: docria.model.DataType

__init__(typename, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

class docria.model.DataTypeInt64(typename, **kwargs)[source]

Bases: docria.model.DataType

__init__(typename, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

class docria.model.DataTypeNoderef(typename, **kwargs)[source]

Bases: docria.model.DataType

__init__(typename, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

class docria.model.DataTypeNoderefList(typename, **kwargs)[source]

Bases: docria.model.DataType

__init__(typename, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

class docria.model.DataTypeNodespan(typename, **kwargs)[source]

Bases: docria.model.DataType

__init__(typename, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

class docria.model.DataTypeString(typename, **kwargs)[source]

Bases: docria.model.DataType

__init__(typename, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

class docria.model.DataTypeTextspan(typename, **kwargs)[source]

Bases: docria.model.DataType

__init__(typename, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

class docria.model.DataTypes[source]

Bases: object

Common datatypes and factory methods for parametrical types

exception docria.model.DataValidationError(message)[source]

Bases: Exception

Failed to validate document

__init__(message)[source]

Initialize self. See help(type(self)) for accurate signature.

class docria.model.Document(**kwargs)[source]

Bases: object

The document which contains all data

__getitem__(key)[source]
__delitem__(key)[source]
__contains__(item)[source]
__init__(**kwargs)[source]

Construct new document

Parameters

kwargs – property key, values

add_layer(_Document__name, **kwargs)[source]

Create and add layer with specified schema

Parameters
  • __name – the name of the layer

  • kwargs – key value pairs with e.g. name of field = type of field

Returns

NodeLayerCollection instance with the specified schema

add_text(name, text)[source]

Add text to the document

Parameters
  • name – name of the context

  • text – the raw string

Returns

Text instance that can be used to derive spans form

compile(extra_fields_ok=False, type_validation=True, **kwargs)[source]

Compile the document, validates and assigns compacted ids to nodes (internal use)

Parameters
  • extra_fields_ok – ignores extra fields in node if set to True

  • type_validation – do type validation, if set to False and type is not correct will result in undefined behaviour, possibly corrupt storage.

Return type

Dict[str, Tuple[Dict[int, int], List[int]]]

Returns

Dictionary of text id to Dict(offset, offset-id)

Raises

SchemaValidationError

property layer

Layer dict

Return type

Dict[str, NodeLayerCollection]

property layers

Alias for layer()

Return type

Dict[str, NodeLayerCollection]

printschema()[source]

Prints the full schema of this document to stdout, containing layer fields and typing information

remove_layer(name, fieldcascade=False)[source]

Remove layer from document if it exists.

Parameters
  • name – name of layer

  • fieldcascade – force removal, and cascade removal of referring fields in other layers, default: false which will result in exception if any layer is referring to name

Return type

bool

Returns

True if layer was removed, False if it does not exist

property text

Text

Return type

Dict[str, Text]

property texts

Alias for text()

Return type

Dict[str, Text]

class docria.model.ExtData(type, data)[source]

Bases: object

User-defined typed data container

__init__(type, data)[source]

Initialize self. See help(type(self)) for accurate signature.

class docria.model.Node(*args, **kwargs)[source]

Bases: dict

Basic building block of the document model

Example

>>> from docria.model import Document, DataTypes as T, Node
>>>
>>> doc = Document()
>>> tokens = doc.add_layer("token", pos=T.string)
>>>
>>> node = Node(pos="NN")
>>>
>>> tokens.add_many([ node ])
>>>
>>> print(node["pos"])  # Gets the field of pos
>>> print(node.get("pos"))  # Node works like a dictionary
>>> print(node.keys())  # return set fields
>>> print("pos" in node)  # check if pos field is set.
__init__(*args, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

detach()[source]

Remove itself from the document model

property fld

Get a pythonic wrapper for this node .e.g node.fld.id == node[“id”]

property i

Get the index of this node.

Returns

-1 if not bound to a layer, [0,) if bound in a layer

is_dangling()[source]

Check if this node is dangling i.e. is not attached to an existing layer, possibly removed or never added.

Return type

bool

is_valid(noexcept=True)[source]

Validate this node against schema

Parameters

noexcept – set to False if exceptions should be raised if validation failure, this will give the exact cause of validation failure.

Return type

bool

Returns

true if valid

iter_span(node)[source]

Return iterator which will give the span from this node to the given node

Parameters

node (Node) – target node (inclusive)

Note

This method corrects for order, i.e. if the target node is to the left of this node, the returned iterator will start at target node.

property left

Get the node left of this node

Return type

Optional[Node]

property right

Get the node right of this node

Return type

Optional[Node]

with_id(id)[source]

Utility method to set id and return this node. This is an unsafe method and should only be used when you know what you are doing.

Parameters

id – internal id

Returns

self

class docria.model.NodeCollectionQuery(collection, predicate)[source]

Bases: docria.model.NodeCollection

Represents a query to document data

__init__(collection, predicate)[source]

Initialize self. See help(type(self)) for accurate signature.

class docria.model.NodeFieldCollection(collection, field)[source]

Bases: collections.abc.Sized

Field from a node collection

__init__(collection, field)[source]

Initialize self. See help(type(self)) for accurate signature.

covered_by(*range)[source]

Covered by predicate

Parameters

range – tuple of start, stop

Returns

covered by predicate

property dtype

Get the DataType for this field

filter(cond)[source]

Generic filter function.

Parameters

cond (Callable[[Any], bool]) – a callable which will be given the value of this field, it is expected to match filter semantics.

Returns

filter predicate

has_value()[source]

Has value predicate, does a field value exist :return: has value predicate

intersected_by(*range)[source]

Intersected by predicate

Parameters

range – tuple of start, stop

Returns

intersected by predicate

is_any(*item)[source]

Is any of predicate, does field value exist in given items.

Parameters

item – the items to verify against

Returns

is any predicate

is_none()[source]

Is none predicate, field value is none :return: is none predicate

to_list()[source]

Convert this node field collection to a python list with field elements.

class docria.model.NodeLayerCollection(schema)[source]

Bases: docria.model.NodeCollection

Node collection, internally a list with gaps which will compact when 25% of the list is empty.

__init__(schema)[source]

Initialize self. See help(type(self)) for accurate signature.

add(*args, **kwargs)[source]

Add node to this layer.

Parameters
  • args – Node objects, if used then kwargs are ignored

  • kwargs – create nodes from given properties, ignored if len(args) > 0

Return type

Node

Returns

node if kwargs was used

Example

>>> layer = doc["layer-name"]  # type: NodeLayerCollection
>>> layer.add(field1="Data", field2=42, field3=text[0:12])
>>> layer.add(node1, node2)
>>> layer.add(*nodes)
add_field(name, type)[source]

Add new field to the schema

Parameters
  • name (str) – name of the field

  • type (DataType) – type of the field

:raises SchemaValidationError if the field conflicts with existing field

add_many(nodes, default_fill=True, full_validation=True)[source]

Add many nodes

Parameters
  • nodes (Iterable[Node]) – list of node references to add

  • default_fill – set to True if default values should be added to nodes

  • full_validation – set to True to do full field validation

Note

If full_validation is set to True, it will first add all nodes, and then perform validation. Internal references between nodes in the nodes input is allowed.

compact()[source]

Compact this layer to have no gaps.

All node references will be stored sequentially in memory.

filter(*fields, fn)[source]

Create a node filter predicate

Parameters
  • fields – the fields for the predicate

  • pred – callable object which given values will return true/false

iter_nodespan(left_most, right_most)[source]

Iterator for node in given span

Parameters
  • left_most (Node) – left most, lowest index node

  • right_most (Node) – right most, highest index node, inclusive.

Return type

Iterator[Node]

Returns

iterator yielding zero or more elements

left(n)[source]
Return type

Optional[Node]

Returns

node to the left or lower index than given n or None if none available.

property name

Name of layer

Return type

str

remove(node)[source]

Remove nodes

Parameters

node (Union[Node, Iterable[Node]]) – the node or list of nodes to remove

remove_field(name, leave_data=False)[source]

Remove existing field

Parameters
  • name (str) – the name of the field to remove

  • leave_data – leave any existing data in nodes, validation fails with default settings if not cleaned out.

Return type

bool

Returns

true if the field was remove, false if the field could not be found

retain(nodes)[source]

Retain all nodes in the given list nodes, remove everything else.

right(n)[source]
Return type

Optional[Node]

Returns

node to the right or larger index than given n or None if none available.

property schema

Get layer schema

Return type

NodeLayerSchema

sort(keyfn)[source]

Sort the nodes, rearrange the node reference order by the given key function

Parameters

keyfn – a function (input: Node) -> value to sort by.

to_pandas(fields=None, materialize_spans=False, include_ref_field=True)[source]

Convert this layer to a pandas Dataframe

Requires Pandas which is not a requirement for Docria.

Parameters
  • fields (Optional[List[str]]) – which fields to include, by default all fields are included.

  • materialize_spans – converts span fields to a materialized string

  • include_ref_field – include the python node reference as __ref field in the dataframe.

Return type

pandas.DataFrame

Returns

pandas.Dataframe with the contents of this layer

unsafe_initialize(nodes)[source]

Directly replaces all nodes with the provided list, no checks for performance.

Note

Unsafe, used for direct initialization by codecs.

Return type

NodeLayerCollection

Returns

self

validate(node)[source]

Validate node against schema, will throw SchemaTypeError if not valid.

Return type

bool

class docria.model.NodeLayerSchema(name)[source]

Bases: object

Node layer declaration

Consists of name and field type declarations

__init__(name)[source]

Initialize self. See help(type(self)) for accurate signature.

class docria.model.NodeList(*elems, fieldtypes=None)[source]

Bases: list, docria.model.NodeCollection

Python list enriched with extra indexing and presentation functionality for optimal use in Docria.

__getitem__(item)[source]

Get field value by nnam, node by index, new lists using standard slices or a list of indices

__init__(*elems, fieldtypes=None)[source]

Initialize self. See help(type(self)) for accurate signature.

class docria.model.NodeSpan(left_most_node, right_most_node)[source]

Bases: docria.model.NodeCollection

Represents a span of nodes in a layer

__getitem__(item)
__len__()[source]

Computes the number of nodes currently contained within this node span.

This function has complexity O(n).

__init__(left_most_node, right_most_node)[source]

Initialize self. See help(type(self)) for accurate signature.

text(field='text')[source]

Return text from left to right :param field: the text span field to use :rtype: str :return: string

textspan(field='text')[source]

Return text from left to right :param field: the text span field to use :return: string

class docria.model.Offset(offset)[source]

Bases: object

Text offset object

__init__(offset)[source]

Initialize self. See help(type(self)) for accurate signature.

exception docria.model.SchemaError(message)[source]

Bases: Exception

Failed to validate a part of the schema

__init__(message)[source]

Initialize self. See help(type(self)) for accurate signature.

exception docria.model.SchemaValidationError(message, fields)[source]

Bases: Exception

Schema validation failed

__init__(message, fields)[source]

Initialize self. See help(type(self)) for accurate signature.

class docria.model.Text(name, text)[source]

Bases: object

Text object, consisting of text and an index of current offsets

__init__(name, text)[source]

Initialize self. See help(type(self)) for accurate signature.

compile(offsets)[source]

Compiles text for serialization

Returns

List of segments

class docria.model.TextSpan(text, start_offset, stop_offset)[source]

Bases: object

Text span, consisting of a start and stop offset.

Note

Use str(span) to get a real string.

__init__(text, start_offset, stop_offset)[source]

Initialize self. See help(type(self)) for accurate signature.

covered_by(span)[source]

Checks if this span is covered by given span :param span: the span to be covered by :return: boolean indicating cover

intersected_by(span)[source]

Checks if this span is intersected by given span :param span: the span to be intersected by :return: boolean indicating intersection

span_to(right_span)[source]

Helper function to return new TextSpan from this position to the given span :type right_span: TextSpan :param right_span: right most span :rtype: TextSpan :return: TextSpan

text_to(right_span)[source]

Helper function to return new TextSpan from this position to the given span :type right_span: TextSpan :param right_span: right most span :rtype: str :return: TextSpan

trim()[source]

Return trimmed span range by whitespace, move start forward, stop backward until something which is not whitespace is encountered. :return self or new instance if new span

trim_()[source]

Trim this span in-place by removing whitespace, move start forward, stop backward until something which is not whitespace is encountered. :return self

docria.algorithm

docria.codec

Codecs, encoding/decoding documents to/from binary or text representations

Classes

Codec

Utility methods for all codecs

JsonCodec

JSON codec

MsgpackCodec

MessagePack document codec

MsgpackDocument(rawdata[, ref])

MessagePack Document, allows partial decoding

MsgpackDocumentExt(doc)

Embeddable document as a extended type

Exceptions

DataError(message)

Serialization/Deserialization failure

class docria.codec.Codec[source]

Bases: object

Utility methods for all codecs

static commit_layers(doc, types, schema, all_nodes)[source]

Do post-processing after deserialization phase, for instance replace node ids with node references.

Parameters
  • doc (Document) – the document

  • types (List[str]) – layer names

  • schema (Dict[str, List[Tuple[str, any]]]) – schema definition

  • all_nodes (Dict[str, List[Node]]) – dictionary of all nodes

exception docria.codec.DataError(message)[source]

Bases: Exception

Serialization/Deserialization failure

__init__(message)[source]

Initialize self. See help(type(self)) for accurate signature.

class docria.codec.JsonCodec[source]

Bases: object

JSON codec

class docria.codec.MsgpackCodec[source]

Bases: object

MessagePack document codec

static compute_text_offsets(doc, texts)[source]

Computes all offsets and inserts text into document

static decode(data, **kwargs)[source]

Decode message pack encoded document

Parameters

data – bytes or file-like object

Returns

Document instance

static encode(doc, **kwargs)[source]

Encode document using MessagePack encoder

Parameters
  • doc – the document to encode

  • kwargs – passed along to Codec.encode and Document.compile

:raises SchemaValidationError :return: bytes of the document

class docria.codec.MsgpackDocument(rawdata, ref=None)[source]

Bases: object

MessagePack Document, allows partial decoding

__init__(rawdata, ref=None)[source]

Initialize self. See help(type(self)) for accurate signature.

binary()[source]

Get this document as binary value

Return type

bytes

document(*layers, **kwargs)[source]

Get fully decoded document

properties(*props)[source]

Get document properties

schema()[source]

Get document schema

texts(*texts)[source]

Get document text

class docria.codec.MsgpackDocumentExt(doc)[source]

Bases: docria.model.ExtData

Embeddable document as a extended type

__init__(doc)[source]

Initialize self. See help(type(self)) for accurate signature.

class docria.codec.XmlCodec[source]

Bases: object

XML Codec, only encoding support

static encode_intermediate(doc, **kwargs)[source]

Conversion of docria document into an intermediate form: texts, schema and layer data.

Parameters
  • doc – docria document

  • kwargs – options for compile

Returns

static encode_tree(doc, verbose=False, verbose_node_spans=False, document_id='', **kwargs)[source]

Encodes a docria document into an XML representation.

Parameters
  • doc (Document) – docria document

  • verbose – add extra attributes to the XML data for readability and simpler tooling

  • verbose_node_spans – add extra nodes for each node, materializing the span for readability

  • document_id – the global unique document id

  • kwargs – additional optoins, see XmlCodec.encode_intermediate for options

Return type

ElementTree

Returns

static encode_utf8string(doc, **kwargs)[source]

Encode docria document into an XML string.

Parameters
  • doc (Document) – docria document

  • kwargs – additional options, see XmlCodec.encode_tree and XmlCodec.encode_intermediate for options.

Returns

docria.storage

docria.printout

Indices and tables