docria.codec

Codecs, encoding/decoding documents to/from binary or text representations

Classes

Codec()

Utility methods for all codecs

JsonCodec()

JSON codec

MsgpackCodec()

MessagePack document codec

MsgpackDocument(data_or_document[, ref])

MessagePack Document, allows partial decoding

MsgpackDocumentExt(doc)

Embeddable document as a extended type

XmlCodec()

XML Codec, only encoding support

Exceptions

DataError(message)

Serialization/Deserialization failure

Codecs, encoding/decoding documents to/from binary or text representations

class docria.codec.Codec[source]

Utility methods for all codecs

static commit_layers(doc, types, schema, all_nodes)[source]

Do post-processing after deserialization phase, for instance replace node ids with node references.

Parameters
  • doc (Document) – the document

  • types (List[str]) – layer names

  • schema (Dict[str, List[Tuple[str, any]]]) – schema definition

  • all_nodes (Dict[str, List[Node]]) – dictionary of all nodes

exception docria.codec.DataError(message)[source]

Serialization/Deserialization failure

__init__(message)[source]
class docria.codec.JsonCodec[source]

JSON codec

class docria.codec.MsgpackCodec[source]

MessagePack document codec

static compute_text_offsets(doc, texts)[source]

Computes all offsets and inserts text into document

static decode(data, **kwargs)[source]

Decode message pack encoded document

Parameters

data – bytes or file-like object

Returns

Document instance

static encode(doc, **kwargs)[source]

Encode document using MessagePack encoder

Parameters
  • doc – the document to encode

  • kwargs – passed along to Codec.encode and Document.compile

:raises SchemaValidationError :return: bytes of the document

class docria.codec.MsgpackDocument(data_or_document, ref=None)[source]

MessagePack Document, allows partial decoding

Example

>>> from docria.model import Document, DataTypes as T, Node
>>> from docria.codec import MsgpackDocument
>>>
>>> doc = Document()
>>> tokens = doc.add_layer("token", pos=T.string)
>>> node = Node(pos="NN")
>>> tokens.add_many([ node ])
>>>
>>> # Convert document to msgpack encoded binary data
>>> msgdoc = MsgpackDocument(doc)
>>> bytes_data = msgdoc.binary()  # type: bytes
>>>
>>> # Convert from msgpack encoded binary data to document
>>> newdoc = MsgpackDocument(bytes_data)
>>> doc = newdoc.document()
__init__(data_or_document, ref=None)[source]

Create a MsgpackDocument

Parameters
  • data_or_document – Raw data (bytes, readable) or a Document instance.

  • ref – Used internally to add information about where this document came from.

binary()[source]

Get this document as binary value

Return type

bytes

document(*layers, **kwargs)[source]

Get fully decoded document

properties(*props)[source]

Get document properties

schema()[source]

Get document schema

texts(*texts)[source]

Get document text

class docria.codec.MsgpackDocumentExt(doc)[source]

Embeddable document as a extended type

__init__(doc)[source]
class docria.codec.XmlCodec[source]

XML Codec, only encoding support

static encode_intermediate(doc, **kwargs)[source]

Conversion of docria document into an intermediate form: texts, schema and layer data.

Parameters
  • doc – docria document

  • kwargs – options for compile

Returns

static encode_tree(doc, verbose=False, verbose_node_spans=False, document_id='', **kwargs)[source]

Encodes a docria document into an XML representation.

Parameters
  • doc (Document) – docria document

  • verbose – add extra attributes to the XML data for readability and simpler tooling

  • verbose_node_spans – add extra nodes for each node, materializing the span for readability

  • document_id – the global unique document id

  • kwargs – additional optoins, see XmlCodec.encode_intermediate for options

Return type

ElementTree

Returns

static encode_utf8string(doc, **kwargs)[source]

Encode docria document into an XML string.

Parameters
  • doc (Document) – docria document

  • kwargs – additional options, see XmlCodec.encode_tree and XmlCodec.encode_intermediate for options.

Returns