Docria

Docria provides a hypergraph document model implementation with a focus on NLP (Natural Language Processing) applications.

Docria provides:

  • In-memory object representations in Python and Java

  • Binary serialization format based on MessagePack

  • File formats optimized for storing and accessing millions of documents locally and in a cluster context

Quickstart

To install the python version:

pip install docria

The first steps

from docria.model import Document, DataTypes as T
import regex as re

# Stupid tokenizer
tokenizer = re.compile(r"[a-zA-Z]+|[0-9]+|[^\s]")
starts_with_uppercase = re.compile(r"[A-Z].*")

doc = Document()

# Create a new text context called 'main' with the text 'This code was written in Lund, Sweden.'
doc.maintext = "This code was written in Lund, Sweden."
#               01234567890123456789012345678901234567
#               0         1         2         3
main_text = doc.maintext

# Create a new layer with fields: id, text and head.
#
# Fields:
#   id is an int32
#   uppercase is a boolean indicating if the token is uppercase
#   text is a textspan from context 'main'
#
tokens = doc.add_layer("token", id=T.int32(), uppercase=T.bool(), text=T.span())

# Adding nodes: Solution 1
i = 0
for m in tokenizer.finditer(str(main_text)):
    token_node = tokens.add(id=i, text=main_text[m.start():m.end()])

    # Check if it is uppercase
    token_node["uppercase"] = starts_with_uppercase.fullmatch(m[0]) is not None
    i += 1

# Reading nodes
for tok in tokens:
    print(tok["text"])

# Filtering, only uppercase tokens
for tok in tokens[tokens["uppercase"] == True]:
    print(tok["text"])

Concepts

The document model consists of the following concepts:

  • Document: The overall container for everything (all nodes, layers, texts must be contained within)

  • Document properties: a single dictionary per document to store metadata in.

  • Text: The basic text representation, a wrapped string to track spans.

  • Text Spans: Subsequence of a string, can always be converted into a hard string by using str(span)

  • Node Spans: Start and stop node in a layer which will produce a sequence of nodes.

  • Layer: Collection of nodes

  • Layer Schema: Definition of field names and types when document is serialized

  • Node: Single node with zero or more fields with values

  • Node fields: Key, value pairs.

from docria.model import Document

doc = Document()
doc.maintext # alias to doc.text["main"] with special support for
             # creating a main text via doc.maintext = "string"

doc.props  # Document metadata dictionary
doc.layers # Layer dictionary, layer name to node layer collection
doc.layer  # Alias to above
doc.texts  # Text dictionary.
doc.text   # Alias to above

Examples

Reading document collections

from docria.storage import MsgpackDocumentReader
from docria.codec import MsgpackDocument

with MsgpackDocumentReader(open("path_to_your_docria_file.docria", "rb")) as reader:
   for rawdoc in reader:
      # rawdoc is of type MsgpackDocument
      doc = rawdoc.document() #  type: docria.Document

      # Print the schema
      doc.printschema()

      for token in doc["token"]:
         # ... do something with the data contained within.
         pass

# You can use MsgpackDocumentReader as a normal instance
# and manually use .close() when done or on the GC to eat it up.

The principle is mostly the same with :class:~`docria.storage.TarMsgpackReader` with the exception it expects a filepath, not a filelike object.

Writing document collections

from docria.storage import MsgpackDocumentReader
from docria.codec import MsgpackDocument

with MsgpackDocumentWriter(open("path_to_your_docria_file.docria", "wb")) as writer:
   # using the previous doc in "The first steps"
   writer.write(doc)

# Rewriting or filtering
with MsgpackDocumentWriter(open("path_to_your_output_docria_file.docria", "wb")) as writer:
   with MsgpackDocumentReader(open("path_to_your_input_docria_file.docria", "rb")) as reader:
      for rawdoc in reader:
         writer.write(rawdoc)  # this is decompression and memory copy of the raw data

The principle is mostly the same with :class:~`docria.storage.TarMsgpackWriter` with the exception it expects a filepath, not a filelike object.

Reading and writing documents to bytes

from docria.codec import MsgpackCodec, MsgpackDocument

binarydata = bytes()  # from any location
binarydata = io.BytesIO()  # or

# To decode into a document
doc = MsgpackCodec.decode(binarydata)

# To encode into a document
binarydata = MsgpackCodec.encode(doc)

# Access data without a full deserialization
rawdoc = MsgpackDocument(binarydata)
rawdow.properties()  # Document metadata as dictionary

# Document texts, dictionary name to list of strings
# (each segment which potentially has annotation) which can be joined to get the full text.
rawdoc.texts()

schema = rawdoc.schema() # advanced access to the contents of this document, lists layers and fields.

doc = rawdoc.document() # full document deserialization

Layer and field query

from docria import Document, DataTypes as T, NodeSpan, NodeList

doc = Document()
doc.maintext = "Lund is a city in Sweden."
#               0123456789012345678901234
#               0         1         2

# Only ordered layers exist in docria, this means all nodes are added sequentially.
# T.span() is equivalent to T.span("main") which referes to the main text
token_layer = doc.add_layer("token", part_of_speech=T.string(), text=T.span(), head=T.noderef("token"))

# Annotation output by CoreNLP 3.9.2 and Basic dependencies
# We set node references later.
first = token_layer.add(part_of_speech="NNP", text=doc.maintext[0:4])
token_layer.add(part_of_speech="VBZ", text=doc.maintext[5:7])
token_layer.add(part_of_speech="DT", text=doc.maintext[8:9])
token_layer.add(part_of_speech="NN", text=doc.maintext[10:14])
token_layer.add(part_of_speech="IN", text=doc.maintext[15:17])
token_layer.add(part_of_speech="NNP", text=doc.maintext[18:24])
last = token_layer.add(part_of_speech=".", text=doc.maintext[24:])

# Create a node span and convert into a list
sent_tokens = NodeSpan(first, last).to_list()

# When setting heads, no validation takes place.
sent_tokens[0]["head"] = token_layer[3] # head = city
sent_tokens[1]["head"] = token_layer[3] # head = city
sent_tokens[2]["head"] = token_layer[3] # head = city
sent_tokens[4]["head"] = token_layer[5] # head = Sweden
sent_tokens[5]["head"] = token_layer[3] # head = city
sent_tokens[6]["head"] = token_layer[3] # head = city

sent_tokens.validate() # We can manually initiate validate for these nodes to fail faster.

# This first query finds all roots by checking if the head is None, and finally picks the first one.
first_root = token_layer[token_layer["head"].is_none()].first()

# This second query finds all nodes with the head equal to first_root
tokens_with_head_first_root = token_layer[token_layer["head"] == first_root]

# Then we print tokens in layer order from matching token to including root token
for tok in tokens_with_head_first_root:
    # iter_span is invariant to order, it will always produce low id to high id.
    print(NodeList(first_root.iter_span(tok))["text"].to_list())

Change presentation settings

The settings used for pretty printing is controlled by the global variable docria.printout.options which is a docria.printout.PrintOptions.

By convention pretty printing will output [layer name]#[internal id] where the internal id can be used to get the node. However, this id is only guaranteed to be static if the layer is not changed, if changed it is invalid.

For references in general use the Node object.

API Reference

docria.model

Docria document model ( primary module )

docria.algorithm

Functions for various processing purposes

docria.codec

Codecs, encoding/decoding documents to/from binary or text representations

docria.storage

docria.printout

Presentation module, utilities for formatting document objects.

docria.model

Docria document model ( primary module )

Classes

DataType(typename, **kwargs)

Data type declaration

DataTypeBinary(typename, **kwargs)

DataTypeBool(typename, **kwargs)

DataTypeEnum(value)

Type names

DataTypeFloat(typename, **kwargs)

DataTypeInt32(typename, **kwargs)

DataTypeInt64(typename, **kwargs)

DataTypeNoderef(typename, **kwargs)

DataTypeNoderefList(typename, **kwargs)

DataTypeNodespan(typename, **kwargs)

DataTypeString(typename, **kwargs)

DataTypeTextspan(typename, **kwargs)

DataTypes()

Common datatypes and factory methods for parametrical types

Document(**kwargs)

The document which contains all data

ExtData(type, data)

User-defined typed data container

Node(*args, **kwargs)

Basic building block of the document model

NodeCollection(fieldtypes)

NodeCollectionQuery(collection, predicate)

Represents a query to document data

NodeFieldCollection(collection, field)

Field from a node collection

NodeLayerCollection(schema)

Node collection, internally a list with gaps which will compact when 25% of the list is empty.

NodeLayerSchema(name)

Node layer declaration

NodeList(*elems[, fieldtypes])

Python list enriched with extra indexing and presentation functionality for optimal use in Docria.

NodeSpan(left_most_node, right_most_node)

Represents a span of nodes in a layer

Offset(offset)

Text offset object

Text(name, text)

Text object, consisting of text and an index of current offsets

TextSpan(text, start_offset, stop_offset)

Text span, consisting of a start and stop offset.

Exceptions

DataValidationError(message)

Failed to validate document

SchemaError(message)

Failed to validate a part of the schema

SchemaValidationError(message, fields)

Schema validation failed

docria.algorithm

Functions for various processing purposes

Functions

bfs(start, children[, is_result])

Breadth first search

chain(*fns)

Create a new function for a sequence of functions which will be applied in sequence

children_of(layer, *props)

Get children of a given property

dfs(start, children[, is_result])

Depth first search

dfs_leaves(start, children[, is_result])

Depth first search, only returning the leaves i.e.

dominant_right(segments)

Resolves overlapping segments by using the dominant right rule, i.e.

dominant_right_span(nodes[, spanfield])

Resolves overlapping spans by using the dominant right rule, i.e.

get_prop(prop[, default])

First order function which can be used to extract property of nodes

group_by_span(group_nodes, layer_nodes[, …])

Groups all nodes in layer_nodes into the corresponding bucket_node

is_covered_by(span_a, span_b)

Covered by predicate :type span_a: TextSpan :param span_a: the node that is tested for cover :type span_b: TextSpan :param span_b: the node that might cover span_a :rtype: bool :return: true or false

span_translate(doc, mapping_layer, …)

Translate span ranges from a partial extraction to the original data.

docria.codec

Codecs, encoding/decoding documents to/from binary or text representations

Classes

Codec()

Utility methods for all codecs

JsonCodec()

JSON codec

MsgpackCodec()

MessagePack document codec

MsgpackDocument(rawdata[, ref])

MessagePack Document, allows partial decoding

MsgpackDocumentExt(doc)

Embeddable document as a extended type

Exceptions

DataError(message)

Serialization/Deserialization failure

docria.storage

Functions

build_msgpack_directory_fileindex(path, *props)

Construct a document index spanning over multiple docria files.

build_msgpack_fileindex(path, *props)

Construct a document index

Classes

DocumentFileIndex(filepath, properties, docrefs)

In-memory index of a single docria file

DocumentIO()

DocumentIndex([basepath])

Multi-file in-memory index

DocumentReader(inputreader)

Utility reader, returns Docria documents.

MsgpackDocumentBlock(position, rawbuffer)

Represents a block of MessagePack docria documents

MsgpackDocumentIO()

MessagePack Documnt I/O class

MsgpackDocumentReader(inputio)

Reader for the blocked MessagePack document file format

MsgpackDocumentWriter(outputio[, …])

Writer for the blocked MessagePack document file format

TarMsgpackReader(inputpath[, mode])

Reader for the tar-based sequential MessagePack format.

TarMsgpackWriter(outputpath[, docformat, …])

Writer for the tar-based sequential MessagePack format.

docria.printout

Presentation module, utilities for formatting document objects.

Functions

get_representation(value)

set_large_screen()

Sets options to higher than default widths

truncate(text)

urn_link_wikidata(partial)

urn_link_wikipedia(partial)

Classes

PrintOptions()

Presentation settings

Table([caption, style, hide_index, hide_headers])

Table representation for text and HTML

TableCell([text, html])

TableRow(*elems[, index])

TableStyle([padding])

Indices and tables