docria.collection

I/O module, read/write collections of documents

Functions

build_msgpack_directory_fileindex(path, *props)

Construct a document index spanning over multiple docria files.

build_msgpack_fileindex(path, *props)

Construct a document index

get_codec(name[, no_except])

rtype

Optional[CompressionCodec]

register_codec(name, compress, decompress[, ...])

unregister_codec(name)

Classes

CompressionCodec(name, compress, decompress)

DocumentFileIndex(filepath, properties, docrefs)

In-memory index of a single docria file

DocumentIO()

DocumentIndex([basepath])

Multi-file in-memory index

DocumentReader(inputreader)

Utility reader, returns Docria documents.

MsgpackDocumentBlock(position, rawbuffer)

Represents a block of MessagePack docria documents

MsgpackDocumentIO()

MessagePack Document I/O class

MsgpackDocumentReader(inputio)

Reader for the blocked MessagePack document file format

MsgpackDocumentWriter(outputio[, ...])

Writer for the blocked MessagePack document file format

TarMsgpackReader(inputpath[, mode])

Reader for the tar-based sequential MessagePack format.

TarMsgpackWriter(outputpath[, docformat, ...])

Writer for the tar-based sequential MessagePack format.

I/O module, read/write collections of documents

class docria.collection.DocumentFileIndex(filepath, properties, docrefs)[source]

In-memory index of a single docria file

__init__(filepath, properties, docrefs)[source]

Constructor of DocumentFileIndex

Parameters
  • filepath (str) – path to MessagePack Document file

  • properties (Dict[str, Dict[any, List[int]]]) – the property index, dictin

  • docrefs (List[Tuple[int, int]]) – list of document references

class docria.collection.DocumentIO[source]

Deprecated since version Use: concrete variants instead such as MsgpackDocumentIO

class docria.collection.DocumentIndex(basepath='.')[source]

Multi-file in-memory index

__init__(basepath='.')[source]
static load(path)[source]

Load pickle index

save(path)[source]

Save index as a pickle file

class docria.collection.DocumentReader(inputreader)[source]

Utility reader, returns Docria documents.

__init__(inputreader)[source]
class docria.collection.MsgpackDocumentBlock(position, rawbuffer)[source]

Represents a block of MessagePack docria documents

__iter__()[source]
Returns

self

__next__()[source]
Returns

MsgpackDocument with the encoded document

__init__(position, rawbuffer)[source]
documents()[source]

Return all documents as a list of tuples (position, MessagePack Docria document)

Return type

List[Tuple[int, MsgpackDocument]]

property position: int

Get the original byte position

Return type

int

tell()[source]

Get the current byte position within this block

Return type

int

class docria.collection.MsgpackDocumentIO[source]

MessagePack Document I/O class

static read(filepath, **kwargs)[source]

Read a document collection :param filepath: the source filepath :param kwargs: arguments for reading :rtype: MsgpackDocumentReader :return: reader for collection

static readfile(filelike, **kwargs)[source]

Read a document collection from file-like object :param filelike: the file like reader :param kwargs: arguments for reading :rtype: MsgpackDocumentReader :return: reader for collection

class docria.collection.MsgpackDocumentReader(inputio)[source]

Reader for the blocked MessagePack document file format

__init__(inputio)[source]

Construct a document reader

Parameters

inputio (Union[RawIOBase, str]) – path to a docria file for reading or a file object (e.g. object returned by open)

blocks()[source]

Get iterator for all document blocks

get(ref)[source]

Returns a specific document at position (file position, block position)

Parameters

ref – tuple of (raw file position, uncompressed block position)

Returns

MessagePack document instance

Note

This method assumes and requires that the underlying I/O supports seeking.

readblock()[source]

Read a single block if possible

Return type

Optional[MsgpackDocumentBlock]

seek(position)[source]

Seek to a block position

Parameters

position – raw file position

Note

This method assumes and requires that the underlying I/O supports seeking.

class docria.collection.MsgpackDocumentWriter(outputio, num_docs_per_block=128, codec=<docria.collection.CompressionCodec object>, mode='xb', **kwargs)[source]

Writer for the blocked MessagePack document file format

__init__(outputio, num_docs_per_block=128, codec=<docria.collection.CompressionCodec object>, mode='xb', **kwargs)[source]

Construct a document writer.

If a string path is provided, mode xb is used, meaning it will fail if a file exist.

Parameters
  • outputio (Union[RawIOBase, str]) – path to a new docria file to write to or an file object

  • num_docs_per_block – the number of documents to cache before compressing the entire block and write to underlying storage.

  • codec – the compression codec to use for blocks

  • mode – if outputio is a string path, the mode to use, by default xb

close()[source]

Flush data and close the underlying storage

flush()[source]

Flush data to the underlying storage.

Note

Will force currently cached blocks to be compressed and written to disk. This might result in blocks having less than specified number of documents per block.

write(doc, **kwargs)[source]

Write docria document

Parameters
class docria.collection.TarMsgpackReader(inputpath, mode='r|gz', **kwargs)[source]

Reader for the tar-based sequential MessagePack format.

__init__(inputpath, mode='r|gz', **kwargs)[source]

TarMsgpackReader constructor

Parameters
  • inputpath – filepath to tar

  • mode – the tarball reading mode, tarfile.open(), can be used to select bz2 or lzma compression modes.

class docria.collection.TarMsgpackWriter(outputpath, docformat='doc%05d.msgpack', rootdir=None, mode='w|gz', **kwargs)[source]

Writer for the tar-based sequential MessagePack format.

__init__(outputpath, docformat='doc%05d.msgpack', rootdir=None, mode='w|gz', **kwargs)[source]

TarMsgpackWriter

Parameters
  • outputpath – filepath to tar

  • docformat – naming convention of files in the tarball,

must include a single digit using old-style string formatting. :param rootdir: set to string if a root directory within the tarfile should be used. :param mode: the tarball writing mode, tarfile.open(), can be used to select bz2 or lzma compression modes.

write(doc)[source]

Write document

Parameters

doc (Union[Document, MsgpackDocument]) – accepts unencoded Document, and encoded MsgpackDocument for fast conversion.

docria.collection.build_msgpack_directory_fileindex(path, *props, basepath='.', num_workers=None)[source]

Construct a document index spanning over multiple docria files.

Parameters
  • path – path to the directory containing docria files

  • props – the properties to index

  • basepath – the relative path to use when saving filepath locations

  • num_workers – the number of processes to spawn for multicore processing of files,

default is the number of cores available as given by multiprocessing.cpu_count().

Return type

DocumentIndex

Returns

populated DocumentIndex

Note

basepath can be used to create an index which only has relative references and thus can be included with the document collection.

docria.collection.build_msgpack_fileindex(path, *props)[source]

Construct a document index

Parameters
  • path – path to file which can be read by MsgpackDocumentReader

  • props – the properties to index

Return type

DocumentFileIndex

Returns

built index