docria.collection¶
I/O module, read/write collections of documents
Functions
|
Construct a document index spanning over multiple docria files. |
|
Construct a document index |
|
|
|
|
|
Classes
|
|
|
In-memory index of a single docria file |
|
Multi-file in-memory index |
|
Utility reader, returns Docria documents. |
|
Represents a block of MessagePack docria documents |
MessagePack Document I/O class |
|
|
Reader for the blocked MessagePack document file format |
|
Writer for the blocked MessagePack document file format |
|
Reader for the tar-based sequential MessagePack format. |
|
Writer for the tar-based sequential MessagePack format. |
I/O module, read/write collections of documents
- class docria.collection.DocumentFileIndex(filepath, properties, docrefs)[source]¶
In-memory index of a single docria file
- class docria.collection.DocumentIO[source]¶
Deprecated since version Use: concrete variants instead such as MsgpackDocumentIO
- class docria.collection.DocumentReader(inputreader)[source]¶
Utility reader, returns Docria documents.
- class docria.collection.MsgpackDocumentBlock(position, rawbuffer)[source]¶
Represents a block of MessagePack docria documents
- documents()[source]¶
Return all documents as a list of tuples (position, MessagePack Docria document)
- Return type
List
[Tuple
[int
,MsgpackDocument
]]
- property position: int¶
Get the original byte position
- Return type
int
- class docria.collection.MsgpackDocumentIO[source]¶
MessagePack Document I/O class
- static read(filepath, **kwargs)[source]¶
Read a document collection :param filepath: the source filepath :param kwargs: arguments for reading :rtype:
MsgpackDocumentReader
:return: reader for collection
- static readfile(filelike, **kwargs)[source]¶
Read a document collection from file-like object :param filelike: the file like reader :param kwargs: arguments for reading :rtype:
MsgpackDocumentReader
:return: reader for collection
- class docria.collection.MsgpackDocumentReader(inputio)[source]¶
Reader for the blocked MessagePack document file format
- __init__(inputio)[source]¶
Construct a document reader
- Parameters
inputio (
Union
[RawIOBase
,str
]) – path to a docria file for reading or a file object (e.g. object returned by open)
- get(ref)[source]¶
Returns a specific document at position (file position, block position)
- Parameters
ref – tuple of (raw file position, uncompressed block position)
- Returns
MessagePack document instance
- Note
This method assumes and requires that the underlying I/O supports seeking.
- readblock()[source]¶
Read a single block if possible
- Return type
Optional
[MsgpackDocumentBlock
]
- class docria.collection.MsgpackDocumentWriter(outputio, num_docs_per_block=128, codec=<docria.collection.CompressionCodec object>, mode='xb', **kwargs)[source]¶
Writer for the blocked MessagePack document file format
- __init__(outputio, num_docs_per_block=128, codec=<docria.collection.CompressionCodec object>, mode='xb', **kwargs)[source]¶
Construct a document writer.
If a string path is provided, mode xb is used, meaning it will fail if a file exist.
- Parameters
outputio (
Union
[RawIOBase
,str
]) – path to a new docria file to write to or an file objectnum_docs_per_block – the number of documents to cache before compressing the entire block and write to underlying storage.
codec – the compression codec to use for blocks
mode – if outputio is a string path, the mode to use, by default xb
- flush()[source]¶
Flush data to the underlying storage.
- Note
Will force currently cached blocks to be compressed and written to disk. This might result in blocks having less than specified number of documents per block.
- write(doc, **kwargs)[source]¶
Write docria document
- Parameters
doc (
Union
[Document
,MsgpackDocument
]) – accepts unencoded Document or Messagepack Document for fast writingkwargs – options to pass to
docria.codec.MsgpackCodec.encode()
- class docria.collection.TarMsgpackReader(inputpath, mode='r|gz', **kwargs)[source]¶
Reader for the tar-based sequential MessagePack format.
- class docria.collection.TarMsgpackWriter(outputpath, docformat='doc%05d.msgpack', rootdir=None, mode='w|gz', **kwargs)[source]¶
Writer for the tar-based sequential MessagePack format.
- __init__(outputpath, docformat='doc%05d.msgpack', rootdir=None, mode='w|gz', **kwargs)[source]¶
TarMsgpackWriter
- Parameters
outputpath – filepath to tar
docformat – naming convention of files in the tarball,
must include a single digit using old-style string formatting. :param rootdir: set to string if a root directory within the tarfile should be used. :param mode: the tarball writing mode,
tarfile.open()
, can be used to select bz2 or lzma compression modes.
- write(doc)[source]¶
Write document
- Parameters
doc (
Union
[Document
,MsgpackDocument
]) – accepts unencoded Document, and encoded MsgpackDocument for fast conversion.
- docria.collection.build_msgpack_directory_fileindex(path, *props, basepath='.', num_workers=None)[source]¶
Construct a document index spanning over multiple docria files.
- Parameters
path – path to the directory containing docria files
props – the properties to index
basepath – the relative path to use when saving filepath locations
num_workers – the number of processes to spawn for multicore processing of files,
default is the number of cores available as given by
multiprocessing.cpu_count()
.- Return type
- Returns
populated DocumentIndex
- Note
basepath can be used to create an index which only has relative references and thus can be included with the document collection.
- docria.collection.build_msgpack_fileindex(path, *props)[source]¶
Construct a document index
- Parameters
path – path to file which can be read by
MsgpackDocumentReader
props – the properties to index
- Return type
- Returns
built index