API documentation#

Column#

class pyorc.Column(stream, index)#

An object that represents a column in an ORC file. It contains statistics about the column. If the stream is a Reader object then the column refers to the entire ORC file, if its a Stripe then just the specified ORC stripe.

Parameters:
Column.statistics#

A dictionary object about the Column’s statistics. It always contains the kind of the column, the number of values that does not include null values and a boolean value about either containing null values or not. It may contain other information depending on the kind of the column like minimum and maximum values, sums etc.

ORCConverter#

class pyorc.ORCConverter#

An abstract class for implementing own converters for date, decimal and timestamp types. These types are stored as integers in the ORC file and can be transformed into more convenient Python objects.

The converter can be set to a Reader or Writer with the converters parameter as a dictionary, where the key is one of TypeKind.DATE, TypeKind.DECIMAL, or TypeKind.TIMESTAMP, and the value is the converter itself.

static ORCConverter.from_orc(*args)#

Builds high-level objects from basic ORC type. Its arguments depend on what ORC type the converter is bound:

  • date: the number of days since the epoch as a single integer.

  • decimal: the decimal number formatted as a string.

  • timestamp: seconds and nanoseconds since the epoch as integers

    and the ZoneInfo object passed to the Reader as timezone.

Returns:

the constructed Python object.

static ORCConverter.to_orc(*args)#

Converts the high-level Python object to basic ORC type. Its argument is a single Python object, when the convert is bound to date or timestamp. The precision and scale are also passed to this method as integers, along with the object when it’s bound to a decimal type, and the Writer’s timezone as a ZoneInfo object, when it’s bound to a timestamp type.

Expected return value:

  • date: the number of days since the epoch as a single integer.

  • decimal: an integer adjusted to the set precision and scale.

  • timestamp: a tuple of seconds and nanoseconds since the epoch as integers.

Predicate#

class pyorc.Predicate(operator, left, right)#

An object that represents an expression for filtering row groups in an ORC file. The supported operators are NOT, AND and OR, while the possible operands can be a PredicateColumn or another Predicate. A Predicate is built from a PredicateColumn, a literal value and a relation between the two.

Parameters:
  • operator (Operator) – an operator type.

  • left (Predicate|PredicateColumn) – the left operand.

  • right (Predicate|PredicateColumn) – the right operand.

Predicate.__or__(other)#

Set logical OR connection between to predicate expressions.

Parameters:

other (Predicate) – the other predicate.

Predicate.__and__(other)#

Set logical AND connection between to predicate expressions.

Parameters:

other (Predicate) – the other predicate.

Predicate.__invert__(other)#

Set logical NOT to a predicate expression.

Parameters:

other (Predicate) – the other predicate.

PredicateColumn#

PredicateColumn(type_kind, name=None, index=None, precision=None,
scale=None)

An object that represents a specific column to use in a predicate expression. It can be compared to literal value to create a Predicate. A column can be addressed by either its name or its index.

A simple predicate example, that filtering row groups where the col0 column is less than 0:

>>> pred = PredicateColumn("col0", TypeKind.INT) < 0)
Parameters:
  • name (str) – the name of the column in the ORC file.

  • type_kind (TypeKind) – the type of the column.

  • precision (int) – the precision if the column’s type is decimal.

  • scale (int) – the scale if the column’s type is decimal.

PredicateColumn.__eq__(other)#
PredicateColumn.__ne__(other)#
PredicateColumn.__lt__(other)#
PredicateColumn.__le__(other)#
PredicateColumn.__gt__(other)#
PredicateColumn.__ge__(other)#

Simple comparison methods to compare a column and a literal value, and return a Predicate object.

Parameters:

other – a literal value for comparison.

Reader#

class pyorc.Reader(fileo, batch_size=1024, column_indices=None, column_names=None, timezone=zoneinfo.ZoneInfo('UTC'), struct_repr=StructRepr.TUPLE, converters=None, predicate=None, null_value=None)#

An object to read ORC files. The fileo must be a binary stream that support seeking. Either column_indices or column_names can be used to select specific columns from the ORC file.

The object iterates over rows by calling Reader.__next__(). By default, the ORC struct type represented as a tuple, but it can be changed by changing struct_repr to a valid StructRepr value.

For decimal, date and timestamp ORC types the default converters to Python objects can be change by setting a dictionary to the converters parameter. The dictionary’s keys must be a TypeKind and the values must implement the ORCConverter abstract class.

Parameters:
  • fileo (object) – a readable binary file-like object.

  • batch_size (int) – The size of a batch to read.

  • column_indices (list) – a list of column indices to read.

  • column_names (list) – a list of column names to read.

  • timezone (ZoneInfo) – a ZoneInfo object to use for parsing timestamp columns.

  • struct_repr (StructRepr) – An enum to set the representation for an ORC struct type.

  • converters (dict) – a dictionary, where the keys are TypeKind and the values are subclasses of ORCConverter.

  • predicate (Predicate) – a predicate expression to read only specified row groups.

  • null_value (object) – a singleton object to represent ORC null value.

Reader.__getitem__(col_idx)#

Get a Column object. The indexing is the same as it’s in the ORC file which means 0 is the top-level, the first field in the top-level struct is 1, if the nth field in the struct is a map then the (n+1)th index is the column of the keys and the (n+2)th index is the values, etc.

Reader.__len__()#

Get the number of rows in the file.

Reader.__next__()#

Get the next row from the file.

Reader.iter_stripes()#

Get an iterator with the Stripe objects from the file.

Returns:

an iterator of Stripe objects.

Return type:

iterator

Reader.read(rows=-1)#

Read the rows into memory. If rows is specified, at most number of rows will be read.

Returns:

A list of rows.

Return type:

list

Reader.read_stripe(idx)#

Read a specific Stripe object at idx from the ORC file.

Parameters:

idx (int) – the index of the stripe.

Returns:

a Stripe object.

Return type:

Stripe

Reader.seek(row, whence=0)#
Jump to a certain row position in the file. Values for whence are:
  • 0 – start of the file (the default); offset should be zero or positive.

  • 1 – current file position; offset may be negative.

  • 2 – end of the file; offset should be negative.

Returns:

number of the absolute row position.

Return type:

int

Reader.bytes_lengths#

The size information of the opened ORC file in bytes returned as a read-only dictionary. It includes the total file size (file_length), the length of the data stripes (content_length), the file footer (file_footer_length), postscript (file_postscript_length) and the stripe statistics (stripe_statistics_length).

>>> example = open("deps/examples/demo-11-zlib.orc", "rb")
>>> reader = pyorc.Reader(example)
>>> reader.bytes_lengths
{'content_length': 396823, 'file_footer_length': 2476, 'file_postscript_length': 25, 'file_length': 408522, 'stripe_statistics_length': 9197}
Reader.compression#

Read-only attribute of the used compression of the file returned as a CompressionKind.

Reader.compression_block_size#

Read-only attribute of compression block size.

Reader.current_row#

The current row position.

Reader.format_version#

The Hive format version of the ORC file, represented as a tuple of (MAJOR, MINOR) versions.

>>> reader.format_version
(0, 11)
Reader.user_metadata#

The user metadata information of the ORC file in a dictionary. The values are always bytes.

Reader.num_of_stripes#

The number of stripes in the ORC file.

Reader.row_index_stride#

The size of row index stride in the ORC file.

Reader.schema#

A TypeDescription object of the ORC file’s schema. Always represents the full schema of the file, regardless which columns are selected to read.

Reader.selected_schema#

A TypeDescription object of the ORC file’s schema that only represents the selected columns. If no columns are specified then it’s the same as Reader.schema.

Reader.software_version#

The version of the writer that created the ORC file, including the used implementation as well.

>>> reader.software_version
'ORC C++ 1.7.0'
Reader.writer_id#

The identification of the writer that created the ORC file. The known writers are the official Java writer, the C++ writer and the Presto writer. Other possible writers are represented as "UNKNOWN_WRITER".

>>> reader.writer_id
'ORC_JAVA_WRITER'
Reader.writer_version#

The version of the writer created the file, returned as WriterVersion. This version is used to mark significant changes (that doesn’t change the file format) and helps the reader to handle the corresponding file correctly.

Stripe#

class pyorc.Stripe(reader, idx)#

An object that represents a stripe in an ORC file. It’s iterable just like Reader, and inherits many of its methods, but the read rows are limited to the stripe.

Parameters:
  • reader (Reader) – a reader object.

  • idx (int) – the index of the stripe.

Stripe.__getitem__(col_idx)#

Get a Column object, just like Reader.__getitem__(), but only for the current stripe.

Stripe.__len__()#

Get the number of rows in the stripe.

Stripe.__next__()#

Get the next row from the stripe.

Stripe.seek(row, whence=0)#

Jump to a certain row position in the stripe. For possible whence values see Reader.seek().

Returns:

number of the absolute row position in the stripe.

Return type:

int

Stripe.read(rows=-1)#

Read the rows into memory. If rows is specified, at most number of rows will be read.

Returns:

A list of rows.

Return type:

list

Stripe.bloom_filter_columns#

The list of column indices that have Bloom filter.

Stripe.bytes_length#

The length of the stripe in bytes.

Stripe.bytes_offset#

The bytes offset where the stripe starts in the file.

Stripe.current_row#

The current row position in the stripe.

Stripe.row_offset#

The row offset where the stripe starts in the file.

Stripe.writer_timezone#

The timezone information of the writer.

TypeDescription#

class pyorc.TypeDescription#

The base class for representing a type of ORC schema. A schema consists one or more instances that are inherited from the TypeDescription class.

static TypeDescription.from_string(schema)#

Return instances of TypeDescription objects from a string representation of an ORC schema.

TypeDescription.find_column_id(name)#

Find the its id of a column by its name.

TypeDescription.set_attributes(attrs)#

Annotate the ORC type with custom attributes. The attrs parameter must be a dictionary with string keys and string values.

TypeDescription.attributes#

Return the attributes that the column is annotated with.

TypeDescription.column_id#

The id of the column.

TypeDescription.kind#

The kind of the current TypeDescription instance. It has to be one of the pyorc.TypeKind enum values.

class pyorc.Boolean#

Class for representing boolean ORC type.

class pyorc.TinyInt#

Class for representing tinyint ORC type.

class pyorc.SmallInt#

Class for representing smallint ORC type.

class pyorc.Int#

Class for representing int ORC type.

class pyorc.BigInt#

Class for representing bigint ORC type.

class pyorc.Float#

Class for representing float ORC type.

class pyorc.Double#

Class for representing double ORC type.

class pyorc.String#

Class for representing string ORC type.

class pyorc.Binary#

Class for representing binary ORC type.

class pyorc.Timestamp#

Class for representing timestamp ORC type.

class pyorc.TimestampInstant#

Class for representing timestamp with local time zone ORC type.

class pyorc.Date#

Class for representing date ORC type.

class pyorc.Char(max_length)#

Class for representing char ORC type with the parameter of the length of the character sequence.

Parameters:

max_length (int) – the maximal length of the character sequence.

class pyorc.VarChar(max_length)#

Class for representing varchar ORC type with the parameter of the maximal length of the variable character sequence.

Parameters:

max_length (int) – the maximal length of the character sequence.

class pyorc.Decimal(precision, scale)#

Class for representing decimal ORC type with the parameters of precision and scale.

Parameters:
  • precision (int) – the precision of the decimal number.

  • scale (int) – the scale of the decimal number.

class pyorc.Union(*cont_types)#

Class for representing uniontype ORC compound type. Its arguments must be TypeDescription instances for the possible type variants.

Parameters:

*cont_types (TypeDescription) – the list of TypeDescription instances for the possible type variants.

class pyorc.Array(cont_type)#

Class for representing array ORC compound type with the parameter of the contained ORC type.

Parameters:

cont_type (TypeDescription) – the instance of the contained type.

class pyorc.Map(key, value)#

Class for representing map ORC compound type with parameters for the key and value ORC types.

Parameters:
class pyorc.Struct(**fields)#

Class for representing struct ORC compound type with keyword arguments of its fields. The fields must be TypeDescription instances.

>>> schema = Struct(
...    field0=Int(),
...    field1=Map(key=String(),value=Double()),
...    field2=Timestamp(),
... )
>>> str(schema)
"struct<field0:int,field1:map<string,double>,field2:timestamp>"
Parameters:

**fields (TypeDescription) – the keywords of TypeDescription instances for the possible fields in the struct.

Writer#

class pyorc.Writer(fileo, schema, batch_size=1024, stripe_size=67108864, row_index_stride=10000, compression=CompressionKind.ZLIB, compression_strategy=CompressionStrategy.SPEED, compression_block_size=65536, bloom_filter_columns=None, bloom_filter_fpp=0.05, timezone=zoneinfo.ZoneInfo('UTC'), struct_repr=StructRepr.TUPLE, converters=None, padding_tolerance=0.0, dict_key_size_threshold=0.0, null_value=None)#

An object to write ORC files. The fileo must be a binary stream. The schema must be TypeDescription or a valid ORC schema definition as a string.

With the bloom_filter_columns a list of column ids or field names can be set to create a Bloom filter for the column. Nested structure fields can be selected with dotted format. For example in a file with a struct<first:struct<second:int>> schema the second column can be selected as ["first.second"].

For decimal, date and timestamp ORC types the default converters from Python objects can be change by setting a dictionary to the converters parameter. The dictionary’s keys must be a TypeKind and the values must implement the ORCConverter abstract class.

Parameters:
  • fileo (object) – a writeable binary file-like object.

  • schema (TypeDescription|str) – the ORC schema of the file.

  • batch_size (int) – the batch size for the ORC file.

  • stripe_size (int) – the stripes size in bytes.

  • row_index_stride (int) – the size of the row index stride.

  • compression (CompressionKind) – the compression kind for the ORC file.

  • compression_strategy (CompressionStrategy) – the compression strategy.

  • compression_block_size (int) – the compression block size in bytes.

  • bloom_filter_columns (list) – list of columns to use Bloom filter.

  • bloom_filter_fpp (float) – the false positive probability for the Bloom filter (Must be 0> and 1<).

  • timezone (ZoneInfo) – a ZoneInfo object to use for writing timestamp columns.

  • struct_repr (StructRepr) – An enum to set the representation for an ORC struct type.

  • converters (dict) – a dictionary, where the keys are TypeKind and the values are subclasses of ORCConverter.

  • padding_tolerance (float) – tolerance for block padding.

  • dict_key_size_threshold (float) – threshold for dictionary encoding.

  • null_value (object) – a singleton object to represent ORC null value.

Writer.__enter__()#
Writer.__exit__()#

A context manager that automatically calls the Writer.close() at the end of the with block.

Writer.close()#

Close an ORC file and write out the metadata after the rows have been added. Must be called to get a valid ORC file.

Writer.set_user_metadata(**kwargs)#

Set additional user metadata to the ORC file. The values must be bytes. The metadata is set when the Writer is closed.

>>> out = open("test_metadata.orc", "wb")
>>> wri = pyorc.Writer(out, "int")
>>> wri.set_user_metadata(extra="info".encode())
>>> wri.close()
>>> inp = open("test_metadata.orc", "rb")
>>> rdr = pyorc.Reader(inp)
>>> rdr.user_metadata
{'extra': b'info'}
Parameters:

**kwargs – keyword arguments to add as metadata to the file.

Writer.write(row)#

Write a row to the ORC file.

Parameters:

row – the row object to write.

Writer.writerows(rows)#

Write multiple rows with one function call. It iterates over the rows and calls Writer.write(). Returns the written number of rows.

Parameters:

rows (iterable) – an iterable with the rows.

Returns:

the written number of rows.

Return type:

int

Required ORC version: 1.9.0

Write an intermediate footer on the file. If the file is truncated to the returned offset, it would be a valid ORC file.

Returns:

the byte offset.

Return type:

int

Writer.current_row#

The current row position.

Writer.schema#

A read-only TypeDescription object of the ORC file’s schema.

Enums#

CompressionKind#

class pyorc.CompressionKind(value)#

The compression kind for the ORC file.

NONE = 0#
ZLIB = 1#
SNAPPY = 2#
LZO = 3#
LZ4 = 4#
ZSTD = 5#

CompressionStrategy#

class pyorc.CompressionStrategy(value)#

Compression strategy for the ORC file.

SPEED = 0#
COMPRESSION = 1#

TypeKind#

class pyorc.TypeKind(value)#

The type kinds for an ORC schema.

BOOLEAN = 0#
BYTE = 1#
SHORT = 2#
INT = 3#
LONG = 4#
FLOAT = 5#
DOUBLE = 6#
STRING = 7#
BINARY = 8#
TIMESTAMP = 9#
LIST = 10#
MAP = 11#
STRUCT = 12#
UNION = 13#
DECIMAL = 14#
DATE = 15#
VARCHAR = 16#
CHAR = 17#
TIMESTAMP_INSTANT = 18#
classmethod has_value(value: int) bool#

StructRepr#

class pyorc.StructRepr(value)#

Enumeration for ORC struct representation.

TUPLE = 0#

For tuple.

DICT = 1#

For dictionary.

WriterVersion#

class pyorc.WriterVersion(value)#

Writer version for an ORC file.

ORIGINAL = 0#
HIVE_8732 = 1#
HIVE_4243 = 2#
HIVE_12055 = 3#
HIVE_13083 = 4#
ORC_101 = 5#
ORC_135 = 6#
ORC_517 = 7#
ORC_203 = 8#
ORC_14 = 9#