API documentation#
Column
#
- class pyorc.Column(stream, index)#
An object that represents a column in an ORC file. It contains statistics about the column. If the stream is a
Reader
object then the column refers to the entire ORC file, if its aStripe
then just the specified ORC stripe.
- Column.statistics#
A dictionary object about the Column’s statistics. It always contains the kind of the column, the number of values that does not include null values and a boolean value about either containing null values or not. It may contain other information depending on the kind of the column like minimum and maximum values, sums etc.
ORCConverter
#
- class pyorc.ORCConverter#
An abstract class for implementing own converters for date, decimal and timestamp types. These types are stored as integers in the ORC file and can be transformed into more convenient Python objects.
The converter can be set to a
Reader
orWriter
with the converters parameter as a dictionary, where the key is one ofTypeKind.DATE
,TypeKind.DECIMAL
, orTypeKind.TIMESTAMP
, and the value is the converter itself.
- static ORCConverter.from_orc(*args)#
Builds high-level objects from basic ORC type. Its arguments depend on what ORC type the converter is bound:
date: the number of days since the epoch as a single integer.
decimal: the decimal number formatted as a string.
- timestamp: seconds and nanoseconds since the epoch as integers
and the ZoneInfo object passed to the Reader as timezone.
- Returns:
the constructed Python object.
- static ORCConverter.to_orc(*args)#
Converts the high-level Python object to basic ORC type. Its argument is a single Python object, when the convert is bound to date or timestamp. The precision and scale are also passed to this method as integers, along with the object when it’s bound to a decimal type, and the Writer’s timezone as a ZoneInfo object, when it’s bound to a timestamp type.
Expected return value:
date: the number of days since the epoch as a single integer.
decimal: an integer adjusted to the set precision and scale.
timestamp: a tuple of seconds and nanoseconds since the epoch as integers.
Predicate
#
- class pyorc.Predicate(operator, left, right)#
An object that represents an expression for filtering row groups in an ORC file. The supported operators are NOT, AND and OR, while the possible operands can be a
PredicateColumn
or another Predicate. A Predicate is built from aPredicateColumn
, a literal value and a relation between the two.
- Predicate.__or__(other)#
Set logical OR connection between to predicate expressions.
- Parameters:
other (Predicate) – the other predicate.
PredicateColumn
#
- PredicateColumn(type_kind, name=None, index=None, precision=None,
- scale=None)
An object that represents a specific column to use in a predicate expression. It can be compared to literal value to create a
Predicate
. A column can be addressed by either its name or its index.A simple predicate example, that filtering row groups where the
col0
column is less than 0:>>> pred = PredicateColumn("col0", TypeKind.INT) < 0)
- Parameters:
name (str) – the name of the column in the ORC file.
type_kind (TypeKind) – the type of the column.
precision (int) – the precision if the column’s type is decimal.
scale (int) – the scale if the column’s type is decimal.
- PredicateColumn.__eq__(other)#
- PredicateColumn.__ne__(other)#
- PredicateColumn.__lt__(other)#
- PredicateColumn.__le__(other)#
- PredicateColumn.__gt__(other)#
Reader
#
- class pyorc.Reader(fileo, batch_size=1024, column_indices=None, column_names=None, timezone=zoneinfo.ZoneInfo('UTC'), struct_repr=StructRepr.TUPLE, converters=None, predicate=None, null_value=None)#
An object to read ORC files. The fileo must be a binary stream that support seeking. Either column_indices or column_names can be used to select specific columns from the ORC file.
The object iterates over rows by calling
Reader.__next__()
. By default, the ORC struct type represented as a tuple, but it can be changed by changing struct_repr to a validStructRepr
value.For decimal, date and timestamp ORC types the default converters to Python objects can be change by setting a dictionary to the converters parameter. The dictionary’s keys must be a
TypeKind
and the values must implement theORCConverter
abstract class.- Parameters:
fileo (object) – a readable binary file-like object.
batch_size (int) – The size of a batch to read.
column_indices (list) – a list of column indices to read.
column_names (list) – a list of column names to read.
timezone (ZoneInfo) – a ZoneInfo object to use for parsing timestamp columns.
struct_repr (StructRepr) – An enum to set the representation for an ORC struct type.
converters (dict) – a dictionary, where the keys are
TypeKind
and the values are subclasses ofORCConverter
.predicate (Predicate) – a predicate expression to read only specified row groups.
null_value (object) – a singleton object to represent ORC null value.
- Reader.__getitem__(col_idx)#
Get a
Column
object. The indexing is the same as it’s in the ORC file which means 0 is the top-level, the first field in the top-level struct is 1, if the nth field in the struct is a map then the (n+1)th index is the column of the keys and the (n+2)th index is the values, etc.
- Reader.__len__()#
Get the number of rows in the file.
- Reader.__next__()#
Get the next row from the file.
- Reader.iter_stripes()#
Get an iterator with the
Stripe
objects from the file.- Returns:
an iterator of
Stripe
objects.- Return type:
iterator
- Reader.read(rows=-1)#
Read the rows into memory. If rows is specified, at most number of rows will be read.
- Returns:
A list of rows.
- Return type:
list
- Reader.seek(row, whence=0)#
- Jump to a certain row position in the file. Values for whence are:
0 – start of the file (the default); offset should be zero or positive.
1 – current file position; offset may be negative.
2 – end of the file; offset should be negative.
- Returns:
number of the absolute row position.
- Return type:
int
- Reader.bytes_lengths#
The size information of the opened ORC file in bytes returned as a read-only dictionary. It includes the total file size (file_length), the length of the data stripes (content_length), the file footer (file_footer_length), postscript (file_postscript_length) and the stripe statistics (stripe_statistics_length).
>>> example = open("deps/examples/demo-11-zlib.orc", "rb") >>> reader = pyorc.Reader(example) >>> reader.bytes_lengths {'content_length': 396823, 'file_footer_length': 2476, 'file_postscript_length': 25, 'file_length': 408522, 'stripe_statistics_length': 9197}
- Reader.compression#
Read-only attribute of the used compression of the file returned as a
CompressionKind
.
- Reader.compression_block_size#
Read-only attribute of compression block size.
- Reader.current_row#
The current row position.
- Reader.format_version#
The Hive format version of the ORC file, represented as a tuple of (MAJOR, MINOR) versions.
>>> reader.format_version (0, 11)
- Reader.user_metadata#
The user metadata information of the ORC file in a dictionary. The values are always bytes.
- Reader.num_of_stripes#
The number of stripes in the ORC file.
- Reader.row_index_stride#
The size of row index stride in the ORC file.
- Reader.schema#
A
TypeDescription
object of the ORC file’s schema. Always represents the full schema of the file, regardless which columns are selected to read.
- Reader.selected_schema#
A
TypeDescription
object of the ORC file’s schema that only represents the selected columns. If no columns are specified then it’s the same asReader.schema
.
- Reader.software_version#
The version of the writer that created the ORC file, including the used implementation as well.
>>> reader.software_version 'ORC C++ 1.7.0'
- Reader.writer_id#
The identification of the writer that created the ORC file. The known writers are the official Java writer, the C++ writer and the Presto writer. Other possible writers are represented as
"UNKNOWN_WRITER"
.>>> reader.writer_id 'ORC_JAVA_WRITER'
- Reader.writer_version#
The version of the writer created the file, returned as
WriterVersion
. This version is used to mark significant changes (that doesn’t change the file format) and helps the reader to handle the corresponding file correctly.
Stripe
#
- class pyorc.Stripe(reader, idx)#
An object that represents a stripe in an ORC file. It’s iterable just like
Reader
, and inherits many of its methods, but the read rows are limited to the stripe.- Parameters:
reader (Reader) – a reader object.
idx (int) – the index of the stripe.
- Stripe.__getitem__(col_idx)#
Get a
Column
object, just likeReader.__getitem__()
, but only for the current stripe.
- Stripe.__len__()#
Get the number of rows in the stripe.
- Stripe.__next__()#
Get the next row from the stripe.
- Stripe.seek(row, whence=0)#
Jump to a certain row position in the stripe. For possible whence values see
Reader.seek()
.- Returns:
number of the absolute row position in the stripe.
- Return type:
int
- Stripe.read(rows=-1)#
Read the rows into memory. If rows is specified, at most number of rows will be read.
- Returns:
A list of rows.
- Return type:
list
- Stripe.bloom_filter_columns#
The list of column indices that have Bloom filter.
- Stripe.bytes_length#
The length of the stripe in bytes.
- Stripe.bytes_offset#
The bytes offset where the stripe starts in the file.
- Stripe.current_row#
The current row position in the stripe.
- Stripe.row_offset#
The row offset where the stripe starts in the file.
- Stripe.writer_timezone#
The timezone information of the writer.
TypeDescription
#
- class pyorc.TypeDescription#
The base class for representing a type of ORC schema. A schema consists one or more instances that are inherited from the TypeDescription class.
- static TypeDescription.from_string(schema)#
Return instances of TypeDescription objects from a string representation of an ORC schema.
- TypeDescription.find_column_id(name)#
Find the its id of a column by its name.
- TypeDescription.set_attributes(attrs)#
Annotate the ORC type with custom attributes. The attrs parameter must be a dictionary with string keys and string values.
- TypeDescription.attributes#
Return the attributes that the column is annotated with.
- TypeDescription.column_id#
The id of the column.
- TypeDescription.kind#
The kind of the current TypeDescription instance. It has to be one of the
pyorc.TypeKind
enum values.
- class pyorc.Boolean#
Class for representing boolean ORC type.
- class pyorc.TinyInt#
Class for representing tinyint ORC type.
- class pyorc.SmallInt#
Class for representing smallint ORC type.
- class pyorc.Int#
Class for representing int ORC type.
- class pyorc.BigInt#
Class for representing bigint ORC type.
- class pyorc.Float#
Class for representing float ORC type.
- class pyorc.Double#
Class for representing double ORC type.
- class pyorc.String#
Class for representing string ORC type.
- class pyorc.Binary#
Class for representing binary ORC type.
- class pyorc.Timestamp#
Class for representing timestamp ORC type.
- class pyorc.TimestampInstant#
Class for representing timestamp with local time zone ORC type.
- class pyorc.Date#
Class for representing date ORC type.
- class pyorc.Char(max_length)#
Class for representing char ORC type with the parameter of the length of the character sequence.
- Parameters:
max_length (int) – the maximal length of the character sequence.
- class pyorc.VarChar(max_length)#
Class for representing varchar ORC type with the parameter of the maximal length of the variable character sequence.
- Parameters:
max_length (int) – the maximal length of the character sequence.
- class pyorc.Decimal(precision, scale)#
Class for representing decimal ORC type with the parameters of precision and scale.
- Parameters:
precision (int) – the precision of the decimal number.
scale (int) – the scale of the decimal number.
- class pyorc.Union(*cont_types)#
Class for representing uniontype ORC compound type. Its arguments must be TypeDescription instances for the possible type variants.
- Parameters:
*cont_types (TypeDescription) – the list of TypeDescription instances for the possible type variants.
- class pyorc.Array(cont_type)#
Class for representing array ORC compound type with the parameter of the contained ORC type.
- Parameters:
cont_type (TypeDescription) – the instance of the contained type.
- class pyorc.Map(key, value)#
Class for representing map ORC compound type with parameters for the key and value ORC types.
- Parameters:
key (TypeDescription) – the instance type of the key in the map.
value (TypeDescription) – the instance type of the value in the map.
- class pyorc.Struct(**fields)#
Class for representing struct ORC compound type with keyword arguments of its fields. The fields must be TypeDescription instances.
>>> schema = Struct( ... field0=Int(), ... field1=Map(key=String(),value=Double()), ... field2=Timestamp(), ... ) >>> str(schema) "struct<field0:int,field1:map<string,double>,field2:timestamp>"
- Parameters:
**fields (TypeDescription) – the keywords of TypeDescription instances for the possible fields in the struct.
Writer
#
- class pyorc.Writer(fileo, schema, batch_size=1024, stripe_size=67108864, row_index_stride=10000, compression=CompressionKind.ZLIB, compression_strategy=CompressionStrategy.SPEED, compression_block_size=65536, bloom_filter_columns=None, bloom_filter_fpp=0.05, timezone=zoneinfo.ZoneInfo('UTC'), struct_repr=StructRepr.TUPLE, converters=None, padding_tolerance=0.0, dict_key_size_threshold=0.0, null_value=None)#
An object to write ORC files. The fileo must be a binary stream. The schema must be
TypeDescription
or a valid ORC schema definition as a string.With the bloom_filter_columns a list of column ids or field names can be set to create a Bloom filter for the column. Nested structure fields can be selected with dotted format. For example in a file with a
struct<first:struct<second:int>>
schema the second column can be selected as["first.second"]
.For decimal, date and timestamp ORC types the default converters from Python objects can be change by setting a dictionary to the converters parameter. The dictionary’s keys must be a
TypeKind
and the values must implement theORCConverter
abstract class.- Parameters:
fileo (object) – a writeable binary file-like object.
schema (TypeDescription|str) – the ORC schema of the file.
batch_size (int) – the batch size for the ORC file.
stripe_size (int) – the stripes size in bytes.
row_index_stride (int) – the size of the row index stride.
compression (CompressionKind) – the compression kind for the ORC file.
compression_strategy (CompressionStrategy) – the compression strategy.
compression_block_size (int) – the compression block size in bytes.
bloom_filter_columns (list) – list of columns to use Bloom filter.
bloom_filter_fpp (float) – the false positive probability for the Bloom filter (Must be 0> and 1<).
timezone (ZoneInfo) – a ZoneInfo object to use for writing timestamp columns.
struct_repr (StructRepr) – An enum to set the representation for an ORC struct type.
converters (dict) – a dictionary, where the keys are
TypeKind
and the values are subclasses ofORCConverter
.padding_tolerance (float) – tolerance for block padding.
dict_key_size_threshold (float) – threshold for dictionary encoding.
null_value (object) – a singleton object to represent ORC null value.
- Writer.__enter__()#
- Writer.__exit__()#
A context manager that automatically calls the
Writer.close()
at the end of thewith
block.
- Writer.close()#
Close an ORC file and write out the metadata after the rows have been added. Must be called to get a valid ORC file.
- Writer.set_user_metadata(**kwargs)#
Set additional user metadata to the ORC file. The values must be bytes. The metadata is set when the Writer is closed.
>>> out = open("test_metadata.orc", "wb") >>> wri = pyorc.Writer(out, "int") >>> wri.set_user_metadata(extra="info".encode()) >>> wri.close() >>> inp = open("test_metadata.orc", "rb") >>> rdr = pyorc.Reader(inp) >>> rdr.user_metadata {'extra': b'info'}
- Parameters:
**kwargs – keyword arguments to add as metadata to the file.
- Writer.write(row)#
Write a row to the ORC file.
- Parameters:
row – the row object to write.
- Writer.writerows(rows)#
Write multiple rows with one function call. It iterates over the rows and calls
Writer.write()
. Returns the written number of rows.- Parameters:
rows (iterable) – an iterable with the rows.
- Returns:
the written number of rows.
- Return type:
int
Required ORC version: 1.9.0
Write an intermediate footer on the file. If the file is truncated to the returned offset, it would be a valid ORC file.
- Returns:
the byte offset.
- Return type:
int
- Writer.current_row#
The current row position.
- Writer.schema#
A read-only
TypeDescription
object of the ORC file’s schema.
Enums#
CompressionKind
#
CompressionStrategy
#
TypeKind
#
- class pyorc.TypeKind(value)#
The type kinds for an ORC schema.
- BOOLEAN = 0#
- BYTE = 1#
- SHORT = 2#
- INT = 3#
- LONG = 4#
- FLOAT = 5#
- DOUBLE = 6#
- STRING = 7#
- BINARY = 8#
- TIMESTAMP = 9#
- LIST = 10#
- MAP = 11#
- STRUCT = 12#
- UNION = 13#
- DECIMAL = 14#
- DATE = 15#
- VARCHAR = 16#
- CHAR = 17#
- TIMESTAMP_INSTANT = 18#
- classmethod has_value(value: int) bool #