Tutorial ======== At this point you have an installed pyorc module. Reading ------- Let's use one of the example ORC files to open in Python:: >>> import pyorc >>> example = open("./deps/examples/demo-12-zlib.orc", "rb") >>> reader = pyorc.Reader(example) See the schema of the selected file:: >>> reader.schema The Reader's schema read-only property is a :class:`TypeDescription` object, representing the ORC file's type hierarchy. We can get a more human-friendly interpretation if we print its string format:: >>> str(reader.schema) 'struct<_col0:int,_col1:string,_col2:string,_col3:string,_col4:int,_col5:string,_col6:int,_col7:int,_col8:int>' We can check the number of rows in the file by calling len() on the Reader:: >>> len(reader) 1920800 The Reader is an interable object, yielding a new row after every iteration:: >>> next(reader) (1, 'M', 'M', 'Primary', 500, 'Good', 0, 0, 0) >>> next(reader) (2, 'F', 'M', 'Primary', 500, 'Good', 0, 0, 0) Iterating over the file's content to process its rows is the preferable way, but we can also read the entire file into the memory with the read method. This method has an optional parameter to control the maximal number of rows to read:: >>> rows = reader.read(10000) >>> rows ... (10000, 'F', 'U', 'Advanced Degree', 1500, 'Unknown', 1, 0, 0), (10001, 'M', 'M', 'Unknown', 1500, 'Unknown', 1, 0, 0), (10002, 'F', 'M', 'Unknown', 1500, 'Unknown', 1, 0, 0)] >>> reader.read() # This call froze the interpreter for several minutes! ... (1920799, 'M', 'U', 'Unknown', 10000, 'Unknown', 6, 6, 6), (1920800, 'F', 'U', 'Unknown', 10000, 'Unknown', 6, 6, 6)] Using this optional parameter for larger ORC file is highly recommended! After all the rows are read, the Reader object has no more rows to yield. There's a seek method to jump a specific row in the file and continue the read from that point:: >>> reader.seek(1000) 1000 >>> next(reader) (1001, 'M', 'M', 'College', 7500, 'Good', 0, 0, 0) By default all fields are loaded from an ORC file, but that can be changed by passing either `column_indices` or `column_names` parameter to Reader:: >>> reader = pyorc.Reader(example, column_names=("_col0", "_col5")) >>> next(reader) (1, 'Good') We can also change the representation of a struct from tuple to dictionary:: >>> from pyorc.enums import StructRepr >>> reader = pyorc.Reader(example, column_indices=(1, 5), struct_repr=StructRepr.DICT) >>> next(reader) {'_col1': 'M', '_col5': 'Good'} Stripes ------- ORC files are divided in to stripes. Stripes are independent of each other. Let's open an other ORC files that has multiple stripes in it:: >>> example = open("./deps/examples/TestOrcFile.testStripeLevelStats.orc", "rb") >>> reader = pyorc.Reader(example) >>> reader.num_of_stripes 3 The `num_of_stripes` property of the Reader shows how many stripes are in the file. We can read a certain stripes using the `read_stripe` method:: >>> stripe2 = reader.read_stripe(2) >>> stripe2 The stripe object also an iterable object and has the same methods for reading and seeking rows, but only in the boundaries of the selected stripe:: >>> next(stripe2) (3, 'three') >>> len(stripe1) 1000 >>> len(reader) 11000 >>> stripe2.row_offset 10000 The `row_offset` returns the absolute position of the first row in the stripe. Filtering row groups -------------------- It is possible to skip certain records in an ORC file using simple filter predicates (or search arguments). Setting a predicate expression to the Reader can help to exclude row groups that don't satisfy the condition during reading:: >>> example = open("./deps/examples/TestStringDictionary.testRowIndex.orc", "rb") >>> reader = pyorc.Reader(example) >>> next(reader) ('row 000000',) >>> reader = pyorc.Reader(example, predicate=pyorc.predicates.PredicateColumn(pyorc.TypeKind.STRING, "str") > "row 004096") >>> next(reader) ('row 004096',) The predicate can be used to select a single row group, but not an individual record. The size of the row group is determined by the `row_index_stride`, set during writing of the file. You can create more complex predicate using logical expressions:: >>> pred = (PredicateColumn(TypeKind.INT, "c0") > 300) & (PredicateColumn(TypeKind.STRING, "c1") == "A") One of the comparands must always be a literal value (cannot compare two columns to each other). Writing ------- To write a new ORC file we need to open a binary file-like object and pass to a Writer object with an ORC schema description. The schema can be a TypeDescription or a simple string ORC schema definition:: >>> output = open("./new.orc", "wb") >>> writer = pyorc.Writer(output, "struct") >>> writer We can add rows to the file with the `write` method:: >>> writer.write((0, "Test 0")) >>> writer.write((1, "Test 1")) Don't forget to close the writer to write out the necessary metadata, otherwise it won't be a valid ORC file. >>> writer.close() For simpler use the Writer object can be used as a context manager and you can also change the struct representation to use dictionaries as rows instead of tuples as well: .. code-block:: python with open("./new.orc", "wb") as output: with pyorc.Writer(output, "struct", struct_repr=StructRepr.DICT) as writer: writer.write({"col0": 0, "col1": "Test 0"}) Using custom converters ----------------------- It's possible to change the default converters that handle the transformations from ORC `date`, `decimal`, and `timestamp` types to Python objects, and back. To create your own converter you need to implement the :class:`ORCConverter` abstract class with two methods: ``from_orc`` and ``to_orc``. The following example returns the ORC timestamp values as seconds and nanoseconds pair: .. code-block:: python import pyorc from pyorc.converters import ORCConverter class TSConverter(ORCConverter): @staticmethod def to_orc(*args): seconds, nanoseconds, timezone = args return (seconds, nanoseconds) @staticmethod def from_orc(seconds, nanoseconds, timezone): return (seconds, nanoseconds) To use the converter you have to set the Reader's or Writer's converters parameter as a dictionary with one of the supported types as key:: data = open("./timestamps.orc", "rb") reader = pyorc.Reader(data, converters={TypeKind.TIMESTAMP: TSConverter})