Orca Data#

orca.data.XShards#

class bigdl.orca.data.XShards[source]#

Bases: object

A collection of data which can be pre-processed in parallel.

transform_shard(func: Callable, *args)[source]#

Transform each shard in the XShards using specified function.

Parameters
  • func – pre-processing function

  • args – arguments for the pre-processing function

Returns

DataShard

collect()[source]#

Returns a list that contains all of the elements in this XShards

Returns

list of elements

num_partitions()[source]#

return the number of partitions in this XShards

Returns

an int

classmethod load_pickle(path: str, minPartitions: Optional[int] = None) bigdl.orca.data.shard.SparkXShards[source]#

Load XShards from pickle files.

Parameters
  • path – The pickle file path/directory

  • minPartitions – The minimum partitions for the XShards

Returns

SparkXShards object

static partition(data: Union[ndarray, List[ndarray], Tuple[ndarray, ndarray], Dict[str, Union[ndarray, Tuple[ndarray], List[ndarray]]]], num_shards: Optional[int] = None) SparkXShards[source]#

Partition local in memory data and form a SparkXShards

Parameters

data – np.ndarray, a tuple, list, dict of np.ndarray, or a nested structure

made of tuple, list, dict with ndarray as the leaf value :param num_shards: the number of shards that the data will be partitioned into :return: a SparkXShards

orca.data.pandas#

bigdl.orca.data.pandas.preprocessing.read_csv(file_path: str, **kwargs) bigdl.orca.data.shard.SparkXShards[source]#

Read csv files to SparkXShards of pandas DataFrames.

Parameters

file_path – A csv file path, a list of multiple csv file paths, or a directory

containing csv files. Local file system, HDFS, and AWS S3 are supported. :param kwargs: You can specify read_csv options supported by pandas. :return: An instance of SparkXShards.

bigdl.orca.data.pandas.preprocessing.read_json(file_path: str, **kwargs) bigdl.orca.data.shard.SparkXShards[source]#

Read json files to SparkXShards of pandas DataFrames.

Parameters

file_path – A json file path, a list of multiple json file paths, or a directory

containing json files. Local file system, HDFS, and AWS S3 are supported. :param kwargs: You can specify read_json options supported by pandas. :return: An instance of SparkXShards.

bigdl.orca.data.pandas.preprocessing.read_file_spark(file_path: str, file_type: str, **kwargs) bigdl.orca.data.shard.SparkXShards[source]#
bigdl.orca.data.pandas.preprocessing.read_parquet(file_path: str, columns: Optional[List[str]] = None, schema: Optional[StructType] = None, **options) SparkXShards[source]#

Read parquet files to SparkXShards of pandas DataFrames.

Parameters

file_path – Parquet file path, a list of multiple parquet file paths, or a directory

containing parquet files. Local file system, HDFS, and AWS S3 are supported. :param columns: list of column name, default=None. If not None, only these columns will be read from the file. :param schema: pyspark.sql.types.StructType for the input schema or a DDL-formatted string (For example col0 INT, col1 DOUBLE). :param options: other options for reading parquet. :return: An instance of SparkXShards.