Orca Data#
orca.data.XShards#
- class bigdl.orca.data.XShards[source]#
Bases:
object
A collection of data which can be pre-processed in parallel.
- transform_shard(func: Callable, *args)[source]#
Transform each shard in the XShards using specified function.
- Parameters
func – pre-processing function
args – arguments for the pre-processing function
- Returns
DataShard
- collect()[source]#
Returns a list that contains all of the elements in this XShards
- Returns
list of elements
- classmethod load_pickle(path: str, minPartitions: Optional[int] = None) bigdl.orca.data.shard.SparkXShards [source]#
Load XShards from pickle files.
- Parameters
path – The pickle file path/directory
minPartitions – The minimum partitions for the XShards
- Returns
SparkXShards object
- static partition(data: Union[ndarray, List[ndarray], Tuple[ndarray, ndarray], Dict[str, Union[ndarray, Tuple[ndarray], List[ndarray]]]], num_shards: Optional[int] = None) SparkXShards [source]#
Partition local in memory data and form a SparkXShards
- Parameters
data – np.ndarray, a tuple, list, dict of np.ndarray, or a nested structure
made of tuple, list, dict with ndarray as the leaf value :param num_shards: the number of shards that the data will be partitioned into :return: a SparkXShards
orca.data.pandas#
- bigdl.orca.data.pandas.preprocessing.read_csv(file_path: str, **kwargs) bigdl.orca.data.shard.SparkXShards [source]#
Read csv files to SparkXShards of pandas DataFrames.
- Parameters
file_path – A csv file path, a list of multiple csv file paths, or a directory
containing csv files. Local file system, HDFS, and AWS S3 are supported. :param kwargs: You can specify read_csv options supported by pandas. :return: An instance of SparkXShards.
- bigdl.orca.data.pandas.preprocessing.read_json(file_path: str, **kwargs) bigdl.orca.data.shard.SparkXShards [source]#
Read json files to SparkXShards of pandas DataFrames.
- Parameters
file_path – A json file path, a list of multiple json file paths, or a directory
containing json files. Local file system, HDFS, and AWS S3 are supported. :param kwargs: You can specify read_json options supported by pandas. :return: An instance of SparkXShards.
- bigdl.orca.data.pandas.preprocessing.read_file_spark(file_path: str, file_type: str, **kwargs) bigdl.orca.data.shard.SparkXShards [source]#
- bigdl.orca.data.pandas.preprocessing.read_parquet(file_path: str, columns: Optional[List[str]] = None, schema: Optional[StructType] = None, **options) SparkXShards [source]#
Read parquet files to SparkXShards of pandas DataFrames.
- Parameters
file_path – Parquet file path, a list of multiple parquet file paths, or a directory
containing parquet files. Local file system, HDFS, and AWS S3 are supported. :param columns: list of column name, default=None. If not None, only these columns will be read from the file. :param schema: pyspark.sql.types.StructType for the input schema or a DDL-formatted string (For example col0 INT, col1 DOUBLE). :param options: other options for reading parquet. :return: An instance of SparkXShards.