Friesian Feature API

friesian.feature.table

class bigdl.friesian.feature.table.Table(df)[source]

Bases: object

property schema
compute()[source]

Trigger computation of the Table.

to_spark_df()[source]

Convert the current Table to a Spark DataFrame.

Returns

The converted Spark DataFrame.

size()[source]

Returns the number of rows in this Table.

Returns

The number of rows in the current Table.

broadcast()[source]

Marks the Table as small enough for use in broadcast join.

select(*cols)[source]

Select specific columns.

Parameters

cols – str or a list of str that specifies column names. If it is ‘*’, select all the columns.

Returns

A new Table that contains the specified columns.

drop(*cols)[source]

Returns a new Table that drops the specified column. This is a no-op if schema doesn’t contain the given column name(s).

Parameters

cols – str or a list of str that specifies the name of the columns to drop.

Returns

A new Table that drops the specified column.

limit(num)[source]

Limits the result count to the number specified.

Parameters

num – int that specifies the number of results.

Returns

A new Table that contains num counts of rows.

repartition(num_partitions)[source]

Return a new Table that has exactly num_partitions partitions.

Parameters

num_partitions – target number of partitions

Returns

a new Table that has num_partitions partitions.

get_partition_row_number()[source]

Return a Table that contains partitionId and corresponding row number.

Returns

a new Table that contains partitionId and corresponding row number.

fillna(value, columns)[source]

Replace null values.

Parameters
  • value – int, long, float, string, or boolean. Value to replace null values with.

  • columns – list of str, the target columns to be filled. If columns=None and value is int, all columns of integer type will be filled. If columns=None and value is long, float, str or boolean, all columns will be filled.

Returns

A new Table that replaced the null values with the specified value.

dropna(columns, how='any', thresh=None)[source]

Drops the rows containing null values in the specified columns.

Parameters
  • columns – str or a list of str that specifies column names. If it is None, it will operate on all columns.

  • how – If how is “any”, then drop rows containing any null values in columns. If how is “all”, then drop rows only if every column in columns is null for that row.

  • thresh – int, if specified, drop rows that have less than thresh non-null values. Default is None.

Returns

A new Table that drops the rows containing null values in the specified columns.

distinct()[source]

Select the distinct rows of the Table.

Returns

A new Table that only contains distinct rows.

filter(condition)[source]

Filters the rows that satisfy condition. For instance, filter(“col_1 == 1”) will filter the rows that has value 1 at column col_1.

Parameters

condition – str that gives the condition for filtering.

Returns

A new Table with filtered rows.

random_split(weights, seed=None)[source]

Randomly splits with the provided weights.

Parameters
  • weights – list of doubles as weights with which to split the table. Weights will be normalized if they don’t sum up to 1.0.

  • seed – The seed for sampling.

:return A list of Tables

clip(columns, min=None, max=None)[source]

Clips continuous values so that they are within the range [min, max]. For instance, by setting the min value to 0, all negative values in columns will be replaced with 0.

Parameters
  • columns – str or a list of str, the target columns to be clipped.

  • min – numeric, the minimum value to clip values to. Values less than this will be replaced with this value.

  • max – numeric, the maximum value to clip values to. Values greater than this will be replaced with this value.

Returns

A new Table that replaced the value less than min with specified min and the value greater than max with specified max.

log(columns, clipping=True)[source]

Calculates the log of continuous columns.

Parameters
  • columns – str or a list of str, the target columns to calculate log.

  • clipping – boolean. Default is True, and in this case the negative values in columns will be clipped to 0 and log(x+1) will be calculated. If False, log(x) will be calculated.

Returns

A new Table that replaced value in columns with logged value.

fill_median(columns)[source]

Replaces null values with the median in the specified numeric columns. Any column to be filled should not contain only null values.

Parameters

columns – str or a list of str that specifies column names. If it is None, it will operate on all numeric columns.

Returns

A new Table that replaces null values with the median in the specified numeric columns.

median(columns)[source]

Returns a new Table that has two columns, column and median, containing the column names and the medians of the specified numeric columns.

Parameters

columns – str or a list of str that specifies column names. If it is None, it will operate on all numeric columns.

Returns

A new Table that contains the medians of the specified columns.

merge_cols(columns, target)[source]

Merge the target column values as a list to a new column. The original columns will be dropped.

Parameters
  • columns – a list of str, the target columns to be merged.

  • target – str, the new column name of the merged column.

Returns

A new Table that replaces columns with a new target column of merged list values.

rename(columns)[source]

Rename columns with new column names

Parameters

columns – dict. Name pairs. For instance, {‘old_name1’: ‘new_name1’, ‘old_name2’: ‘new_name2’}”.

Returns

A new Table with new column names.

show(n=20, truncate=True)[source]

Prints the first n rows to the console.

Parameters
  • n – int, the number of rows to show.

  • truncate – If set to True, truncate strings longer than 20 chars by default. If set to a number greater than one, truncates long strings to length truncate and align cells right.

get_stats(columns, aggr)[source]

Calculate the statistics of the values over the target column(s).

Parameters
  • columns – str or a list of str that specifies the name(s) of the target column(s). If columns is None, then the function will return statistics for all numeric columns.

  • aggr – str or a list of str or dict to specify aggregate functions, min/max/avg/sum/count are supported. If aggr is a str or a list of str, it contains the name(s) of aggregate function(s). If aggr is a dict, the key is the column name, and the value is the aggregate function(s).

Returns

dict, the key is the column name, and the value is aggregate result(s).

min(columns)[source]

Returns a new Table that has two columns, column and min, containing the column names and the minimum values of the specified numeric columns.

Parameters

columns – str or a list of str that specifies column names. If it is None, it will operate on all numeric columns.

Returns

A new Table that contains the minimum values of the specified columns.

max(columns)[source]

Returns a new Table that has two columns, column and max, containing the column names and the maximum values of the specified numeric columns.

Parameters

columns – str or a list of str that specifies column names. If it is None, it will operate on all numeric columns.

Returns

A new Table that contains the maximum values of the specified columns.

to_list(column)[source]

Convert all values of the target column to a list. Only call this if the Table is small enough.

Parameters

column – str, specifies the name of target column.

Returns

list, contains all values of the target column.

to_dict()[source]

Convert the Table to a dictionary. Only call this if the Table is small enough.

Returns

dict, the key is the column name, and the value is the list containing

all values in the corresponding column.

add(columns, value=1)[source]

Increase all of values of the target numeric column(s) by a constant value.

Parameters
  • columns – str or a list of str, the target columns to be increased.

  • value – numeric (int/float/double/short/long), the constant value to be added.

Returns

A new Table with updated numeric values on specified columns.

property columns

Get column names of the Table.

Returns

A list of strings that specify column names.

sample(fraction, replace=False, seed=None)[source]

Return a sampled subset of Table.

Parameters
  • fraction – float, fraction of rows to generate, should be within the range [0, 1].

  • replace – allow or disallow sampling of the same row more than once.

  • seed – seed for sampling.

Returns

A new Table with sampled rows.

ordinal_shuffle_partition()[source]

Shuffle each partition of the Table by adding a random ordinal column for each row and sort by this ordinal column within each partition.

Returns

A new Table with shuffled partitions.

write_parquet(path, mode='overwrite')[source]

Write the Table to Parquet file.

Parameters
  • path – str, the path to the Parquet file.

  • mode – str. One of “append”, “overwrite”, “error” or “ignore”. append: Append contents to the existing data. overwrite: Overwrite the existing data. error: Throw an exception if the data already exists. ignore: Silently ignore this operation if data already exists.

cast(columns, dtype)[source]

Cast columns to the specified type.

Parameters
  • columns – str or a list of str that specifies column names. If it is None, then cast all of the columns.

  • dtype – str (“string”, “boolean”, “int”, “long”, “short”, “float”, “double”) that specifies the data type.

Returns

A new Table that casts all of the specified columns to the specified type.

write_csv(path, delimiter=',', mode='overwrite', header=True, num_partitions=None)[source]

Write the Table to csv file.

Parameters
  • path – str, the path to the csv file.

  • delimiter – str, the delimiter to use for separating fields. Default is “,”.

  • mode – str. One of “append”, “overwrite”, “error” or “ignore”. append: Append the contents of this StringIndex to the existing data. overwrite: Overwrite the existing data. error: Throw an exception if the data already exists. ignore: Silently ignore this operation if the data already exists.

  • header – boolean, whether to include the schema at the first line of the csv file. Default is True.

  • num_partitions – positive int. The number of files to write.

concat(tables, mode='inner', distinct=False)[source]

Concatenate a list of Tables into one Table in the dimension of row.

Parameters
  • tables – a Table or a list of Tables.

  • mode – str, either inner or outer. For inner mode, the new Table would only contain columns that are shared by all Tables. For outer mode, the resulting Table would contain all the columns that appear in all Tables.

  • distinct – boolean. If True, the result Table would only contain distinct rows. Default is False.

Returns

A single concatenated Table.

drop_duplicates(subset=None, sort_cols=None, keep='min')[source]

Return a new Table with duplicate rows removed.

Parameters
  • subset – str or a list of str, specifies which column(s) to be considered when referring to duplication. If subset is None, all the columns will be considered.

  • sort_cols – str or a list of str, specifies the column(s) to determine which item to keep when duplicated. If sort_cols is None, duplicate rows will be dropped randomly.

  • keep – str, the strategy to keep the duplicate, either min and max. Default is min. It will only take effect when sort_cols is not None. If keep is min, rows with the smallest values in sort_cols will be kept. If keep is max, rows with the largest values in sort_cols will be kept.

Returns

A new Table with duplicate rows removed.

append_column(name, column)[source]

Append a column with a constant value to the Table.

Parameters
  • name – str, the name of the new column.

  • column – pyspark.sql.column.Column, new column to be added into the table.

Returns

A new Table with the appended column.

subtract(other)[source]

Return a new Table containing rows in this Table but not in another Table

Parameters

other – Table.

Returns

A new Table.

col(name)[source]

Get the target column of the Table.

sort(*cols, **kwargs)[source]

Sort the Table by specified column(s).

Parameters
  • cols – list of Column or column names to sort by.

  • ascending – boolean or list of boolean (default True). Sort ascending vs. descending. Specify list for multiple sort orders. If a list is specified, length of the list must equal length of the cols.

order_by(*cols, **kwargs)

Sort the Table by specified column(s).

Parameters
  • cols – list of Column or column names to sort by.

  • ascending – boolean or list of boolean (default True). Sort ascending vs. descending. Specify list for multiple sort orders. If a list is specified, length of the list must equal length of the cols.

to_pandas()[source]
cache()[source]

Persist this Table in memory.

Returns

This Table.

uncache()[source]

Make this table as non-persistent and remove all its blocks from memory.

Returns

This Table.

coalesce(num_partitions)[source]

Return a new Table that has exactly num_partitions partitions. coalesce uses existing partitions to minimize the amount of data that’s shuffled.

Parameters

num_partitions – target number of partitions

Returns

a new Table that has num_partitions partitions.

intersect(other)[source]

Return a new Table containing rows only in both this Table and another Table

Parameters

other – Table.

Returns

A new Table.

collect()[source]

Returns all the records as a list of Row.

property dtypes

Returns all column names and their data types as a list.

class bigdl.friesian.feature.table.FeatureTable(df)[source]

Bases: Table

classmethod read_parquet(paths)[source]

Loads Parquet files as a FeatureTable.

Parameters

paths – str or a list of str, the path(s) to Parquet file(s).

Returns

A FeatureTable for recommendation data.

classmethod read_json(paths, cols=None)[source]

Loads json files as a FeatureTable.

Parameters
  • paths – str or a list of str, the path(s) to the json file(s).

  • cols – str or a list of str. The columns to select from the json file(s). Default is None and in this case all the columns will be considered.

Returns

A FeatureTable for recommendation data.

classmethod read_csv(paths, delimiter=',', header=False, names=None, dtype=None)[source]

Loads csv files as a FeatureTable.

Parameters
  • paths – str or a list of str, the path(s) to the csv file(s).

  • delimiter – str, the delimiter to use for parsing the csv file(s). Default is “,”.

  • header – boolean, whether the first line of the csv file(s) will be treated as the header for column names. Default is False.

  • names – str or a list of str, the column names for the csv file(s). You need to provide this if the header cannot be inferred. If specified, names should have the same length as the number of columns.

  • dtype – str or a list of str or dict, the column data type(s) for the csv file(s). You may need to provide this if you want to change the default inferred types of specified columns. If dtype is a str, then all the columns will be cast to the target dtype. If dtype is a list of str, then it should have the same length as the number of columns and each column will be cast to the corresponding str dtype. If dtype is a dict, then the key should be the column name and the value should be the str dtype to cast the column to.

Returns

A FeatureTable for recommendation data.

classmethod read_text(paths, col_name='value')[source]

Loads text files as a FeatureTable.

Parameters
  • paths – str or a list of str, the path(s) to the text file(s).

  • col_name – the column name of the text. Default is “value”.

Returns

A FeatureTable for recommendation data.

static from_pandas(pandas_df)[source]

Returns the contents of of a pandas DataFrame as FeatureTable.

Parameters

pandas_df – a pandas DataFrame.

Returns

A FeatureTable for recommendation data.

encode_string(columns, indices, broadcast=True, do_split=False, sep=',', sort_for_array=False, keep_most_frequent=False)[source]

Encode columns with provided list of StringIndex. Unknown string will be None after the encoding and you may need to fillna with 0.

Parameters
  • columns – str or a list of str, the target columns to be encoded.

  • indices – StringIndex or a list of StringIndex, StringIndexes of target columns. The StringIndex should at least have two columns: id and the corresponding categorical column. Or it can be a dict or a list of dicts. In this case, the keys of the dict should be within the categorical column and the values are the target ids to be encoded.

  • broadcast – bool, whether need to broadcast index when encode string. Default is True.

  • do_split – bool, whether need to split column value to array to encode string. Default is False.

  • sep – str, a string representing a regular expression to split a column value. Default is ‘,’.

  • sort_for_array – bool, whether need to sort array columns. Default is False.

  • keep_most_frequent – bool, whether need to keep most frequent value as the column value. Default is False.

Returns

A new FeatureTable which transforms categorical features into unique integer values with provided StringIndexes.

filter_by_frequency(columns, min_freq=2)[source]

Filter the FeatureTable by the given minimum frequency on the target columns.

Parameters
  • columns – str or a list of str, column names which are considered for filtering.

  • min_freq – int, min frequency. Columns with occurrence below this value would be filtered.

Returns

A new FeatureTable with filtered records.

hash_encode(columns, bins, method='md5')[source]

Hash encode for categorical column(s).

Parameters
  • columns – str or a list of str, the target columns to be encoded. For dense features, you need to cut them into discrete intervals beforehand.

  • bins – int, defines the number of equal-width bins in the range of column(s) values.

  • method – hashlib supported method, like md5, sha256 etc.

Returns

A new FeatureTable with hash encoded columns.

cross_hash_encode(columns, bins, cross_col_name=None, method='md5')[source]

Hash encode for cross column(s).

Parameters
  • columns – a list of str, the categorical columns to be encoded as cross features. For dense features, you need to cut them into discrete intervals beforehand.

  • bins – int, defined the number of equal-width bins in the range of column(s) values.

  • cross_col_name – str, the column name for output cross column. Default is None, and in this case the default cross column name will be ‘crossed_col1_col2’ for [‘col1’, ‘col2’].

  • method – hashlib supported method, like md5, sha256 etc.

Returns

A new FeatureTable with the target cross column.

category_encode(columns, freq_limit=None, order_by_freq=False, do_split=False, sep=',', sort_for_array=False, keep_most_frequent=False, broadcast=True)[source]

Category encode the given columns.

Parameters
  • columns – str or a list of str, target columns to encode from string to index.

  • freq_limit – int, dict or None. Categories with a count/frequency below freq_limit will be omitted from the encoding. Can be represented as either an integer, dict. For instance, 15, {‘col_4’: 10, ‘col_5’: 2} etc. Default is None, and in this case all the categories that appear will be encoded.

  • order_by_freq – boolean, whether the result StringIndex will assign smaller indices to values with more frequencies. Default is False and in this case frequency order may not be preserved when assigning indices.

  • do_split – bool, whether need to split column value to array to encode string. Default is False.

  • sep – str, a string representing a regular expression to split a column value. Default is ‘,’.

  • sort_for_array – bool, whether need to sort array columns. Default is False.

  • keep_most_frequent – bool, whether need to keep most frequent value as the column value. Default is False.

  • broadcast – bool, whether need to broadcast index when encode string. Default is True.

Returns

A tuple of a new FeatureTable which transforms categorical features into unique integer values, and a list of StringIndex for the mapping.

one_hot_encode(columns, sizes=None, prefix=None, keep_original_columns=False)[source]

Convert categorical features into ont hot encodings. If the features are string, you should first call category_encode to encode them into indices before one hot encoding. For each input column, a one hot vector will be created expanding multiple output columns, with the value of each one hot column either 0 or 1. Note that you may only use one hot encoding on the columns with small dimensions for memory concerns.

For example, for column ‘x’ with size 5: Input: |x| |1| |3| |0| Output will contain 5 one hot columns: |prefix_0|prefix_1|prefix_2|prefix_3|prefix_4| | 0 | 1 | 0 | 0 | 0 | | 0 | 0 | 0 | 1 | 0 | | 1 | 0 | 0 | 0 | 0 |

Parameters
  • columns – str or a list of str, the target columns to be encoded.

  • sizes – int or a list of int, the size(s) of the one hot vectors of the column(s). Default is None, and in this case, the sizes will be calculated by the maximum value(s) of the columns(s) + 1, namely the one hot vector will cover 0 to the maximum value. You are recommended to provided the sizes if they are known beforehand. If specified, sizes should have the same length as columns.

  • prefix – str or a list of str, the prefix of the one hot columns for the input column(s). Default is None, and in this case, the prefix will be the input column names. If specified, prefix should have the same length as columns. The one hot columns for each input column will have column names: prefix_0, prefix_1, … , prefix_maximum

  • keep_original_columns – boolean, whether to keep the original index column(s) before the one hot encoding. Default is False, and in this case the original column(s) will be replaced by the one hot columns. If True, the one hot columns will be appended to each original column.

Returns

A new FeatureTable which transforms categorical indices into one hot encodings.

gen_string_idx(columns, freq_limit=None, order_by_freq=False, do_split=False, sep=',')[source]

Generate unique index value of categorical features. The resulting index would start from 1 with 0 reserved for unknown features.

Parameters
  • columns – str, dict or a list of str, dict, target column(s) to generate StringIndex. dict is a mapping of source column names -> target column name if needs to combine multiple source columns to generate index. For example: {‘src_cols’:[‘a_user’, ‘b_user’], ‘col_name’:’user’}.

  • freq_limit – int, dict or None. Categories with a count/frequency below freq_limit will be omitted from the encoding. Can be represented as either an integer, dict. For instance, 15, {‘col_4’: 10, ‘col_5’: 2} etc. Default is None, and in this case all the categories that appear will be encoded.

  • order_by_freq – boolean, whether the result StringIndex will assign smaller indices to values with more frequencies. Default is False and in this case frequency order may not be preserved when assigning indices.

  • do_split – bool, whether need to split column value to array to generate index.

Default is False. :param sep: str, a string representing a regular expression to split a column value.

Default is ‘,’.

Returns

A StringIndex or a list of StringIndex.

cross_columns(crossed_columns, bucket_sizes)[source]

Cross columns and hashed to specified bucket size

Parameters
  • crossed_columns – list of column name pairs to be crossed. i.e. [[‘a’, ‘b’], [‘c’, ‘d’]]

  • bucket_sizes – hash bucket size for crossed pairs. i.e. [1000, 300]

Returns

A new FeatureTable with crossed columns.

min_max_scale(columns, min=0.0, max=1.0)[source]

Rescale each column individually to a common range [min, max] linearly using column summary statistics, which is also known as min-max normalization or rescaling.

Parameters
  • columns – str or a list of str, the column(s) to be rescaled.

  • min – int, the lower bound after transformation, shared by all columns. Default is 0.0.

  • max – int, the upper bound after transformation, shared by all columns. Default is 1.0.

Returns

A tuple of a new FeatureTable with rescaled column(s), and a dict of the original min and max values of the input column(s).

transform_min_max_scale(columns, min_max_dict)[source]

Rescale each column individually with the given [min, max] range of each column.

Parameters
  • columns – str or a list of str, the column(s) to be rescaled.

  • min_max_dict – dict, the key is the column name, and the value is the tuple of min and max values of this column.

Returns

A new FeatureTable with rescaled column(s).

add_negative_samples(item_size, item_col='item', label_col='label', neg_num=1)[source]

Generate negative records for each record in the FeatureTable. All the records in the original FeatureTable will be treated as positive samples with value 1 for label_col and the negative samples will be randomly generated with value 0 for label_col.

Parameters
  • item_size – int, the total number of items in the FeatureTable.

  • item_col – str, the name of the item column. Whether the record is positive or negative will be based on this column. Default is ‘item’.

  • label_col – str, the name of the label column. Default is ‘label’.

  • neg_num – int, the number of negative records for each positive record. Default is 1.

Returns

A new FeatureTable with negative samples.

add_hist_seq(cols, user_col, sort_col='time', min_len=1, max_len=100, num_seqs=2147483647)[source]

Add a column of history visits of each user.

Parameters
  • cols – str or a list of str, the column(s) to be treated as histories.

  • user_col – str, the column to be treated as the user.

  • sort_col – str, the column to sort by for each user. Default is ‘time’.

  • min_len – int, the minimal length of a history sequence. Default is 1.

  • max_len – int, the maximal length of a history sequence. Default is 100.

  • num_seqs – int, default is 2147483647 (maximum value of 4-byte integer), which means to to keep all the histories. You can set num_seqs=1 to only keep the last history.

Returns

A new FeatureTable with history sequences.

add_neg_hist_seq(item_size, item_history_col, neg_num)[source]

Generate a list of negative samples for each item in the history sequence.

Parameters
  • item_size – int, the total number of items in the FeatureTable.

  • item_history_col – str, the history column to generate negative samples.

  • neg_num – int, the number of negative items for each history (positive) item.

Returns

A new FeatureTable with negative history sequences.

mask(mask_cols, seq_len=100)[source]

Add mask on specified column(s).

Parameters
  • mask_cols – str or a list of str, the column(s) to be masked with 1s and 0s. Each column should be of list type.

  • seq_len – int, the length of the masked column. Default is 100.

Returns

A new FeatureTable with masked columns.

pad(cols, seq_len=100, mask_cols=None, mask_token=0)[source]

Add padding on specified column(s).

Parameters
  • cols – str or a list of str, the column(s) to be padded with mask_tokens. Each column should be of list type.

  • seq_len – int, the length to be padded to for cols. Default is 100.

  • mask_cols – str or a list of str, the column(s) to be masked with 1s and 0s.

  • mask_token – numeric types or str, should be consistent with element’s type of cols. Default is 0.

Returns

A new FeatureTable with padded columns.

apply(in_col, out_col, func, dtype='string')[source]

Transform a FeatureTable using a user-defined Python function.

Parameters
  • in_col – str or a list of str, the column(s) to be transformed.

  • out_col – str, the name of output column.

  • func – The Python function with in_col as input and out_col. When in_col is a list of str, func should take a list as input, and in this case you are generating out_col given multiple input columns.

  • dtype – str, the data type of out_col. Default is string type.

Returns

A new FeatureTable after column transformation.

join(table, on=None, how=None, lsuffix=None, rsuffix=None)[source]

Join a FeatureTable with another FeatureTable.

Parameters
  • table – A FeatureTable.

  • on – str or a list of str, the column(s) to join.

  • how – str, default is inner. Must be one of: inner, cross, outer, full, fullouter, full_outer, left, leftouter, left_outer, right, rightouter, right_outer, semi, leftsemi, left_semi, anti, leftanti and left_anti.

  • lsuffix – The suffix to use for the original Table’s overlapping columns.

  • rsuffix – The suffix to use for the input Table’s overlapping columns.

Returns

A joined FeatureTable.

add_value_features(columns, dict_tbl, key, value)[source]

Add features based on key columns and the key value Table. For each column in columns, it adds a value column using key-value pairs from dict_tbl.

Parameters
  • columns – str or a list of str, the key columns in the original FeatureTable.

  • dict_tbl – A Table for the key value mapping.

  • key – str, the name of the key column in dict_tbl.

  • value – str, the name of value column in dict_tbl.

Returns

A new FeatureTable with value columns.

reindex(columns=[], index_tbls=[])[source]

Replace the value using index_dicts for each col in columns, set 0 for default

Parameters
  • columns – str of a list of str

  • dict_tbls – table or list of tables, each one has a mapping from old index to new one

Returns

FeatureTable

gen_reindex_mapping(columns=[], freq_limit=10)[source]
Generate a mapping from old index to new one based on popularity count on descending order
param columns

str or a list of str

param freq_limit

int, dict or None. Indices with a count below freq_limit will be omitted. Can be represented as either an integer or dict. For instance, 15, {‘col_4’: 10, ‘col_5’: 2} etc. Default is 10,

Returns

a list of FeatureTables, each table has a mapping from old index to new index new index starts from 1, save 0 for default

group_by(columns=[], agg='count', join=False)[source]

Group the Table with specified columns and then run aggregation. Optionally join the result with the original Table.

Parameters
  • columns – str or a list of str. Columns to group the Table. If it is an empty list, aggregation is run directly without grouping. Default is [].

  • agg

    str, list or dict. Aggregate functions to be applied to grouped Table. Default is “count”. Supported aggregate functions are: “max”, “min”, “count”, “sum”, “avg”, “mean”, “sumDistinct”, “stddev”, “stddev_pop”, “variance”, “var_pop”, “skewness”, “kurtosis”, “collect_list”, “collect_set”, “approx_count_distinct”, “first”, “last”. If agg is a str, then agg is the aggregate function and the aggregation is performed on all columns that are not in columns. If agg is a list of str, then agg is a list of aggregate function and the aggregation is performed on all columns that are not in columns. If agg is a single dict mapping from str to str, then the key is the column to perform aggregation on, and the value is the aggregate function. If agg is a single dict mapping from str to list, then the key is the column to perform aggregation on, and the value is list of aggregate functions.

    Examples: agg=”sum” agg=[“last”, “stddev”] agg={“*”:”count”} agg={“col_1”:”sum”, “col_2”:[“count”, “mean”]}

  • join – boolean. If True, join the aggregation result with original Table.

Returns

A new Table with aggregated column fields.

split(ratio, seed=None)[source]

Split the FeatureTable into multiple FeatureTables for train, validation and test.

Parameters
  • ratio – a list of portions as weights with which to split the FeatureTable. Weights will be normalized if they don’t sum up to 1.0.

  • seed – The seed for sampling.

Returns

A tuple of FeatureTables split by the given ratio.

target_encode(cat_cols, target_cols, target_mean=None, smooth=20, kfold=2, fold_seed=None, fold_col='__fold__', drop_cat=False, drop_fold=True, out_cols=None)[source]

For each categorical column or column group in cat_cols, calculate the mean of target columns in target_cols and encode the FeatureTable with the target mean(s) to generate new features.

Parameters
  • cat_cols – str, a list of str or a nested list of str. Categorical column(s) or column group(s) to target encode. To encode categorical column(s), cat_cols should be a str or a list of str. To encode categorical column group(s), cat_cols should be a nested list of str.

  • target_cols – str or a list of str. Numeric target column(s) to calculate the mean. If target_cols is a list, then each target_col would be used separately to encode the cat_cols.

  • target_mean – dict of {target column : mean} to provides global mean of target column(s) if known beforehand to save calculation. Default is None and in this case the global mean(s) would be calculated on demand.

  • smooth – int. The mean of each category is smoothed by the overall mean. Default is 20.

  • kfold – int. Specifies number of folds for cross validation. The mean values within the i-th fold are calculated with data from all other folds. If kfold is 1, global-mean statistics are applied; otherwise, cross validation is applied. Default is 2.

  • fold_seed – int. Random seed used for generating folds. Default is None and in this case folds will be generated with row number in each partition.

  • fold_col – str. Name of integer column used for splitting folds. If fold_col exists in the FeatureTable, then this column is used; otherwise, it is randomly generated within the range [0, kfold). Default is “__fold__”.

  • drop_cat – boolean, whether to drop the original categorical columns. Default is False.

  • drop_fold – boolean, whether to drop the fold column. Default is True.

  • out_cols – str, a list of str or a nested list of str. When both cat_cols and target_cols has only one element, out_cols can be a single str. When cat_cols or target_cols has only one element, out_cols can be a list of str, and each element in out_cols corresponds to an element in target_cols or cat_cols. When it is a nested list of str, each inner list corresponds to the categorical column in the same position of cat_cols. Each element in the inner list corresponds to the target column in the same position of target_cols. Default to be None and in this case the output column will be cat_col + “_te_” + target_col.

Returns

A tuple of a new FeatureTable with target encoded columns and a list of TargetCodes which contains the target encode values of the whole FeatureTable.

encode_target(targets, target_cols=None, drop_cat=True)[source]

Encode columns with the provided TargetCode(s).

Parameters
  • targets – TargetCode or a list of TargetCode.

  • target_cols – str or a list of str. Selects part of target columns of which target encoding will be applied. Default is None and in this case all target columns contained in targets will be encoded.

  • drop_cat – boolean, whether to drop the categorical column(s). Default is True.

Returns

A new FeatureTable which encodes each categorical column into group-specific mean of target columns with provided TargetCodes.

difference_lag(columns, sort_cols, shifts=1, partition_cols=None, out_cols=None)[source]

Calculates the difference between two consecutive rows, or two rows with certain interval of the specified continuous columns. The table is first partitioned by partition_cols if it is not None, and then sorted by sort_cols before the calculation.

Parameters
  • columns – str or a list of str. Continuous columns to calculate the difference.

  • sort_cols – str or a list of str. Columns by which the table is sorted.

  • shifts – int or a list of int. Intervals between two rows.

  • partition_cols – Columns by which the table is partitioned.

  • out_cols – str, a list of str, or a nested list of str. When both columns and shifts has only one element, out_cols can be a single str. When columns or shifts has only one element, out_cols can be a list of str, and each element in out_cols corresponds to an element in shifts or columns. When it is a list of list of str, each inner list corresponds to a column in columns. Each element in the inner list corresponds to a shift in shifts. If it is None, the output column will be sort_cols + “_diff_lag_” + column + “_” + shift. Default is None.

Returns

A new FeatureTable with difference columns.

cut_bins(columns, bins, labels=None, out_cols=None, drop=True)[source]

Segment values of the target column(s) into bins, which is also known as bucketization.

Parameters
  • columns – str or a list of str, the numeric column(s) to segment into intervals.

  • bins – int, a list of int or dict. If bins is a list, it defines the bins to be used. NOTE that for bins of length n, there will be n+1 buckets. For example, if bins is [0, 6, 18, 60], the resulting buckets are (-inf, 0), [0, 6), [6, 18), [18, 60), [60, inf). If bins is an int, it defines the number of equal-width bins in the range of all the column values, i.e. from column min to max. NOTE that there will be bins+2 resulting buckets in total to take the values below min and beyond max into consideration. For examples, if bins is 2, the resulting buckets are (-inf, col_min), [col_min, (col_min+col_max)/2), [(col_min+col_max)/2, col_max), [col_max, inf). If bins is a dict, the key should be the input column(s) and the value should be int or a list of int to specify the bins as described above.

  • labels – a list of str or dict, the labels for the returned bins. Default is None, and in this case the new bin column would use the integer index to encode the interval. Index would start from 0. If labels is a list of str, then the corresponding label would be used to replace the integer index at the same position. The number of elements in labels should be the same as the number of bins. If labels is a dict, the key should be the input column(s) and the value should be a list of str as described above.

  • out_cols – str or a list of str, the name of output bucketized column(s). Default is None, and in this case the name of each output column will be “column_bin” for each input column.

  • drop – boolean, whether to drop the original column(s). Default is True.

Returns

A new FeatureTable with feature bucket column(s).

get_vocabularies(columns)[source]

Create vocabulary for each column, and return dict of vocabularies

Parameters

columns – str or a list of str. Columns to generate vocabularies.

Returns

A dict of vocabularies.

sample_listwise(columns, num_sampled_list, num_sampled_item, random_seed=None, replace=True)[source]

Convert the FeatureTable to a sample listwise FeatureTable. The columns should be of list type and have the same length. Note that the rows with list length < num_sampled_item will be dropped since they don’t have enough examples.

You can use groupby to aggregate records under the same key before calling sample_listwise. >>> tbl +—-+—-+—-+ |name| a| b| +—-+—-+—-+ | a| 1| 1| | a| 2| 2| | b| 1| 1| +—-+—-+—-+ >>> tbl.group_by(“name”, agg=”collect_list”) +—-+——————+——————+ |name| collect_list(a)| collect_list(b)| +—-+——————+——————+ | a| [1, 2]| [1, 2]| | b| [1]| [1]| +—-+——————+——————+ >>> tbl +—-+————+————+——————–+ |name| int_arr| str_arr| int_arr_arr| +—-+————+————+——————–+ | a| [1, 2, 3]| [1, 2, 3]| [[1], [2], [3]]| | b|[1, 2, 3, 4]|[1, 2, 3, 4]|[[1], [2], [3], [4]]| | c| [1]| [1]| [[1]]| +—-+————+————+——————–+ >>> tbl.sample_listwise([“int_arr”, “str_arr”, “int_arr_arr”], num_sampled_list=4, >>> num_sampled_item=2) +—-+——-+——-+———–+ |name|int_arr|str_arr|int_arr_arr| +—-+——-+——-+———–+ | a| [1, 3]| [1, 3]| [[1], [3]]| | a| [2, 1]| [2, 1]| [[2], [1]]| | a| [3, 2]| [3, 2]| [[3], [2]]| | a| [2, 3]| [2, 3]| [[2], [3]]| | b| [4, 1]| [4, 1]| [[4], [1]]| | b| [2, 3]| [2, 3]| [[2], [3]]| | b| [2, 3]| [2, 3]| [[2], [3]]| | b| [2, 3]| [2, 3]| [[2], [3]]| +—-+——-+——-+———–+ >>> tbl.sample_listwise([“int_arr”, “str_arr”], num_sampled_list=2, >>> num_sampled_item=2, replace=False) +—-+————+————+——————–+—————+—————+ |name| int_arr| str_arr| int_arr_arr|sampled_int_arr|sampled_str_arr| +—-+————+————+——————–+—————+—————+ | a| [1, 2, 3]| [1, 2, 3]| [[1], [2], [3]]| [3, 2]| [3, 2]| | a| [1, 2, 3]| [1, 2, 3]| [[1], [2], [3]]| [2, 1]| [2, 1]| | b|[1, 2, 3, 4]|[1, 2, 3, 4]|[[1], [2], [3], [4]]| [2, 4]| [2, 4]| | b|[1, 2, 3, 4]|[1, 2, 3, 4]|[[1], [2], [3], [4]]| [4, 2]| [4, 2]| +—-+————+————+——————–+—————+—————+

Parameters
  • columns – str or a list of str. Columns to convert to sampled list. Each column should be of list type. The list length of specified columns in the same row must be the same.

  • num_sampled_list – int. The number of lists that should be sampled for each row.

  • num_sampled_item – int. The number of elements to be sampled for each list from the list of each column.

  • random_seed – int. The number for creating ‘np.random.RandomState’. Default: None.

  • replace – bool. Indicates whether to replace the original columns. If replace=False, a corresponding column “sampled_col” will be generated for each sampled column.

Returns

A new sampled listwise FeatureTable.

class bigdl.friesian.feature.table.StringIndex(df, col_name)[source]

Bases: Table

classmethod read_parquet(paths, col_name=None)[source]

Loads Parquet files as a StringIndex.

Parameters
  • paths – str or a list of str, the path(s) to Parquet file(s).

  • col_name – str. The column name of the corresponding categorical column. If col_name is None, the file name will be used as col_name.

Returns

A StringIndex.

classmethod from_dict(indices, col_name)[source]

Create the StringIndex from a dict of indices.

Parameters
  • indices – dict, the key is the categorical column, and the value is the corresponding index. We assume that the key is a str and the value is a int.

  • col_name – str. The column name of the categorical column.

Returns

A StringIndex.

to_dict()[source]

Convert the StringIndex to a dict, with the categorical features as keys and indices as values. Note that you may only call this if the StringIndex is small.

Returns

A dict for the mapping from string to index.

write_parquet(path, mode='overwrite')[source]

Write the StringIndex to Parquet file.

Parameters
  • path – str, the path to the Parquet file. Note that the col_name will be used as basename of the Parquet file.

  • mode – str. One of “append”, “overwrite”, “error” or “ignore”. append: Append the contents of this StringIndex to the existing data. overwrite: Overwrite the existing data. error: Throw an exception if the data already exists. ignore: Silently ignore this operation if the data already exists.

cast(columns, dtype)[source]

Cast columns to the specified type.

Parameters
  • columns – str or a list of str that specifies column names. If it is None, then cast all of the columns.

  • dtype – str (“string”, “boolean”, “int”, “long”, “short”, “float”, “double”) that specifies the data type.

Returns

A new Table that casts all of the specified columns to the specified type.

class bigdl.friesian.feature.table.TargetCode(df, cat_col, out_target_mean)[source]

Bases: Table

Target Encoding output used for encoding new FeatureTables, which consists of the encoded categorical column or column group and the target encoded columns (mean statistics of the categorical column or column group).

Parameters
  • df – Target encoded data.

  • cat_col – str or list of str. The categorical column or column group encoded in the original FeatureTable.

  • out_target_mean – dict, the key is the target encoded output column in this TargetCode, and the value is a tuple of the target column in the original FeatureTable together with the target column’s global mean in the original FeatureTable. For example: {“col3_te_target1”: (“target1”, 3.0)}, and in this case cat_col for this TargetCode should be “col3”.

rename(columns)[source]

Rename columns with new column names

Parameters

columns – dict. Name pairs. For instance, {‘old_name1’: ‘new_name1’, ‘old_name2’: ‘new_name2’}”.

Returns

A new Table with new column names.