shapelets.SandBox.from_csv#

SandBox.from_csv(path: Union[str, Path], *, auto_detect: Optional[bool] = True, hive_partitioning: Optional[bool] = False, delimiter: Optional[str] = None, quote: Optional[str] = None, escape: Optional[str] = None, has_header: Optional[bool] = None, null_string: Optional[str] = None, date_format: Optional[str] = None, timestamp_format: Optional[str] = None, sample_size: Optional[int] = None, skip_detection: Optional[bool] = None, compression: Optional[Literal['None', 'GZIP', 'ZSTD']] = None, include_filename: Optional[bool] = None, skip_top_lines: Optional[int] = None) DataSet#

Imports a list of csv files as a Relation.

All files must have the same schema.

Parameters:
path: str or Path, required

Path to files to load. It accepts either single string or aPath object.

Use a string value when using wildcards in your path (*) to match a directory tree structure. These paths may contain references to environment variables ($var or ${var}) and home directory expressions (~).

Use paths to specify valid and resoluble paths.

Paths, either in string or Path object formats, should include the file pattern to load (ej: *.csv)

auto_detect: Optional bool; defaults to True

When set to True, delimiter, quote, escape and header options will be automatically inferred from the files.

hive_partitioning: boll, Optional, defaults to False

Path contains Hive expressions that should be incorporated into the loaded dataset.

delimiter: Optional str

Specifies the string that separates columns within each line of the file.

quote: Optional str

Specifies the quoting string to be used when a data value is quoted.

escape: Optional str

Specifies the string that should appear before a data character sequence that matches the quote value. The default is the same as the quote value (so that the quoting string is doubled if it appears in the data).

has_header: Optional bool

Specifies that the file contains a header line with the names of each column in the file.

null_string: Optional str

Specifies the string that represents a NULL value. The default is an empty string.

date_format: Optional str

Specifies the date format to use when parsing dates. See the notes section.

timestamp_format: Optional str

Specifies the date format to use when parsing timestamps. See the notes section.

sample_size: Optional int

Option to define number of sample rows for automatic CSV type detection.

skip_detection: Optional bool

Option to skip type detection for CSV parsing and assume all columns to be of type string.

compression: Optional. One of ‘None’, ‘GZIP’, ‘ZSTD’

When not set, it will try automatic detection; otherwise, valid values should be none, gzip or zstd.

include_filename: Optional bool

Adds an additional column whose value is the file name

skip_top_lines: Optional int

The number of lines at the top of the file to skip.

Notes

The following table outlines the different options for parsing dateFormat and timestampFormat:

Specifier

Description

%a

Abbreviated weekday name.

%A

Full weekday name.

%w

Weekday as a decimal number.

%d

Day of the month as a zero-padded decimal.

%-d

Day of the month as a decimal number.

%b

Abbreviated month name.

%B

Full month name.

%m

Month as a zero-padded decimal number.

%-m

Month as a decimal number.

%y

Year without century as a zero-padded decimal number.

%-y

Year without century as a decimal number.

%Y

Year with century as a decimal number.

%H

Hour (24-hour clock) as a zero-padded decimal number.

%-H

Hour (24-hour clock) as a decimal number.

%I

Hour (12-hour clock) as a zero-padded decimal number.

%-I

Hour (12-hour clock) as a decimal number.

%p

Locale’s AM or PM.

%M

Minute as a zero-padded decimal number.

%-M

Minute as a decimal number.

%S

Second as a zero-padded decimal number.

%-S

Second as a decimal number.

%g

Millisecond as a decimal number, zero-padded on the left.

%f

Microsecond as a decimal number, zero-padded on the left.

%z

UTC offset in the form +HHMM or -HHMM.

%Z

Time zone name.

%j

Day of the year as a zero-padded decimal number.

%-j

Day of the year as a decimal number.

%U

Week number of the year (Sunday as the first day of the week).

%W

Week number of the year (Monday as the first day of the week).

%c

ISO date and time representation.

%x

ISO date representation.

%X

ISO time representation.

%%

A literal ‘%’ character.

Examples

>>> df = session.from_csv("my_data.csv")