shapelets.SandBox.from_csv#
- SandBox.from_csv(path: Union[str, Path], *, auto_detect: Optional[bool] = True, hive_partitioning: Optional[bool] = False, delimiter: Optional[str] = None, quote: Optional[str] = None, escape: Optional[str] = None, has_header: Optional[bool] = None, null_string: Optional[str] = None, date_format: Optional[str] = None, timestamp_format: Optional[str] = None, sample_size: Optional[int] = None, skip_detection: Optional[bool] = None, compression: Optional[Literal['None', 'GZIP', 'ZSTD']] = None, include_filename: Optional[bool] = None, skip_top_lines: Optional[int] = None) DataSet #
Imports a list of csv files as a Relation.
All files must have the same schema.
- Parameters:
- path: str or Path, required
Path to files to load. It accepts either single string or aPath object.
Use a string value when using wildcards in your path (*) to match a directory tree structure. These paths may contain references to environment variables ($var or ${var}) and home directory expressions (~).
Use paths to specify valid and resoluble paths.
Paths, either in string or Path object formats, should include the file pattern to load (ej: *.csv)
- auto_detect: Optional bool; defaults to True
When set to True, delimiter, quote, escape and header options will be automatically inferred from the files.
- hive_partitioning: boll, Optional, defaults to False
Path contains Hive expressions that should be incorporated into the loaded dataset.
- delimiter: Optional str
Specifies the string that separates columns within each line of the file.
- quote: Optional str
Specifies the quoting string to be used when a data value is quoted.
- escape: Optional str
Specifies the string that should appear before a data character sequence that matches the quote value. The default is the same as the quote value (so that the quoting string is doubled if it appears in the data).
- has_header: Optional bool
Specifies that the file contains a header line with the names of each column in the file.
- null_string: Optional str
Specifies the string that represents a NULL value. The default is an empty string.
- date_format: Optional str
Specifies the date format to use when parsing dates. See the notes section.
- timestamp_format: Optional str
Specifies the date format to use when parsing timestamps. See the notes section.
- sample_size: Optional int
Option to define number of sample rows for automatic CSV type detection.
- skip_detection: Optional bool
Option to skip type detection for CSV parsing and assume all columns to be of type string.
- compression: Optional. One of ‘None’, ‘GZIP’, ‘ZSTD’
When not set, it will try automatic detection; otherwise, valid values should be none, gzip or zstd.
- include_filename: Optional bool
Adds an additional column whose value is the file name
- skip_top_lines: Optional int
The number of lines at the top of the file to skip.
Notes
The following table outlines the different options for parsing dateFormat and timestampFormat:
Specifier
Description
%a
Abbreviated weekday name.
%A
Full weekday name.
%w
Weekday as a decimal number.
%d
Day of the month as a zero-padded decimal.
%-d
Day of the month as a decimal number.
%b
Abbreviated month name.
%B
Full month name.
%m
Month as a zero-padded decimal number.
%-m
Month as a decimal number.
%y
Year without century as a zero-padded decimal number.
%-y
Year without century as a decimal number.
%Y
Year with century as a decimal number.
%H
Hour (24-hour clock) as a zero-padded decimal number.
%-H
Hour (24-hour clock) as a decimal number.
%I
Hour (12-hour clock) as a zero-padded decimal number.
%-I
Hour (12-hour clock) as a decimal number.
%p
Locale’s AM or PM.
%M
Minute as a zero-padded decimal number.
%-M
Minute as a decimal number.
%S
Second as a zero-padded decimal number.
%-S
Second as a decimal number.
%g
Millisecond as a decimal number, zero-padded on the left.
%f
Microsecond as a decimal number, zero-padded on the left.
%z
UTC offset in the form +HHMM or -HHMM.
%Z
Time zone name.
%j
Day of the year as a zero-padded decimal number.
%-j
Day of the year as a decimal number.
%U
Week number of the year (Sunday as the first day of the week).
%W
Week number of the year (Monday as the first day of the week).
%c
ISO date and time representation.
%x
ISO date representation.
%X
ISO time representation.
%%
A literal ‘%’ character.
Examples
>>> df = session.from_csv("my_data.csv")