Dataset

Learn how to upload datasets using the Python SDK with pandas DataFrames, CSV files, or Excel sheets.

The Dataset class in the Python SDK provides powerful methods to upload data from various sources. Choose the upload method that best matches your data format and workflow.

table
DataFrame Upload

Upload pandas DataFrames directly without saving to disk

file-csv
CSV Upload

Upload CSV files from your local file system

file-excel
Excel Upload

Upload specific sheets from Excel workbooks

SDK Upload Methods

upload_dataframe(dataframe: DataFrame, name: str) -> Dataset | None

Upload a pandas DataFrame directly using the SDK client.

Parameters:

  • dataframe: pandas.DataFrame to upload
  • name: User-friendly dataset name

Returns:

  • Dataset object on success (with id populated)
  • None on failure

Example:

import pandas as pd
from fount import Fount

# Initialize SDK client
client = Fount(api_key="your-api-key")

# Create or load your DataFrame
df = pd.read_csv('local_data.csv')

# Upload using SDK
dataset = client.upload_dataframe(df, name="Sales Data")

if dataset:
    print(f"Upload successful! Dataset ID: {dataset.id}")

Use Cases:

  • When you've already processed data in pandas
  • For dynamic data generation in your Python scripts
  • When working with in-memory transformations

SDK Implementation Notes

exclamation-triangleError Handling

All SDK upload methods return None on failure. Always check the return value before proceeding:

# Good practice: Check return value
dataset = client.upload_csv(pathname, name="My Data")
if dataset is None:
    print("Upload failed - check your file path and permissions")
    # Handle error appropriately
else:
    print(f"Success! Dataset ID: {dataset.id}")
    # Continue with dataset operations
databaseDataset Object

The SDK returns a Dataset object containing:

  • id: Unique identifier for the uploaded dataset
  • name: The friendly name you provided
  • Additional metadata about the upload

You'll use the id field for subsequent SDK operations like querying or updating the dataset:

# Use dataset.id for further operations
results = client.query_dataset(dataset.id)
check-circleSDK Best Practices
  • Client Initialization: Always initialize the SDK client with proper authentication
    • Naming: Use descriptive names that clearly identify your data
    • File Paths: Use absolute paths or ensure relative paths are correct
    • Memory Management: For very large files, prefer upload_csv() over upload_dataframe() to avoid memory issues
    • Excel Sheets: Verify sheet names exactly match (case-sensitive)
    • Error Handling: Implement proper error handling for production code

Troubleshooting

upload_csv

Upload failed - check your file path and permissions / dataset is None

The CSV upload did not create a Dataset object. The method returns None on failure.

Common causes: wrong local file path, file not readable, or using a relative path from the wrong working directory.

Fix:

from pathlib import Path
print(Path("my_data.csv").resolve())  # check absolute path
dataset = client.upload_csv("my_data.csv", name="My Data")
if dataset is None:
    print("Upload failed — check path and file permissions")

FileNotFoundError: [Errno 2] No such file or directory

The SDK cannot find the file at the given path.

Fix: Use an absolute path or verify the relative path from os.getcwd():

import os
print(os.getcwd())  # confirm working directory
dataset = client.upload_csv("/absolute/path/to/data.csv", name="My Data")

ParserError: Error tokenizing data / Expected X fields in line Y, saw Z

The CSV file is malformed or uses a delimiter that pandas cannot parse.

Fix: Verify the file opens cleanly in pandas before uploading:

import pandas as pd
df = pd.read_csv("my_data.csv")  # must succeed before upload
print(df.head())

Re-save the file as a clean UTF-8 CSV if needed.

upload_dataframe

AttributeError: object has no attribute 'columns' / dataframe must be a pandas DataFrame

The object passed as dataframe is not a valid pandas DataFrame.

Fix: Load the data into pandas first:

import pandas as pd
df = pd.read_csv("my_data.csv")
assert isinstance(df, pd.DataFrame)
dataset = client.upload_dataframe(df, name="My Data")

Object of type ... is not JSON serializable / Upload failed

The DataFrame contains complex Python objects (lists, dicts, arrays, or mixed nested types) that cannot be serialized.

Fix: Check df.dtypes and convert object columns to strings, numerics, or datetimes before uploading:

print(df.dtypes)
# Convert problematic columns
df["my_col"] = df["my_col"].astype(str)

MemoryError / Kernel died / Upload failed for large dataframe

The in-memory DataFrame is too large for the upload path to handle.

Fix: Use upload_csv() for large files instead:

# Check memory usage first
df.info(memory_usage="deep")
# Use file-based upload for large data
dataset = client.upload_csv("/path/to/large_file.csv", name="Large Data")

upload_excel

Worksheet named '...' not found

The sheet name passed does not exist in the workbook (sheet names are case-sensitive).

Fix: List available sheet names and copy the exact value:

import pandas as pd
print(pd.ExcelFile("my_workbook.xlsx").sheet_names)
# Then use the exact name:
dataset = client.upload_excel("my_workbook.xlsx", sheet_name="Sheet1", name="My Data")

Unsupported format / Excel file format cannot be determined / BadZipFile

The file is not a valid Excel workbook (e.g., it is a CSV renamed to .xlsx, corrupted, or password-protected).

Fix: Open the file locally to confirm it is a real .xlsx workbook. If it is a CSV, use upload_csv() instead:

dataset = client.upload_csv("my_data.csv", name="My Data")

All Upload Methods

EmptyDataError / No columns to parse from file / Uploaded dataset has 0 rows

The dataset is empty or could not be parsed into rows and columns.

Fix: Verify the data before uploading:

print(df.shape)   # must be (rows > 0, cols > 0)
print(df.head())
print(df.columns.tolist())

ValidationError: name field required / name must be a valid string

The name parameter is missing, None, blank, or not a string.

Fix: Always pass a descriptive non-empty string:

dataset = client.upload_dataframe(df, name="q4_sales_weekly")

Dataset not found / 404 / NoneType object has no attribute 'id'

The dataset reference used in a downstream call (train(), tune(), inference()) is invalid.

Fix: Print dataset.id immediately after upload to confirm the object is valid before proceeding:

dataset = client.upload_dataframe(df, name="My Data")
print(dataset.id)  # must not be None

Duplicate column names / columns overlap / ambiguous column selection

The uploaded dataset has repeated column headers, which makes column selection in training/tuning ambiguous.

Fix: Rename columns so every column is unique before uploading:

print(df.columns[df.columns.duplicated()].tolist())  # find duplicates
df.columns = [f"{c}_{i}" if df.columns.tolist().count(c) > 1 else c
              for i, c in enumerate(df.columns)]