Knowledge Base

How to Create a Dataset

Overview

Any kind of data, in any schema, can be pushed into the Narrative Data Collaboration Platform as a dataset—exactly as it is stored in your own system. While the platform supports multiple data formats, we recommend using Parquet for optimal performance and efficiency.

Creating a dataset defines a container that will hold your raw data. Afterwards, data can be added to your dataset container.

Follow these instructions to make your data available for use in My Data.

New Dataset

Navigate to My Data.
Click "New Dataset" to create a new dataset.

1. Dataset Upload Process

Set a name and upload a sample that represents the schema and values of the future (larger) file. The sample should be less than 10MB.

Supported Data Types

The Narrative platform supports the following file types:

Parquet (Recommended)
CSV and other common delimiters (such as pipes, tabs, etc.)
JSON Lines (also known as newline-delimited JSON)

Why Parquet?

We prefer Parquet because it is a columnar storage file format that provides efficient data compression and encoding schemes. This results in better performance for both storage and querying.

Parquet supports rich data types, including nested structures, arrays, and maps. This allows for more complex data representations compared to the flat structures of CSV files.
Parquet enforces data types and schemas, reducing the likelihood of data inconsistencies. This ensures that data ingested into Narrative maintains integrity and is reliable for downstream use.

Comparing Parquet with CSV and JSON Lines:

CSV Files:
- Pros: Simple and human-readable.
- Cons: Larger file sizes, no support for complex nested data, and inefficient for large-scale data processing.
JSON Lines Files:
- Pros: Supports nested data structures, simple to process line by line, and is semi-structured.
- Cons: Larger file sizes due to verbose syntax compared to Parquet, slower parsing, and not optimized for columnar operations.

Difference Between JSON and JSON Lines:

JSON: Typically represents data as a single JSON object or an array of objects. Standard JSON files cannot be processed efficiently when dealing with large datasets, as they require loading the entire file into memory.
Example of standard JSON:
```
[
  {"name": "Alice", "age": 30},
  {"name": "Bob", "age": 25}
]
```
JSON Lines: Also known as newline-delimited JSON, represents data as individual JSON objects separated by newlines. Each line is a valid JSON object, making it easier to process large files line by line.
Example of JSON Lines:
```
{"name": "Alice", "age": 30}
{"name": "Bob", "age": 25}
```

Note: The Narrative platform accepts data in JSON Lines format but does not accept standard JSON files that contain arrays or nested structures.

Important: When uploading JSON files to Narrative, please ensure they are in the JSON Lines format (one JSON object per line). We do not accept standard JSON files that contain an array of objects or nested structures. Read more about JSON Lines here: https://jsonlines.org/.

2. Dataset Details

Once your dataset is ingested, you should see the new object listed at the top of your datasets at My Data.

Below are the available metadata options for each dataset. Some metadata will have placeholders which you can update:

Name: The name of the dataset.
Description: Provide information about the dataset.
Schema:
- Required: Specify if a field must contain a value for a record to be added to your dataset.
- Queryable: Determine if this field can be made available to buyers.
  - Marking a field as not queryable will ensure that this field cannot be transmitted to other parties on the Data Collaboration Platform.
- Sensitive: Decide if data in this field should be redacted when displayed in sample UIs and APIs.
  - Marking a field as sensitive will ensure that the data in this field is never shown to users within the Narrative platform unless purchased by that user and exported. To configure a field as sensitive, select the toggle on initial upload of the dataset.
  - Unlike not sellable data, data in "sensitive" fields will be delivered, in the original format, when purchased.
File Type: Inferred from the sample upload. For best performance, we recommend using Parquet.
Write Mode: This field informs how to treat new data uploaded to your dataset. Two options are supported:
- append: Use if new data should be added to your dataset.
- overwrite: Use if new data should replace your entire dataset.

Add Data

To add data to your dataset, you can either add data using the "Upload Files" button or configure an ingestion connector such as the AWS S3 Connector to ingest data.

Uploading Parquet or JSON Lines Files Manually:

Click "Upload Files".
Select your Parquet or JSON Lines files.
Confirm the upload and wait for the ingestion process to complete.

Using AWS S3 Connector with Parquet or JSON Lines Files:

Ensure your data files in S3 are in Parquet or JSON Lines format for optimal performance.
Configure the connector to point to your data files in S3.

For more information on ingesting data via S3, see Setting Up a Managed S3 Bucket.