LIVESTREAMING

LOADING & QUERYING

LARGE DATASETS

CATEGORY

Shapelets

CATEGORY

Shapelets

CATEGORY

Shapelets

CATEGORY

Livestreaming

DATE

21 December

 TIME

   5 Minutes

Introducing Shapelets, a platform for data scientists that allows you to digest and manipulate large datasets in the most efficient way.  

In our recent Livestream, we focused on data management and how Shapelets stands out compared to other tools.You can watch the live demonstration in the video above. 

 

clement-mercier
DATA INGESTION

Article by Clement Mercier

 

First, let us begin with a detailed explanation of how Shapelets champions data ingestion: 

1. Shapelets from a Jupiter Notebook

In our first example, we demonstrated Shapelets’ outstanding ingestion capabilities by using the New York Taxi dataset from 2009, which is 5.3 GB in size. To put this into perspective, a month’s worth of data is approximately 400 MB. We compared the computing time of a 400 MB dataset using Pandas and Polars:  

  • Pandas took 15 seconds to load the dataset and calculate the mean number of passengers, 
  • While Polars took 12 seconds 
  • With Shapelets plaform, we were able to complete the same calculation in just 3 seconds. 

We then applied the same comparison to the 5.3 GB dataset, and Pandas and Polars were unable to compute it due to the file size. On the other hand, our Shapelets API returned a computation time of 9.72 seconds, even faster than the other libraries for the smaller file. This shows a significant improvement in terms of time and resources. 

 

Shapelets outperforms the other tools in terms of the number of resources used, as well as the time it takes to load and compute a dataset.

Even if the syntax is different from the three libraries, it is made to be intuitive for Data Scientists. For example, with the Shapelets API, we created a sandbox (where we will be able to query the data) for the same use case. We were able to mix and match any dataset in the organization (CSV, parquet, arrow), and then we easily queried the data thanks to an easy and intuitive expression. To make our syntax comprehensible for everyone, we considered the insights of our customers.   

In terms of performance, we have been running some studies (formal test environment) on the New York taxis dataset of different sizes, queries, and libraries. In almost all cases, for medium sizes files, Shapelets API was faster than the rest and vastly different from the results from Pandas and Dask. Most libraries were unable to compute it for larger files, with no significant difference between Spark and Shapelets. 

Because we want to be the most competitive data management library, we’re working hard to improve our performance on large files.

2. Shapelets Sandbox playground  

In this example, we wanted to explore Sandbox, and the relationships created along the line.   

Firstly, Sandbox generates a unified execution plan for your data regardless of its type. For example, we can quickly and clearly return the format of the data and the rows in the datasets, as well as return the schema and select a region of the data. Our free examples include a wide range of functions. 

The playground enables you to perform transformations on large files. The key concept of the playground is that it enables you to perform transformations on data in real time, with computation occurring only when the results are produced. 

 

 This means that if we create a new transformation from the previous one, we chain all of the transformations together, resulting in a very efficient computation. 

 

 3. Questions about Shapelets  

What would be the first steps to start working with the data ingestion feature?  

First, Shapelets can be found on Pypi, where you can find all of the feature’s instructions, as well as a Docker image. Shapelets is currently available on Python 3.7, 8, 9, and 10, we are currently in development for python 3.11, which should be available very soon.  

 

What can we expect from the next releases?    

We have been working on adding data streaming capabilities to the platform for the next release of the product.  

If you have sensors in an industrial or medical environment, you will soon be able to use Shapelets to collect all data and store it in a safe, very secure manner, which will be directly available for you to use.

 

 

Shapelets   CONCLUSION

To summarize, Shapelets is a powerful platform for data scientists that allows you to efficiently manage and analyze large datasets.    

When compared to other tools like Pandas or Polars, our API saves time and resources, and its user-friendly syntax makes it simple to use.  

Shapelets’ capabilities prove that it is a highly competitive option for data management and processing. Fast, easy and intuitive. Shapelets is the solution for data teams in business. 

  

Contact us for a demo and to get started with Shapelets, please see our website and Pypi page for instructions and examples. 

We use cookies to personalise content and ads, to provide social media features and to analyse our traffic. We also share information about your use of our site with our social media, advertising and analytics partners. View more
Cookies settings
Accept
Privacy & Cookie policy
Privacy & Cookies policy
Cookie name Active
We use cookies on our website. These are small files that your browser will create automatically and that are stored on your end device (laptop, tablet, Smartphone or similar) when you visit our website. Cookies do not cause any damage to your end device, contain no viruses, Trojans or other harmful software. The cookie is used to store information that results from the respective context of the specifically used end device. However, this shall not mean that we directly gain knowledge of your identity this way. Use of cookies serves to make use of our offer more pleasant for you. We use session cookies in order to recognise that you have visited individual pages of our website before. They will be deleted automatically after you leave our website. Furthermore, we also use temporary cookies to optimise user friendliness, which are stored on your end device for a certain specified period. When you visit our website again in order to use our services, it will be automatically recognised that you have visited us before and which input and settings you have made so that you will not have to enter them again. On the other hand, we use cookies in order to statistically record use of our website and to evaluate it for the purpose of optimising our offer to you (see section 5). These cookies enable us to recognise that you have visited us before if you visit our website again. These cookies are deleted automatically after two years in each case. The data processed by cookies are required for the purpose of maintaining our legitimate interests and those of third parties according to Article 6(1)(1)(f) GDPR. Most browsers accept cookies automatically. You may, however, configure your browser so that no cookies will be stored on your computer or that you will always be informed before a new cookie is set up. Complete deactivation of cookies may, however, render you unable to use all functions of our website.
Save settings
Cookies settings