Towards a universal scientific data importer
Dealing with Scientific Data is Hard
How much time do you spend preparing your data for analysis? If the answer is more than 5 minutes -- that's too long. Whether you're writing one-off scripts to parse a file, using a cloud-hosted set of instrument connectors, or just manually copying and pasting data, you're going to end up wasting valuable time. These tools might eventually solve your problem, but they require significant amounts of configuration and effort before they provide value. This leaves scientists with a difficult choice:
- Wait for others to help them use their data.
- Write some hacky code to clean up data themselves.
- Manually copy and paste data (typically from a spreadsheet) as an interim solution.
All come with a time cost that put scientists – and their work– at a disadvantage.
At Sphinx, we are working to improve all steps in the analysis of scientific data. We started with democratizing data transformations and plot creation – enabling scientists to operate autonomously during data analysis. Our new system lets scientists describe their data in simple terms, so that the process of data import is no longer a bottleneck for analysis creation. This feature has entered Open Beta and is ready for all Sphinx users (and if you aren’t a Sphinx user sign up here).
Focus on value, avoid the hype
In our past posts we’ve highlighted value-creating ways to parse data from files (see our post on extracting data from spreadsheets). The downside is that existing tools require coding expertise and a strong understanding of abstract theory – exactly what we don’t want to burden scientists with. Further, results of these tools can be incorrect, brittle, and operate in limited use cases without an ability for the user to intervene and correct during data import.
At Sphinx Bio we want to make experiences that feel magical to our users. This is why we decided to create an assistant on top of our existing data import tool that does a number of actions on the user’s behalf. We took the best parts of the generalist LLMs tools and made them functional for users during data import. If the assistant wrong, there is a fallback where the user can manually define the data, and if it is right it saves an enormous amount of time.
We call this assistant, Metis after the Greek titan known for wisdom and thought – the same traits we want to empower in our users.
Immediate value during data import via Metis
This feature was made to be as simple as possible, only requiring that a user upload and describe a file. Our system will inspect the file’s contents and perform the best set of data transformations to create an Analysis ready Dataset. After a user makes a request, they are provided with three things:
- A reply that we have received and are working on the request.
- An indicator of how and why we will transform the file.
- A singular, merged table of the file’s content in a tidy format.
Let's take a look at three examples that showcase this ability of Metis to handle diverse data import cases. Know that many more are possible using your own data!
Upload a tabular file
This is the simplest case, where a file is already tabular and needs to be uploaded to Sphinx. Previously, this would take about a minute for a user to upload. With Metis that is reduced to seconds, whilst being just as robust as a manual upload. An additional benefit of Metis is that the table can be located anywhere in the file – reducing the effort for a user to find and select the table.
“Help me import this data file to create a Dataset.”
Many screening plates across tabs
This is a case we see with many of our customers, where data for a sample are spread across many plate-formatted data regions. This can take scientists several minutes per plate to identify and configure, plus time to decide how to merge the data together. With Metis we can send the file and define the merge rules all at once, resulting in a Dataset in only 15 seconds (dependent on the number of plates).
“This file has plates in multiple tabs. Join all the plates by the well to create a Dataset.”
Heterogeneous Data
This is the most complex case, where multiple data formats are present in the same file. Depending on the complexity this can takes users many minutes per tab to identify. With Metis we can identify the data regions far faster than a user – resulting in a Dataset in about 15 seconds (dependent on the number of data regions).
“I have multiple files that contain my data. Put all data it into one Dataset."
Metis streamlines data import, offering significant time savings and enhanced Dataset quality. Users can review the results, inquire about the process, modify operations, or opt for manual import if desired. This approach not only accelerates work but also promotes 'right first time' outcomes, minimizing rework due to import errors.
The system's transparency allows for the creation of reusable templates, which is where its true power lies. By saving the import process as a template, Metis enables users to to dynamically adjust and optimize handling of future files. Templates can evolve over time, continually improving the import process. Using a template instead of a pure LLM-driven import addresses issues with LLM-generated results, such as hallucinations or non-deterministic outputs. This ensures consistency and interpretability across successive imports. By leveraging LLMs to expedite template creation, Metis strikes the optimal balance between speed, rigor, and reproducibility.
Users estimate that Metis results in a time savings of up to 7 hours per week (going from 8 hours of importing data down to less than an hour) – effectively freeing up an entire workday for more important tasks.
Limitations
Placing Metis in Open Beta allows all users to experience the feature and provide feedback. Users can expect frequent improvements and updates as our team improves Metis.
This only supports using one file to create one Dataset. We currently don’t support cases where multiple files are used to create one or many Datasets. This is an extension we want to add in the future.
Conversational Data import ends at the creation of a Dataset. It cannot edit existing Datasets or use a Dataset to create an Analysis. We see many users struggle with making their data ready for use in an Analysis, so we are focusing on data import first.
Data import is not aware of the larger context of your work and your data. We can’t suggest what to do with your specific data, and instead suggest general approaches and methods for handling data and parsing it from files. This method of import does not yet integrate with analysis functions presented in our past post Building a Declarative UX at Sphinx with LLMs.
Future Direction
We are working to integrate all our past work into a comprehensive experience for users. See our posts on a similar approach for editing an Analysis here.
If you are excited by working on these kinds of problems and want to build better software for scientists, we’re hiring! You can see our latest code on Github at https://github.com/sphinxbio.
Reach out with any questions to hello@sphinxbio.com and thanks for reading.
Note: Science is a team sport – this work would not have been possible without the work of many others. Work by Microsoft (https://doi.org/10.48550/arXiv.2407.09025) on using heuristics to parse out file content, Nielson Norman on generative UI (https://www.nngroup.com/articles/generative-ui/) have inspired use to make our tools more powerful for users.