Show HN: An open-source tool that semantically profiles your data using LLMs

https://github.com/Cocoon-Data-Transformation/cocoon

Building a chatbot for your data and pipelines is challenging because they are often too large (e.g., 1,000+ tables) to fit within the LLM context window. Cocoon addresses this by creating a RAG layer for your data and pipelines. With Cocoon's RAG, we offer a cursor-style chatbot for your data tasks.

Get Started

Cocoon is available on PyPI. Create a virtual env and then:

pip install cocoon_data -U

To get started, you need to connect to

LLMs (e.g., GPT-4, Claude-3, Gemini-Ultra, or your local LLMs)
Data Warehouses (e.g., Snowflake, Big Query, Duckdb...)

from cocoon_data import *
# if you use Open AI GPT-4
openai.api_key  = 'xycabc'
# if you use Snowflake
con = snowflake.connector.connect(...)
query_widget, cocoon_workflow = create_cocoon_workflow(con)
# a helper widget to query your data warehouse
query_widget.display()
# the main panel to interact with Cocoon
cocoon_workflow.start()

🎉 You shall see the following on a notebook:

We also offer a browser UI, only for the chat over RAG feature. Simply:

pip install cocoon_data -U
cocoon_data

You shall see

{
  "by": "zh2408",
  "descendants": 2,
  "id": 40248744,
  "kids": [
    40249297
  ],
  "score": 10,
  "text": "The problem we solve is profiling tables: this is the initial step where you need to understand the table and identify any anomalies.<p>During the process, many small decisions require semantic understanding. For example, missing values are normal for &#x27;deathdate&#x27; (still alive) but abnormal for &#x27;name.&#x27; For outliers, 100 for ages is fine, but some are -1, which is impossible! We use LLMs to semantically understand your tables and detect anomalies.<p>You can try it by uploading a CSV, and we will email back the profile: <a href=\"https:&#x2F;&#x2F;cocoon-data-transformation.github.io&#x2F;page&#x2F;\" rel=\"nofollow\">https:&#x2F;&#x2F;cocoon-data-transformation.github.io&#x2F;page&#x2F;</a><p>Let me know your feedback. Thanks!",
  "time": 1714749854,
  "title": "Show HN: An open-source tool that semantically profiles your data using LLMs",
  "type": "story",
  "url": "https://github.com/Cocoon-Data-Transformation/cocoon"
}

{
  "author": "Cocoon-Data-Transformation",
  "date": null,
  "description": "Data management with LLMs. Contribute to Cocoon-Data-Transformation/cocoon development by creating an account on GitHub.",
  "image": "https://opengraph.githubassets.com/db842f2b85ee971d6e612cd7497f2c69ecda0f8c0d6a044707d7da7482556193/Cocoon-Data-Transformation/cocoon",
  "logo": "https://logo.clearbit.com/github.com",
  "publisher": "GitHub",
  "title": "GitHub - Cocoon-Data-Transformation/cocoon: Data management with LLMs",
  "url": "https://github.com/Cocoon-Data-Transformation/cocoon"
}

{
  "url": "https://github.com/Cocoon-Data-Transformation/cocoon",
  "title": "GitHub - Cocoon-Data-Transformation/cocoon: Data management with LLMs",
  "description": "Building a chatbot for your data and pipelines is challenging because they are often too large (e.g., 1,000+ tables) to fit within the LLM context window. Cocoon addresses this by creating a RAG layer for...",
  "links": [
    "https://github.com/Cocoon-Data-Transformation/cocoon"
  ],
  "image": "https://opengraph.githubassets.com/db842f2b85ee971d6e612cd7497f2c69ecda0f8c0d6a044707d7da7482556193/Cocoon-Data-Transformation/cocoon",
  "content": "<div><article><p><a target=\"_blank\" href=\"https://github.com/Cocoon-Data-Transformation/cocoon/blob/main/images/cocoon_logo.png\"><img src=\"https://github.com/Cocoon-Data-Transformation/cocoon/raw/main/images/cocoon_logo.png\" alt=\"Cocoon Logo\" /></a>\n</p>\n<p><a target=\"_blank\" href=\"https://camo.githubusercontent.com/6cd0120cc4c5ac11d28b2c60f76033b52db98dac641de3b2644bb054b449d60c/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4c6963656e73652d4d49542d79656c6c6f772e737667\"><img src=\"https://camo.githubusercontent.com/6cd0120cc4c5ac11d28b2c60f76033b52db98dac641de3b2644bb054b449d60c/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4c6963656e73652d4d49542d79656c6c6f772e737667\" alt=\"License: MIT\" /></a></p>\n<p>Building a chatbot for your data and pipelines is challenging because they are often too large (e.g., 1,000+ tables) to fit within the LLM context window. Cocoon addresses this by creating a RAG layer for your data and pipelines. With Cocoon's RAG, we offer a cursor-style chatbot for your data tasks.</p>\n<ul>\n<li>\n<p><a target=\"_blank\" href=\"https://cocoon-rag-851564657364.us-east1.run.app/\"><em>live demo on RAG Hubspot + Salesforce Data</em></a></p>\n</li>\n<li>\n<p><a target=\"_blank\" href=\"https://cocoon-data-transformation.github.io/page/\"><em>Learn more about all the features</em></a>\n<br /></p>\n<p><a target=\"_blank\" href=\"https://youtu.be/kv5mwTkpfY0\">\n  <img src=\"https://github.com/Cocoon-Data-Transformation/cocoon/raw/main/images/Thumbnail2.png\" alt=\"IMAGE ALT TEXT\" />\n</a>\n</p>\n<br />\n</li>\n</ul>\n<p></p><h2>Get Started</h2><a target=\"_blank\" href=\"https://github.com/Cocoon-Data-Transformation/cocoon#get-started\"></a><p></p>\n<ul>\n<li>👉 <a target=\"_blank\" href=\"https://cocoon-data-transformation.github.io/page/clean\">Online Service to clean your uploaded CSV</a></li>\n<li>👉 <a target=\"_blank\" href=\"https://colab.research.google.com/github/Cocoon-Data-Transformation/cocoon/blob/main/demo/Cocoon_Stage_Demo.ipynb\">Try this Google Collab Notebook for Data Warehouse RAG</a></li>\n<li>👉 <a target=\"_blank\" href=\"https://colab.research.google.com/github/Cocoon-Data-Transformation/cocoon/blob/main/demo/Cocoon_RAG_pipeline.ipynb\">Try this Google Collab Notebook for Data Pipeline RAG</a></li>\n</ul>\n<p>Cocoon is available on PyPI. Create a virtual env and then:</p>\n<div><pre>pip install cocoon_data -U</pre></div>\n<p>To get started, you need to connect to</p>\n<ul>\n<li>LLMs (e.g., GPT-4, Claude-3, Gemini-Ultra, or your local LLMs)</li>\n<li>Data Warehouses (e.g., Snowflake, Big Query, Duckdb...)</li>\n</ul>\n<div><pre><span>from</span> <span>cocoon_data</span> <span>import</span> <span>*</span>\n<span># if you use Open AI GPT-4</span>\n<span>openai</span>.<span>api_key</span>  <span>=</span> <span>'xycabc'</span>\n<span># if you use Snowflake</span>\n<span>con</span> <span>=</span> <span>snowflake</span>.<span>connector</span>.<span>connect</span>(...)\n<span>query_widget</span>, <span>cocoon_workflow</span> <span>=</span> <span>create_cocoon_workflow</span>(<span>con</span>)\n<span># a helper widget to query your data warehouse</span>\n<span>query_widget</span>.<span>display</span>()\n<span># the main panel to interact with Cocoon</span>\n<span>cocoon_workflow</span>.<span>start</span>()</pre></div>\n<p>🎉 You shall see the following on a notebook:</p>\n<p><a target=\"_blank\" href=\"https://github.com/Cocoon-Data-Transformation/cocoon/blob/main/images/notebook.png\"><img src=\"https://github.com/Cocoon-Data-Transformation/cocoon/raw/main/images/notebook.png\" /></a>\n</p>\n<p>We also offer a browser UI, only for the chat over RAG feature. Simply:</p>\n<div><pre>pip install cocoon_data -U\ncocoon_data</pre></div>\n<p>You shall see</p>\n<p><a target=\"_blank\" href=\"https://github.com/Cocoon-Data-Transformation/cocoon/blob/main/images/browser.png\"><img src=\"https://github.com/Cocoon-Data-Transformation/cocoon/raw/main/images/browser.png\" /></a>\n</p>\n</article></div>",
  "author": "",
  "favicon": "https://github.githubassets.com/favicons/favicon.svg",
  "source": "github.com",
  "published": "",
  "ttr": 42,
  "type": "object"
}