What is Snowflake Cortex?

Snowflake Cortex is an intelligent, fully-managed service offered by Snowflake. It empowers users to analyze data and build AI applications entirely within the secure environment of Snowflake [1]. Here's a breakdown of its key functionalities:

  • Large Language Models (LLMs): Cortex provides access to industry-leading LLMs through its LLM Functions. These functions allow you to leverage the power of LLMs for tasks like understanding text, generating different creative text formats, translating languages, and summarizing information. Imagine having advanced AI models built right into your data platform!

  • Machine Learning (ML) Functions: Cortex offers built-in ML functions that use SQL syntax. This makes it easier for data analysts, even those without extensive machine learning expertise, to perform tasks like anomaly detection and classification directly on their data in Snowflake.

  • Security and Scalability: Because Cortex functions within the Snowflake environment, it benefits from Snowflake's robust security features and scalability. This ensures your data remains secure while allowing you to handle large datasets efficiently.

Overall, Snowflake Cortex aims to bring the power of AI and machine learning to data analysis within the familiar Snowflake platform. It allows data analysts and business users to leverage cutting-edge AI models and functionalities without needing to become machine learning experts themselves.

Do I need to purchase a separate product / license to use Snowflake Cortex? How is it priced?

You can use all the LLM functions without any additional subscriptions or agreements. This applies even to free trial accounts.

  • Pay-per-use model: You only pay for what you use. Snowflake credits are used to cover the cost of LLM functions. These functions are priced based on the number of tokens processed for each task.
  • Transparent pricing: Snowflake's documentation provides a clear table showing the cost per 1 million tokens for each LLM function, so you can easily estimate your usage costs.

Are all the LLMs available only for text, or does Snowflake Cortex support other types?

Snowflake Cortex is expanding its capabilities beyond text-based LLMs! As of recently (March 2024), they've announced support for multimodal LLMs. This means Snowflake is incorporating models that can handle not just text, but also images and potentially other data formats.

Here's a breakdown of what this means:

  • Multimodal LLMs: These advanced models go beyond text and can understand the relationship between different data types. For instance, an LLM might analyze an image and its accompanying text description to provide a more comprehensive understanding of the content.
  • Snowflake's Partnership: They've partnered with Reka.ai, a company offering powerful multimodal models like Flash and Core [1]. These models can be used within Snowflake Cortex to unlock new possibilities for data analysis.

While the full range of supported data types beyond text might not be explicitly documented yet, the introduction of multimodal LLMs signifies a shift towards handling various data formats within Cortex.

Here are some potential use cases for image-based LLMs in Snowflake Cortex:

  • Automated Image Captioning: Generate captions for product images in an e-commerce platform, improving accessibility and searchability.
  • Content Moderation: Identify inappropriate content within images based on pre-defined criteria.
  • Image Classification and Tagging: Automatically categorize and tag images based on their content, facilitating image organization and retrieval.

Remember, this is a rapidly evolving field. As Snowflake Cortex progresses, we can expect even more capabilities for working with diverse data types using LLMs.

What are some of the use cases or jobs I can do with LLMs on my enterprise data?

Here are some exciting use cases for LLMs on your enterprise data:

Enhancing Data Analysis and Exploration:

  • Automated Summarization: LLMs can analyze vast amounts of data and generate concise summaries, highlighting key trends, insights, and anomalies. This saves analysts time and helps them focus on deeper exploration.
  • Data Quality Improvement: LLMs can identify inconsistencies and errors within your data by recognizing patterns and relationships. They can also suggest data cleaning strategies for more reliable analysis.
  • Generating Research Questions: LLMs can analyze existing data and research to identify potential new research avenues or questions worth exploring. This can fuel innovation and lead to new discoveries.

Boosting Content Creation and Communication:

  • Automated Report Generation: LLMs can take analyzed data and automatically generate reports in a clear and concise format, saving time and resources.
  • Personalized Content Creation: LLMs can personalize marketing materials, customer support responses, or internal communications based on user data and preferences.
  • Document Summarization and Translation: LLMs can quickly summarize lengthy documents or translate them into different languages, improving accessibility and international communication.

Optimizing Business Processes and Decision Making:

  • Customer Service Chatbots: LLMs can power advanced chatbots that understand natural language, answer customer queries effectively, and even personalize interactions.
  • Market Research and Trend Analysis: LLMs can analyze social media data, customer reviews, and market research reports to identify customer sentiment, emerging trends, and potential areas of growth.
  • Risk Assessment and Fraud Detection: LLMs can analyze financial data and identify patterns that might indicate fraudulent activity or potential financial risks.

Important Considerations:

  • Data Security and Privacy: Ensure proper data governance and anonymization techniques when using LLMs on sensitive enterprise data.
  • Model Explainability and Bias: Understand how LLMs arrive at their conclusions and be aware of potential biases within the training data.
  • Focus on Business Needs: Choose LLM applications that directly address specific business challenges and contribute to measurable goals.

Remember, LLMs are a powerful tool but require careful integration and ongoing monitoring to ensure responsible and effective use within your enterprise data landscape.

How do I know which LLM I should choose in the complete function?

Snowflake created a link on tips on which Large Languange Model you should use. Check the following guide: Large Language Model (LLM) Functions (Snowflake Cortex) | Snowflake Documentation

When I choose a model? Am I downloading the model into my account to run inference?

The complete() function in the tidyr package for R actually doesn't involve downloading or running any model for inference. It focuses on handling missing values within your existing data frame.

Here's a breakdown to clarify the difference:

  • Complete Function:

    • Purpose: This function addresses missing values (NA or NULL) in your data.
    • Functionality: It helps you identify missing entries and potentially fill them in based on various strategies (e.g., mean imputation, forward fill). You don't choose a model in this context.
  • Model for Inference:

    • Purpose: This refers to machine learning models trained on separate data to make predictions on new data.
    • Functionality: You would download or use a pre-trained model for tasks like image classification, text generation, or sentiment analysis. This process often involves running inference on the model with your input data.

Here's an analogy:

Imagine you have a table with some missing entries.

  • complete() is like a tool to fill in those missing gaps in the table itself.
  • A model for inference is like a separate tool that analyzes the completed table and makes predictions about something else entirely, based on patterns it learned from other data.

In summary, complete() focuses on data cleaning within your existing data set, while models for inference involve separate, pre-trained models used for making predictions.

What is the difference between the complete() function and specialized functions like summarize() ?

The main difference between the complete() function and the summarize() function lies in their purpose and level of detail:

  • Complete Function:

    • Goal: Checks for missing values (often represented by NA or NULL) in a data set.
    • Output: Typically a logical value (TRUE or FALSE) indicating if there are any missing values in the entire data set or specific columns.
    • Focus: Provides a high-level overview of data completeness.
  • Summarize Function:

    • Goal: Creates a summary of the data set based on user-defined calculations.
    • Output: A new data frame with one row for each group (if used with group_by beforehand) containing summary statistics like mean, median, count, etc. for specified columns.
    • Focus: Offers a detailed look at various aspects of the data set.

Here's an analogy:

Imagine you have a library with books.

  • complete() is like checking if any books are missing from the shelves.
  • summarize() is like calculating the average number of pages per book, the number of books in each genre, or the most popular author.

In short, complete() gives a yes/no answer about missing data, while summarize() provides a rich analysis of the data's characteristics.

Do the LLMs learn from my data?

In general, no, the LLMs you use through Snowflake Cortex likely won't learn directly from your data. Here's why:

  • Pre-Trained Models: LLMs are typically pre-trained on massive datasets before being deployed for use. These datasets encompass a vast amount of text and code, aiming to give the LLM a strong foundation for understanding language.
  • Cortex as an Access Point: Snowflake Cortex acts as an intermediary, providing a secure environment to run pre-trained LLMs on your data. Your data is used for the specific task at hand (like translation or text analysis) but isn't incorporated into the core LLM itself.

However, there are some nuances to consider:

  • Privacy Preserving Techniques: There's a possibility that Snowflake Cortex might use privacy-preserving techniques to leverage your data while protecting its confidentiality. This could involve anonymized versions of your data used to improve the LLM's performance within Cortex without compromising sensitive information.
  • Future Advancements: The field of LLMs is constantly evolving. As the technology progresses, there might be future scenarios where LLMs can be fine-tuned on user data within secure environments like Cortex. But this is not the current standard practice.

Which LLMs are supported in Snowflake Cortex?

Here are the supported LLMs as of March 2024.


Is Snowflake running LLMs on an external platform?

No, everything happens within Snowflake's secure environment. This means your data remains under your control and adheres to your existing security and governance policies.

Also, there's no need to worry about setting up separate environments, APIs, integrations, or managing different governance rules. Snowflake takes care of everything for a smooth experience.

What LLMs does Snowflake have?

Snowflake doesn't necessarily have its own large language models (LLMs) but it offers a platform to leverage them. Here's a breakdown of what they do:

  • Snowflake Cortex (Private Preview): This service allows you to securely run LLMs from various providers directly within the Snowflake environment. You can think of it as a one-stop shop to access cutting-edge AI models for tasks like text translation. For instance, Snowflake mentions using Meta's Llama 2 model for such purposes [3].
  • Focus on Integration: Their approach seems to be integrating industry-leading LLMs rather than creating their own. This gives users flexibility and access to a wider range of capabilities.

Overall, Snowflake acts as a facilitator for using LLMs for data analysis and manipulation tasks. They provide the cloud infrastructure and user interface for you to leverage these powerful language models.

What does LLM stand for?

LLMs are computer programs trained on massive amounts of text data to communicate and generate human-like text in response to a wide range of prompts and questions.

