6 Important Things to Know About the Snowflake Connector for SharePoint
Snowflake introduced its latest edition of connectors with the Snowflake Microsoft SharePoint connector in its November 2024 release. The Snowflake connector for SharePoint can connect directly to a Microsoft 365 SharePoint site within Snowflake. Users can query these files from within Snowflake using SQL, Python, or REST APIs and use this data for conversational analysis.
This post outlines some common pitfalls that could arise during the setup of the SharePoint connector in your Snowflake account and how they could be resolved. These were real issues we faced, resolved through suggestions from Snowflake partner support and insights from official Snowflake documentation.
This article is intended for readers with a basic understanding of Snowflake and its object types, such as warehouse, roles, Cortex Search Service, etc. It lists some of our experiences to help readers better understand the nuances of setting up the SharePoint connector for the first time.
After configuration, the connector establishes a sync pipeline between SharePoint and Snowflake. The synchronization can be scheduled to refresh daily, weekly, or monthly. New files are available for querying immediately after synchronization. As part of the connector setup, Snowflake internally creates a Cortex Search Service in the application catalog, which enables seamless querying of the documents placed in the location. The connector currently supports PDF, DOCX, and PPTX files.
For instance, let’s say you frequently have business documents, typically PDFs or DOCX files, landing in a SharePoint location. This document explores using the Snowflake SharePoint connector to converse with these files on a regular basis.
Pre-requisites to get started with the connector
- Keep the SharePoint site URL ready where your business files are stored. e.g.- https://xxx.sharepoint.com/sites/Xyz
- Ensure you have a Microsoft Graph application and obtain the credentials from your Azure administrator. These include client ID, client secret, and directory (tenant) ID. You can find details on access requirements here. Sample credentials are as shown in the image below.
Figure 1: Sample Azure Credentials
- Note down the name of the Snowflake warehouse for the connector and the associated Cortex Search Service. Also, record the role name required to access the connector. A Snowflake user with the accountadmin role is needed to install the connector.
Installing and verifying the connector state
Install the connector application from the Snowflake Marketplace using the accountadmin role. Once the installation is complete, you will receive a confirmation email. View detailed installation instructions here.
After installation, configure the connector by specifying:
- The warehouse for data ingestion (used to run the connector)
- The warehouse for the Cortex Search Service
- The role that will use the Cortex Search Service
Next, enter the SharePoint site URL, client ID, client secret, and directory (tenant) ID. This step will verify connectivity to the SharePoint site.
Once connected, select the folder or subfolder to be read by the connector. Choosing a specific folder within the SharePoint site can improve sync speed.
After the source is validated, file ingestion begins. The UI will display the content refresh status as ‘Syncing_data.’ The sync frequency can be set to daily, weekly, or monthly. The App UI will initially display a null sync status (no last sync time). This indicates that syncing has not yet begun and must be monitored.
Alternatively, run the query below to check its status.
SELECT * FROM <APPLICATION_INSTANCE_DATABASE>.PUBLIC.SYNC_STATUS;
Once the first sync is completed, the same UI will show the last sync time. Sometimes, the connector could show the status as syncing for quite a long time. This has been covered in the ‘Potential pitfalls’ section below.
Multiple connector instances can be configured to read from various locations to connect to multiple sites or folders.
The connector configurations can also be viewed by running the following query –
SELECT * FROM <APPLICATION_INSTANCE_DATABASE>.PUBLIC.CONNECTOR_CONFIGURATION;
Use the following commands to fetch details about the Cortex Search Service created by the connector: Here, SNOWFLAKE_CONNECTOR_FOR_SHAREPOINT is the application instance database. This database name will be used in the queries given in later sections.
use database SNOWFLAKE_CONNECTOR_FOR_SHAREPOINT;
use schema cortex;
show cortex search services;
Querying using the connector
Before starting to query the connector, recheck the synchronization status.
If the syncing status shows ‘Paused,’ the connector syncing is stopped and must be resumed. If the syncing status is still null, no successful synchronization has occurred. The first sync is complete once the value changes from null to a definite timestamp. After that, only delta changes (new or modified files) are synced.
You can query the Cortex Search using SQL, Python, or REST API. The following is an example of SQL. The documentation provides more samples.
SELECT PARSE_JSON(
SNOWFLAKE.CORTEX.SEARCH_PREVIEW(
‘SNOWFLAKE_CONNECTOR_FOR_SHAREPOINT.cortex.search_service’,
‘{
“query”: “List the Portfolio Managers for XYZ Bond Fund”,
“columns”: [“chunk”, “web_url”],
“limit”: 3
}’
)
This query returns the JSON response, which contains the chunks from the document corresponding to the query, the URL pointing to the document name, and other details.
[
{
“chunk”: “XYZ Core Bond Fund – Class A Shares Release Date\n09-30-24 ……………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………\nCategory\nIntermediate Core Bond\nInvestment Objective & Strategy Portfolio Analysis”,
“web_url”: ” https://xxx.sharepoint.com/sites/Xyz/Shared%20Documents/sub_folder_name/abcd.pdf”
},
{….
}
]
These response chunks can be further combined and restructured using other Cortex functions, like SNOWFLAKE.CORTEX.COMPLETE, to return the result as an answer to a conversational bot’s question.
If the response is an empty array / [], it could indicate that the syncing is incomplete, or the files were skipped due to conflicts. The following section covers details of potential failure.
Since Cortex Search is a new feature, the results returned could sometimes be out of context with respect to the content of the files. It is up to the user to apply checks to maintain the correctness or quality of the results. Snowflake has an experimental feature, Cortex Search confidence score, and many other customization features that can be enabled for an account on demand.
Potential pitfalls for synchronization
During setup, there could be some challenges that require some refining.
Firstly, check for the sync status in <APPLICATION_INSTANCE_DATABASE>.PUBLIC.SYNC_STATUS.
If syncing is null, there has not been any successful synchronization yet.
Next, run the following SQL queries to get the connection error logs.
SELECT * FROM <APPLICATION_INSTANCE_DATABASE>.PUBLIC.CONNECTOR_ERRORS;
CALL SNOWFLAKE_CONNECTOR_FOR_SHAREPOINT.PUBLIC.GET_TROUBLESHOOTING_DATA(‘2024-12-30’, ‘2024-12-31’);
Listed below are some of the challenges we faced on our Snowflake account:
- A large number of m365 groups to be synchronized
Check for the number of m365 groups that are being fetched. These are the groups that the connector is currently observing. The connector syncs all available m365 groups, and there is no way to choose only the groups relevant to the site. Synchronizing a massive number of m365 groups could cause a considerable delay.To get the details on the number of m365 groups, the following check can be done –You can use any HTTP client, like curl or Postman, to perform the two requests below. Below are the details you need to pass in, which are also used for connector configuration:- tenant ID
- client ID
- secret value
1st Request – Authentication and Retrieval of Access Token:
URL: https://login.microsoftonline.com/{tenant ID}/oauth2/v2.0/token
Request type: GET
Headers: Content-Type: application/x-www-form-urlencoded
Body (form-urlencoded):
{
client_id: {client ID}
client_secret: {secret value}
scope: https://graph.microsoft.com/.default
grant_type: client_credentials
}
The above response should return the access token for the JSON key “access_token,” which should be used in the second request.
2nd Request – Retrieval of Groups
Request type: GET
URL: https://graph.microsoft.com/v1.0/groups/delta
Headers: Authorization: access_token from 1st request’s response’}
Body: none
The response returned would give the number of m365 groups being synchronized. This could take considerable time since the connector syncs all the available m365 groups.
In our case, we had around 200 m365 groups, so syncing took a long time to complete. Even if the number of files in the SharePoint location is less, synchronizing multiple m365 groups could take a long time.
- Query timeout challenges
The first synchronization is the longest because Snowflake downloads everything at once. Hence, the value of the parameter STATEMENT_TIMEOUT_IN_SECONDS plays a significant role here.If the parameter’s value is low, the synchronization process could time out after the specified time.To test it out, you can start with 3600 seconds and bump this value to 172800 as the maximum limit.Once the first sync is completed, subsequently, only the delta since the last content sync is downloaded, so this is expected to take much less time.You can make use of the below query to increase the timeout value.alter warehouse YOUR_WH set STATEMENT_TIMEOUT_IN_SECONDS= 86400;If the value is set at the account level, you need to increase the value at an account level as it overrides the value set on the warehouse when running a Python procedurealter account set STATEMENT_TIMEOUT_IN_SECONDS = 86400;Similarly, the user task timeout needs to be increased. You can make use of the below query to increase the timeout value.alter account set USER_TASK_TIMEOUT_MS = 86400000;
This increment is required when the job takes more than an hour, causing it to restart again as the timeout has kicked in. Once the first synchronization happens, the consecutive runs do not take as long.
In our case, the group sync process easily exceeded 1 hour because it needed to perform around 200 HTTP requests for groups and probably a few thousand requests for each unique user detected in these groups.
- Snowflake partner support adjustments
If the sync group is large and timeouts are observed, some modifications may be required to some of Snowflake’s internal parameters.In our case, the Snowflake support team adjusted the value of their internal parameter (PYTHON_SP_PROGRESS_TIMEOUT_SECONDS), which is responsible for the Python stored procedure timeout, to 86400 at the account level. Again, this is not needed in every account. This change was necessary because the SharePoint connector internally calls the Python stored procedure, which has a default timeout of 1 hour. Since the synchronization exceeded an hour, the Support team had to fine-tune this for our account.After initialization, the parameters were returned to their lower values from both ends.
- Docx and PPTX file formats not recognized
For docx and pptx files, the error logs in our connection_error tables were as follows-Failed to parse a file with ID: 1234abcdxyz. Error: The provided format .zip isn’t supported. Supported formats: [‘.docx’, ‘.pptx’, ‘.pdf’].This parsing conflict was due to the .docx and .pptx files on SharePoint having a header of a .zip file.The Snowflake support team changed the parsing mechanism and fixed the challenge where the reading extension was masked or overwritten by an extension (ZIP for DOCX).This fix was applied globally, so installing the connector to any other Snowflake account should not face this challenge.
- “Python Interpreter Error”
If this message is observed in any of the logs, then there is no recovery in place currently, and this will eventually repeat itself. Once the connector fails to parse a downloaded file, it is skipped with the error entry in the connector_errors object. The file will not be re-ingested on the next sync, as only incremental changes are processed. Later, even if any fix is done, if no changes happen to the file, it won’t be picked up by the CDC process.In this case, the file needs to be added again to the SharePoint location to be picked as a delta file.
- Resource Monitors crossing the credit quota
For every connector instance created, a workflow consisting of a couple of procedures, tasks, and streams is created. These include CALL and MERGE statements that are run as part of the syncing activity. These queries could sometimes run for more than 10-12 hours, consuming a lot of credits. So, we must be mindful of the cost while setting up instances.The complete cost details are covered in the Cost Implications section below.Once the root causes are identified and resolved, run the following query to resume the connector.call SNOWFLAKE_CONNECTOR_FOR_SHAREPOINT.PUBLIC.RESUME_CONNECTOR();If the connector is not paused, pause the connector first before resuming.call SNOWFLAKE_CONNECTOR_FOR_SHAREPOINT.PUBLIC.PAUSE_CONNECTOR();followed by the earlier command to resume the connector.The following command will trigger the synchronization.
call SNOWFLAKE_CONNECTOR_FOR_SHAREPOINT.PUBLIC.REFRESH_SHAREPOINT_CONTENT();
Cost Implications
A corresponding Cortex Search Service is created for every instance of the Snowflake connector for SharePoint. Two dynamic table entities appear in the Snowsight monitoring UI for dynamic tables for each Cortex Search service. These two entities are of the format _CORTEX_SEARCH_SOURCE_* and _CORTEX_SEARCH_REFRESH_*. These are run every few milliseconds against the warehouse configured for the connector. These queries will run even if the connector instance is paused, leading to credits being consumed. The connector also runs 2 call statements and 1 merge statement daily against each instance. These queries run long, causing credit consumption to grow. The cortex search service internally uses the Parse_document function, which contributes to the Compute cost.
Overall, the cost consumption is around compute, storage, Cortex Search service, data transfer, and healthcheck task costs.
Currently, pausing the connector does not stop all credit consumption. If you want to stop all credit consumption, uninstalling the connector is the only option. Using a resource monitor and a dedicated warehouse for the connector can help monitor and control the costs.
More details on the cost can be found here.
Summary
As of the day this blog was written, the Snowflake connector for SharePoint is a preview feature. The connector can easily query files in the SharePoint site from within Snowflake and use them for conversational analysis.
It also keeps the user permissions and access controls in place while returning the results. Once this feature is mature enough, it will significantly enhance how we use our conventional bots. With time, this could be a game-changer in providing an innovative, secure, easy-to-use assistant for SharePoint.
Latest Blogs
Core banking platforms like Temenos Transact, FIS® Systematics, Fiserv DNA, Thought Machine,…
We are at a turning point for healthcare. The complexity of healthcare systems, strict regulations,…
Clinical trials evaluate the efficacy and safety of a new drug before it comes into the market.…
Introduction In the upstream oil and gas industry, drilling each well is a high-cost, high-risk…