Integration Guide

Best practices and process for No-ETL integration.

Introduction

This document provides the guidance to start a successful integration with Kubit’s Self-Service Analytics. It covers the contexts and details steps required to share your data with Kubit and also how to get your metrics configured properly. It also includes the best practices in data model design and implementation as a reference.

With secure cloud data warehouse sharing, you can share your data with Kubit in real-time without the hassle of copying them, or introducing complicated batch jobs. While having full control over your data and maintaining a Single Source of Truth, all the computation cost involved with Self-Service Analytics will be covered by Kubit. This is the essence of No-Code Analytics offering from Kubit.

Requirements

  • You already have all your analytics related data stored in a Cloud Data Warehouse such as Snowflake, BigQuery or RedShift.
  • Your data engineers will configure secure data share in your cloud data warehouse to grant Kubit read-only access to your data.
  • If live analytics is in the project scope, please make sure that your data is up to date all the time.
  • You have detailed documentation about your data model and event definitions (e.g. a Data Dictionary). If full documentation is not yet available, you will need to invest more time to pass along the knowledge and answering questions. This process often requires various members from your team to be involved in. Don’t worry, Kubit has a built-in Data Dictionary feature to help you to collaboratively build this knowledge base dynamically on the fly.
  • There are some basic understanding and agreement of the metrics (KPIs) and dimensions that your team wants. The integration project will cover a limited number of metrics and dimensions. Though all the tools and features are enabled for you to build your own. That’s the essence of Self-Service Analytics.
  • Keep a constant communication on any changes to the data model, data dictionary and metrics definition with the Kubit team. Some changes are trivial to incorporate if planned ahead of time. Some may require invalidation and updates to existing analyses.
  • Provide feedback and share your thoughts with Kubit.

Process

  • Inventory and get your data ready in your data warehouse in case it's not already there.
  • Discuss the project scope with Kubit: use cases, KPIs, metrics and dimensions.
  • Share access to your data with Kubit through secure data share.
  • Get access to Kubit: integrate Single Sign-on (SSO) or provide a list of users that need access.
  • Test use cases with Self-Service Analytics and provide feedback.

About Your Data

Your data should already be stored in data warehouse tables for analytics purposes. They should also be updated frequently to stay live. Here are the key considerations for the integration and some best practices.

Data Model

Data model is sometimes presented as an ER (Entity Relationship) diagram. It captures the relationship between all the entities in your database.

It is never a good idea to just dump your transactional data (i.e. database tables used for serving your users) into the data warehouse for analytics purposes. A star schema is the most recommended pattern for storing analytics data.

  • Fact tables: Events, Transactions (e.g. Purchases, Subscriptions)
  • Dimension tables: Users, Campaigns, AttributionsBelow is a design sample of the proposed data model.

If your current data model is different, don’t worry, there are many creative ways to address that by collaborating with your data engineers.

13621362

The fact tables store information about when something happens (events), some inline properties (eg timestamp, country, app_id, app_version), and some dimension keys which can be used to lookup for the dimension values in dimension tables (eg user_id to lookup user’s email in users table, campaign_id to lookup for a campaign’s start and end dates).

In most scenarios, it is recommended to use inline properties on the events as much as possible, i.e. capture most dimensions’ values as properties and store them on every event. It simplifies the query and also improves analytics performance dramatically. Modern analytical data warehouses like Snowflake have built-in optimizations to store duplicated column values very efficiently. If you are interested in this topic, read more about Columnar Database.

How to implement ETL (Extract, Transform and Load) to populate the star schema from your transactional data is out of the scope for this guide. Though feel free to contact Kubit support to get additional information or advice on this topic.

User Identification

Every user, regardless of registered or anonymous (guest), requires a unique identifier for the analytics to work. Counting unique users and following a user’s behavior (action events) are the most common analytical measures.

👍

User ID :woman:

Ideally there should just be one unified ID field to identify any user.

In reality it could be much more complicated because the relationship between a person, different apps, their devices and credentials can be complex. Especially with considerations of privacy, fraud, hacking, device and credential sharing. It is a business decision for your organization to choose the best strategy.

Conceptually, there are several IDs:

  • Device ID: IDFV or Android ID (deprecated). These IDs are Used to identify a physical device though this concept is going away because of privacy concerns. In most places, people use Advertiser ID for Device ID.
  • Advertiser ID: IDFA or AAID (Android Advertising ID). Usually used to match with attribution or advertising data from third-party vendors. Keep in mind these IDs may be empty or have a special value when ad tracking is disabled. Since iOS 14, ad tracking is now disabled by default.
  • User ID: always generate a User ID as soon as the app is launched so every guest/anonymous user can be identified too.
  • Account ID: some developers introduce this to identify a registered user across different apps and/or devices. In most cases it brings more pain and confusions. It is recommended to only use it in special cases like analyzing how users use multiple apps on different platforms (iOS, Android and Web).
  • Credential (username, email): due to privacy concerns, it is a bad idea to store user credentials in any analytics data.

Generally speaking, the pair of User ID and Advertiser ID can satisfy most of the product analytics use cases.

Events

Events are generated when a certain action takes place. For example: Login, Stream, Vote Up, Page View, Button Click. It is the most critical data structure for analytics.

Data Dictionary

👍

Data Dictionary :books:

It is strongly recommended to have a Data Dictionary (as simple as a Google Sheet) in place to capture the definition of every event.

  • Event name: a key (unique short name) and a business name (easy to read and understand).
  • Trigger condition: when this event is triggered.
  • Properties: the content and meaning of each property stored in this event.
  • Owner: who to talk to if a change is needed.
  • Version history: what has changed to this event over time. This information is often critical to communicate.

Instrumentation

Events need to be instrumented in the code on both mobile (in the app) and server (backend) sides because some events are triggered by user actions, some by backend processes (eg subscription renewal, push notifications sent).

If you are starting from scratch, there is no need to reinvent the wheel. It is recommended to use a CDP (Customer Data Platform) like Segment, mParticle, RudderStack or Snowplow to give you a mature SDK and the maximum flexibility to control where the data goes.

Technology aside, the key of a successful instrument project is communication. Keep in mind that the engineers who are injecting the events into code flow don’t usually know how these events are used in analytics. Providing this context information is often critical to avoid bugs and surprises down the line.

Properties

There are two kinds of properties for the events: Common Properties and Event Properties. A property can be categorical (enumeration of values) or numerical (unbounded continuous values).

Common Properties

These are the properties that every event should have. Usually they are used in filters or breakdowns as dimension attributes. Here are some examples:

  • Timestamp: when an event happens. Sometimes it is necessary to have both Client Timestamp (client clock when action is taken) and Server Timestamp (server clock when the event is processed) since the mobile device clock may not be trusted.
  • Date Week, Month, Quarter, Year: It is strongly recommended to also store these parts of the timestamp separately since analytics is often based on these time units.
  • Fiscal Quarter, Fiscal Year: only if your organization has the fiscal concept and wants to use them in analytics.
  • Device ID: often this is Advertiser ID
  • User ID: the user identifier
  • App: the id or name of the app
  • App Version: the version of the app
  • Country: country where the app is installed (usually from the device)
  • Locale: device locale (language + country)
  • Gender: user’s gender information
  • Birth year: used to compute user age at query time
  • Device model: the model of the device, e.g. iPhone 11 Pro, Pixel 4  
  • OS Version: version of the OS on the device, e.g. iOS 13.6.7
  • A/B Test: test id and group assignment of an A/B test
  • Install Date: when the app was installed, used to compute Install Days and group users by how long they have installed the app. It is also critical for Dn/Wn metrics.

Event Properties

Each event can have its own set of properties depending on the context. It is recommended to store them in a set of named generic properties instead of introducing ad-hoc properties on the fly. This way the database table schema can be kept clean and efficient. These generic properties can be explained and managed using Data Dictionary. Here are some examples:

  • Context: capture the context information of the event. For example, it can be the flow where the user is in (e.g. Registration, Purchase), or the feature section (e.g. Wishlist, Checkout).
  • Target: the target of the event. For example: the content id for a Favorite event; the social channel (e.g. Facebook, Instagram) for Share event; the SKU for a subscription event.
  • Value: the value associated with the event. For example, the price for a Subscription or Purchase event.
  • K1, K2, …Kn: more named properties slotted for different events.

Timezone

Any timezone conversion during query time will kill the performance.

👍

Performance :rocket:

For simplicity and efficiency, all the data related to time should be stored as the Timestamp or Date data type, which indicates a certain point in time and has no timezone concept.

The complex issue of timezone arises when any date or time is presented in text format, as in every analytics report with dates (MM/DD/YYYY). It is recommended for the organization to decide on using one certain timezone across the board for all analytics purposes. Usually the selected timezone is where the headquarter is based since it is much easier to talk about local dates in business settings.  

Share Your Data

There are multiple ways to share your data in such a way that requires no ETL integration. Please, check the guides in the Secure Data Sharing section to find the one applicable to your tech stack.

Define KPIs

This section provides guidance on how to convey your organization’s definition on metrics (KPIs) and dimensions. ‍

Metrics

Please provide a list of top metrics for your product analytics needs. Most of these metrics can be presented in simple math formulas with measure functions, along with some filter conditions. For example:

  • Engagement: DAU, MAU - unique users in a day or month
  • Retention: D2, W2 Retention - % of users who retained (came back) in day 2 or week 2
  • Activity: Likes/DAU, Shares/DAU, Streams/DAU - count an action event divided by DAU.

👍

Tip :bulb:

It is always recommended to use DAU or MAU as denominator to measure rate instead of absolute volume

  • Monetization: ARPU, Attach Rate - AVG(revenue)/DAU, % of users who subscribed
  • Funnel: Registration, Purchase - unique users taking action at each steps of the funnel during a certain time period
  • Performance: Download Failure Rate - COUNT(failure)/TOTAL(start, fail, cancel, end)

Dimensions

Please provide the definitions of all the dimensions needed in the analytics. Most of these dimensions can come from the event's properties. Some dimensions require joining with dimension tables during query time. Here are the information needed:

  • Name: display name of the dimension.
  • For inline dimensions:
    • Name of the event property
  • For join dimensions:
    • Dimension table name
    • Dimension value column: the column on the dimension table to be used for dimension values
    • Join key: event property and dimension table column for join
    • Join condition: inner (only matching rows returns) or outer (for rows without match, use NULL)
    • Special join conditions: first touch, last touch, time window

Cohorts

Cohort is a group of users matching certain criteria. You can define these cohorts to be used for breakdowns in any Formula, Prediction and Braze Integration using very similar definitions like a query with measures, filters and conditions. For example:

  • New Users: COUNT(First Launched) = 1 in last 7 days  
  • Frequent Streamers: COUNT(Stream) > 5 in last 1 day
  • Whales: SUM(Revenue on Purchases) > 100 in last 30 days

Security

For a SaaS solution, security is always on top of the priority list at Kubit. ‍

Data Security

  • Through Secure Data Sharing, Kubit doesn’t store or change your data. You have full control over what data is shared and the permissions.
  • You can avoid exposing the tables or columns which are related to user privacy (like email, username, phone number etc) to Kubit. Analytics mostly just work with aggregated data instead of every user’s private information.
  • All network communication goes through the HTTPS encrypted channel, including when Kubit talks to your data warehouse and the analytics web interface in the browser.    
  • Kubit’s cloud infrastructure takes all the necessary steps to enforce data security and isolation, including VPC (Virtual Private Cloud) and encrypted network traffic between all components.
  • Strict access control and processes are enforced within Kubit organizations to limit and monitor employees who have access to customer data.‍

User Management

Kubit utilizes Auth0, the leading Enterprise Authentication and Authorization Platform to handle all user security and management.

If your organization is using enterprise user identity services like Google G-Suite, Active Directory or LDAP already, Single Sign On can be enabled with Kubit through Auth0. There are no extra credentials for users to remember and your IT department has full control over who has access to Kubit.

To start simple, you can also just provide several users’ email addresses to get them provisioned on Kubit directly. They can login to Kubit with email and password. Also MFA (Multi-Factor Authentication) can be enabled for these users too to add more security.

Support

If you have questions or need help you can:

  • send an email to [email protected]
  • reach out to us in your dedicated Slack support channel
  • contact your Account Manager