news

Implementing Your First BQML Algorithm in Looker

Datatonic Christelle Xu

 

Author: Christelle Xu

 

 

Leveraging BQML in tandem with Looker’s semantic layer to build, visualise, and deliver powerful predictions

About a year ago, Google announced the release of BigQuery ML (BQML) which at the time of writing, allows users to create and run fundamental models like linear regression, logistic regression, and k-means clustering directly within BigQuery using SQL syntax. This effectively eliminates the need to export data to generate a model accelerating the data science workflow.

Because BigQuery integrates with Looker, we now have the power to leverage the Looker platform’s semantic layer to build, visualise, and deliver the predictions much like any other kind of business analytics. For those of you that are new to Looker, Looker is a data platform that enables business users to access, analyse, and explore data easily.

There are a number of ways these models can be built out and imported, but my goal today is to walk you through one specific use case — building out a linear regression model directly from Looker. By the end of this walkthrough you should have a solid understanding of how Looker and BQML integrate, and how to build and tweak a simple model directly from Looker.

Prerequisites

Connecting BigQuery: If you don’t already have connection built out from BigQuery to Looker you can follow instructions here.

LookML knowledge: There is generalised LookML code to follow along below, but I hope to make the principles intuitive for someone not steeped in Looker development. That being said, I do assume you’ve worked with LookML before.

Why build out the model in Looker at all?

Assuming your company already uses Looker, your database connection should already be built out, and your analysts should already be comfortable working out of and visualising through Looker. Creating a BQML model through Looker is no more complicated than creating a new view and explore.

No data needs to be exported, streamlining and reducing the resources generally required of a data science pipeline. Modelling through Looker gives your analysts another tool to model and visualise data in an environment they’re familiar with.

The primary difference in syntax between building out a model directly in BigQuery and building the model out in Looker directly is that Looker will use a LookML wrapper around a BQML query. This will define the relationship between the model and the rest of the data. The query itself is altered to reference the appropriate table using a substitution parameter but is structurally the same as if you were to build out the query in BigQuery itself.

The Setup

When building out any new Looker project, I’ll typically start by autogenerating views from tables in my database. What Looker’s doing at this stage is building dimensions and pulling field types based on the nascent structure of the data, effectively creating the mappings to the database for us. The primary advantage of doing so at this point is that if you build out business logic from an extended view, you can regenerate the autogenerated mappings without losing business logic.

Equally, if you want this model and predictions to live within a live project, you can autogenerate a view from a table in your database with data that will serve as your model inputs.

These autogenerated views serve as the foundation for a Looker native derived table (NDT). NDTs are easier to read, manipulate, and do not require a SQL query to build out. We’ll build an NDT from fields we’ve pulled through in the database via the autogenerated views. The fields we choose will be those that we want to use as input values for our model, and as you’ll see below, we’ll refer to this input table repeatedly to access the necessary fields. The input view looks something like this:

view: descriptive_name_input {     

     derived_table: {
          explore_source: input_view_name {
               column: first_input {}
               column: second_input {}
               column: third_input {}
               column: fourth_input {}
               column: fifth_input {}
               column: sixth_input {}
               }
     }
}

This will be your autogenerated table, keeping in mind that the view you’re using has to be exposed as an explore if it’s going to be used as a derived table.

I save all views related to this BQML model in the same ‘descriptive_name.view’ file. This includes the inputs defined above, the BQML model, its evaluation, its predictions, etc. They can also each have their own view file, but at this point, I find it easier to make changes with all relevant views in the same file.

Creating the Model in Looker

We’re now ready to build the model. If you were building this model directly within BQML, the syntax would be:

CREATE OR REPLACE MODEL
     `MODEL_DATASET.MODEL_NAME`
OPTIONS(
    model_type = ‘LINEAR_REG’,
    input_label_cols = [‘your_dependent_variable_name’]
) AS
     SELECT
       *
     FROM `MODEL_DATASET.INPUT_TABLE`

A whole host of model options and variations can be found here. Once executed, the model is saved under the model name defined in the second line of the query and is nested within the dataset listed in that same line.

This query changes very little when integrating with Looker, and it is the Looker wrapper, built out using LookML, that allows for this integration. The command sql_create replaces the CREATE OR REPLACE statement in BQML and tells Looker to execute the command as is, without Looker’s usual error checking. This command also requires the creation of a persistent derived table (PDT), and when used in tandem with a datagroup (a caching policy) becomes a powerful way to control how long the results of this model should be saved before being rebuilt with new data — saving both on cost and runtime.

view: model_view_name {     

      derived_table: {
          datagroup_trigger: your_datagroup_name
          sql_create:
             CREATE OR REPLACE MODEL 
                  ${SQL_TABLE_NAME}
             OPTIONS (
                  model_type = ‘LINEAR_REG’,
                  input_label_cols =  [‘your_dependent_variable_name’]
              ) AS
                  SELECT
                     *FROM ${descriptive_name_input.SQL_TABLE_NAME};;
      }
}

From sql_create down, you’ll notice that the syntax is the same as that of the one built in BQML. All other variations you see in the code block are attributable to the Looker wrapper.

The differences you might notice are:

  • The model names and table references have been replaced by ${SQL_TABLE_NAME}. This is a substitution operator which essentially references the input view (input_view_name) derived previously.
  • The view parameter assigns the model the name model_view_name and will be referenced down the line using a substitution parameter much like the one referenced in the previous point.
  • The datagroup_trigger parameter assigns a caching policy to the table which we need when calling the sql_create parameter.
  • The derived_table parameter is specified when creating a custom table in Looker.

How do we evaluate our model?

To evaluate the model in BigQuery, we’d run the following:

SELECT
  *
FROM ml.EVALUATE(
   MODEL `MODEL_DATASET.MODEL_NAME`, 
   (SELECT
      *
    FROM `MODEL_DATASET.INPUT_TABLE`))

The asterisks, in particular the second asterisk in the code block above, can be replaced with specific inputs.

The Looker wrapper in this case varies on two related points. First, we call the BQML query using sql rather than sql_create since we’re just performing a query. Next, the absence of a datagroup_trigger means that we do not create a PDT (nor do we need to).

view: model_name_eval {   
  
     derived_table: {
          sql:
            SELECT
              *
            FROM ml.EVALUATE(
              MODEL ${model_view_name.SQL_TABLE_NAME},
              (SELECT
                 *
               FROM ${descriptive_name_input.SQL_TABLE_NAME}));;
            }dimension: mean_absolute_error {type: number}
dimension: mean_squared_error {type: number}
dimension: mean_squared_log_error {type: number}
dimension: median_absolute_error {type: number}
dimension: r2_score {type: number}
dimension: explained_variance {type: number}
}

Again, the sql is the same as that found within BQML except for the substitution operators which take the places of the tables.

You’ll notice that I’ve defined dimensions for the columns the ml.EVALUATE function returns in BQML. Dimensions, once defined as I have here, can be used to explore data and build visualisations in the usual way.

If you do have visualisations built out and you happen to tweak your model, you can update the visualisations with the new evaluation metrics by clicking on “Clear Cache and Refresh”.

The dimensions we pull through in this case should look familiar — R², mean squared error, etc. These evaluation metrics vary based on the model being evaluated, and those fields can be found here.

Tuning your model

BQML offers the ‘ml.TRAINING_INFO’ function returns hyperparameters to help tune your model. The query is built in BQML as follows:

SELECT
  *
FROM ml.TRAINING_INFO(
   MODEL `MODEL_DATASET.MODEL_NAME`, 
   (SELECT
      *
    FROM `MODEL_DATASET.INPUT_TABLE`))

Which in Looker looks like this:

view: model_name_training {     
     derived_table: {
          sql: 
            SELECT 
              * 
            FROM ml.TRAINING_INFO(
               MODEL ${model_view_name.SQL_TABLE_NAME}) ;;
}dimension: training_run {type: number}
dimension: iteration {type: number}
dimension: loss {type: number}
dimension: eval_loss {type: number}
dimension: duration_ms {label: “Duration (ms)” type: number}
dimension: learning_rate {type: number}
dimension: iterations {type: number}
measure: total_loss {
             type: sum
             sql: ${loss} ;;
}
}

The syntax at this point should look familiar since it’s the syntax we saw when evaluating our model. The dimensions can also be built out into a dashboard for real time tweaking and updated by clicking on “Clear Cache and Refresh”.

Let’s Predict

The query used to run a prediction follows the same pattern we’ve seen above, and variations on this pattern can be found here. In BigQuery, we’d run the following:

SELECT
   *
FROM ml.PREDICT(
     MODEL `MODEL_DATASET.MODEL_NAME`, 
     (SELECT
        label,
        column1,
        column2
      FROM `DATASET.TABLE`))

Whatever the label name or column was specified when creating your model, that label name will gain a prefix predicted. So in our case, the column would be named predicted_your_dependent_variable_name. We can also query any of the inputs from the inputs table alongside the predicted value.

This is how it would look in Looker:

view: model_name_prediction {  
   
     derived_table: {
          sql:
            SELECT
              *
            FROM ml.PREDICT(
               MODEL ${model_view_name.SQL_TABLE_NAME},
               (SELECT
                  *
                FROM ${descriptive_name_input.SQL_TABLE_NAME}));;
                }dimension: predicted_your_dependent_variable_name {type: number}
dimension: residual {
              type: number
              sql: ${predicted_your_dependent_variable_name} —     ${TABLE}.your_dependent_variable_name;;}
dimension: input_one {type: string}
dimension: input_two {type: string}
dimension: input_three {type: string}
}

The prediction outputs can be built into dimensions which again, can be used directly to build out visualisations and delivered to stakeholders as easily as any other piece of analytics.

Congrats!

You’ve built your first BQML model in Looker. Hopefully by now you feel comfortable navigating your way through BQML syntax and integrating it into Looker using LookML. We began with a linear regression but these same principles apply to other models BQML offers. I will cover these variations in later articles.

Do you have questions or projects you’d like our help with? Get in touch –   we’d love to hear from you.

Up next
Case Studies
View now