INFORMATION_SCHEMA in BigQuery: A Critical Tool for Auditing and Data Governance
A Complete Tutorial
If you’re not familiar with INFORMATION_SCHEMA
, don’t worry! It’s essentially a metadata repository that contains information about the various tables, views, and other objects in a database. In BigQuery, this metadata can be accessed through a series of views collectively known as INFORMATION_SCHEMA
views.
Now, you might be wondering, “Why should I care about metadata and information_schema views?” Well, for starters, information_schema views can give you a wealth of information about your dataset, including column names, data types, and even the creation time of tables. This information can be incredibly useful for tasks like auditing, data discovery, and optimizing query performance.
Plus, since INFORMATION_SCHEMA
views are themselves tables, they can be queried just like any other table in BigQuery. This means you can use SQL to extract the exact information you need from the metadata repository, without having to rely on external tools or manual processes.
What is INFORMATION_SCHEMA?
In essence, it’s a set of SQL views that provide a standardized way of accessing metadata about the objects in a database. Each view in INFORMATION_SCHEMA
corresponds to a different type of object in the database. For example, the TABLES
view contains information about all the tables in the database, while the COLUMNS
view contains information about the columns in each table.
One thing to note is that INFORMATION_SCHEMA
views are read-only, meaning you can’t modify the metadata through these views. However, you can use the information in the views to inform other operations in BigQuery, such as querying or creating tables.
Another important feature is that it provides a standardized way of accessing metadata across different database management systems. This means that if you’re familiar with INFORMATION_SCHEMA
in one system, you should be able to use it in other systems as well, with some minor variations depending on the specific implementation.
Understanding BigQuery’s INFORMATION_SCHEMA Views
First up, we have the TABLES
view. As you might expect, this view contains information about all the tables in your BQ dataset, such as the table name, creation time, etc. Additionally, it also provides information about the table schema, including the column names, data types, and mode (e.g. nullable or required).
Next, we have the VIEWS
view. This view contains information about all the views in your dataset, including the view name, query definition, and creation time.
The COLUMNS
view provides information about the columns in each table, including the column name, data type, and whether the column is nullable or required.
The ROUTINES
view contains information about stored procedures and functions in your BigQuery dataset, including the routine name, routine type, and routine definition.
The SCHEMATA
view provides information about the schemas in your dataset, including the schema name and the project and dataset ID.
Lastly, the PARTITIONS
view contains information about the partitions in partitioned tables, including the partition ID, partition start and end times, and partition size.
Keep in mind that this is just a brief overview of the different INFORMATION_SCHEMA
views available in BQ. Each view provides a wealth of information that can be used for various purposes, depending on your specific needs.
Querying BigQuery’s INFORMATION_SCHEMA Views
Here, I try to outline various useful SQL codes you can use to extract useful information from INFORMATION_SCHEMA.
Extract all table names and creation time in a dataset
SELECT table_name, creation_time
FROM `project_name.dataset_name.INFORMATION_SCHEMA.TABLES`
The other columns in the TABLES
view are: table_catalog, table_schema, table_type, is_insertable_into, is_typed, base_table_catalog, base_table_schema, base_table_name, snapshot_time_ms, ddl, default_collation_name, upsert_stream_apply_watermark.
Extract all view names in a dataset
SELECT table_name
FROM `project_name.dataset_name.INFORMATION_SCHEMA.VIEWS`
The other columns in the VIEWS view are: table_catalog, table_schema, view_definition, check_option, use_standard_sql.
Extract all columns in a table
SELECT column_name, data_type, is_nullable
FROM `project_name.dataset_name.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name = 'table_name'
As you can see, querying INFORMATION_SCHEMA
views in BQ is very similar to querying regular tables in BQ. The only difference is that you’re querying metadata instead of data.
The other columns in the COLUMNS
view are: ordinal_position, data_type, is_generated, generation_expression, is_stored, is_hidden, is_updatable, is_system_defined, is_partitioning_column, clustering_ordinal_position, collation_name, column_default, rounding_mode.
Extract all columns in all tables
SELECT table_name, column_name, data_type, is_nullable
FROM `project_name.dataset_name.INFORMATION_SCHEMA.COLUMNS`
Extract all routines in a dataset
SELECT routine_name, routine_type, routine_definition
FROM `project_name.dataset_name.INFORMATION_SCHEMA.ROUTINES`
Extract all schemas in a project
SELECT schema_name
FROM `project_name.INFORMATION_SCHEMA.SCHEMATA`
Extract partition info for a partitioned table
SELECT *
FROM `project_name.dataset_name.INFORMATION_SCHEMA.PARTITIONS`
WHERE table_name = 'partitioned_table_name'
Extract all tables in a dataset with a certain creation date
SELECT table_name, creation_time
FROM `project_name.dataset_name.INFORMATION_SCHEMA.TABLES`
WHERE creation_time = '2023-01-01'
Extract all columns in a table with a certain data type
SELECT column_name, data_type
FROM `project_name.dataset_name.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name = 'table_name'
AND data_type = 'STRING'
Extract all views in a dataset with a certain query definition
SELECT table_name, view_definition
FROM `project_name.dataset_name.INFORMATION_SCHEMA.VIEWS`
WHERE view_definition LIKE '%search_term%'
Extract all access grantees for a table with a list of privilege type
SELECT *
FROM `region-us`.INFORMATION_SCHEMA.OBJECT_PRIVILEGES
WHERE object_name = 'table_name' and object_schema = 'dataset_name'
Advanced features of INFORMATION_SCHEMA
in BigQuery
First up, we have the ability to use INFORMATION_SCHEMA
to optimize query performance. For example, let’s say you want to join two tables. By using INFORMATION_SCHEMA
to examine the tables’ schemas and metadata, you can optimize the join to be more efficient. You can do this by using the EXACT_COUNT_DISTINCT()
function to determine the exact number of distinct values in a column, and then using that information to estimate the size of the join.
SELECT EXACT_COUNT_DISTINCT(column_name)
FROM `project_id.dataset_id.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name = 'table_name' AND column_name = 'column_name'
Another advanced feature of INFORMATION_SCHEMA
is the ability to examine partitioned tables. By using the PARTITIONS
view, you can gain insights into the partitions in your table, including partition ID, start and end times, and partition size. This information can be incredibly useful for optimizing queries on partitioned tables, such as using partition pruning to only access the partitions that are relevant to your query.
Additionally, it can be used to audit and monitor your dataset. By using the VIEWS
view, you can examine the query definitions for all the views in your dataset, and ensure that they are still valid and up-to-date. This can be especially useful if you have a large number of views in your dataset, or if your dataset is updated frequently.
Finally, it can be used to gain deeper insights into your dataset. For example, by using the TABLES
view, you can examine the creation time and last modified time for all the tables in your dataset. This can be useful for understanding the lifecycle of your data, and determining when tables were created or updated.
Use Cases for INFORMATION_SCHEMA
in BigQuery
One use case is auditing and data governance. By using the VIEWS
view, you can examine the query definitions for all the views in your dataset, and ensure that they are still valid and up-to-date.
Another use case is data discovery. By using the TABLES
and COLUMNS
views, you can quickly and easily discover the tables and columns in your dataset, and understand their structure and relationships. This can be especially useful for new team members or for exploring a dataset that you are not familiar with.
INFORMATION_SCHEMA
can also be used for optimizing query performance. As mentioned before, by using the EXACT_COUNT_DISTINCT()
function in conjunction with the COLUMNS
view, you can estimate the size of a join, and optimize your query accordingly.
Another use case is gaining insights into your dataset. For example, by using the TABLES
view, you can examine the creation time and last modified time for all the tables in your dataset. This can help you understand the lifecycle of your data, and determine when tables were created or updated.
Finally, it can be used for creating documentation for your dataset. By using the views, you can generate documentation that includes information about tables, columns, views, and more.
Conclusion
In conclusion, INFORMATION_SCHEMA
views in BQ are a powerful tool for anyone working with databases, whether you’re a data analyst, engineer, or administrator. These views provide you with valuable insights into your dataset’s structure, relationships, and history, making your data management tasks easier and more efficient.
These views enable you to explore, audit, optimize, and document your dataset in a standardized and accessible way. With just a few SQL queries, you can dive deep into your dataset’s metadata, discovering everything from table schemas and column data types to partition information and view definitions. This newfound knowledge can help you optimize your queries, ensure data integrity, and maintain up-to-date documentation for your dataset.
Moreover, BQ’s implementation of INFORMATION_SCHEMA
provides a familiar ground for those who have worked with other database management systems, allowing you to leverage your existing knowledge and skills. However, don’t forget that each system may have its nuances, so always keep an eye out for any BigQuery-specific features or differences.
References
BigQuery Metadata | How to retrieve metadata information in BigQuery (link)
Query Optimization 101: Techniques and Best Practices (link)
Advanced Strategies for Partitioning and Clustering in BigQuery (link)
Fine-Tuning BigQuery Costs: Best Practices and Advanced Techniques (link)
The Evolution of SQL: A Look at the Past, Present, and Future of SQL Standards (link)
Advanced Dynamic SQL Topics: Stored Procedures, ORM, and BI Tools (link)
How to Build a Data Platform: A Comprehensive Guide for Technical Teams (link)
Designing a data warehouse from the ground up: Tips and Best Practices (link)
My pick for top 48 advanced database systems interview questions (link)
Five must-read books for data engineers (link)
Data Solution Architects: The Future of Data Management (link)
Metadata Management: A Key Component of Data Governance (link)
Ensuring Data Quality: Best Practices, Challenges and Solutions (link)
Data Security: Essential Considerations for Data Engineers (link)
I hope you enjoyed reading this 🙂. If you’d like to support me as a writer consider signing up to become a Medium member. It’s just $5 a month and you get unlimited access to Medium 🙏 .
Before leaving this page, I appreciate if you follow me on Medium and Linkedin 👉
Also, if you are a medium writer yourself, you can join my Linkedin group. In that group, I share curated articles about data and technology. You can find it: Linkedin Group. Also, if you like to collaborate, please join me as a group admin.