[Python Big Data] Google bigquery와 SQL

Bigquery를 이용하면 SQL을 이용하여 큰 데이터셋을 다룰 수 있다

SQL은 데이터베이스 시간에 어느정도 공부를 해서 bigquery를 파이썬에서 pandas와 어떻게 함께 사용하나에 대해 공부해보았다

먼저

1
2

import pandas as pd
from google.cloud import bigquery

cs

pandas와 bigquery import 해주기!

1
2
3
4
5
6
7
8

# Create a "Client" object
client = bigquery.Client()
 
# Construct a reference to the "stackoverflow" dataset
dataset_ref = client.dataset("stackoverflow", project="bigquery-public-data")
 
# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)
Colored by Color Scripter

cs

client object와 사용하고싶은 데이터의 reference를 가져왔다

클라이언트에서 데이터 레퍼런스를 가져와서 저장하면 dataset이 된다

앞으로 사용할 것은 이 데이터셋이다

1
2
3
4
5

# Write the code you need here to figure out the answer
tables = list(client.list_tables(dataset))
 
for table in tables:
    print(table.table_id)

cs

데이터셋에 어떤 데이터들이 있는지 먼저살펴보았다

반복문으로 데이터셋 안의 table을 출력시켜준다

내가 자주 쓰는 stackoverflow 사이트의 데이터셋에는 이러한 테이블들이 들어있었다

가장 기본적인 테이블인 users table을 살펴보았다

1
2
3
4
5

# Construct a reference to the "users" table
table_ref = dataset_ref.table("users")
 
# API request - fetch the table
table = client.get_table(table_ref)

cs

데이터셋을 가져올때처럼 테이블도 reference를 만들어 가져온다

사용자에 대한 정보가 담긴 테이블은 어떤 스키마를 가지고 있을지 table.schema를 통해 확인해보았다

이름과 형식 등의 정보를 알 수 있다

1
2

# Preview the first five lines of the "full" table
client.list_rows(table, max_results=5).to_dataframe()

cs

스키마 뿐만 아니라 list_rows 함수를 통해 저장된 값도 알 수 있다

이 때 list_rows 함수에 selected_fileds를 지정해주면 보고싶은 feature만 확인할 수 있다

selected_fileds = table.schema[:1] 처럼 사용하면 사용자들의 ID 만 확인 가능하다

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

query = """
        SELECT display_name,about_me,age
        FROM bigquery-public-data.stackoverflow.users
        WHERE id<100
        """
 
# Set up the query (cancel the query if it would use too much of 
# your quota, with the limit set to 10 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(query, job_config=safe_config)
 
# API request - run the query, and return a pandas DataFrame
results = query_job.to_dataframe()
 
# View top few rows of results
print(results.head())
Colored by Color Scripter

cs

마지막으로 query 날리기!

query는 문자열로 지정해주고 client.query를 통해 결과를 가져올 수 있다

결과를 다섯개만 출력해보면 이렇게 원하는 정보가 나온다

용량이 너무 커지지 않게 job_config를 설정해주었는데 설정해주지 않고

client.query(query)로 바로 사용 가능하다

저작자표시

'AI' 카테고리의 다른 글

[Kaggle] Melbourne 집값 예측하기 - 3 Categorical Variables (0)	2020.05.23
[Kaggle] Melbourne 집값 예측하기 - 2 Missing Values (0)	2020.05.23
[Kaggle] Melbourne 집값 예측하기 - 1 (0)	2020.05.22
[Python Data Manipulation] Pandas 사용법 (0)	2020.05.17
[Python Data Visualization] Seaborn 사용법 (0)	2020.05.13

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

영원토록 빛나고 싶어

[Python Big Data] Google bigquery와 SQL

'AI' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

[Python Big Data] Google bigquery와 SQL

'AI' 카테고리의 다른 글

'AI' Related Articles

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역