In Clickhouse we can use internal dictionaries as well as external dictionaries, they can be an alternative to JSON that doesn’t always work fine. DIctionaries store information in memory and can be invoked with the dictGet method. Let’s review how we can create one in Clickhouse and use it for our queries.
We will illustrate an example of data using the Untappd API. Untappd is a social network for everyone who loves craft beer. We are going to use сheck-ins of Russian-based craft breweries and start collecting information about them to analyze this data later on and to draw some conclusions. in today’s article, we will analyze how to receive metadata on Russian breweries with Untappd and store it in a Clickhouse dictionary.
Collecting data with Untappd
First off, we need to create a new app to receive client_id and client_secret_key to make API calls. Follow this link and fill in the fields:
Usually, it takes about 1 to 3 weeks to wait for approval.
import requests
import pandas as pd
import time
We’ll be using the requests library to make API calls, view results in a Pandas DataFrame, and save them in a CSV file before sending it to a Clickhouse dictionary. Untappd has strict limits on the number of requests, prohibiting us to make more than 100 calls per hour. Therefore, we need to make our script wait for 38 seconds using the Python time module.
client_id = 'your_client_id'
client_secret = 'your_client_secret'
all_brewery_of_russia = []
We want to get data for one thousand Russian breweries. One request to the Brewery Search method enables us to view up to 50 breweries. The website gave us 3369 breweries when searching the word “Russia” manually.
Let’s check this, scroll down to the bottom, and open the page code.
Each brewery received is stored in the beer-item class. This means we can the number of references to beer-item:
And as it turned out, we have exactly 1000 breweries, not 3369. When searching the word “Russia” manually, the results also contain some American breweries. So, we need to make 20 calls, getting 50 breweries at a time:
for offset in range(0, 1000, 50):
try:
print('offset = ', offset)
print('remained:', 1000 - offset, '\n')
response = requests.get(f'https://api.untappd.com/v4/search/brewery?client_id={client_id}&client_secret={client_secret}',
params={
'q':'Russia',
'offset':offset,
'limit':50
})
item = response.json()
print(item, '\n')
all_brewery_of_russia.append(item)
time.sleep(37)
except Exception:
print(Exception)
continue
The Brewery Search method includes several parameters, q – a string with a country name (specify specify “Russia” to get all the breweries based in Russia), offset – allows us to shift by 50 lines in the search to get the next list of breweries, limit – restricts the number of breweries received and can not be more than 50. Convert the answer to JSON and append data sotred in the item object to the all_brewery_of_russia list.
Our data may also include breweries from other countries. That’s why we need to filter the data. Iterate through the all_brewery_of_russia list and keep only those breweires, which country_name is Russia.
brew_list = []
for element in all_brewery_of_russia:
brew = element['response']['brewery']
for i in range(brew['count']):
if brew['items'][i]['brewery']['country_name'] == 'Russia':
brew_list.append(brew['items'][i])
Print out the first element in our brew_list:
print(brew_list[0])
Create a DataFrame with the following columns: brewery_id, beer_count, brewery_name, brewery_slug, brewery_page_url, brewery_city, lat и lng. And several lists to sort out the data stored in the brewery_list:
df = pd.DataFrame()
brewery_id_list = []
beer_count_list = []
brewery_name_list = []
brewery_slug_list = []
brewery_page_url_list = []
brewery_location_city = []
brewery_location_lat = []
brewery_location_lng = []
for brewery in brew_list:
brewery_id_list.append(brewery['brewery']['brewery_id'])
beer_count_list.append(brewery['brewery']['beer_count'])
brewery_name_list.append(brewery['brewery']['brewery_name'])
brewery_slug_list.append(brewery['brewery']['brewery_slug'])
brewery_page_url_list.append(brewery['brewery']['brewery_page_url'])
brewery_location_city.append(brewery['brewery']['location']['brewery_city'])
brewery_location_lat.append(brewery['brewery']['location']['lat'])
brewery_location_lng.append(brewery['brewery']['location']['lng'])
Assign them as column values:
df['brewery_id'] = brewery_id_list
df['beer_count'] = beer_count_list
df['brewery_name'] = brewery_name_list
df['brewery_slug'] = brewery_slug_list
df['brewery_page_url'] = brewery_page_url_list
df['brewery_city'] = brewery_location_city
df['brewery_lat'] = brewery_location_lat
df['brewery_lng'] = brewery_location_lng
And view our DataFrame:
df.head()
Let’s sort the values by brewery_id and store our DataFrame as a CSV file without index column and headings:
df = df.sort_values(by='brewery_id')
df.to_csv('brewery_data.csv', index=False, header=False)
Creating a Clickhouse dictionary
You can create Clickouse dictionaries in many different ways. We will try to structure it in an XML file, configure the server files, and access it through our client. The XML file structure will be the following:
Learn more about other ways you can create Clickhouse dictionaries in the documentation
<yandex>
<dictionary>
<name>breweries</name>
<source>
<file>
<path>/home/ubuntu/brewery_data.csv</path>
<format>CSV</format>
</file>
</source>
<layout>
<flat />
</layout>
<structure>
<id>
<name>brewery_id</name>
</id>
<attribute>
<name>beer_count</name>
<type>UInt64</type>
<null_value>Null</null_value>
</attribute>
<attribute>
<name>brewery_name</name>
<type>String</type>
<null_value>Null</null_value>
</attribute>
<attribute>
<name>brewery_slug</name>
<type>String</type>
<null_value>Null</null_value>
</attribute>
<attribute>
<name>brewery_page_url</name>
<type>String</type>
<null_value>Null</null_value>
</attribute>
<attribute>
<name>brewery_city</name>
<type>String</type>
<null_value>Null</null_value>
</attribute>
<attribute>
<name>lat</name>
<type>String</type>
<null_value>Null</null_value>
</attribute>
<attribute>
<name>lng</name>
<type>String</type>
<null_value>Null</null_value>
</attribute>
</structure>
<lifetime>300</lifetime>
</dictionary>
</yandex>
name is a dictionary name, attribute holds the properties of the columns, id is a key field, file stores file path and format. We are going to store our file in this directory: /home/ubuntu.
Let’s upload our CSV and XML files to the server, it can be done using an FTP like FileZilla. We explained how to deploy Clickhouse on an Amazon instance in our previous article, this time need to do the same. Open your FileZilla client and go to SFTP settings to add a private key:
Connect to your server address, it can be found in the EC2 management console. Specify SFTP as a protocol, your Host, and Ubuntu as a username.
Your Public DNS may change in case of overload
After connecting we will wind up in this location /home/ubuntu. Let’s put the files in that folder and connect via SSH using Termius. Then we need to move the files to /etc/clickhouse-server to view them in Clickhouse:
Learn how you can connect to an AWS server using SSH client from our previous material “Installing Clickhouse on AWS”
sudo mv breweries_dictionary.xml /etc/clickhouse server/
Go to the config file:
cd /etc/clickhouse-server
sudo nano config.xml
We need the tag, it’s the path to a file that describes the dictionaries structure. Specify the path to our XML file:
<dictionaries_config>/etc/clickhouse-server/breweries_dictionary.xml</dictionaries_config>
Save our file and run the Clickhouse client:
clickhouse client
Let’s check that the dictionary really loaded:
SELECT * FROM system.dictionaries\G
In case of success you will get the following:
Now, let’s write a query with the dictGet function to get the name of the brewery with ID 999. Pass in the dictionary name, as the first argument, then the filed name and ID.
SELECT dictGet('breweries', 'brewery_name', toUInt64(999))
And our query returns this:
Similarly, we could use this function to get a beer name, when the table contains only IDs.