{
    "version": "https:\/\/jsonfeed.org\/version\/1",
    "title": "LEFT JOIN: blog on analytics, visualisation & data science, posts tagged: clickhouse",
    "home_page_url": "https:\/\/en.leftjoin.ru\/tags\/clickhouse\/",
    "feed_url": "https:\/\/en.leftjoin.ru\/tags\/clickhouse\/json\/",
    "icon": "https:\/\/en.leftjoin.ru\/user\/userpic@2x.jpg",
    "author": {
        "name": "Nikolay Valiotti",
        "url": "https:\/\/en.leftjoin.ru\/",
        "avatar": "https:\/\/en.leftjoin.ru\/user\/userpic@2x.jpg"
    },
    "items": [
        {
            "id": "65",
            "url": "https:\/\/en.leftjoin.ru\/all\/altinity-clickhouse-training-101\/",
            "title": "Clickhouse Training 101 by Altinity",
            "content_html": "<p>Just recently I have completed a <a href=\"https:\/\/altinity.com\/clickhouse-training\/?utm_source=leftjoin\">Clickhouse Training by Altinity (101 Series Training). <\/a> For those who are just getting to know Clickhouse, Altinity offers free basic training: <a href=\"https:\/\/altinity.com\/data-warehouse-basics\/?utm_source=leftjoin\">Data Warehouse Basics<\/a>. I recommend starting with it if you are planning to dive into learning.<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/altinity-clickhouse-developer-300px.png\" width=\"300\" height=\"300\" alt=\"\" \/>\n<div class=\"e2-text-caption\">Certification by Altinity<\/div>\n<\/div>\n<p>I would like to share my experience on the training as well as the <a href=\"https:\/\/capable-stream-f18.notion.site\/Clickhouse-Training-101-by-Altinity-notes-120f1b6467f44a30956d6d7ffeff7b08\">training notes<\/a>.<br \/>\nThe training costs $500 and lasts 4 days for 2 hours. It is carried out in the evenings Moscow time (starting from 19:00 GMT +3).<\/p>\n<h2>Session # 1.<\/h2>\n<p>The first day mostly revises everything covered in Data Warehouse Basics, but it has several new ideas on how to get useful information on queries from system tables.<\/p>\n<p>For example, this query will show which commands are running and their status.<\/p>\n<pre class=\"e2-text-code\"><code>SELECT command, is_done\r\nFROM system.mutations\r\nWHERE table = 'ontime'<\/code><\/pre><p>Besides, for me it was useful to learn about column compression with the use of codecs:<\/p>\n<pre class=\"e2-text-code\"><code>ALTER TABLE ontime\r\n MODIFY COLUMN TailNum LowCardinality(String) CODEC(ZSTD(1))<\/code><\/pre><div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/--2021-08-01--11.53.59.png\" width=\"1732\" height=\"1048\" alt=\"\" \/>\n<\/div>\n<p>For those who are just starting with Clickhouse, the first day will be super useful as it will help in understanding table engines and syntax for their creation, partitions, inserting data (for example directly from S3).<\/p>\n<pre class=\"e2-text-code\"><code>INSERT INTO sdata\r\nSELECT * FROM s3(\r\n 'https:\/\/s3.us-east-1.amazonaws.com\/d1-altinity\/data\/sdata*.csv.gz',\r\n 'aws_access_key_id',\r\n 'aws_secret_access_key',\r\n 'Parquet',\r\n 'DevId Int32, Type String, MDate Date, MDatetime\r\nDateTime, Value Float64')<\/code><\/pre><h2>Session # 2.<\/h2>\n<p>I found the second day the most intense and useful as within this session Robert from Altinity talks about aggregate functions and materialized views ( <a href=\"https:\/\/www.notion.so\/Session-2-35af1ed8d2c54c6fa7fcbea3c9385810#f36adc3df7d74deebedcb3c04e019661\">detailed scheme for the creation of materialized views<\/a> ) in Clickhouse in more detail.<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/--2021-08-02--18.11.45.png\" width=\"877\" height=\"495\" alt=\"\" \/>\n<\/div>\n<p>It was super useful for me to learn about index types in Clickhouse.<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/--2021-08-02--18.20.35.png\" width=\"874\" height=\"472\" alt=\"\" \/>\n<\/div>\n<h2>Session # 3.<\/h2>\n<p>During the third day, colleagues share their knowledge on how to work with Kafka and JSON objects stored in the tables.<br \/>\nIt was interesting to find out that working with arrays in Clickhouse is very similar to arrays in Python:<\/p>\n<pre class=\"e2-text-code\"><code>WITH [1, 2, 4] AS array\r\nSELECT\r\n array[1] AS First,\r\n array[2] AS Second,\r\n array[3] AS Third,\r\n array[-1] AS Last,\r\n length(array) AS Length<\/code><\/pre><p>When working with arrays, there is a great feature called ARRAY JOIN which “unrolls” arrays to rows.<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/--2021-08-02--19.14.28.png\" width=\"812\" height=\"518\" alt=\"\" \/>\n<\/div>\n<p>Clickhouse allows you to efficiently interact with JSON objects stored in a table:<\/p>\n<pre class=\"e2-text-code\"><code>-- Get a JSON string value\r\nSELECT JSONExtractString(row, 'request') AS request\r\nFROM log_row LIMIT 3\r\n-- Get a JSON numeric value\r\nSELECT JSONExtractInt(row, 'status') AS status\r\nFROM log_row LIMIT 3<\/code><\/pre><p>This piece of code is an example of how to extract the elements of the JSON array “request” and “status” separately.<\/p>\n<pre class=\"e2-text-code\"><code>ALTER TABLE log_row\r\n ADD COLUMN\r\nstatus Int16 DEFAULT\r\n JSONExtractInt(row, 'status')\r\nALTER TABLE log_row\r\nUPDATE status = status WHERE 1 = 1<\/code><\/pre><h2>Session # 4.<\/h2>\n<p>The most difficult topic from my point of view was saved for the last day – <a href=\"https:\/\/www.notion.so\/Session-4-f2aa33b6fe434a4e8542f0f64f9439bc#3a3038e94dbf4b47a10284dc1dc226ec\">building sharding and replication patterns <\/a> and building queries on distributed Clickhouse servers.<\/p>\n<div class=\"e2-text-picture\">\n<div class=\"fotorama\" data-width=\"938\" data-ratio=\"1.8073217726397\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/--2021-08-02--19.43.22.png\" width=\"938\" height=\"519\" alt=\"\" \/>\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/--2021-08-02--19.47.52.png\" width=\"934\" height=\"520\" alt=\"\" \/>\n<\/div>\n<\/div>\n<p>Special respect to Altinity for an excellent collection of labs during the training.<\/p>\n<p><b>Links<\/b>:<\/p>\n<ul>\n<li><a href=\"https:\/\/capable-stream-f18.notion.site\/Clickhouse-Training-101-by-Altinity-notes-120f1b6467f44a30956d6d7ffeff7b08\">Notes in Notion<\/a><\/li>\n<li><a href=\"https:\/\/altinity.com\/clickhouse-training\/?utm_source=leftjoin\">ClickHouse 101 Training by Altinity<\/a><\/li>\n<\/ul>\n",
            "date_published": "2021-08-09T08:41:15+03:00",
            "date_modified": "2021-08-09T08:49:34+03:00",
            "image": "https:\/\/en.leftjoin.ru\/pictures\/altinity-clickhouse-developer-300px.png",
            "_date_published_rfc2822": "Mon, 09 Aug 2021 08:41:15 +0300",
            "_rss_guid_is_permalink": "false",
            "_rss_guid": "65",
            "_e2_data": {
                "is_favourite": false,
                "links_required": [
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/fotorama\/fotorama.css",
                    "system\/library\/fotorama\/fotorama.js"
                ],
                "og_images": [
                    "https:\/\/en.leftjoin.ru\/pictures\/altinity-clickhouse-developer-300px.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/--2021-08-01--11.53.59.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/--2021-08-02--18.11.45.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/--2021-08-02--18.20.35.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/--2021-08-02--19.14.28.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/--2021-08-02--19.43.22.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/--2021-08-02--19.47.52.png"
                ]
            }
        },
        {
            "id": "46",
            "url": "https:\/\/en.leftjoin.ru\/all\/collecting-social-media-data-for-top-ml-ai-data-science-related\/",
            "title": "Collecting Social Media Data for Top ML, AI &amp; Data Science related accounts on Instagram",
            "content_html": "<p>Instagram is in the top 5 most visited websites, perhaps not for our industry. Nevertheless, we are going to test this hypothesis using Python and our data analytics skills. In this post, we will share how to collect social media data using the Instagram API.<\/p>\n<p><b>Data collection method<\/b><br \/>\nThe Instagram API won’t let us collect data about other platform users for no reason, but there is always a way. Try sending the following request:<\/p>\n<pre class=\"e2-text-code\"><code>https:\/\/instagram.com\/leftjoin\/?__a=1<\/code><\/pre><p>The request returns a JSON object with detailed user information, for instance, we can easily get an account name, number of posts, followers, subscriptions, as well as the first ten user posts with likes count, comments and etc. The <a href=\"https:\/\/github.com\/OlegYurchik\/pyInstagram\">pyInstagram<\/a> library allows sending such requests.<\/p>\n<p><b>SQL schema<\/b><br \/>\nData will be collected into thee Clickhouse tables: users,  posts, comments. The users table will contain user data, such as user id, username,  user’s first and last name, account description, number of followers, subscriptions, posts, comments, and likes, whether an account is verified or not, and so on.<\/p>\n<pre class=\"e2-text-code\"><code>CREATE TABLE instagram.users\r\n(\r\n    `added_at` DateTime,\r\n    `user_id` UInt64,\r\n    `user_name` String,\r\n    `full_name` String,\r\n    `base_url` String,\r\n    `biography` String,\r\n    `followers_count` UInt64,\r\n    `follows_count` UInt64,\r\n    `media_count` UInt64,\r\n    `total_comments` UInt64,\r\n    `total_likes` UInt64,\r\n    `is_verified` UInt8,\r\n    `country_block` UInt8,\r\n    `profile_pic_url` Nullable(String),\r\n    `profile_pic_url_hd` Nullable(String),\r\n    `fb_page` Nullable(String)\r\n)\r\nENGINE = ReplacingMergeTree\r\nORDER BY added_at<\/code><\/pre><p>The posts table will be populated with the post owner name, post id, caption, comments coun, and so on. To check whether a post is an advertisement,  Instagram carousel, or a video we can use these fields: <span class=\"inline-code\">is_ad<\/span>, <span class=\"inline-code\">is_album<\/span> and <span class=\"inline-code\">is_video<\/span>.<\/p>\n<pre class=\"e2-text-code\"><code>CREATE TABLE instagram.posts\r\n(\r\n    `added_at` DateTime,\r\n    `owner` String,\r\n    `post_id` UInt64,\r\n    `caption` Nullable(String),\r\n    `code` String,\r\n    `comments_count` UInt64,\r\n    `comments_disabled` UInt8,\r\n    `created_at` DateTime,\r\n    `display_url` String,\r\n    `is_ad` UInt8,\r\n    `is_album` UInt8,\r\n    `is_video` UInt8,\r\n    `likes_count` UInt64,\r\n    `location` Nullable(String),\r\n    `recources` Array(String),\r\n    `video_url` Nullable(String)\r\n)\r\nENGINE = ReplacingMergeTree\r\nORDER BY added_at<\/code><\/pre><p>In the comments table, we store each comment separately with the comment owner and text.<\/p>\n<pre class=\"e2-text-code\"><code>CREATE TABLE instagram.comments\r\n(\r\n    `added_at` DateTime,\r\n    `comment_id` UInt64,\r\n    `post_id` UInt64,\r\n    `comment_owner` String,\r\n    `comment_text` String\r\n)\r\nENGINE = ReplacingMergeTree\r\nORDER BY added_at<\/code><\/pre><p><b>Writing the script<\/b><br \/>\nImport the following classes from the library: <span class=\"inline-code\">Account<\/span>, <span class=\"inline-code\">Media<\/span>, <span class=\"inline-code\">WebAgent<\/span> and <span class=\"inline-code\">Comment<\/span>.<\/p>\n<pre class=\"e2-text-code\"><code>from instagram import Account, Media, WebAgent, Comment\r\nfrom datetime import datetime\r\nfrom clickhouse_driver import Client\r\nimport requests\r\nimport pandas as pd<\/code><\/pre><p>Next, create an instance of the <span class=\"inline-code\">WebAgent<\/span> class required for some library methods and data updating. To collect any meaningful information we need to have at least account names. Since we don’t have them yet, send the following request to search for porifles by the  keywords specified in queries_list. The search results will be composed of Instagram pages that match any keyword in the list.<\/p>\n<pre class=\"e2-text-code\"><code>agent = WebAgent()\r\nqueries_list = ['machine learning', 'data science', 'data analytics', 'analytics', 'business intelligence',\r\n                'data engineering', 'computer science', 'big data', 'artificial intelligence',\r\n                'deep learning', 'data scientist','machine learning engineer', 'data engineer']\r\nclient = Client(host='12.34.56.789', user='default', password='', port='9000', database='instagram')\r\nurl = 'https:\/\/www.instagram.com\/web\/search\/topsearch\/?context=user&amp;count=0'<\/code><\/pre><p>Let’s iterate the keywords collecting all matching accounts. Then remove duplicates from the obtained list by converting it to set and back.<\/p>\n<pre class=\"e2-text-code\"><code>response_list = []\r\nfor query in queries_list:\r\n    response = requests.get(url, params={\r\n        'query': query\r\n    }).json()\r\n    response_list.extend(response['users'])\r\ninstagram_pages_list = []\r\nfor item in response_list:\r\n    instagram_pages_list.append(item['user']['username'])\r\ninstagram_pages_list = list(set(instagram_pages_list))<\/code><\/pre><p>Now we need to loop through the list of pages and request detailed information about an account if it’s not in the table yet. Create an instance of the Account class and pass username as a parameter.<br \/>\nThen update the account information using the agent.update()<br \/>\nmethod. We will collect only the first 100 posts to keep it moving. Next, create a list named  <span class=\"inline-code\">media_list<\/span> to store received post ids after calling the <span class=\"inline-code\">agent.get_media()<\/span> method.<\/p>\n<p><details><br \/>\n<summary><span style=\"color:#7ea9b8\">Collecting user media data<\/span><\/summary><\/p>\n<pre class=\"e2-text-code\"><code>all_posts_list = []\r\nusername_count = 0\r\nfor username in instagram_pages_list:\r\n    if client.execute(f&quot;SELECT count(1) FROM users WHERE user_name='{username}'&quot;)[0][0] == 0:\r\n        print('username:', username_count, '\/', len(instagram_pages_list))\r\n        username_count += 1\r\n        account_total_likes = 0\r\n        account_total_comments = 0\r\n        try:\r\n            account = Account(username)\r\n        except Exception as E:\r\n            print(E)\r\n            continue\r\n        try:\r\n            agent.update(account)\r\n        except Exception as E:\r\n            print(E)\r\n            continue\r\n        if account.media_count &lt; 100:\r\n            post_count = account.media_count\r\n        else:\r\n            post_count = 100\r\n        print(account, post_count)\r\n        media_list, _ = agent.get_media(account, count=post_count, delay=1)\r\n        count = 0<\/code><\/pre><p><\/details><\/p>\n<p>Because we need to count the total number of likes and comments  before adding a new user to our database, we’ll start with them first. Almost all required fields belong to the <span class=\"inline-code\">Media<\/span> class:<\/p>\n<p><details><br \/>\n<summary><span style=\"color:#7ea9b8\">Collecting user posts<\/span><\/summary><\/p>\n<pre class=\"e2-text-code\"><code>for media_code in media_list:\r\n            if client.execute(f&quot;SELECT count(1) FROM posts WHERE code='{media_code}'&quot;)[0][0] == 0:\r\n                print('posts:', count, '\/', len(media_list))\r\n                count += 1\r\n\r\n                post_insert_list = []\r\n                post = Media(media_code)\r\n                agent.update(post)\r\n                post_insert_list.append(datetime.now().strftime('%Y-%m-%d %H:%M:%S'))\r\n                post_insert_list.append(str(post.owner))\r\n                post_insert_list.append(post.id)\r\n                if post.caption is not None:\r\n                    post_insert_list.append(post.caption.replace(&quot;'&quot;,&quot;&quot;).replace('&quot;', ''))\r\n                else:\r\n                    post_insert_list.append(&quot;&quot;)\r\n                post_insert_list.append(post.code)\r\n                post_insert_list.append(post.comments_count)\r\n                post_insert_list.append(int(post.comments_disabled))\r\n                post_insert_list.append(datetime.fromtimestamp(post.date).strftime('%Y-%m-%d %H:%M:%S'))\r\n                post_insert_list.append(post.display_url)\r\n                try:\r\n                    post_insert_list.append(int(post.is_ad))\r\n                except TypeError:\r\n                    post_insert_list.append('cast(Null as Nullable(UInt8))')\r\n                post_insert_list.append(int(post.is_album))\r\n                post_insert_list.append(int(post.is_video))\r\n                post_insert_list.append(post.likes_count)\r\n                if post.location is not None:\r\n                    post_insert_list.append(post.location)\r\n                else:\r\n                    post_insert_list.append('')\r\n                post_insert_list.append(post.resources)\r\n                if post.video_url is not None:\r\n                    post_insert_list.append(post.video_url)\r\n                else:\r\n                    post_insert_list.append('')\r\n                account_total_likes += post.likes_count\r\n                account_total_comments += post.comments_count\r\n                try:\r\n                    client.execute(f'''\r\n                        INSERT INTO posts VALUES {tuple(post_insert_list)}\r\n                    ''')\r\n                except Exception as E:\r\n                    print('posts:')\r\n                    print(E)\r\n                    print(post_insert_list)<\/code><\/pre><p><\/details><\/p>\n<p>Store comments in the variable with the same name after calling the <span class=\"inline-code\">get_comments()<\/span> method:<br \/>\n<details><br \/>\n<summary><span style=\"color:#7ea9b8\">Collecting post comments<\/span><\/summary><\/p>\n<pre class=\"e2-text-code\"><code>comments = agent.get_comments(media=post)\r\n                for comment_id in comments[0]:\r\n                    comment_insert_list = []\r\n                    comment = Comment(comment_id)\r\n                    comment_insert_list.append(datetime.now().strftime('%Y-%m-%d %H:%M:%S'))\r\n                    comment_insert_list.append(comment.id)\r\n                    comment_insert_list.append(post.id)\r\n                    comment_insert_list.append(str(comment.owner))\r\n                    comment_insert_list.append(comment.text.replace(&quot;'&quot;,&quot;&quot;).replace('&quot;', ''))\r\n                    try:\r\n                        client.execute(f'''\r\n                            INSERT INTO comments VALUES {tuple(comment_insert_list)}\r\n                        ''')\r\n                    except Exception as E:\r\n                        print('comments:')\r\n                        print(E)\r\n                        print(comment_insert_list)<\/code><\/pre><p><\/details><\/p>\n<p>And now, when we have obtained user posts and comments new information can be added to the table.<br \/>\n<details><br \/>\n<summary><span style=\"color:#7ea9b8\">Collecting user data<\/span><\/summary><\/p>\n<pre class=\"e2-text-code\"><code>user_insert_list = []\r\n        user_insert_list.append(datetime.now().strftime('%Y-%m-%d %H:%M:%S'))\r\n        user_insert_list.append(account.id)\r\n        user_insert_list.append(account.username)\r\n        user_insert_list.append(account.full_name)\r\n        user_insert_list.append(account.base_url)\r\n        user_insert_list.append(account.biography)\r\n        user_insert_list.append(account.followers_count)\r\n        user_insert_list.append(account.follows_count)\r\n        user_insert_list.append(account.media_count)\r\n        user_insert_list.append(account_total_comments)\r\n        user_insert_list.append(account_total_likes)\r\n        user_insert_list.append(int(account.is_verified))\r\n        user_insert_list.append(int(account.country_block))\r\n        user_insert_list.append(account.profile_pic_url)\r\n        user_insert_list.append(account.profile_pic_url_hd)\r\n        if account.fb_page is not None:\r\n            user_insert_list.append(account.fb_page)\r\n        else:\r\n            user_insert_list.append('')\r\n        try:\r\n            client.execute(f'''\r\n                INSERT INTO users VALUES {tuple(user_insert_list)}\r\n            ''')\r\n        except Exception as E:\r\n            print('users:')\r\n            print(E)\r\n            print(user_insert_list)<\/code><\/pre><p><\/details><\/p>\n<p><b>Conclusion<\/b><br \/>\nTo sum up, we have collected data of 500 users, with nearly 20K posts and 40K comments. As the database will be updated, we can write a simple query to get the top 10 ML, AI & Data Science related most followed accounts for today.<\/p>\n<pre class=\"e2-text-code\"><code>SELECT *\r\nFROM users\r\nORDER BY followers_count DESC\r\nLIMIT 10<\/code><\/pre><p>And as a bonus, here is a list of the most interesting Instagram accounts on this  topic:<\/p>\n<ol start=\"1\">\n<li><a href=\"https:\/\/www.instagram.com\/ai_machine_learning\/\">@ai_machine_learning<\/a><\/li>\n<li><a href=\"https:\/\/www.instagram.com\/neuralnine\/\">@neuralnine<\/a><\/li>\n<li><a href=\"https:\/\/www.instagram.com\/datascienceinfo\/\">@datascienceinfo<\/a><\/li>\n<li><a href=\"https:\/\/www.instagram.com\/compscistuff\/\">@compscistuff<\/a><\/li>\n<li><a href=\"https:\/\/www.instagram.com\/computersciencelife\/\">@computersciencelife<\/a><\/li>\n<li><a href=\"https:\/\/www.instagram.com\/welcome.ai\/\">@welcome.ai<\/a><\/li>\n<li><a href=\"https:\/\/www.instagram.com\/papa_programmer\/\">@papa_programmer<\/a><\/li>\n<li><a href=\"https:\/\/www.instagram.com\/data_science_learn\/\">@data_science_learn<\/a><\/li>\n<li><a href=\"https:\/\/www.instagram.com\/neuralnet.ai\/\">@neuralnet.ai<\/a><\/li>\n<li><a href=\"https:\/\/www.instagram.com\/techno_thinkers\/\">@techno_thinkers<\/a><\/li>\n<\/ol>\n<p><i>View the code on <a href=\"https:\/\/github.com\/valiotti\/leftjoin\/tree\/master\/instagram\">GitHub<\/a><\/i><\/p>\n",
            "date_published": "2020-09-30T16:06:11+03:00",
            "date_modified": "2020-09-30T16:13:40+03:00",
            "_date_published_rfc2822": "Wed, 30 Sep 2020 16:06:11 +0300",
            "_rss_guid_is_permalink": "false",
            "_rss_guid": "46",
            "_e2_data": {
                "is_favourite": false,
                "links_required": [
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css"
                ],
                "og_images": []
            }
        },
        {
            "id": "33",
            "url": "https:\/\/en.leftjoin.ru\/all\/handling-website-buttons-in-selenium\/",
            "title": "Handling website buttons in Selenium",
            "content_html": "<p>In our previous article, <a href=\"https:\/\/www.valiotti.com\/leftjoin\/all\/parse-website-with-python-p2\/\" class=\"nu\">“<u>Parsing the data of site’s catalogue, using Beautiful Soup and Selenium<\/u>”<\/a> we have addressed the problem of working with dynamic pages, but sometimes this method doesn’t work,  as with “Show more” buttons. Today we will show how you can imitate button click with Selenium to load a whole page, collect beer IDs, ratings, and send the data to Clickhouse.<\/p>\n<h2>Webpage structure<\/h2>\n<p>Let’s take a random brewery that has 105 check-ins, or customer feedbacks. One page with check-ins displays up to 25 records and looks like this:<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/1-8.png\" width=\"1186\" height=\"735\" alt=\"\" \/>\n<\/div>\n<p>If we try to scroll down to the bottom, we will encounter the same button that prevents us from getting all 105 records at once:<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/2-8.png\" width=\"350\" height=\"65\" alt=\"\" \/>\n<\/div>\n<p>First off, to address this task, let’s find out the button class and just click it until it works. Since Selenium launches the browser and the next “Show more”  button may not be loaded in time, that’s why we set  2-second intervals between the clicks. As soon as the page is loaded we will take its content and parse the relevant data.<br \/>\nLet’s view the source code and  find the button, it’s assigned to the <span class=\"inline-code\">more_checkins<\/span> class.<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/3-9.png\" width=\"849\" height=\"94\" alt=\"\" \/>\n<\/div>\n<p>The button has style attributes, such as <span class=\"inline-code\">display<\/span>. When the button is displayed this attribute takes the <span class=\"inline-code\">block<\/span> value. But when we scroll the page to the buttom and there is nothing left to display, the attribute takes the <span class=\"inline-code\">none<\/span> value and we can stop clicking.<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/4-8.png\" width=\"562\" height=\"107\" alt=\"\" \/>\n<\/div>\n<h2>Writing our code<\/h2>\n<p>Let’s import the necessary libraries<\/p>\n<pre class=\"e2-text-code\"><code>import time\r\nfrom selenium import webdriver\r\nfrom bs4 import BeautifulSoup as bs\r\nimport re\r\nfrom datetime import datetime\r\nfrom clickhouse_driver import Client<\/code><\/pre><p class=\"note\">Chromedriver is used to run Selenium tests on Chrome and can be downloaded from <a href=\"https:\/\/chromedriver.chromium.org\/downloads\">the official website<\/a><\/p>\n<p>Connect to the database and create cookies:<\/p>\n<pre class=\"e2-text-code\"><code>client = Client(host='ec1-23-456-789-10.us-east-2.compute.amazonaws.com', user='', password='', port='9000', database='')\r\ncount = 0\r\ncookies = {\r\n    'domain':'untappd.com',\r\n    'expiry':1594072726,\r\n    'httpOnly':True,\r\n    'name':'untappd_user_v3_e',\r\n    'path':'\/',\r\n    'secure':False,\r\n    'value':'your_value'\r\n}<\/code><\/pre><p class=\"note\">You can find out more about working with cookies in Selenium from <a href=\"https:\/\/www.valiotti.com\/leftjoin\/all\/parse-website-with-python-p2\/\" class=\"nu\">“<u>Parsing the data of site’s catalogue, using Beautiful Soup and Selenium<\/u>”<\/a>. We will need the untappd_user_v3_e parameter.<\/p>\n<p>As we are going to work with pages that have more than hundreds of thousands of records,  it’s pretty heavy and our instance may be overloaded. To prevent this, we will shut down unnecessary parts and then enable authentication cookie:<\/p>\n<pre class=\"e2-text-code\"><code>options = webdriver.ChromeOptions()\r\nprefs = {'profile.default_content_setting_values': {'images': 2, \r\n                            'plugins': 2, 'fullscreen': 2}}\r\noptions.add_experimental_option('prefs', prefs)\r\noptions.add_argument(&quot;start-maximized&quot;)\r\noptions.add_argument(&quot;disable-infobars&quot;)\r\noptions.add_argument(&quot;--disable-extensions&quot;)\r\ndriver = webdriver.Chrome(options=options)\r\ndriver.get('https:\/\/untappd.com\/TooSunnyBrewery')\r\ndriver.add_cookie(cookies)<\/code><\/pre><p>We  will need a function that would take a link, open it in the browser, load a whole page and return a soup object to be parsed. Get the  <span class=\"inline-code\">display<\/span> attribute, assign it to the <span class=\"inline-code\">more_checkins<\/span>: variable and click the button until the attribute is <span class=\"inline-code\">none<\/span>. Let’s set  2-second intervals between the clicks, to wait for the page to load. As soon as we received the page, converth it into a <span class=\"inline-code\">soup<\/span> object using the <span class=\"inline-code\">bs4<\/span> library.<\/p>\n<pre class=\"e2-text-code\"><code>def get_html_page(url):\r\n    driver.get(url)\r\n    driver.maximize_window()\r\n    more_checkins = driver.execute_script(&quot;var more_checkins=document.getElementsByClassName('more_checkins_logged')[0].style.display;return more_checkins;&quot;)\r\n    print(more_checkins)\r\n    while more_checkins != &quot;none&quot;:\r\n        driver.execute_script(&quot;document.getElementsByClassName('more_checkins_logged')[0].click()&quot;)\r\n        time.sleep(2)\r\n        more_checkins = driver.execute_script(&quot;var more_checkins=document.getElementsByClassName('more_checkins_logged')[0].style.display;return more_checkins;&quot;)\r\n        print(more_checkins)\r\n    source_data = driver.page_source\r\n    soup = bs(source_data, 'lxml')\r\n    return soup<\/code><\/pre><p>Write the following function that will take a page <span class=\"inline-code\">url<\/span>, pass it in the <span class=\"inline-code\">get_html_page<\/span> and receive a soup object to parse. The function returns zipped lists with beer IDs and ratings.<\/p>\n<p class=\"note\"> See how you can use <a href=\"https:\/\/www.valiotti.com\/leftjoin\/all\/parse-website-with-python-p1\/\">Beautiful Soup to retrieve data from a website catalogue<\/a><\/p>\n<pre class=\"e2-text-code\"><code>def parse_html_page(url):\r\n    soup = get_html_page(url)\r\n    brewery_id = soup.find_all('a', {'class':'label',\r\n                                     'href':re.compile('https:\/\/untappd.com\/brewery\/*')})[0]['href'][28:]\r\n    items = soup.find_all('div', {'class':'item',\r\n                                  'id':re.compile('checkin_*')})\r\n    checkin_rating_list = []\r\n    beer_id_list = []\r\n    count = 0\r\n    print('Filling the lists')\r\n    for checkin in items:\r\n        print(count, '\/', len(items))\r\n        try:\r\n            checkin_rating_list.append(float(checkin.find('div', {'class':'caps'})['data-rating']))\r\n        except Exception:\r\n            checkin_rating_list.append('cast(Null as Nullable(Float32))')\r\n        try:\r\n            beer_id_list.append(int(checkin.find('a', {'class':'label'})['href'][-7:]))\r\n        except Exception:\r\n            beer_id_list.append('cast(Null as Nullable(UInt64))')\r\n        count += 1 \r\n    return zip(checkin_rating_list, beer_id_list)<\/code><\/pre><p>Finally, write a function call for the breweries. We’ve covered how to receive a list of Russian brewery IDs in this article: <a href=\"https:\/\/www.valiotti.com\/leftjoin\/all\/example-of-using-dictionaries-in-clickhouse-with-untappd\/\">Example of using dictionaries in Clickhouse with Untappd<\/a>.<br \/>\nLet’s fetch it from the Clickhouse table.<\/p>\n<pre class=\"e2-text-code\"><code>brewery_list = client.execute('SELECT brewery_id FROM brewery_info')<\/code><\/pre><p>If we print out the <span class=\"inline-code\">brewery_list<\/span>,  we will find out that the data is stored in a list of tuples.<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/5-6.png\" width=\"378\" height=\"141\" alt=\"\" \/>\n<\/div>\n<p>Let’s make it a bit prettier with the lambda expression:<\/p>\n<pre class=\"e2-text-code\"><code>flatten = lambda lst: [item for sublist in lst for item in sublist]\r\nbrewery_list = flatten(brewery_list)<\/code><\/pre><p>That’s much better:<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/6-5.png\" width=\"252\" height=\"139\" alt=\"\" \/>\n<\/div>\n<p>Create a <span class=\"inline-code\">url<\/span> for each brewery in the list, it includes a standard link and a brewery ID in the end. Pass it to the <span class=\"inline-code\">parse_html_page<\/span> function that fetches the <span class=\"inline-code\">get_html_page<\/span> and return lists with <span class=\"inline-code\">beer_id<\/span> and <span class=\"inline-code\">rating_score<\/span>. Since the lists are zipped, we can iterate throught them, create a tuple and send it to Clickhouse.<\/p>\n<pre class=\"e2-text-code\"><code>for brewery_id in brewery_list:\r\n    print('Fetching the brewery with id', brewery_id, count, '\/', len(brewery_list))\r\n    url = 'https:\/\/untappd.com\/brewery\/' + str(brewery_id)\r\n    returned_checkins = parse_html_page(url)\r\n    for rating, beer_id in returned_checkins:\r\n        tuple_to_insert = (rating, beer_id)\r\n        try:\r\n            client.execute(f'INSERT INTO beer_reviews VALUES {tuple_to_insert}')\r\n        except errors.ServerException as E:\r\n            print(E)\r\n    count += 1<\/code><\/pre><p>That’s it about the way we can handle “Show more” buttons. Over time we will form a large dataset for further analysis, to work with in our next series.<\/p>\n",
            "date_published": "2020-06-22T10:54:56+03:00",
            "date_modified": "2020-06-22T11:13:03+03:00",
            "image": "https:\/\/en.leftjoin.ru\/pictures\/1-8.png",
            "_date_published_rfc2822": "Mon, 22 Jun 2020 10:54:56 +0300",
            "_rss_guid_is_permalink": "false",
            "_rss_guid": "33",
            "_e2_data": {
                "is_favourite": false,
                "links_required": [
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css"
                ],
                "og_images": [
                    "https:\/\/en.leftjoin.ru\/pictures\/1-8.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/2-8.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/3-9.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/4-8.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/5-6.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/6-5.png"
                ]
            }
        },
        {
            "id": "32",
            "url": "https:\/\/en.leftjoin.ru\/all\/example-of-using-dictionaries-in-clickhouse-with-untappd\/",
            "title": "Example of using dictionaries in Clickhouse with Untappd",
            "content_html": "<p>In Clickhouse we can use internal dictionaries as well as external dictionaries, they can be an alternative to JSON that doesn’t always work fine. DIctionaries store information in memory and can be invoked with the dictGet method. Let’s review how we can create one in Clickhouse and use it for our queries.<\/p>\n<p>We will illustrate an example of data using the Untappd API. Untappd is a social network for everyone who loves craft beer. We are going to use сheck-ins of Russian-based craft breweries and start collecting information about them to analyze this data later on and to draw some conclusions. in today’s article, we will analyze how to receive metadata on Russian breweries with Untappd and store it in a Clickhouse dictionary.<\/p>\n<h2>Collecting data with Untappd<\/h2>\n<p>First off, we need to create a new app to receive <span class=\"inline-code\">client_id<\/span> and  <span class=\"inline-code\">client_secret_key<\/span> to make API calls. Follow  <a href=\"https:\/\/untappd.com\/api\/register?register=new\">this link<\/a> and fill in the fields:<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/1-7.png\" width=\"734\" height=\"478\" alt=\"\" \/>\n<\/div>\n<p>Usually, it takes about 1 to 3 weeks to wait for approval.<\/p>\n<pre class=\"e2-text-code\"><code>import requests\r\nimport pandas as pd\r\nimport time<\/code><\/pre><p>We’ll be using the requests library to make API calls, view results in a Pandas DataFrame, and save them in a CSV file before sending it to a Clickhouse dictionary. Untappd has strict limits on the number of requests, prohibiting us to make more than 100 calls per hour. Therefore, we need to make our script wait for 38 seconds using the Python time module.<\/p>\n<pre class=\"e2-text-code\"><code>client_id = 'your_client_id'\r\nclient_secret = 'your_client_secret'\r\nall_brewery_of_russia = []<\/code><\/pre><p>We want to get data for one thousand Russian breweries. One request to the <a href=\"https:\/\/untappd.com\/api\/docs#brewerysearch\">Brewery Search<\/a> method enables us to view up to 50 breweries. The website gave us 3369 breweries when searching the word “Russia” manually.<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/2-7.png\" width=\"728\" height=\"223\" alt=\"\" \/>\n<\/div>\n<p>Let’s check this, scroll down to the bottom, and open the page code.<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/3-8.png\" width=\"405\" height=\"199\" alt=\"\" \/>\n<\/div>\n<p>Each brewery received is stored in the <span class=\"inline-code\">beer-item<\/span> class. This means we can the number of references to <span class=\"inline-code\">beer-item<\/span>:<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/4-7.png\" width=\"395\" height=\"21\" alt=\"\" \/>\n<\/div>\n<p>And as it turned out, we have exactly 1000 breweries, not 3369. When searching the word “Russia” manually, the results also contain some American breweries. So, we need to make 20 calls, getting 50 breweries at a time:<\/p>\n<pre class=\"e2-text-code\"><code>for offset in range(0, 1000, 50):\r\n    try:\r\n        print('offset = ', offset)\r\n        print('remained:', 1000 - offset, '\\n')\r\n        response = requests.get(f'https:\/\/api.untappd.com\/v4\/search\/brewery?client_id={client_id}&amp;client_secret={client_secret}',\r\n                               params={\r\n                                   'q':'Russia',\r\n                                   'offset':offset,\r\n                                   'limit':50\r\n                               })\r\n        item = response.json()\r\n        print(item, '\\n')\r\n        all_brewery_of_russia.append(item)\r\n        time.sleep(37)\r\n    except Exception:\r\n        print(Exception)\r\n        continue<\/code><\/pre><p>The <span class=\"inline-code\">Brewery Search<\/span> method includes several parameters, q – a string with a country name (specify specify “Russia” to get all the breweries based in Russia),  offset – allows us to shift by 50 lines in the search to get the next list of breweries, limit – restricts the number of breweries received and can not be more than 50. Convert the answer to JSON and append data sotred in the <span class=\"inline-code\">item<\/span> object to the  <span class=\"inline-code\">all_brewery_of_russia<\/span>  list.<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/5-5.png\" width=\"911\" height=\"323\" alt=\"\" \/>\n<\/div>\n<p>Our data may also include breweries from other countries. That’s why we need to filter the data. Iterate through the <span class=\"inline-code\">all_brewery_of_russia<\/span>  list and keep only those breweires, which <span class=\"inline-code\">country_name<\/span> is Russia.<\/p>\n<pre class=\"e2-text-code\"><code>brew_list = []\r\nfor element in all_brewery_of_russia:\r\n    brew = element['response']['brewery']\r\n    for i in range(brew['count']):\r\n        if brew['items'][i]['brewery']['country_name'] == 'Russia':\r\n            brew_list.append(brew['items'][i])<\/code><\/pre><p>Print out the first element in our <span class=\"inline-code\">brew_list<\/span>:<\/p>\n<pre class=\"e2-text-code\"><code>print(brew_list[0])<\/code><\/pre><div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/11.png\" width=\"1009\" height=\"79\" alt=\"\" \/>\n<\/div>\n<p>Create a DataFrame with the following columns: <span class=\"inline-code\">brewery_id<\/span>,  <span class=\"inline-code\">beer_count<\/span>,  <span class=\"inline-code\">brewery_name<\/span>,  <span class=\"inline-code\">brewery_slug<\/span>,  <span class=\"inline-code\">brewery_page_url<\/span>,  <span class=\"inline-code\">brewery_city<\/span>,  <span class=\"inline-code\">lat<\/span> и  <span class=\"inline-code\">lng<\/span>. And several lists to sort out the data stored in the <span class=\"inline-code\">brewery_list<\/span>:<\/p>\n<pre class=\"e2-text-code\"><code>df = pd.DataFrame()\r\nbrewery_id_list = []\r\nbeer_count_list = []\r\nbrewery_name_list = []\r\nbrewery_slug_list = []\r\nbrewery_page_url_list = []\r\nbrewery_location_city = []\r\nbrewery_location_lat = []\r\nbrewery_location_lng = []\r\nfor brewery in brew_list:\r\n    brewery_id_list.append(brewery['brewery']['brewery_id'])\r\n    beer_count_list.append(brewery['brewery']['beer_count'])\r\n    brewery_name_list.append(brewery['brewery']['brewery_name'])\r\n    brewery_slug_list.append(brewery['brewery']['brewery_slug'])\r\n    brewery_page_url_list.append(brewery['brewery']['brewery_page_url'])\r\n brewery_location_city.append(brewery['brewery']['location']['brewery_city'])\r\n    brewery_location_lat.append(brewery['brewery']['location']['lat'])\r\n    brewery_location_lng.append(brewery['brewery']['location']['lng'])<\/code><\/pre><p>Assign them as column values:<\/p>\n<pre class=\"e2-text-code\"><code>df['brewery_id'] = brewery_id_list\r\ndf['beer_count'] = beer_count_list\r\ndf['brewery_name'] = brewery_name_list\r\ndf['brewery_slug'] = brewery_slug_list\r\ndf['brewery_page_url'] = brewery_page_url_list\r\ndf['brewery_city'] = brewery_location_city\r\ndf['brewery_lat'] = brewery_location_lat\r\ndf['brewery_lng'] = brewery_location_lng<\/code><\/pre><p>And view our DataFrame:<\/p>\n<pre class=\"e2-text-code\"><code>df.head()<\/code><\/pre><div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/6-4.png\" width=\"866\" height=\"185\" alt=\"\" \/>\n<\/div>\n<p>Let’s sort the values by <span class=\"inline-code\">brewery_id<\/span>  and store our DataFrame as a CSV file without index column and headings:<\/p>\n<pre class=\"e2-text-code\"><code>df = df.sort_values(by='brewery_id')\r\ndf.to_csv('brewery_data.csv', index=False, header=False)<\/code><\/pre><h2>Creating a Clickhouse dictionary<\/h2>\n<p>You can create Clickouse dictionaries in many different ways. We will try to structure it in an XML file, configure the server files, and access it through our client. The XML file structure will be the following:<\/p>\n<p class=\"note\">Learn more about other ways you can create Clickhouse dictionaries <a href=\"https:\/\/clickhouse.tech\/docs\/ru\/engines\/table-engines\/special\/dictionary\/\">in the documentation<\/a><\/p>\n<pre class=\"e2-text-code\"><code>&lt;yandex&gt;\r\n&lt;dictionary&gt;\r\n        &lt;name&gt;breweries&lt;\/name&gt;\r\n        &lt;source&gt;\r\n                &lt;file&gt;\r\n                        &lt;path&gt;\/home\/ubuntu\/brewery_data.csv&lt;\/path&gt;\r\n                        &lt;format&gt;CSV&lt;\/format&gt;\r\n                &lt;\/file&gt;\r\n        &lt;\/source&gt;\r\n        &lt;layout&gt;\r\n                &lt;flat \/&gt;\r\n        &lt;\/layout&gt;\r\n        &lt;structure&gt;\r\n                &lt;id&gt;\r\n                        &lt;name&gt;brewery_id&lt;\/name&gt;\r\n                &lt;\/id&gt;\r\n                &lt;attribute&gt;\r\n                        &lt;name&gt;beer_count&lt;\/name&gt;\r\n                        &lt;type&gt;UInt64&lt;\/type&gt;\r\n                        &lt;null_value&gt;Null&lt;\/null_value&gt;\r\n                &lt;\/attribute&gt;\r\n                &lt;attribute&gt;\r\n                        &lt;name&gt;brewery_name&lt;\/name&gt;\r\n                        &lt;type&gt;String&lt;\/type&gt;\r\n                        &lt;null_value&gt;Null&lt;\/null_value&gt;\r\n                &lt;\/attribute&gt;\r\n                &lt;attribute&gt;\r\n                        &lt;name&gt;brewery_slug&lt;\/name&gt;\r\n                        &lt;type&gt;String&lt;\/type&gt;\r\n                        &lt;null_value&gt;Null&lt;\/null_value&gt;\r\n                &lt;\/attribute&gt;\r\n                &lt;attribute&gt;\r\n                        &lt;name&gt;brewery_page_url&lt;\/name&gt;\r\n                        &lt;type&gt;String&lt;\/type&gt;\r\n                        &lt;null_value&gt;Null&lt;\/null_value&gt;\r\n                &lt;\/attribute&gt;\r\n                &lt;attribute&gt;\r\n                        &lt;name&gt;brewery_city&lt;\/name&gt;\r\n                        &lt;type&gt;String&lt;\/type&gt;\r\n                        &lt;null_value&gt;Null&lt;\/null_value&gt;\r\n                &lt;\/attribute&gt;\r\n                &lt;attribute&gt;\r\n                        &lt;name&gt;lat&lt;\/name&gt;\r\n                        &lt;type&gt;String&lt;\/type&gt;\r\n                        &lt;null_value&gt;Null&lt;\/null_value&gt;\r\n                &lt;\/attribute&gt;\r\n                &lt;attribute&gt;\r\n                        &lt;name&gt;lng&lt;\/name&gt;\r\n                        &lt;type&gt;String&lt;\/type&gt;\r\n                        &lt;null_value&gt;Null&lt;\/null_value&gt;\r\n                &lt;\/attribute&gt;\r\n        &lt;\/structure&gt;\r\n        &lt;lifetime&gt;300&lt;\/lifetime&gt;\r\n&lt;\/dictionary&gt;\r\n&lt;\/yandex&gt;<\/code><\/pre><p><span class=\"inline-code\">name<\/span> is a dictionary name,  <span class=\"inline-code\">attribute<\/span>  holds the properties of the columns, <span class=\"inline-code\">id<\/span>  is a key field,   <span class=\"inline-code\">file<\/span> stores file path and format. We are going to store our file in this directory: \/home\/ubuntu.<\/p>\n<p>Let’s upload our CSV and XML files to the server, it can be done using an FTP like FileZilla. We explained how to deploy Clickhouse on an Amazon instance in our <a href=\"http:\/\/leftjoin.ru\/all\/stavim-clickhouse-na-aws\/\">previous article<\/a>, this time need to do the same. Open your FileZilla client and go to SFTP settings to add a private key:<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/7-4.png\" width=\"777\" height=\"557\" alt=\"\" \/>\n<\/div>\n<p>Connect to your server address, it can be found in the EC2 management console. Specify SFTP as a protocol, your Host, and Ubuntu as a username.<\/p>\n<p class=\"note\">Your Public DNS may change in case of overload<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/8-3.png\" width=\"452\" height=\"451\" alt=\"\" \/>\n<\/div>\n<p>After connecting we will wind up in this location \/home\/ubuntu. Let’s put the files in that folder and connect via SSH using Termius. Then we need to move the files to \/etc\/clickhouse-server to view them in Clickhouse:<\/p>\n<p class=\"note\">Learn how you can connect to an AWS server using SSH client from our previous material <a href=\"http:\/\/leftjoin.ru\/all\/stavim-clickhouse-na-aws\/\" class=\"nu\">“<u>Installing Clickhouse on AWS<\/u>”<\/a><\/p>\n<pre class=\"e2-text-code\"><code>sudo mv breweries_dictionary.xml \/etc\/clickhouse server\/<\/code><\/pre><p>Go to the config file:<\/p>\n<pre class=\"e2-text-code\"><code>cd \/etc\/clickhouse-server\r\nsudo nano config.xml<\/code><\/pre><p>We need the <dictionaries_config> tag, it’s the path to a file that describes the dictionaries structure. Specify the path to our XML file:<\/p>\n<pre class=\"e2-text-code\"><code>&lt;dictionaries_config&gt;\/etc\/clickhouse-server\/breweries_dictionary.xml&lt;\/dictionaries_config&gt;<\/code><\/pre><p>Save our file and run the Clickhouse client:<\/p>\n<pre class=\"e2-text-code\"><code>clickhouse client<\/code><\/pre><p>Let’s check that the dictionary really loaded:<\/p>\n<pre class=\"e2-text-code\"><code>SELECT * FROM system.dictionaries\\G<\/code><\/pre><p>In case of success you will get the following:<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/9-3.png\" width=\"854\" height=\"344\" alt=\"\" \/>\n<\/div>\n<p>Now, let’s write a query with the  <a href=\"https:\/\/clickhouse.tech\/docs\/ru\/sql-reference\/functions\/ext-dict-functions\/#dictget\">dictGet<\/a> function to get the name of the brewery with ID 999. Pass in the dictionary name, as the first argument, then the filed name and ID.<\/p>\n<pre class=\"e2-text-code\"><code>SELECT dictGet('breweries', 'brewery_name', toUInt64(999))<\/code><\/pre><p>And our query returns this:<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/10.png\" width=\"400\" height=\"60\" alt=\"\" \/>\n<\/div>\n<p>Similarly, we could use this function to get a beer name, when the table contains only IDs.<\/p>\n",
            "date_published": "2020-06-16T10:25:07+03:00",
            "date_modified": "2020-06-16T10:18:15+03:00",
            "image": "https:\/\/en.leftjoin.ru\/pictures\/1-7.png",
            "_date_published_rfc2822": "Tue, 16 Jun 2020 10:25:07 +0300",
            "_rss_guid_is_permalink": "false",
            "_rss_guid": "32",
            "_e2_data": {
                "is_favourite": false,
                "links_required": [
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css"
                ],
                "og_images": [
                    "https:\/\/en.leftjoin.ru\/pictures\/1-7.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/2-7.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/3-8.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/4-7.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/5-5.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/11.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/6-4.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/7-4.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/8-3.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/9-3.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/10.png"
                ]
            }
        },
        {
            "id": "26",
            "url": "https:\/\/en.leftjoin.ru\/all\/working-with-materialized-views-in-clickhouse\/",
            "title": "Working with Materialized Views in Clickhouse",
            "content_html": "<p>This time we’ll illustrate how you can pass data on Facebook ad campaigns to Clickhouse tables with Python and implement Materialized Views. What is materialized views, you may ask. Oftentimes Clickhouse is used to handle large amounts of data and the time spent waiting for a response from a table with raw data is constantly increasing. Usually, we would use ETL-process to address this task efficiently or create aggregate tables, which are not that useful because we have to regularly update them. Clickhouse system offers a new way to meet the challenge using materialized views.<br \/>\nMaterialized Views allow us to store and update data on a hard drive in line with the <span class=\"inline-code\">SELECT<\/span> query that was used to get a view. When we need to insert data into a table, the <span class=\"inline-code\">SELECT<\/span> method transforms our data and populates a materialized view.<\/p>\n<p><b>Setting Up Amazon EC2 instance<\/b><br \/>\nWe need to connect our Python script that we created <a href=\"https:\/\/www.valiotti.com\/leftjoin\/all\/collecting-data-on-facebook-ad-campaigns\/\/\">in this article<\/a>  to Cickhouse. The script will make queries, so let’s open several ports. In your AWS Dashboard go to Network & Security — Security Groups. Our instance belongs to the launch-wizard-1 group. Сlick it and pay attention to the Inbound rules, you need to set them as shown in this screenshot:<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/1-22.png\" width=\"969\" height=\"293\" alt=\"\" \/>\n<\/div>\n<p><b>Setting up Clickhouse<\/b><br \/>\nIt’s time to set up Clickhouse. Let’s edit the config.xml file using nano text editor:<\/p>\n<pre class=\"e2-text-code\"><code>cd \/etc\/clickhouse-server\r\nsudo nano config.xml<\/code><\/pre><p class=\"note\"> Learn more about <a href=\"https:\/\/linuxize.com\/post\/how-to-use-nano-text-editor\/\">the shortcuts here <\/a> if you didn’t get how to exit nano too :)<\/p>\n<p>Uncomment this line:<\/p>\n<pre class=\"e2-text-code\"><code>&lt;listen_host&gt;0.0.0.0&lt;\/listen_host&gt;<\/code><\/pre><p>to access your database from any IP-address:<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/2-3.png\" width=\"479\" height=\"194\" alt=\"\" \/>\n<\/div>\n<p><b>Create a table and its materialized view<\/b><br \/>\nOpen a terminal window to create our database with tables:<\/p>\n<pre class=\"e2-text-code\"><code>CREATE DATABASE db1\r\nUSE db1<\/code><\/pre><p>We’ll refer to the same example of data collection from Facebook. The data on Ad Campaigns may often change and be updated, with this in mind we want to create a materialized view that would automatically update aggregate tables containing the costs data. Our Clickhouse table will look almost the same as the <span class=\"inline-code\">DataFrame<\/span> used in the previous post. We picked <span class=\"inline-code\">ReplacingMergeTree <\/span> as an engine for our table, it will remove duplicates by sorting key:<\/p>\n<pre class=\"e2-text-code\"><code>CREATE TABLE facebook_insights(\r\n\tcampaign_id UInt64,\r\n\tclicks UInt32,\r\n\tspend Float32,\r\n\timpressions UInt32,\r\n\tdate_start Date,\r\n\tdate_stop\t Date,\r\n\tsign Int8\r\n) ENGINE = ReplacingMergeTree\r\nORDER BY (date_start, date_stop)<\/code><\/pre><p>And then,  create a materialized view:<\/p>\n<pre class=\"e2-text-code\"><code>CREATE MATERIALIZED VIEW fb_aggregated\r\nENGINE = SummingMergeTree()\r\nORDER BY date_start\r\n\tAS\r\n\tSELECT campaign_id,\r\n\t\t      date_start,\r\n\t\t      sum(spend * sign) as spent,\r\n\t\t      sum(impressions * sign) as impressions,\r\n\t\t      sum(clicks * sign) as clicks\r\n\tFROM facebook_insights\r\n\tGROUP BY date_start, campaign_id<\/code><\/pre><p class=\"note\">More details are available in <a href=\"https:\/\/clickhouse.yandex\/blog\/en\/how-to-update-data-in-clickhouse\">the Clickhouse blog<\/a>.<\/p>\n<p>Unfortunately for us, Clikhouse system doesn’t include a familiar <span class=\"inline-code\">UPDATE<\/span> method. So we need to find a workaround. Thanks to the Yandex team, these guys offered to insert rows with a negative <span class=\"inline-code\">sign<\/span> first,  and then use <span class=\"inline-code\">sign<\/span> for reversing. According to this principle, the old data will be ignored when summing.<\/p>\n<p><b>Script<\/b><br \/>\nLet’s start writing the script and import a new library, which is called clickhouse_driver. It allows to make queries to Clickhouse in Python:<\/p>\n<p class=\"note\">We are using the updated version of the script from <a href=\"https:\/\/www.valiotti.com\/leftjoin\/all\/collecting-data-on-facebook-ad-campaigns\/\"> “Collecting Data on Facebook Ad Campaigns”<\/a>. But it will work fine if you just combine this code with the previous one.<\/p>\n<pre class=\"e2-text-code\"><code>from datetime import datetime, timedelta\r\nfrom clickhouse_driver import Client\r\nfrom clickhouse_driver import errors<\/code><\/pre><p>An object of the <span class=\"inline-code\">Client<\/span>class enables us to make queries with the <span class=\"inline-code\">execute()<\/span> method. Type in your public DNS in the <span class=\"inline-code\">host<\/span> field,  <span class=\"inline-code\">port<\/span> – 9000, specify default as a <span class=\"inline-code\">user<\/span>, and a <span class=\"inline-code\">database<\/span>for the connection.<\/p>\n<pre class=\"e2-text-code\"><code>client = Client(host='ec1-2-34-56-78.us-east-2.compute.amazonaws.com', user='default', password=' ', port='9000', database='db1')<\/code><\/pre><p>To ensure that everything works as expected, we need to write the following query that will print out names of all databases stored on the server:<\/p>\n<pre class=\"e2-text-code\"><code>client.execute('SHOW DATABASES')<\/code><\/pre><p>In case of success the query will return this list:<\/p>\n<pre class=\"e2-text-code\"><code>[('_temporary_and_external_tables',), ('db1',), ('default',), ('system',)]<\/code><\/pre><p>For example, we want to get data for the past three days. Create several <span class=\"inline-code\">datetime objects <\/span> with the datetime library and convert them to strings using the<span class=\"inline-code\">strftime()<\/span><br \/>\nmethod:<\/p>\n<pre class=\"e2-text-code\"><code>date_start = datetime.now() - timedelta(days=3)\r\ndate_end = datetime.now() - timedelta(days=1)\r\ndate_start_str = date_start.strftime(&quot;%Y-%m-%d&quot;)\r\ndate_end_str = date_end.strftime(&quot;%Y-%m-%d&quot;)<\/code><\/pre><p>This query returns all table columns for a certain period:<\/p>\n<pre class=\"e2-text-code\"><code>SQL_select = f&quot;select campaign_id, clicks, spend, impressions, date_start, date_stop, sign from facebook_insights where date_start &gt; '{date_start_str}' AND date_start &lt; '{date_end_str}'&quot;<\/code><\/pre><p>Make a query and pass the data to the <span class=\"inline-code\">old_data_list<\/span>.  And then, replace their <span class=\"inline-code\">sign<\/span>for -1 and append elements to the <span class=\"inline-code\">new_data_list<\/span>:<\/p>\n<pre class=\"e2-text-code\"><code>new_data_list = []\r\nold_data_list = []\r\nold_data_list = client.execute(SQL_select)\r\n\r\nfor elem in old_data_list:\r\n    elem = list(elem)\r\n    elem[len(elem) - 1] = -1\r\n    new_data_list.append(elem)<\/code><\/pre><p>Finally,  write our algorithm: insert the data with the <span class=\"inline-code\">sign =-1<\/span>, optimize it with <span class=\"inline-code\">ReplacingMergeTree<\/span>, remove duplicates, and <span class=\"inline-code\">INSERT<\/span> new data with the <span class=\"inline-code\">sign = 1<\/span>.<\/p>\n<pre class=\"e2-text-code\"><code>SQL_query = 'INSERT INTO facebook_insights VALUES'\r\nclient.execute(SQL_query, new_data_list)\r\nSQL_optimize = &quot;OPTIMIZE TABLE facebook_insights&quot;\r\nclient.execute(SQL_optimize)\r\nfor i in range(len(insight_campaign_id_list)):\r\n    client.execute(SQL_query, [[insight_campaign_id_list[i],\r\n                                insight_clicks_list[i],\r\n                                insight_spend_list[i],\r\n                                insight_impressions_list[i],\r\n                                datetime.strptime(insight_date_start_list[i], '%Y-%m-%d').date(),\r\n                                datetime.strptime(insight_date_start_list[i], '%Y-%m-%d').date(),\r\n                                1]])\r\n    client.execute(SQL_optimize)<\/code><\/pre><p>Get back to Clickhouse and make the next query to view the first 20 rows:<br \/>\n<span class=\"inline-code\">SELECT * FROM facebook_insights LIMIT 20<\/span><\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/3-4.png\" width=\"754\" height=\"375\" alt=\"\" \/>\n<\/div>\n<p>And  <span class=\"inline-code\">SELECT * FROM fb_aggregated LIMIT 20<\/span> to compare our materialized view:<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/4-4.png\" width=\"748\" height=\"385\" alt=\"\" \/>\n<\/div>\n<p>Nice work! Now we have a materialized view that will be updated each time when the data in the <span class=\"inline-code\">facebook_insights <\/span> table changes. The trick with the <span class=\"inline-code\">sign<\/span> operator allows to differ already processed data and prevent its summation, while <span class=\"inline-code\">ReplacingMergeTree <\/span>engine helps us to remove duplicates.<\/p>\n",
            "date_published": "2020-05-20T15:06:48+03:00",
            "date_modified": "2020-05-21T07:35:35+03:00",
            "image": "https:\/\/en.leftjoin.ru\/pictures\/1-22.png",
            "_date_published_rfc2822": "Wed, 20 May 2020 15:06:48 +0300",
            "_rss_guid_is_permalink": "false",
            "_rss_guid": "26",
            "_e2_data": {
                "is_favourite": false,
                "links_required": [
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css"
                ],
                "og_images": [
                    "https:\/\/en.leftjoin.ru\/pictures\/1-22.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/2-3.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/3-4.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/4-4.png"
                ]
            }
        },
        {
            "id": "25",
            "url": "https:\/\/en.leftjoin.ru\/all\/installing-clickhouse-on-aws\/",
            "title": "Installing Clickhouse on AWS",
            "content_html": "<p>In today’s article, we’ll work with Clickhouse and install it on a free Amazon EC2 instance.<\/p>\n<p><b>AWS account and Ubuntu Instance<\/b><br \/>\nThe easiest way to install Clickouse on a virtual Ubuntu server is to use .deb packages. There is no need to worry if you don’t have one – Amazon Web Services provide Free Tier offers that you can enjoy for 12 months. Just go to <a href=\"https:\/\/aws.amazon.com\">https:\/\/aws.amazon.com<\/a> and sign up.<br \/>\nOnce registered, go to your Dashboard, find the “Build a solution” option and click «Launch a virtual machine», and choose one that comes with Ubuntu Server pre-installed.<\/p>\n<div class=\"e2-text-picture\">\n<div class=\"fotorama\" data-width=\"730\" data-ratio=\"2.5\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/1-2.png\" width=\"730\" height=\"292\" alt=\"\" \/>\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/2-2.png\" width=\"986\" height=\"146\" alt=\"\" \/>\n<\/div>\n<\/div>\n<p>Create a key pair – one is a public key and another is a private key that you need to store locally, it secures our connection.<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/3-3.png\" width=\"683\" height=\"500\" alt=\"\" \/>\n<\/div>\n<p>After this step, we’ll see the EC2 Management Console with our EC2 instance up and running. It has a public DNS that we need to save.<\/p>\n<p><b>Connect with Termius<\/b><br \/>\nWe connect to our virtual server via SSH protocol. The majority of clients support this protocol, and for our case, we’ll be using Termius. Click «+ NEW HOST» and complete the fields.<br \/>\nType your public DNS in the address field,  «ubuntu» as a Username and leave the password field empty. Now,  in order to complete the Key field, we need to specify a file with the .pem extension, the one that was received after creating an Instance. Your result should be much the same:<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/4-3.png\" width=\"451\" height=\"419\" alt=\"\" \/>\n<\/div>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/5-2.png\" width=\"450\" height=\"673\" alt=\"\" \/>\n<\/div>\n<p>Connect to our Instance after authentication and we’ll get a new console screen:<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/6-2.png\" width=\"677\" height=\"403\" alt=\"\" \/>\n<\/div>\n<p>Now we can install Clickhouse. Run the following command to add the Clickhouse repository:<\/p>\n<p class=\"note\">Learn more about other ways you can install Clickhouse in the <a href=\"https:\/\/clickhouse.tech\/docs\/en\/getting-started\/install\/\">documentation<\/a><\/p>\n<pre class=\"e2-text-code\"><code>echo &quot;deb http:\/\/repo.yandex.ru\/clickhouse\/deb\/stable\/ main\/&quot; | sudo tee \/etc\/apt\/sources.list.d\/clickhouse.list<\/code><\/pre><p>Make sure to update the packages:<\/p>\n<pre class=\"e2-text-code\"><code>sudo apt-get update<\/code><\/pre><p>Finally, install our client and server by running:<\/p>\n<pre class=\"e2-text-code\"><code>sudo apt-get install -y clickhouse-server clickhouse-client<\/code><\/pre><p>And it’s done! The client and Clickhouse server were installed on our instance. Run the server:<\/p>\n<pre class=\"e2-text-code\"><code>sudo service clickhouse-server start<\/code><\/pre><p>Test our Clickhouse server to ensure that everything works:<\/p>\n<pre class=\"e2-text-code\"><code>sudo service clickhouse-server status:<\/code><\/pre><p>And if everything works fine, we’ll get the following output:<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/7-2.png\" width=\"881\" height=\"125\" alt=\"\" \/>\n<\/div>\n<p>Type in the next command to connect to our client:<\/p>\n<pre class=\"e2-text-code\"><code>clickhouse client<\/code><\/pre><div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/8-2.png\" width=\"505\" height=\"122\" alt=\"\" \/>\n<\/div>\n<p>Run another check as suggested in the documentation:<\/p>\n<pre class=\"e2-text-code\"><code>SELECT 1<\/code><\/pre><p>If everything was done right we’ll get the following:<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/9-2.png\" width=\"455\" height=\"218\" alt=\"\" \/>\n<\/div>\n<p>This is it! Next time we’ll share how to work with Python and  Clickhouse,  return to our script that retrieves data on Ad Campaigns and push it into a table to visualize after.<\/p>\n",
            "date_published": "2020-05-18T14:28:02+03:00",
            "date_modified": "2020-05-18T14:21:17+03:00",
            "image": "https:\/\/en.leftjoin.ru\/pictures\/1-2.png",
            "_date_published_rfc2822": "Mon, 18 May 2020 14:28:02 +0300",
            "_rss_guid_is_permalink": "false",
            "_rss_guid": "25",
            "_e2_data": {
                "is_favourite": false,
                "links_required": [
                    "system\/library\/fotorama\/fotorama.css",
                    "system\/library\/fotorama\/fotorama.js",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css"
                ],
                "og_images": [
                    "https:\/\/en.leftjoin.ru\/pictures\/1-2.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/2-2.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/3-3.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/4-3.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/5-2.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/6-2.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/7-2.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/8-2.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/9-2.png"
                ]
            }
        },
        {
            "id": "21",
            "url": "https:\/\/en.leftjoin.ru\/all\/clickhouse-as-a-consumer-for-amazon-msk\/",
            "title": "Clickhouse as a consumer for Amazon MSK",
            "content_html": "<p class=\"note\"><b>Disclaimer<\/b>: the note is of a technical nature, therefore it might be interesting to fewer people with business background.<\/p>\n<p>This blog hasn’t addressed the topic of <a href=\"https:\/\/clickhouse.yandex\/\">Clickhouse<\/a> yet, however it’s one of the fastest databases from Yandex company. Brief account without going into details: Clickhouse – is the most efficiently written DBMS of a column type with respect to program code, information about the DBMS is quite thoroughly <a href=\"https:\/\/clickhouse.tech\/docs\/en\/\">described in the documentation<\/a> and in multiple videos on Youtube (<a href=\"https:\/\/www.youtube.com\/watch?v=bSyQahMVZ7w\">one<\/a>, <a href=\"https:\/\/www.youtube.com\/watch?v=Ac2C2G2g8Cg\">two<\/a>, <a href=\"https:\/\/www.youtube.com\/watch?v=ltg8vstuHUU\">three<\/a>).<\/p>\n<p>Over the last four years, I’ve been using Clickhouse in my practice as an analyst and expert in building analytical reporting. Mostly, I’ve been using <a href=\"all\/redash-full-fledged-on-demand-analytics\/\">Redash<\/a> for solution of tasks on reporting visualization \/ reports with parameters \/ dashboards, as it is the most convenient interface for access to Clickhouse data.<br \/>\nHowever, just recently, in Looker, <a href=\"all\/looker-overview\/\">that I spoke about previously<\/a>, an opportunity to connect Clickhouse as a data source appeared. It’s worth noting, that in Tableau connection to Clickhouse has existed for quite a while.<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/setup-clickhouse-new-connection-624@2x.png\" width=\"635\" height=\"761\" alt=\"\" \/>\n<\/div>\n<p>The architecture of the analytical service, based on Clickhouse, is predominantly cloud one. That’s how it was in the task reviewed. Let’s assume you have an allocated instance EC2 in Amazon (on which you’ve installed Clickhouse) and a separate Kafka-cluster (solution <a href=\"https:\/\/aws.amazon.com\/ru\/msk\/\">of Amazon MSK<\/a>).<\/p>\n<p><b>The task<\/b>: is to connect Clickhouse as a  <i>consumer<\/i>  in order to obtain information from brokers of your Kafka cluster. In fact, it’s quite thoroughly described  <a href=\"https:\/\/docs.aws.amazon.com\/msk\/latest\/developerguide\/create-client-machine.html\">how exactly one can connect to Kafka cluster<\/a> in documentation on the site of Amazon MSK, so I won’t repeat this information. In my case, the guide helped: the topics were created by a producer from machine with installed Clickhouse, and were read from it by a consumer.<\/p>\n<p>However, a problem arose: at <a href=\"https:\/\/clickhouse.tech\/docs\/en\/operations\/table_engines\/kafka\/\">connection of Clickhouse to Kafka as a consumer<\/a>, the following error occurred:<\/p>\n<pre class=\"e2-text-code\"><code>020.02.02 18:01:56.209132 [ 46 ] {e7124cd5-2144-4b1d-bd49-8a410cdbd607} &lt;Error&gt; executeQuery: std::exception. Code: 1001, type: cppkafka::HandleException, e.what() = Local: Timed out, version = 20.1.2.4 (official build) (from 127.0.0.1:46586) (in query: SELECT * FROM events), Stack trace (when copying this message, always include the lines below):<\/code><\/pre><p>For a long time I’ve been searching information in Clickhouse documentation regarding a possible cause of this error, however it was in vain. The next idea that I had was checking the work of a local Kafka broker from the same machine. I installed Kafka client, connected Clickhouse, sent the data to topic and to Clickhouse and managed to read it easily, so Clickhouse consumer works with a local broker, meaning that it works in general.<\/p>\n<p>Having spoken with all my acquaintances who are experts in the fields of infrastructure and Clickhouse, we weren’t able to identify the cause of the problem in stride. We’ve checked firewall, network settings,- everything was opened. It was also confirmed by the fact, that messages could be sent from a local machine to the topic of remote browser by the command <i>bin\/kafka-console-producer.sh<\/i> and could be also read from there <i>bin\/kafka-console-consumer.sh<\/i>.<\/p>\n<p>Thereafter, I got the idea to appeal to the main guru and developer of Clickhouse – Alexey Milovidov. Alexey eagerly tried to reply to the questions arisen and proposed a number of hypothesis, that we checked (for instance, tracing of network connections, etc.), however, even after more low-level audit we didn’t manage to localize the problem. then, Alexey recommended to turn to Michail Philimonov from the company <a href=\"https:\/\/www.altinity.com\/\">Altinity<\/a>. Michail turned out to be an extremely responsive expert, and proposed one hypothesis after another in order to conduct testing (in parallel, providing tips on a better way of testing).<\/p>\n<p>As a result of our joint efforts, we discovered that the problem arises at the library <i>librdkafka<\/i>, since the other package <i>kafkacat<\/i>, that uses the same library, falls off the connection to the broker with the very same problem (<i>Local: timed out<\/i>).<\/p>\n<p>After examination of connection through <i>bin\/kafka-console-consumer.sh<\/i> and connection parameters, Michail advised to add the following line into \/<i>etc\/clickhouse-server\/config.xml<\/i>:<\/p>\n<pre class=\"e2-text-code\"><code>&lt;kafka&gt;&lt;security_protocol&gt;ssl&lt;\/security_protocol&gt;&lt;\/kafka&gt;<\/code><\/pre><p>And, oh, what a miracle! Clickhouse connected to the cluster and pulled the required data from the broker.<\/p>\n<p>I hope, this recipe and my experience will allow you to save time and powers in studying the similar problem :)<\/p>\n<p>P.S. Actually Clickhouse has a very friendly community and <a href=\"https:\/\/t.me\/clickhouse_en\">telegram-chat<\/a> where you can ask for advice and more likely to get help.<\/p>\n",
            "date_published": "2020-02-04T11:55:18+03:00",
            "date_modified": "2020-05-12T11:21:46+03:00",
            "image": "https:\/\/en.leftjoin.ru\/pictures\/setup-clickhouse-new-connection-624@2x.png",
            "_date_published_rfc2822": "Tue, 04 Feb 2020 11:55:18 +0300",
            "_rss_guid_is_permalink": "false",
            "_rss_guid": "21",
            "_e2_data": {
                "is_favourite": false,
                "links_required": [
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css"
                ],
                "og_images": [
                    "https:\/\/en.leftjoin.ru\/pictures\/setup-clickhouse-new-connection-624@2x.png"
                ]
            }
        }
    ],
    "_e2_version": 3386,
    "_e2_ua_string": "E2 (v3386; Aegea)"
}