7 posts tagged

clickhouse

Later Ctrl + ↑

Installing Clickhouse on AWS

Время чтения текста – 5 минут

In today’s article, we’ll work with Clickhouse and install it on a free Amazon EC2 instance.

AWS account and Ubuntu Instance
The easiest way to install Clickouse on a virtual Ubuntu server is to use .deb packages. There is no need to worry if you don’t have one – Amazon Web Services provide Free Tier offers that you can enjoy for 12 months. Just go to https://aws.amazon.com and sign up.
Once registered, go to your Dashboard, find the “Build a solution” option and click «Launch a virtual machine», and choose one that comes with Ubuntu Server pre-installed.

Create a key pair – one is a public key and another is a private key that you need to store locally, it secures our connection.

After this step, we’ll see the EC2 Management Console with our EC2 instance up and running. It has a public DNS that we need to save.

Connect with Termius
We connect to our virtual server via SSH protocol. The majority of clients support this protocol, and for our case, we’ll be using Termius. Click «+ NEW HOST» and complete the fields.
Type your public DNS in the address field, «ubuntu» as a Username and leave the password field empty. Now, in order to complete the Key field, we need to specify a file with the .pem extension, the one that was received after creating an Instance. Your result should be much the same:

Connect to our Instance after authentication and we’ll get a new console screen:

Now we can install Clickhouse. Run the following command to add the Clickhouse repository:

Learn more about other ways you can install Clickhouse in the documentation

echo "deb http://repo.yandex.ru/clickhouse/deb/stable/ main/" | sudo tee /etc/apt/sources.list.d/clickhouse.list

Make sure to update the packages:

sudo apt-get update

Finally, install our client and server by running:

sudo apt-get install -y clickhouse-server clickhouse-client

And it’s done! The client and Clickhouse server were installed on our instance. Run the server:

sudo service clickhouse-server start

Test our Clickhouse server to ensure that everything works:

sudo service clickhouse-server status:

And if everything works fine, we’ll get the following output:

Type in the next command to connect to our client:

clickhouse client

Run another check as suggested in the documentation:

SELECT 1

If everything was done right we’ll get the following:

This is it! Next time we’ll share how to work with Python and  Clickhouse, return to our script that retrieves data on Ad Campaigns and push it into a table to visualize after.

 No comments    779   2020   Amazon Web Services   AWS   clickhouse   data analytics

Clickhouse as a consumer for Amazon MSK

Время чтения текста – 4 минуты

Disclaimer: the note is of a technical nature, therefore it might be interesting to fewer people with business background.

This blog hasn’t addressed the topic of Clickhouse yet, however it’s one of the fastest databases from Yandex company. Brief account without going into details: Clickhouse – is the most efficiently written DBMS of a column type with respect to program code, information about the DBMS is quite thoroughly described in the documentation and in multiple videos on Youtube (one, two, three).

Over the last four years, I’ve been using Clickhouse in my practice as an analyst and expert in building analytical reporting. Mostly, I’ve been using Redash for solution of tasks on reporting visualization / reports with parameters / dashboards, as it is the most convenient interface for access to Clickhouse data.
However, just recently, in Looker, that I spoke about previously, an opportunity to connect Clickhouse as a data source appeared. It’s worth noting, that in Tableau connection to Clickhouse has existed for quite a while.

The architecture of the analytical service, based on Clickhouse, is predominantly cloud one. That’s how it was in the task reviewed. Let’s assume you have an allocated instance EC2 in Amazon (on which you’ve installed Clickhouse) and a separate Kafka-cluster (solution of Amazon MSK).

The task: is to connect Clickhouse as a  consumer in order to obtain information from brokers of your Kafka cluster. In fact, it’s quite thoroughly described how exactly one can connect to Kafka cluster in documentation on the site of Amazon MSK, so I won’t repeat this information. In my case, the guide helped: the topics were created by a producer from machine with installed Clickhouse, and were read from it by a consumer.

However, a problem arose: at connection of Clickhouse to Kafka as a consumer, the following error occurred:

020.02.02 18:01:56.209132 [ 46 ] {e7124cd5-2144-4b1d-bd49-8a410cdbd607} <Error> executeQuery: std::exception. Code: 1001, type: cppkafka::HandleException, e.what() = Local: Timed out, version = 20.1.2.4 (official build) (from 127.0.0.1:46586) (in query: SELECT * FROM events), Stack trace (when copying this message, always include the lines below):

For a long time I’ve been searching information in Clickhouse documentation regarding a possible cause of this error, however it was in vain. The next idea that I had was checking the work of a local Kafka broker from the same machine. I installed Kafka client, connected Clickhouse, sent the data to topic and to Clickhouse and managed to read it easily, so Clickhouse consumer works with a local broker, meaning that it works in general.

Having spoken with all my acquaintances who are experts in the fields of infrastructure and Clickhouse, we weren’t able to identify the cause of the problem in stride. We’ve checked firewall, network settings,- everything was opened. It was also confirmed by the fact, that messages could be sent from a local machine to the topic of remote browser by the command bin/kafka-console-producer.sh and could be also read from there bin/kafka-console-consumer.sh.

Thereafter, I got the idea to appeal to the main guru and developer of Clickhouse – Alexey Milovidov. Alexey eagerly tried to reply to the questions arisen and proposed a number of hypothesis, that we checked (for instance, tracing of network connections, etc.), however, even after more low-level audit we didn’t manage to localize the problem. then, Alexey recommended to turn to Michail Philimonov from the company Altinity. Michail turned out to be an extremely responsive expert, and proposed one hypothesis after another in order to conduct testing (in parallel, providing tips on a better way of testing).

As a result of our joint efforts, we discovered that the problem arises at the library librdkafka, since the other package kafkacat, that uses the same library, falls off the connection to the broker with the very same problem (Local: timed out).

After examination of connection through bin/kafka-console-consumer.sh and connection parameters, Michail advised to add the following line into /etc/clickhouse-server/config.xml:

<kafka><security_protocol>ssl</security_protocol></kafka>

And, oh, what a miracle! Clickhouse connected to the cluster and pulled the required data from the broker.

I hope, this recipe and my experience will allow you to save time and powers in studying the similar problem :)

P.S. Actually Clickhouse has a very friendly community and telegram-chat where you can ask for advice and more likely to get help.

 No comments    1192   2020   Analytics engineering   clickhouse   expert   troubleshooting   yandex