How to: Google App Script

Время чтения текста – 4 минуты

Today’s post will shed some light on how you can use one of the very handy basic analytics tools – Google Sheets – even more effectively.
Often, analysts need to work a lot with tables and create selections for various conditions. In working with tables, the problem usually lies in the fact that you have to process and organize large amounts of data stored on different sheets. For example, you keep track of customers on one sheet of a spreadsheet, uploading data from third-party services, and you would like to add or change the data on another sheet. Let’s see how easy and simple we suggest to solve this problem.

Script editor

If you have a Google account and data sheets uploaded to Google Sheets, then you can create a script for this table. Select the table in which you want to automate the transfer of information from one sheet to another, open it and select the “Script Editor” item from the “Tools” menu. The browser will redirect you to the Apps Script page, where you can create and edit scripts for the table.

Automating line transfer to another sheet

So, our script should automatically perform the task of line transfer to another sheet. To do this, we create one more column in the table, in which it will be possible to check the box to transfer the line or remove it to cancel this action. Let’s see how exactly this is done.
When writing a script, you can use a trigger function that fires when a certain condition is met. You can read more about trigger functions in the documentation. In our script, we use the OnEdit(e) function, which is triggered when a table cell is edited (toggle the checkbox on and off). We will call this function at the end of the script, but for now, we will write everything that should happen in the function.

function myScript(e) {   
  // We set the following condition for the function: you only need to react to clicking the checkmark in the eighth column on the "Лиды-воронка" sheet.
  if (e.source.getActiveSheet().getName() == "Лиды- воронка" && e.range.getColumn() == 8)
  {
    // Saving the objects of the source sheet and the destination sheet
    destSheet = SpreadsheetApp.getActiveSpreadsheet().getSheetByName('test');
    sourceSheet = e.source.getActiveSheet();
    // We clear the destination sheet, starting from the second line, since we have a title in the table.
    destSheet.getRange(2, 1, destSheet.getLastRow(), destSheet.getLastColumn()).clearContent();
    // We go through all the cells with check marks, looking for those cells in which the check marks are put down.
    range = sourceSheet.getRange(1, 1, sourceSheet.getLastRow(), sourceSheet.getLastColumn());       
    for (i = 2; i <= range.getNumRows(); i++) 
    {      
      // We get all the checkboxes.
      if (range.getCell(i,8).getValue())
      {        
        // If the checkbox is filled, the current line is transferred to a new sheet.
        currentRow = sourceSheet.getRange(i, 1, i, sourceSheet.getLastColumn());           
        destSheet.appendRow(currentRow.getValues()[0]);
      }      
    }    
  }

// Then, we call a trigger function that will call our script every time we edit a cell.
function onEdit(e) {
  myScript(e)
}

As a result, if a check mark in a column is selected, then the corresponding row will be transferred to another sheet. Also, it is important to remember that when you remove a check mark, the line is removed from another sheet.
In our example, we got the following two tables:

Conclusions

Today we showed you an example of a script that will simplify, and maybe even improve, the routine work with tables. Of course, many more processes can be automated with Google Sheets scripts. You can read about other features yourself, and we’ll cover some of them in detail in future posts!

 No comments    512   2021   google analytics

Mean VS median: how to choose a target metric?

Время чтения текста – 8 минут

In today’s article, we would like to highlight a simple, but important topic – how to choose a simple metric to evaluate a particular dataset. Everyone has been familiar with the arithmetic mean for a long time, almost every student knows very well that you should sum up all the available values, divide by their number and get the average value. However, school knowledge does not include any alternative options, of which, in fact, there are many in statistics – almost for every occasion. In solving research and marketing problems, people often take mean as a target. Is this legitimate or is there a better option? Let’s figure it out.

To begin with, it is worth remembering the definitions of the two metrics that we will talk about today.
Mean is the most popular statistic used to calculate a data center. What is the median? Median is a value that splits data, sorted in order of increasing values, into two equal parts. This means that the median shows the central value in the sample if the number of cases is odd and the arithmetic mean of the two values ​​if the number of cases in the sample is even.

Research tasks

So, the estimation of the sample mean is often important in many research questions. For instance, specialists studying demography often study changes in the number of regions in Russia in order to track the dynamics and reflect it in reports. Let’s try to calculate the average size of the Russian city, as well as the median, and then compare the results.
First, you need to find and load data by connecting the pandas library for this.

import pandas as pd
city = pd.read_csv('city.csv', sep = ';')

Then, you need to calculate the mean and median of the sample.

mean_pop = round (city.population_2020.mean (), 0)
median_pop = round (city.population_2020.median (), 0)

The values, of course, are different, since the distribution of observations in the sample is different from the normal one. In order to understand whether they are very different, let’s build a distribution graph and display the mean and median.

import matplotlib.pyplot as plt
import seaborn as sns

sns.set_palette('rainbow')
fig = plt.figure(figsize = (20, 15))
ax = fig.add_subplot(1, 1, 1)
g = sns.histplot(data = city, x= 'population_2020', alpha=0.6, bins = 100, ax=ax)

g.axvline(mean_pop, linewidth=2, color='r', alpha=0.9, linestyle='--', label = 'Mean = {:,.0f}'.format(mean_pop).replace(',', ' '))
g.axvline(median_pop, linewidth=2, color='darkgreen', alpha=0.9, linestyle='--', label = 'Median = {:,.0f}'.format(median_pop).replace(',', ' '))

plt.ticklabel_format(axis='x', style='plain')
plt.xlabel("Population", fontsize=20)
plt.ylabel("Number of cities", fontsize=20)
plt.title("Distribution of population of russian cities", fontsize=20)
plt.legend(fontsize="xx-large")
plt.show()

Also, on this data it is worth building a boxplot for more accurate visualization with the main distribution quantiles, median, mean and outliers.

fig = plt.figure(figsize = (10, 10))
sns.set_theme(style="whitegrid")
sns.set_palette(palette="pastel")

sns.boxplot(y = city['population_2020'], showfliers = False)

plt.scatter(0, 550100, marker='*', s=100, color = 'black', label = 'Outlier')
plt.scatter(0, 560200, marker='*', s=100, color = 'black')
plt.scatter(0, 570300, marker='*', s=100, color = 'black')
plt.scatter(0, mean_pop, marker='o', s=100, color = 'red', edgecolors = 'black', label = 'Mean')
plt.legend()

plt.ylabel("Population", fontsize=15)
plt.ticklabel_format(axis='y', style='plain')
plt.title("Boxplot of population of russian cities", fontsize=15)
plt.show()

It follows from the graphs that the median is significantly less than the average, and it is also clear that this is a consequence of the presence of outliers which are Moscow and St. Petersburg. Since the arithmetic mean is an extremely sensitive metric to outliers, it is not worth relying on conclusions about the mean if they are present in the sample. An increase or decrease in the population of Moscow can greatly change the average population in Russia, but this will not affect the real regional trend.
Using the arithmetic mean, we say that the number of a typical (average) city in the Russian Federation is 268 thousand people. However, this misleads us, since the mean value is significantly higher than the median solely due to the size of the population of Moscow and St. Petersburg. In fact, the number of a typical Russian city is significantly less (2 times less!) and is approximately 104 thousand citizens.

Marketing tasks

In a business tasks, the difference between the mean and the median is also important, as using the wrong metric can seriously affect the results of the advertising campaign or make it difficult to achieve the goal. Let’s take a look at a real example of the difficulties an entrepreneur can face in retail if he chooses the wrong target metric.
To begin with, as in the previous example, let’s load a dataset about supermarket purchases. Let’s select the dataset columns necessary for analysis and rename them to simplify the code in the future. Since this data is not as well prepared as the previous ones, it is necessary to group all purchased items by receipts. In this case, it is necessary to group by two variables: by the customer’s id and by the date of purchase (the date and time is determined by the moment of closing the bill, therefore, all purchases within one bill coincide by date variable). Then, let’s name the resulting column “total_bill”, that is, the check amount and calculate the average and median.

df = pd.read_excel ('invoice_data.xlsx')
df.columns = ['user', 'total_price', 'date']
groupped_df = pd.DataFrame (df.groupby (['user', 'date']). total_price.sum ())
groupped_df.columns = ['total_bill']
mean_bill = groupped_df.total_bill.mean ()
median_bill = groupped_df.total_bill.median ()

Now, as in the previous example, you need to plot the distribution of customer checks and boxplot, and also display the median and mean on each of them.

sns.set_palette('rainbow')
fig = plt.figure(figsize = (20, 15))
ax = fig.add_subplot(1, 1, 1)
sns.histplot(groupped_df, x = 'total_bill', binwidth=200, alpha=0.6, ax=ax)
plt.xlabel("Purchases", fontsize=20)
plt.ylabel("Total bill", fontsize=20)
plt.title("Distribution of total bills", fontsize=20)
plt.axvline(mean_bill, linewidth=2, color='r', alpha=1, linestyle='--', label = 'Mean = {:.0f}'.format(mean_bill))
plt.axvline(median_bill, linewidth=2, color='darkgreen', alpha=1, linestyle='--', label = 'Median = {:.0f}'.format(median_bill))
plt.legend(fontsize="xx-large")
plt.show()
fig = plt.figure(figsize = (10, 10))
sns.set_theme(style="whitegrid")
sns.set_palette(palette="pastel")

sns.boxplot(y = groupped_df['total_bill'], showfliers = False)

plt.scatter(0, 1800, marker='*', s=100, color = 'black', label = 'Outlier')
plt.scatter(0, 1850, marker='*', s=100, color = 'black')
plt.scatter(0, 1900, marker='*', s=100, color = 'black')
plt.scatter(0, mean_bill, marker='o', s=100, color = 'red', edgecolors = 'black', label = 'Mean')
plt.legend()

plt.ticklabel_format(axis='y', style='plain')
plt.ylabel("Total bill", fontsize=15)
plt.title("Boxplot of total bills", fontsize=15)
plt.show()

The graphs show that the distribution is different from normal, which means that the median and mean are not equal. The median value is smaller than the average by about 220 rubles.
Now, imagine that marketers have a task to increase the average buyer’s bill. A marketer may decide that since the average check is 601 rubles, the following promotion can be offered: “All buyers who make a purchase over 600 rubles, will get 20% discount on any good for 100 rubles.” In general, it is a reasonable offer, however, in reality, the average check is lower – 378 rubles. Thus, the majority of buyers will not be interested in the offer, since their purchase usually does not reach the proposed threshold. This means, that they will not take advantage of the offer and will not receive a discount, and the company will not be able to achieve its goal and increase the profit. The point is that the initial assumptions were wrong.

Conclusions

As you already understood, the mean often shows a more pleasant result, both for business and for research tasks, because it is always nicer to imagine the situation with the average check or the demographic situation in the country better than it really is. However, one must always remember about the shortcomings of mean in order to be able to correctly choose the appropriate analogue for assessing a particular situation.

 No comments    413   2021   analysis   matplotlib   pandas   python

Graph of telegram channels related to analytics

Время чтения текста – 3 минуты

The authors of various Telegram-blogs often publish a selection of their favorite channels so as to share their professional choice with their audience. The idea, of course, is not new, but I decided not just to make a rating of interesting analytical telegram blogs, but to solve this problem analytically.

As part of the current course of my studies, I am learning many modern approaches to data analysis and visualization. At the very beginning of the course, there was a warm-up exercise: object-oriented programming in Python for collecting and iterative building a graph with TMDB API. Usually, this method is used to construct a connection graph of actors, where the connection is a role in the same film. However, I decided that I could apply it to another problem: building a graph of connections for the analytic community.

Since my time has been particularly limited recently, and I have already completed a similar task for the course, I decided to transfer this knowledge to someone else who is interested in analytics. Fortunately, at that moment, a candidate for the vacancy of a junior data analyst, Andrey, texted me in direct messages. He is now in the process of comprehending all the intricacies of analytics, so we agreed on an internship, in which Andrey parsed data from telegram channels.

Andrey’s main task was to collect all texts from the Internet analyst’s telegram channel, select the channels, which Aleksey Nikushin linked, collect texts from these telegram channels and links on these channels. “Link” means any mention of the channel: through @, through a link or repost. As a result of parsing, Andrey got two files: nodes and edges.
Now I will present to you the graph that I got based on this data and comment on the results.

I would like to take this opportunity to express my compliments to the karpov.courses team, as Andrey has excellent knowledge of the Python language!

As a result, the top 10 channels in terms of degree (number of connections) looks like this:

  1. Интернет-аналитика
  2. Reveal The Data
  3. Инжиниринг Данных
  4. Data Events
  5. Datalytics
  6. Чартомойка
  7. LEFT JOIN
  8. Epic Growth
  9. RTD: ссылки и репосты
  10. Дашбордец

In my opinion, it turned out extremely exciting and visually interesting, and Andrey is a great fellow! By the way, he also started his own channel ”Это разве аналитика?”, where analytics news is published.

Looking ahead, this problem has a continuation. With the help of the Markov chain, we modeled where the user ends up if he iteratively navigates through all the mentions in the channels. It turned out unexpectedly, but we will tell about it next time!

 No comments    440   2021   data analytics   python

Bubble charts basics: area vs radius

Время чтения текста – 5 минут

Data visualization is a skill used in any industry where data is present, because tables are only good for storing information. When there is a need to present data, or rather certain conclusions derived from them, the data must be presented on graphs of a suitable type. So, here you are faced with two tasks: the first is to choose the right type of graph, the second is to display the results in a plausible way. Today we will tell you about one mistake that designers sometimes make when visualizing data on bubble-charts and how this mistake can be avoided.

The crux of building a bubble-chart

Firstly, let us tell you a bit of boring theory before we start analyzing the data. Bubble-chart is a convenient way to show three numerical variables without building a 3D model. The usual X and Y axes indicate the values ​​of two parameters, and the third is shown by the size of the circle that corresponds to each observation. This is what makes it possible to avoid the need to build a complex 3D chart, that is, anyone who sees a bubble-chart will be able to draw conclusions about the data much faster.

A mistake that a designer, but not a data analyst, can make

With the metrics that are displayed on the axes of the graph, no questions arise. This is the usual way of visualizing data, but with the data shown by the size of the circles there is some difficulty: how to correctly and accurately display changes in the values ​​of a variable, if the control is not a point on the axis, but the size of this point?
The fact is that when building such a graph without using analytical tools, for example, in a graphics editor, the author can draw circles, taking the radius of the circle as its size. At first glance, everything seems to be absolutely correct – the larger the value of the variable, the larger the radius of the circle. However, in this case, the area of ​​the circle will increase not as a linear, but as a power function, because S = π × r2. For instance, the figure below shows that if you double the radius of a circle, the area will quadruple.


Draw a circle in Matplotlib  

fig = plt.figure (figsize = (10, 10))
ax = fig.add_subplot (1, 1, 1)
s = 4 * 10e3


ax.scatter (100, 100, s = s, c = 'r')
ax.scatter (100, 100, s = s / 4, c = 'b')
ax.scatter (100, 100, s = 10, c = 'g')
plt.axvline (99, c = 'black')
plt.axvline (101, c = 'black')
plt.axvline (98, c = 'black')
plt.axvline (102, c = 'black')


ax.set_xticks (np.arange (95, 106, 1))
ax.grid (alpha = 1)

plt.show ()

This means that the graph will look implausible, because the dimensions will not reflect the real change in the variable, and the viewer pays attention and compares exactly the area of ​​the circles on the graph.

How to build such a graph correctly?

Fortunately, if you build bubble-charts using Python libraries (Matplotlib and Seaborn), then the size of the circle will be determined by the area, which is absolutely correct from in terms of visualization.
Now, using the example of real data found on Kaggle, we will show how to build a bubble-chart. The data contains the following variables: country, population size, percentage of literate population. For the chart to be readable, let’s take a subsample of the top 10 countries after sorting all the data in order of increasing GDP.

First, let’s load all the necessary libraries:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Then, load the data, clear it from all rows with missing values ​​and transform the population of countries to millions of people:

data = pd.read_csv ('countries of the world.csv', sep = ',')
data = data.dropna ()
data = data.sort_values ​​(by = 'Population', ascending = False)
data = data.head (10)
data ['Population'] = data ['Population']. apply (lambda x: x / 1000000)

Now that all the preparations are complete, you can build a bubble-chart:

sns.set (style = "darkgrid")
fig, ax = plt.subplots (figsize = (10, 10))
g = sns.scatterplot (data = data, x = "Literacy (%)", y = "GDP ($ per capita)", size = "Population", sizes = (10,1500), alpha = 0.5)
plt.xlabel ("Literacy (Percentage of literate citizens)")
plt.ylabel ("GDP per Capita")
plt.title ('Chart with bubbles as area', fontdict = {'fontsize': 'x-large'})

def label_point (x, y, val, ax):
    a = pd.concat ({'x': x, 'y': y, 'val': val}, axis = 1)
    for i, point in a.iterrows ():
        ax.text (point ['x'], point ['y'] + 500, str (point ['val']))

label_point (data ['Literacy (%)'], data ['GDP ($ per capita)'], data ['Country'], plt.gca ())

ax.legend (loc = 'upper left', fontsize = 'medium', title = 'Population (in mln)', title_fontsize = 'large', labelspacing = 1)

plt.show ()

This graph displays three metrics in an understandable way: the level of GDP per capita on the Y axis, the percentage of the literate population on the X axis, and the population – by the area of ​​the circle.

We recommend using size of the circle to show one of the variables, if there is a need to show 3 or more variables on one chart.

 No comments    549   2021   python   visualisation

How and why should you export reports from Jupyter Notebook to PDF

Время чтения текста – 5 минут

If you are a data analyst and you need to present a report to a client, if you are looking for a job and do not know how to draw up a test task in such a way that people will pay attention to you, if you have a lot of educational projects related to data analytics and visualization, this post will be very, very useful to you.
Looking at someone else’s code in a Jupyter Notebook can be problematic, because the result is often lost between lines of code with data preparation, importing the necessary libraries and a series of attempts to implement the idea. That is why a method such as exporting results to a PDF file in LaTeX format is a great option for final visualization. It will save time and look presentable. In scientific circles, articles and reports are very often formatted using LaTeX, since it has a number of advantages:

  • Math equations and formulas look neater.
  • The bibliography is automatically generated based on all references used in the document.
  • The author can focus on the content (not on the appearance of the document), since the layout of the text and other data is set automatically by specifying the necessary parameters in the code.

Today we will talk in detail about how to export such beautiful reports from Jupyter Notebook to PDF using LaTeX.

Installing LaTeX

The most important point in generating a report from a Jupyter Notebook in Python is exporting it to the final file. The main library you need to install is – nbconvert – which converts your notebook into any convenient document format: pdf (as in our case), html, latex, etc. This library needs not only to be installed, but some preinstalling of several other packages as well: Pandoc, TeX, and Chromium. According to the link to the library, the whole process is described in detail for each software, so we will not dwell on it here.
Once you have completed all the preliminary steps, you need to install and import the library into your Jupyter Notebook.

! pip install nbconvert
import nbconvert

Export tables to Markdown format

Usually, tables look a bit odd in reports, as they can be difficult to read quickly, but sometimes it is still necessary to add a small table to the final document. In order for the table to look neat, you need to save it in Markdown format. This can be done manually, but if there is a lot of data in the table, it is better to come up with a more convenient method. We suggest using the following simple pandas_df_to_markdown_table () function, which converts any dataframe to a markdown-table. Note: after the conversion, indices disappear, therefore, if they are important (as in our example), it is worth saving them into a variable in the first column of the dataframe.

data_g = px.data.gapminder ()
summary = round (data_g.describe (), 2)
summary.insert (0, 'metric', summary.index)

# Function to convert dataframe to Markdown Table
def pandas_df_to_markdown_table (df):
    from IPython.display import Markdown, display
    fmt = ['---' for i in range (len (df.columns))]
    df_fmt = pd.DataFrame ([fmt], columns = df.columns)
    df_formatted = pd.concat ([df_fmt, df])
    display (Markdown (df_formatted.to_csv (sep = "|", index = False)))

pandas_df_to_markdown_table (summary)

Export image to report

In this example, we will build a bubble-chart, the construction method of which was described in a recent post. Previously we used the Seaborn library, which shown that the display of data with the size of circles on the graph is correct. The same graphs can be created using the Plotly library.
In order to display the plot in the final report, you also need to complete an additional step. The point is that plt.show () will not help to display the graph when exporting. Therefore, you need to save the graph in the working directory, and then, using the iPython.display library, display it using the Image () function.

from IPython.display import Image
import plotly.express as px
fig = px.scatter (data_g.query ("year == 2007"), x = "gdpPercap", y = "lifeExp",
                 size = "pop", color = "continent",
                 log_x = True, size_max = 70)
fig.write_image ('figure_1.jpg')
Image (data = 'figure_1.jpg', width = 1000)

Formation and export of the report

When all stages of data analysis are completed, the report can be exported. If you need headings or text in the report, then write them in the cells of the notebook, changing the format from Code to Markdown. For export, you can use the terminal, running the second line there without an exclamation mark, or you can run the code written below in the cell of the Jupiter Notebook. We advise you not to load the report with code and use TemplateExporter.exclude_input = True parameter so that the cells with the code are not exported. Also, when you run this cell in your notebook, the code produces a standard output, and you need to write %% capture at the beginning of the cell not to export it.

%% capture
! jupyter nbconvert --to pdf --TemplateExporter.exclude_input = True ~ / Desktop / VALIOTTI / Reports / Sample \ LaTeX \ Report.ipynb
! open ~ / Desktop / VALIOTTI / Reports / Sample \ LaTeX \ Report.pdf

If you did everything correctly and methodically, then you will end up with a report similar to this one!
Present your data nicely :)

 No comments    3042   2021   data analytics   python
Earlier Ctrl + ↓