<?xml version="1.0" encoding="utf-8"?> 
<rss version="2.0">

<channel>

<title>LEFT JOIN: blog on analytics, visualisation &amp; data science, posts tagged: pandas-profiling</title>
<link>https://en.leftjoin.ru/tags/pandas-profiling/</link>
<description></description>
<generator>E2 (v3386; Aegea)</generator>

<item>
<title>Pandas Profiling in action:  reviewing a new EDA library on Superstore Sales dataset</title>
<guid isPermaLink="false">39</guid>
<link>https://en.leftjoin.ru/all/pandas-profiling-in-action/</link>
<comments>https://en.leftjoin.ru/all/pandas-profiling-in-action/</comments>
<description>
&lt;p&gt;Before moving directly to data analysis we need to understand what type of data we are going to work with. In today’s material, we will take a closer look at the SuperStore Sales dataset, specifically at the &lt;i&gt;Orders&lt;/i&gt; column. It includes customer shopping data of a Canadian online supermarket, such as order, product and customer ids,  type of shipping, prices, product categories, names and etc. You can find more information about this dataset on &lt;a href="https://github.com/PacktPublishing/Tableau-10-Best-Practices/blob/master/Chapter%205/Sample%20-%20Superstore%20Sales%20(Excel).xls"&gt;GitHub&lt;/a&gt;. After creating a pandas DataFrame we can simply  use the &lt;span class="inline-code"&gt;describe()&lt;/span&gt; method to get a sense of our data.&lt;/p&gt;
&lt;pre class="e2-text-code"&gt;&lt;code&gt;import pandas as pd

df = pd.read_csv('superstore_sales_orders.csv', decimal=',')
df.describe(include='all')&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;And oftentimes it leads to such a mess:&lt;/p&gt;
&lt;div class="e2-text-picture"&gt;
&lt;img src="https://en.leftjoin.ru/pictures/1-15.png" width="984" height="427" alt="" /&gt;
&lt;/div&gt;
&lt;p class="note"&gt;The source code of this library is available on &lt;a href="https://github.com/pandas-profiling/pandas-profiling"&gt;GitHub&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;If we spend some time trying to get a grasp of this descriptive table,  we can find out that customers are more likely to choose “Regular air” as a shipping type or that the majority of orders were made from Ontario.  Nevertheless, there is a better tool to describe the dataset in more detail  – the pandas-profiling library.  Just pass a DataFrame to it and we will get a generated HTML page with a detailed description of our dataset:&lt;/p&gt;
&lt;pre class="e2-text-code"&gt;&lt;code&gt;import pandas_profiling
profile = pandas_profiling.ProfileReport(df)
profile.to_file(&amp;quot;output.html&amp;quot;)&lt;/code&gt;&lt;/pre&gt;&lt;div class="e2-text-picture"&gt;
&lt;img src="https://en.leftjoin.ru/pictures/2-15.png" width="973" height="621" alt="" /&gt;
&lt;/div&gt;
&lt;p&gt;As you see, it returned a page with 6 sections, namely: overview, variables, interactions and correlations, number of missing values, and dataset samples.&lt;/p&gt;
&lt;p class="note"&gt;View a full version of the &lt;a href="http://leftjoin.ru/files/superstore.html"&gt;Pandas Profiling Report&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;Data overview&lt;/h2&gt;
&lt;p&gt;Let’s move to the first subsection called “Overview”.  Pandas Profiling provided the following stats: number of variables, number of observations, missing cells, duplicates, and file size. The  &lt;span class="inline-code"&gt;Variable types&lt;/span&gt;  column shows that our DataFrame consists of 12 categorical and 9 numerical variables.&lt;/p&gt;
&lt;div class="e2-text-picture"&gt;
&lt;img src="https://en.leftjoin.ru/pictures/3-14.png" width="737" height="356" alt="" /&gt;
&lt;/div&gt;
&lt;p&gt;The  “Reproduction”  subsection stores technical information,  showing how long it took to analyze the dataset,  currently installed version , configuration info and etc.&lt;/p&gt;
&lt;div class="e2-text-picture"&gt;
&lt;img src="https://en.leftjoin.ru/pictures/4-12.png" width="725" height="293" alt="" /&gt;
&lt;/div&gt;
&lt;p&gt;The  “Warnings”  subsection informs about possible issues in the dataset structure. Now,  it warns us that the “Order Date” column has too many distinct values.&lt;/p&gt;
&lt;div class="e2-text-picture"&gt;
&lt;img src="https://en.leftjoin.ru/pictures/5_5.png" width="712" height="485" alt="" /&gt;
&lt;/div&gt;
&lt;h2&gt;Variables&lt;/h2&gt;
&lt;p&gt;Moving further, this subsection contains a detailed description of each variable, displaying the number of duplicates and missing values stored, memory size, maximum and minimal values. Right next to the stats you can see the distribution of column values.&lt;/p&gt;
&lt;div class="e2-text-picture"&gt;
&lt;img src="https://en.leftjoin.ru/pictures/6_6.png" width="722" height="282" alt="" /&gt;
&lt;/div&gt;
&lt;p&gt;Clicking on  &lt;span class="inline-code"&gt;Toggle details&lt;/span&gt;  you will see more expanded information:  quartiles, median and other useful descriptive statistical indicators. The remaining tabs contain a histogram displayed on the main screen, top 10 frequent values and extremes.&lt;/p&gt;
&lt;div class="e2-text-picture"&gt;
&lt;img src="https://en.leftjoin.ru/pictures/7_7.png" width="737" height="404" alt="" /&gt;
&lt;/div&gt;
&lt;h2&gt;Interactions&lt;/h2&gt;
&lt;p&gt;This section displays how variables are interconnected on a hexbin plot: The graph looks not very obvious and clear, since the legend is lacking.&lt;/p&gt;
&lt;div class="e2-text-picture"&gt;
&lt;img src="https://en.leftjoin.ru/pictures/8_8.png" width="716" height="548" alt="" /&gt;
&lt;/div&gt;
&lt;h2&gt;Correlations&lt;/h2&gt;
&lt;p&gt;The section represents correlations between variables calculated in a variety of ways. For example, the first tab shows Pearson’s r-value. It is noticeable that &lt;span class="inline-code"&gt;Profit &lt;/span&gt; is positively correlated with  &lt;span class="inline-code"&gt;Sales&lt;/span&gt;.  You can get a detailed explanation to each coefficient by clicking on the &lt;span class="inline-code"&gt;Toggle correlation descriptions&lt;/span&gt; button.&lt;/p&gt;
&lt;div class="e2-text-picture"&gt;
&lt;img src="https://en.leftjoin.ru/pictures/9_9.png" width="739" height="554" alt="" /&gt;
&lt;/div&gt;
&lt;h2&gt;Missing values&lt;/h2&gt;
&lt;p&gt;This section includes a bar chart, matrix, and dendrogram with the number of fields in each variable. For instance,  the  &lt;span class="inline-code"&gt;Product Base Margin&lt;/span&gt;  column is missing three values.&lt;/p&gt;
&lt;div class="e2-text-picture"&gt;
&lt;img src="https://en.leftjoin.ru/pictures/10_10.png" width="739" height="411" alt="" /&gt;
&lt;/div&gt;
&lt;h2&gt;Samples&lt;/h2&gt;
&lt;p&gt;And the final section show the first and last 10 rows as chunks of a dataset, pretty similar to the  &lt;span class="inline-code"&gt;head()&lt;/span&gt;  method in Pandas.&lt;/p&gt;
&lt;div class="e2-text-picture"&gt;
&lt;img src="https://en.leftjoin.ru/pictures/11_11.png" width="661" height="469" alt="" /&gt;
&lt;/div&gt;
&lt;h2&gt;Key Takeaways&lt;/h2&gt;
&lt;p&gt;The library is definitely more focused on statistics than Pandas, one can get useful descriptive stats for each variable and see their correlation.  It provides a comprehensive report on a dataset in a user-friendly way,  allowing to undertake an initial investigation and get a sense of data.&lt;br /&gt;
Still, the library has its shortfalls. If your dataset is fairly large the report generation time may be extended up to several hours. It’s a great tool for automating EDA tasks,  however, it can’t do all the work for you and some details may be overlooked. If you are just getting started with data analysis, we would highly recommend to start it with pandas. It will solidify your knowledge and boost confidence in working with data.&lt;/p&gt;
</description>
<pubDate>Fri, 18 Sep 2020 15:37:40 +0300</pubDate>
</item>


</channel>
</rss>