{
    "version": "https:\/\/jsonfeed.org\/version\/1",
    "title": "LEFT JOIN: blog on analytics, visualisation & data science, posts tagged: instagram",
    "home_page_url": "https:\/\/en.leftjoin.ru\/tags\/instagram\/",
    "feed_url": "https:\/\/en.leftjoin.ru\/tags\/instagram\/json\/",
    "icon": "https:\/\/en.leftjoin.ru\/user\/userpic@2x.jpg",
    "author": {
        "name": "Nikolay Valiotti",
        "url": "https:\/\/en.leftjoin.ru\/",
        "avatar": "https:\/\/en.leftjoin.ru\/user\/userpic@2x.jpg"
    },
    "items": [
        {
            "id": "46",
            "url": "https:\/\/en.leftjoin.ru\/all\/collecting-social-media-data-for-top-ml-ai-data-science-related\/",
            "title": "Collecting Social Media Data for Top ML, AI &amp; Data Science related accounts on Instagram",
            "content_html": "<p>Instagram is in the top 5 most visited websites, perhaps not for our industry. Nevertheless, we are going to test this hypothesis using Python and our data analytics skills. In this post, we will share how to collect social media data using the Instagram API.<\/p>\n<p><b>Data collection method<\/b><br \/>\nThe Instagram API won’t let us collect data about other platform users for no reason, but there is always a way. Try sending the following request:<\/p>\n<pre class=\"e2-text-code\"><code>https:\/\/instagram.com\/leftjoin\/?__a=1<\/code><\/pre><p>The request returns a JSON object with detailed user information, for instance, we can easily get an account name, number of posts, followers, subscriptions, as well as the first ten user posts with likes count, comments and etc. The <a href=\"https:\/\/github.com\/OlegYurchik\/pyInstagram\">pyInstagram<\/a> library allows sending such requests.<\/p>\n<p><b>SQL schema<\/b><br \/>\nData will be collected into thee Clickhouse tables: users,  posts, comments. The users table will contain user data, such as user id, username,  user’s first and last name, account description, number of followers, subscriptions, posts, comments, and likes, whether an account is verified or not, and so on.<\/p>\n<pre class=\"e2-text-code\"><code>CREATE TABLE instagram.users\r\n(\r\n    `added_at` DateTime,\r\n    `user_id` UInt64,\r\n    `user_name` String,\r\n    `full_name` String,\r\n    `base_url` String,\r\n    `biography` String,\r\n    `followers_count` UInt64,\r\n    `follows_count` UInt64,\r\n    `media_count` UInt64,\r\n    `total_comments` UInt64,\r\n    `total_likes` UInt64,\r\n    `is_verified` UInt8,\r\n    `country_block` UInt8,\r\n    `profile_pic_url` Nullable(String),\r\n    `profile_pic_url_hd` Nullable(String),\r\n    `fb_page` Nullable(String)\r\n)\r\nENGINE = ReplacingMergeTree\r\nORDER BY added_at<\/code><\/pre><p>The posts table will be populated with the post owner name, post id, caption, comments coun, and so on. To check whether a post is an advertisement,  Instagram carousel, or a video we can use these fields: <span class=\"inline-code\">is_ad<\/span>, <span class=\"inline-code\">is_album<\/span> and <span class=\"inline-code\">is_video<\/span>.<\/p>\n<pre class=\"e2-text-code\"><code>CREATE TABLE instagram.posts\r\n(\r\n    `added_at` DateTime,\r\n    `owner` String,\r\n    `post_id` UInt64,\r\n    `caption` Nullable(String),\r\n    `code` String,\r\n    `comments_count` UInt64,\r\n    `comments_disabled` UInt8,\r\n    `created_at` DateTime,\r\n    `display_url` String,\r\n    `is_ad` UInt8,\r\n    `is_album` UInt8,\r\n    `is_video` UInt8,\r\n    `likes_count` UInt64,\r\n    `location` Nullable(String),\r\n    `recources` Array(String),\r\n    `video_url` Nullable(String)\r\n)\r\nENGINE = ReplacingMergeTree\r\nORDER BY added_at<\/code><\/pre><p>In the comments table, we store each comment separately with the comment owner and text.<\/p>\n<pre class=\"e2-text-code\"><code>CREATE TABLE instagram.comments\r\n(\r\n    `added_at` DateTime,\r\n    `comment_id` UInt64,\r\n    `post_id` UInt64,\r\n    `comment_owner` String,\r\n    `comment_text` String\r\n)\r\nENGINE = ReplacingMergeTree\r\nORDER BY added_at<\/code><\/pre><p><b>Writing the script<\/b><br \/>\nImport the following classes from the library: <span class=\"inline-code\">Account<\/span>, <span class=\"inline-code\">Media<\/span>, <span class=\"inline-code\">WebAgent<\/span> and <span class=\"inline-code\">Comment<\/span>.<\/p>\n<pre class=\"e2-text-code\"><code>from instagram import Account, Media, WebAgent, Comment\r\nfrom datetime import datetime\r\nfrom clickhouse_driver import Client\r\nimport requests\r\nimport pandas as pd<\/code><\/pre><p>Next, create an instance of the <span class=\"inline-code\">WebAgent<\/span> class required for some library methods and data updating. To collect any meaningful information we need to have at least account names. Since we don’t have them yet, send the following request to search for porifles by the  keywords specified in queries_list. The search results will be composed of Instagram pages that match any keyword in the list.<\/p>\n<pre class=\"e2-text-code\"><code>agent = WebAgent()\r\nqueries_list = ['machine learning', 'data science', 'data analytics', 'analytics', 'business intelligence',\r\n                'data engineering', 'computer science', 'big data', 'artificial intelligence',\r\n                'deep learning', 'data scientist','machine learning engineer', 'data engineer']\r\nclient = Client(host='12.34.56.789', user='default', password='', port='9000', database='instagram')\r\nurl = 'https:\/\/www.instagram.com\/web\/search\/topsearch\/?context=user&amp;count=0'<\/code><\/pre><p>Let’s iterate the keywords collecting all matching accounts. Then remove duplicates from the obtained list by converting it to set and back.<\/p>\n<pre class=\"e2-text-code\"><code>response_list = []\r\nfor query in queries_list:\r\n    response = requests.get(url, params={\r\n        'query': query\r\n    }).json()\r\n    response_list.extend(response['users'])\r\ninstagram_pages_list = []\r\nfor item in response_list:\r\n    instagram_pages_list.append(item['user']['username'])\r\ninstagram_pages_list = list(set(instagram_pages_list))<\/code><\/pre><p>Now we need to loop through the list of pages and request detailed information about an account if it’s not in the table yet. Create an instance of the Account class and pass username as a parameter.<br \/>\nThen update the account information using the agent.update()<br \/>\nmethod. We will collect only the first 100 posts to keep it moving. Next, create a list named  <span class=\"inline-code\">media_list<\/span> to store received post ids after calling the <span class=\"inline-code\">agent.get_media()<\/span> method.<\/p>\n<p><details><br \/>\n<summary><span style=\"color:#7ea9b8\">Collecting user media data<\/span><\/summary><\/p>\n<pre class=\"e2-text-code\"><code>all_posts_list = []\r\nusername_count = 0\r\nfor username in instagram_pages_list:\r\n    if client.execute(f&quot;SELECT count(1) FROM users WHERE user_name='{username}'&quot;)[0][0] == 0:\r\n        print('username:', username_count, '\/', len(instagram_pages_list))\r\n        username_count += 1\r\n        account_total_likes = 0\r\n        account_total_comments = 0\r\n        try:\r\n            account = Account(username)\r\n        except Exception as E:\r\n            print(E)\r\n            continue\r\n        try:\r\n            agent.update(account)\r\n        except Exception as E:\r\n            print(E)\r\n            continue\r\n        if account.media_count &lt; 100:\r\n            post_count = account.media_count\r\n        else:\r\n            post_count = 100\r\n        print(account, post_count)\r\n        media_list, _ = agent.get_media(account, count=post_count, delay=1)\r\n        count = 0<\/code><\/pre><p><\/details><\/p>\n<p>Because we need to count the total number of likes and comments  before adding a new user to our database, we’ll start with them first. Almost all required fields belong to the <span class=\"inline-code\">Media<\/span> class:<\/p>\n<p><details><br \/>\n<summary><span style=\"color:#7ea9b8\">Collecting user posts<\/span><\/summary><\/p>\n<pre class=\"e2-text-code\"><code>for media_code in media_list:\r\n            if client.execute(f&quot;SELECT count(1) FROM posts WHERE code='{media_code}'&quot;)[0][0] == 0:\r\n                print('posts:', count, '\/', len(media_list))\r\n                count += 1\r\n\r\n                post_insert_list = []\r\n                post = Media(media_code)\r\n                agent.update(post)\r\n                post_insert_list.append(datetime.now().strftime('%Y-%m-%d %H:%M:%S'))\r\n                post_insert_list.append(str(post.owner))\r\n                post_insert_list.append(post.id)\r\n                if post.caption is not None:\r\n                    post_insert_list.append(post.caption.replace(&quot;'&quot;,&quot;&quot;).replace('&quot;', ''))\r\n                else:\r\n                    post_insert_list.append(&quot;&quot;)\r\n                post_insert_list.append(post.code)\r\n                post_insert_list.append(post.comments_count)\r\n                post_insert_list.append(int(post.comments_disabled))\r\n                post_insert_list.append(datetime.fromtimestamp(post.date).strftime('%Y-%m-%d %H:%M:%S'))\r\n                post_insert_list.append(post.display_url)\r\n                try:\r\n                    post_insert_list.append(int(post.is_ad))\r\n                except TypeError:\r\n                    post_insert_list.append('cast(Null as Nullable(UInt8))')\r\n                post_insert_list.append(int(post.is_album))\r\n                post_insert_list.append(int(post.is_video))\r\n                post_insert_list.append(post.likes_count)\r\n                if post.location is not None:\r\n                    post_insert_list.append(post.location)\r\n                else:\r\n                    post_insert_list.append('')\r\n                post_insert_list.append(post.resources)\r\n                if post.video_url is not None:\r\n                    post_insert_list.append(post.video_url)\r\n                else:\r\n                    post_insert_list.append('')\r\n                account_total_likes += post.likes_count\r\n                account_total_comments += post.comments_count\r\n                try:\r\n                    client.execute(f'''\r\n                        INSERT INTO posts VALUES {tuple(post_insert_list)}\r\n                    ''')\r\n                except Exception as E:\r\n                    print('posts:')\r\n                    print(E)\r\n                    print(post_insert_list)<\/code><\/pre><p><\/details><\/p>\n<p>Store comments in the variable with the same name after calling the <span class=\"inline-code\">get_comments()<\/span> method:<br \/>\n<details><br \/>\n<summary><span style=\"color:#7ea9b8\">Collecting post comments<\/span><\/summary><\/p>\n<pre class=\"e2-text-code\"><code>comments = agent.get_comments(media=post)\r\n                for comment_id in comments[0]:\r\n                    comment_insert_list = []\r\n                    comment = Comment(comment_id)\r\n                    comment_insert_list.append(datetime.now().strftime('%Y-%m-%d %H:%M:%S'))\r\n                    comment_insert_list.append(comment.id)\r\n                    comment_insert_list.append(post.id)\r\n                    comment_insert_list.append(str(comment.owner))\r\n                    comment_insert_list.append(comment.text.replace(&quot;'&quot;,&quot;&quot;).replace('&quot;', ''))\r\n                    try:\r\n                        client.execute(f'''\r\n                            INSERT INTO comments VALUES {tuple(comment_insert_list)}\r\n                        ''')\r\n                    except Exception as E:\r\n                        print('comments:')\r\n                        print(E)\r\n                        print(comment_insert_list)<\/code><\/pre><p><\/details><\/p>\n<p>And now, when we have obtained user posts and comments new information can be added to the table.<br \/>\n<details><br \/>\n<summary><span style=\"color:#7ea9b8\">Collecting user data<\/span><\/summary><\/p>\n<pre class=\"e2-text-code\"><code>user_insert_list = []\r\n        user_insert_list.append(datetime.now().strftime('%Y-%m-%d %H:%M:%S'))\r\n        user_insert_list.append(account.id)\r\n        user_insert_list.append(account.username)\r\n        user_insert_list.append(account.full_name)\r\n        user_insert_list.append(account.base_url)\r\n        user_insert_list.append(account.biography)\r\n        user_insert_list.append(account.followers_count)\r\n        user_insert_list.append(account.follows_count)\r\n        user_insert_list.append(account.media_count)\r\n        user_insert_list.append(account_total_comments)\r\n        user_insert_list.append(account_total_likes)\r\n        user_insert_list.append(int(account.is_verified))\r\n        user_insert_list.append(int(account.country_block))\r\n        user_insert_list.append(account.profile_pic_url)\r\n        user_insert_list.append(account.profile_pic_url_hd)\r\n        if account.fb_page is not None:\r\n            user_insert_list.append(account.fb_page)\r\n        else:\r\n            user_insert_list.append('')\r\n        try:\r\n            client.execute(f'''\r\n                INSERT INTO users VALUES {tuple(user_insert_list)}\r\n            ''')\r\n        except Exception as E:\r\n            print('users:')\r\n            print(E)\r\n            print(user_insert_list)<\/code><\/pre><p><\/details><\/p>\n<p><b>Conclusion<\/b><br \/>\nTo sum up, we have collected data of 500 users, with nearly 20K posts and 40K comments. As the database will be updated, we can write a simple query to get the top 10 ML, AI & Data Science related most followed accounts for today.<\/p>\n<pre class=\"e2-text-code\"><code>SELECT *\r\nFROM users\r\nORDER BY followers_count DESC\r\nLIMIT 10<\/code><\/pre><p>And as a bonus, here is a list of the most interesting Instagram accounts on this  topic:<\/p>\n<ol start=\"1\">\n<li><a href=\"https:\/\/www.instagram.com\/ai_machine_learning\/\">@ai_machine_learning<\/a><\/li>\n<li><a href=\"https:\/\/www.instagram.com\/neuralnine\/\">@neuralnine<\/a><\/li>\n<li><a href=\"https:\/\/www.instagram.com\/datascienceinfo\/\">@datascienceinfo<\/a><\/li>\n<li><a href=\"https:\/\/www.instagram.com\/compscistuff\/\">@compscistuff<\/a><\/li>\n<li><a href=\"https:\/\/www.instagram.com\/computersciencelife\/\">@computersciencelife<\/a><\/li>\n<li><a href=\"https:\/\/www.instagram.com\/welcome.ai\/\">@welcome.ai<\/a><\/li>\n<li><a href=\"https:\/\/www.instagram.com\/papa_programmer\/\">@papa_programmer<\/a><\/li>\n<li><a href=\"https:\/\/www.instagram.com\/data_science_learn\/\">@data_science_learn<\/a><\/li>\n<li><a href=\"https:\/\/www.instagram.com\/neuralnet.ai\/\">@neuralnet.ai<\/a><\/li>\n<li><a href=\"https:\/\/www.instagram.com\/techno_thinkers\/\">@techno_thinkers<\/a><\/li>\n<\/ol>\n<p><i>View the code on <a href=\"https:\/\/github.com\/valiotti\/leftjoin\/tree\/master\/instagram\">GitHub<\/a><\/i><\/p>\n",
            "date_published": "2020-09-30T16:06:11+03:00",
            "date_modified": "2020-09-30T16:13:40+03:00",
            "_date_published_rfc2822": "Wed, 30 Sep 2020 16:06:11 +0300",
            "_rss_guid_is_permalink": "false",
            "_rss_guid": "46",
            "_e2_data": {
                "is_favourite": false,
                "links_required": [
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css"
                ],
                "og_images": []
            }
        }
    ],
    "_e2_version": 3386,
    "_e2_ua_string": "E2 (v3386; Aegea)"
}