arrow_back Is it a normal topic for the project and its implementation plan?

1 vote
The task of writing a project came up. The topic, of course, must also be chosen by myself. I am far from mathematics, statistics, big data, but I had this idea - to find out what interests have active subscribers of group X in VKontakte.
What is an active subscriber? It's someone who likes the post. (This immediately raises the question, how do you measure activity? The ratio of the number of likes to the number of posts? Then, what is the threshold value to set. I think these issues will be resolved after the data collection).
Data collection method
We take the original group, parse the last N posts, get a list of users who have liked these posts. Then we get a list of groups from these users, we parse these groups' posts, we get a list of users who liked these posts... This can go on and on, but resources are limited.
I have already collected data of 5000 posts from one group, got the likes (took 2.75 hours). Surprisingly, the total number of likes is 22 * 10^6, and the number of unique users is 9 * 10^5.
And from this I can make a graph of interests. I.e. for each pair of groups you can find out the total number of active users - this will be the weight of the edge. And then you can manually mark up the group - specify the topic, and based on these two things, conclude what interests active users have subscribed to group X.
What do you think of the idea? Is the research going to be okay?


These are the questions that you need to ask the person who set the task. Or at least you need to know the level of the project. some kind of work-research, laboratory work, coursework, ECR. For the latter is clearly weak.
d-sem If there is no information about the level of the project, it is a school project (10-11 grades).
In general, I can figure out (although maybe I overestimate myself, and this kind of research is much harder than I think. probably so, because statistics and similar subjects I have not studied) something, provided I have plenty of time. I don't know if the study and research method is okay? I'm afraid I may have made a mistake somewhere that will cause me to draw the wrong conclusions.

1 Answer

2 votes
In my opinion you are doing a lot of unnecessary work. Your goal is to find out what interests active subscribers of group X have. The first problem is finding active subscribers. You need to focus here. Exactly what criteria to use to determine this. Then you get a number of active users. And you do not parse other groups, but those users. ( Pareto principle to help you) You want to know their interests, and there is no point in pairing other groups. The problem is how you will determine the theme of the group. Manually? Then it is easier to just get a list of groups from the surveyed users and make a sorted list from larger to smaller and determine the topics yourself (remember the Pareto principle).


Forever Extreme , sounds plausible. But how, for example, to analyze inactive participants? After all, even difficult to understand, they are signed up because they are interested, or they are not interested, but they are too lazy to unsubscribe. Although, this is probably an extreme case, there are few of them. Then by the same logic, you can take users, their groups, take the groups with the most common participants, mark them up, get the main interests. But what if there are a lot of users - 2 million? (What 400'000 subscribers should I take to make this sample representative? Should I randomly select them? You could, of course, check everyone, but that would take a very long time. I got a list of groups of active members and noticed that the number of groups with >10'000 members is very small. With each verified member, less and less active unique groups you can get. This, by the way, is what Pareto's law says. ) You can immediately look at the most recent activity. If the user has been in for a long time, there is no need to check it.

P.S.doing about the same thing now, only I do not need to parse, but my data is not structured(

Could you elaborate on that?
Akina I took the last 2,500 posts (six months). I think it's objective enough. I should have added an average of N posts out of 100
kerosin228 That's great. Let's take it one step at a time.
1.) We disqualified all users who made less than N likes per 100 posts.
2.) Sort them in descending order (based on the number of likes)
3) Take from their profiles the groups that these users are in (a minimum of 20%, starting with the most active)
4) You'll get active users and a list of groups they belong to.
5) Make a summary table of groups and users. You get a list of groups where the first place will be group Y with the largest number of YOUR active subscribers.
6) Analyze the topics of at least the first 20% of this list of groups and get topics
P.S. There will be an error in any case. There is a chance of running into a closed profile. So you will have to take only open profiles. But as a rule the most active with an open profile. Work volume, but there is something to tie, this id users and id groups + number of likes. If I'm not mistaken is the third normal form.
In general, you can make up your mind and split all groups of all YOUR active users and find out in which groups besides yours these users show the highest activity, then cross these data user->active groups. And analyze these data on topics. But I think they will not be very different from if you go the first way.
P.S.doing about the same thing now, only I do not need to parse, but my data is not structured(
Forever Extreme how I determine active subscribers - if a user likes more than N posts out of 100, then he is active.
Weed out bots. Interesting task, we need to think about how to distinguish a bot from a person - by some traits.
What signs do we have: FI, date of birth (I analyze by the fact that teenagers (I mean, I understand what their interests are approximately, but it must be proved), and in VKontakte you can register from 14 years old. So the age in most cases is incorrect), country, city, number of friends, relationships with other social networks, number of subscribers, interests, any music, movies, political views (also rarely filled in, but in most cases correctly), gender (in most cases listed correctly). And how do you use this data to identify interests?

By the way, why did I want to parse other groups? I assume that an active user will like what he likes. Accordingly, we will get to know interesting groups. Of course, most behave passively (just look at the number of likes and views), but they, too, can be analyzed. Again, how then analyze the profile, if there is such a scant amount of information?
Even better would be to identify unique users from the entire pool of subscribers. Weed out the bots.
And then count the activity of the resulting list. How many likes and comments. ( Here again we get pareto) and then find out their interests. If you will parse other groups for their active users, you will get the activity rating of the users, rather than their interests
If a user has liked more than N posts out of 100, he is active.
Well, well... A user came in, liked 80% of the posts, didn't really know what they were about, and went off into the fog for another week - active...

It is necessary to monitor the dynamics - at least daily (say, daily at 12-00), or even more often, for a fairly long period, say, a month. Based on such data, we can already talk about activity.