Link Prediction in a Social Network
My approach for Facebook’s user recommendation contest
Table of Contents
- Introduction
- Some intriguing insights on social networks
- Understanding the data
- Data Visualization
- Data Preparation
- Feature Engineering
- Model building
- Evaluation
- Conclusion
1.Introduction
A couple years ago, Facebook launched a link prediction contest on Kaggle, with the goal of recommending missing edges in a social network graph. Social Networks mainly focus on building social relations among users who share common interests, background, real-life connections,etc. People may or may not want to maximize their social influence. For example, a business page owners on instagram want to influence as many people as possible for their commercial advantages.However, the network is evolving in time, new users are joining, adding friends, new connections between old users,etc.Based on the current network we want to be able to predict the upcoming changes in the network and make recommendations accordingly.
Facebook has provided a snapshot of its social network at time(say ‘t’) and based on it, we need to predict the future possible links.In this blog, I will be sharing my approach to solve this case study.
2.Some intriguing insights on social networks
What does a social network look like? I wanted to play around with the data first just to get a rough feel of what I was working with, so I used an app called SocNetV to interact with the network.
Here’s a sample of a small network.
The node in Black is the selected node from the training set connected to 3 other nodes in the network to unfold their local networks.The nodes are sized according to their number of connections.
We can see that our selected node is friends with 3 other nodes(in yellow), all having fairly large networks. Note that the edges between these nodes in this network are undirected i.e., edge between two nodes indicate both follow each other.
Since the above image does not depict the distinction between a follower and a followee, lets take a directed graph. Here, in this example, we can see the followers and following of a particular user.
The selected user is the black node,it’s friends are the red nodes(users that follow and are followed back by the selected user), it’s followees are pink colored and it’s followers are green colored.
Here’s another network.
While the previous user had many only-followers(green nodes), this one has none.This can be considered as a node-level feature that indicates that the user is follower-hungry.
And here’s another user who has nothing but followers(maybe a celebrity).
The Lone Wolf
Lets take a look at another graph whose network is small.
The Social Butterfly
Whose network is a bit larger.
3.Understanding the Data
After the data gets downloaded, it reads something like this ;
#reading the train data
df=pd.read_csv('train.csv')df.head()
Checking the data for any missing rows/ duplicates, if any.
sum(df.isna().any(1))0sum(df.duplicated())0
Seems like there was no missing data or any duplicates.
- So from the data, we can infer that it is a directed graph in which we are provided with two nodes; a source and a destination node.I tried to visualize the given network using the NetworkX python library.NetworkX is a tool for creating, manipulating and perform study of structure of complex networks.
#saving the train data by removing the headers and indexes
df.to_csv('data/train_woheader.csv',header=False,index=False)#storing the list of edges in a varible
g=nx.read_edgelist('data/train_woheader.csv',delimiter=',',create_using=nx.DiGraph(),nodetype=int)#printing the information of graph
print(nx.info(g))Name:
Type: DiGraph
Number of nodes: 1862220
Number of edges: 9437519
Average in degree: 5.0679
Average out degree: 5.0679
So, the total number of unique nodes in this network are 1862220 and the total number of edges in the network are 9437519. Note that we are provided with edges/links and not nodes.
Therefore, the total number of possible edges/links/connections in this network could be 1862220 C 2. Out of these, we are provided with only 9437519. The remaining 1862220 C 2–9437519 edges do not exist in the network.
What do we have to predict?
Well, the friend recommendation can be defined as a binary classification problem which takes a set of features from the users and maps them to ‘1’ if there exists a link between a pair and ‘0’ otherwise.
So, let’s create an indicator variable for link which will be ‘1’ if there is a link between two nodes and ‘0’ if there is no link between the two nodes. However, in our data we are provided with all the links in the given network. So the remaining 1862220 C 2–9437519 combinations of nodes do not have any links between them. Although, we cannot take this entire remaining combinations because this give us an entirely biased data.So, we will be randomly sampling 9437519 from it to get a balanced data.Therefore, the final size of the data will be 9437519 x 2 = 18875038.
So, our final data will have 9437519 rows with indicator variable ‘1’ and 9437519 rows with indicator variable ‘0’.
4.Data Visualization
I took a sample of 50 data just to view the network using the networkX tool.
#creating a sample of 50 data points
pd.read_csv('train.csv',nrows=50).to_csv('data/train_woheader_sample.csv',header=False,index=False)#reading the edgelist in a variable using networkX
subgraph=nx.read_edgelist('data/train_woheader_sample.csv',delimiter=',',create_using=nx.DiGraph(),nodetype=int)#plotting the graph
pos=nx.spring_layout(subgraph)
nx.draw(subgraph,pos,node_color='#A0CBE2',edge_color='#00bb5e',width=1,edge_cmap=plt.cm.Blues,with_labels=True)
plt.savefig("graph_sample.pdf")
print(nx.info(subgraph))Name:
Type: DiGraph
Number of nodes: 66
Number of edges: 50
Average in degree: 0.7576
Average out degree: 0.7576
Notice how all the source nodes by default are placed towards the center of the graph and the destination nodes are placed outwards by the networkX library.
Let’s have a look at some distributions.
Here’s the distribution of number of followers of each node in the training set as well as the number of followees of each nodes.
#Followers
indegree_dist = list(dict(g.in_degree()).values())
indegree_dist.sort()
plt.figure(figsize=(10,6))
plt.plot(indegree_dist)
plt.xlabel('Index No')
plt.ylabel('No Of Followers')
plt.show()
### 90-100 percentile
for i in range(0,11):
print(90+i,'percentile value is',np.percentile(indegree_dist,90+i))90 percentile value is 12.0
91 percentile value is 13.0
92 percentile value is 14.0
93 percentile value is 15.0
94 percentile value is 17.0
95 percentile value is 19.0
96 percentile value is 21.0
97 percentile value is 24.0
98 percentile value is 29.0
99 percentile value is 40.0
100 percentile value is 552.0### 99-100 percentile
for i in range(10,110,10):
print(99+(i/100),'percentile value is',np.percentile(indegree_dist,99+(i/100)))99.1 percentile value is 42.0
99.2 percentile value is 44.0
99.3 percentile value is 47.0
99.4 percentile value is 50.0
99.5 percentile value is 55.0
99.6 percentile value is 61.0
99.7 percentile value is 70.0
99.8 percentile value is 84.0
99.9 percentile value is 112.0
100.0 percentile value is 552.0
Observations:
- Most of the users in this network have followers in the range of 40 to 50.
- The maximum number of followers by a user is 552.
#Followees
outdegree_dist = list(dict(g.out_degree()).values())
outdegree_dist.sort()
plt.figure(figsize=(10,6))
plt.plot(outdegree_dist)
plt.xlabel('Index No')
plt.ylabel('No Of Followee')
plt.show()
Similarly 90–100 percentile and 99–100 for followees,
### 90-100 percentile
for i in range(0,11):
print(90+i,'percentile value is',np.percentile(outdegree_dist,90+i))90 percentile value is 12.0
91 percentile value is 13.0
92 percentile value is 14.0
93 percentile value is 15.0
94 percentile value is 17.0
95 percentile value is 19.0
96 percentile value is 21.0
97 percentile value is 24.0
98 percentile value is 29.0
99 percentile value is 40.0
100 percentile value is 1566.0### 99-100 percentile
for i in range(10,110,10):
print(99+(i/100),'percentile value is',np.percentile(outdegree_dist,99+(i/100)))99.1 percentile value is 42.0
99.2 percentile value is 45.0
99.3 percentile value is 48.0
99.4 percentile value is 52.0
99.5 percentile value is 56.0
99.6 percentile value is 63.0
99.7 percentile value is 73.0
99.8 percentile value is 90.0
99.9 percentile value is 123.0
100.0 percentile value is 1566.0
Observations:
- The number of followees for most users fall in the range of 40–50.
- Maximum is 1566.
print('No of persons those are not following anyone are' ,sum(np.array(outdegree_dist)==0),'and % is',
sum(np.array(outdegree_dist)==0)*100/len(outdegree_dist))No of persons those are not following anyone are 274512 and % is 14.741115442858524print('No of persons who have no followers' ,sum(np.array(indegree_dist)==0),'and % is',
sum(np.array(indegree_dist)==0)*100/len(indegree_dist))No of persons who have no followers 188043 and % is 10.097786512871734
Notice that 14% of the users in our data have 0 followees and 10% of users have 0 followers.
Now let’s find the number of friends the users have(people followed by the user and are followed back in return).
in_out_degree_sort = sorted(in_out_degree)
plt.figure(figsize=(10,6))
plt.plot(in_out_degree_sort)
plt.xlabel('Index No')
plt.ylabel('No Of people each person is following + followers')
plt.show()
Observations:
- Most values lie in the range of 80–100.
Note that, the above distribution for followers and followees are the followers/followees excluding friends. Followers of a user may or may not be followed back by the user. Similarly the followees of the user may or may not follow back the user.
5.Data Preparation
Now, let’s generate those 9437519 missing edges as discussed earlier(This may take some time).
%%time
###generating missing edges from given graph
import random
#getting all set of edges
r = csv.reader(open('data/train_woheader.csv','r'))
#the dict will contain a tuple of 2 nodes as key and the value will be 1 if the nodes are connected else -1
edges = dict()
for edge in r:
edges[(edge[0], edge[1])] = 1
missing_edges = set([])
while (len(missing_edges)<9437519):
a=random.randint(1, 1862220)
b=random.randint(1, 1862220)
tmp = edges.get((a,b),-1)
if tmp == -1 and a!=b:
try:
# adding points who less likely to be friends
if nx.shortest_path_length(g,source=a,target=b) > 2:
missing_edges.add((a,b))
else:
continue
except:
missing_edges.add((a,b))
else:
continue
Here, we are creating a dictionary in which a pair of nodes will be the key and the value will be ‘1’ if there exists a link between them and ‘-1’ otherwise.
So, we are running a while loop to select exactly 9437519 number of missing edges and adding them into a set named ‘missing_edges’.
Now, take a look at the if statement.We first randomly select two nodes from the data and check if the edge between them is present in the dictionary that we created and we check if the two randomly chosen nodes are not the same node(a node cannot be linked with itself).
Now, we check if the shortest path between two nodes is greater than two and if yes, we add them to our missing set.But WHY?
Understand the intuition behind this.
Consider the node 6, 4 and 3.The shortest path between nodes 4 and 3 is two.Now, as nodes 6 and 3 are both friends of 4, they are most likely to be friends with each other. Therefore we eliminate this case so that our model could learn better. This is not necessary though.
Since that’s done, let’s save it in a pickle file.
len(missing_edges)9437519pickle.dump(missing_edges,open('data/missing_edges_final.p','wb'))
Now, let’s save the two lists as positive and negative data points and performing test-train split.
#reading total data df
df_pos = pd.read_csv('train.csv')
df_neg = pd.DataFrame(list(missing_edges), columns=['source_node', 'destination_node'])print("Number of nodes in the graph with edges", df_pos.shape[0])
print("Number of nodes in the graph without edges", df_neg.shape[0])#Trian test split
#Spiltted data into 80-20
#positive links and negative links seperatly because we need positive training data only for creating graph
#and for feature generation
X_train_pos, X_test_pos, y_train_pos, y_test_pos = train_test_split(df_pos,np.ones(len(df_pos)),test_size=0.2, random_state=9)
X_train_neg, X_test_neg, y_train_neg, y_test_neg = train_test_split(df_neg,np.zeros(len(df_neg)),test_size=0.2, random_state=9)print('='*60)
print("Number of nodes in the train data graph with edges", X_train_pos.shape[0],"=",y_train_pos.shape[0])
print("Number of nodes in the train data graph without edges", X_train_neg.shape[0],"=", y_train_neg.shape[0])
print('='*60)
print("Number of nodes in the test data graph with edges", X_test_pos.shape[0],"=",y_test_pos.shape[0])
print("Number of nodes in the test data graph without edges", X_test_neg.shape[0],"=",y_test_neg.shape[0])#removing header and saving
X_train_pos.to_csv('data/train_pos_after_eda.csv',header=False, index=False)
X_test_pos.to_csv('data/test_pos_after_eda.csv',header=False, index=False)
X_train_neg.to_csv('data/train_neg_after_eda.csv',header=False, index=False)
X_test_neg.to_csv('data/test_neg_after_eda.csv',header=False, index=False)Number of nodes in the graph with edges 9437519
Number of nodes in the graph without edges 9437519
============================================================
Number of nodes in the train data graph with edges 7550015 = 7550015
Number of nodes in the train data graph without edges 7550015 = 7550015
============================================================
Number of nodes in the test data graph with edges 1887504 = 1887504
Number of nodes in the test data graph without edges 1887504 = 1887504
As the final data has been prepared and saved in four csv files, let’s try to visualize the final data using the networkX library and highlight some of its characteristics.
if (os.path.isfile('data/train_pos_after_eda.csv')) and (os.path.isfile('data/test_pos_after_eda.csv')):
train_graph=nx.read_edgelist('data/train_pos_after_eda.csv',delimiter=',',create_using=nx.DiGraph(),nodetype=int)
test_graph=nx.read_edgelist('data/test_pos_after_eda.csv',delimiter=',',create_using=nx.DiGraph(),nodetype=int)
print(nx.info(train_graph))
print(nx.info(test_graph))
# finding the unique nodes in the both train and test graphs
train_nodes_pos = set(train_graph.nodes())
test_nodes_pos = set(test_graph.nodes())
trY_teY = len(train_nodes_pos.intersection(test_nodes_pos))
trY_teN = len(train_nodes_pos - test_nodes_pos)
teY_trN = len(test_nodes_pos - train_nodes_pos)
print('no of people common in train and test -- ',trY_teY)
print('no of people present in train but not present in test -- ',trY_teN)
print('no of people present in test but not present in train -- ',teY_trN)
print(' % of people not there in Train but exist in Test in total Test data are {} %'.format(teY_trN/len(test_nodes_pos)*100))Name:
Type: DiGraph
Number of nodes: 1780722
Number of edges: 7550015
Average in degree: 4.2399
Average out degree: 4.2399
Name:
Type: DiGraph
Number of nodes: 1144623
Number of edges: 1887504
Average in degree: 1.6490
Average out degree: 1.6490
no of people common in train and test -- 1063125
no of people present in train but not present in test -- 717597
no of people present in test but not present in train -- 81498
% of people not there in Train but exist in Test in total Test data are 7.1200735962845405 %
81000 people are present in the test set but not in the train set, this is a slight cold start problem.
Now, let’s concatenate the positive and negative edges together and save the final test and train data.
#final train and test data sets
if (not os.path.isfile('data/train_after_eda.csv')) and \
(not os.path.isfile('data/test_after_eda.csv')) and \
(not os.path.isfile('data/train_y.csv')) and \
(not os.path.isfile('data/test_y.csv')) and \
(os.path.isfile('data/train_pos_after_eda.csv')) and \
(os.path.isfile('data/test_pos_after_eda.csv')) and \
(os.path.isfile('data/train_neg_after_eda.csv')) and \
(os.path.isfile('data/test_neg_after_eda.csv')):
X_train_pos = pd.read_csv('data/train_pos_after_eda.csv', names=['source_node', 'destination_node'])
X_test_pos = pd.read_csv('data/test_pos_after_eda.csv', names=['source_node', 'destination_node'])
X_train_neg = pd.read_csv('data/train_neg_after_eda.csv', names=['source_node', 'destination_node'])
X_test_neg = pd.read_csv('data/test_neg_after_eda.csv', names=['source_node', 'destination_node'])
print('='*60)
print("Number of nodes in the train data graph with edges", X_train_pos.shape[0])
print("Number of nodes in the train data graph without edges", X_train_neg.shape[0])
print('='*60)
print("Number of nodes in the test data graph with edges", X_test_pos.shape[0])
print("Number of nodes in the test data graph without edges", X_test_neg.shape[0])
X_train = X_train_pos.append(X_train_neg,ignore_index=True)
y_train = np.concatenate((y_train_pos,y_train_neg))
X_test = X_test_pos.append(X_test_neg,ignore_index=True)
y_test = np.concatenate((y_test_pos,y_test_neg))
X_train.to_csv('data/train_after_eda.csv',header=False,index=False)
X_test.to_csv('data/test_after_eda.csv',header=False,index=False)
pd.DataFrame(y_train.astype(int)).to_csv('data/train_y.csv',header=False,index=False)
pd.DataFrame(y_test.astype(int)).to_csv('data/test_y.csv',header=False,index=False)============================================================
Number of nodes in the train data graph with edges 7550015
Number of nodes in the train data graph without edges 7550015
============================================================
Number of nodes in the test data graph with edges 1887504
Number of nodes in the test data graph without edges 1887504#final shape of test and train data
print("Data points in train data",X_train.shape)
print("Data points in test data",X_test.shape)
print("Shape of target variable in train",y_train.shape)
print("Shape of target variable in test", y_test.shape)Data points in train data (15100030, 2)
Data points in test data (3775008, 2)
Shape of target variable in train (15100030,)
Shape of target variable in test (3775008,)
6.Feature Engineering
Since we have no features to work with in our data we will be generating our own features.
- Jaccard Distance
The jaccard Index measures the similarity between finite sets, and is defined as the size of intersection divided by the union of the sample sets.The Jaccard distance which measures the similarity between sample sets is complimentary to jaccard index and is obtained by subtracting the jaccard index by 1.
We will be calculating the jaccard distance between the followers and followees of each node pair in the training set.
#for followees
def jaccard_for_followees(a,b):
try:
if len(set(train_graph.successors(a))) == 0 | len(set(train_graph.successors(b))) == 0:
return 0
sim = (len(set(train_graph.successors(a)).intersection(set(train_graph.successors(b)))))/\
(len(set(train_graph.successors(a)).union(set(train_graph.successors(b)))))
except:
return 0
return sim#for followers
def jaccard_for_followers(a,b):
try:
if len(set(train_graph.predecessors(a))) == 0 | len(set(g.predecessors(b))) == 0:
return 0
sim = (len(set(train_graph.predecessors(a)).intersection(set(train_graph.predecessors(b)))))/\
(len(set(train_graph.predecessors(a)).union(set(train_graph.predecessors(b)))))
return sim
except:
return 0
2. Cosine Distance
Cosine distance is a metric used to measure how similar the documents are irrespective of their size. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space.
Lets calculate the cosine distances for followers and followees of all the node pairs.
#for followees
def cosine_for_followees(a,b):
try:
if len(set(train_graph.successors(a))) == 0 | len(set(train_graph.successors(b))) == 0:
return 0
sim = (len(set(train_graph.successors(a)).intersection(set(train_graph.successors(b)))))/\
(math.sqrt(len(set(train_graph.successors(a)))*len((set(train_graph.successors(b))))))
return sim
except:
return 0
def cosine_for_followers(a,b):
try:
if len(set(train_graph.predecessors(a))) == 0 | len(set(train_graph.predecessors(b))) == 0:
return 0
sim = (len(set(train_graph.predecessors(a)).intersection(set(train_graph.predecessors(b)))))/\
(math.sqrt(len(set(train_graph.predecessors(a))))*(len(set(train_graph.predecessors(b)))))
return sim
except:
return 0
3. Page rank
PageRank (PR) is an algorithm used by Google Search to rank web pages in their search engine results. PageRank was named after Larry Page, one of the founders of Google.
PageRank works by counting the number and quality of links to a page(nodes in this case) to determine a rough estimate of how important the website(node) is.
Here, we want to calculate the PageRank around a each node pair(source and destination) in the training set.
pr = nx.pagerank(train_graph, alpha=0.85)
pickle.dump(pr,open('data/page_rank.p','wb'))mean_pr=float(sum(pr.values())) / len(pr)print('min',pr[min(pr, key=pr.get)])
print('max',pr[max(pr, key=pr.get)])
print('mean',float(sum(pr.values())) / len(pr))min 1.6556497245737814e-07
max 2.7098251341935827e-05
mean 5.615699699389075e-07
For all the data points which are part of the test dataset but are not in the training dataset we will not have the pagerank for these data points.For these data points,we will use the mean pagerank as imputation.
4.Shortest path
Shortest path is the path between two nodes such that the sum of their weights is minimum.
We will be calculating the shortest path between every node pair in our network.
def compute_shortest_path_length(a,b):
p=-1
try:
if train_graph.has_edge(a,b):
train_graph.remove_edge(a,b)
p= nx.shortest_path_length(train_graph,source=a,target=b)
train_graph.add_edge(a,b)
else:
p= nx.shortest_path_length(train_graph,source=a,target=b)
return p
except:
return -1
For all the node pairs, if the nodes already have a link between them, we will first delete that link and compute the shortest path for it. We are doing this so that our model could understand the network better.
5. Weakly connected components
A particular component is said to be strongly connected if there is at least one path from any given node to any other node.To borrow an example from wikipedia,
https://en.wikipedia.org/wiki/Strongly_connected_component#/media/File:Scc.png
In the first component (consisting of a,b,c) every node is reachable from every other node in that component.We can go from a →b, e →a,b →e and b →a via e.
A directed graph is weakly connected if replacing all its directed edges with undirected edges. Therefore, every strongly connected component is a weakly connected component.However, if it is not a strongly connected component, then to check whether it is a weakly connected component remove the directions of the edges and see if still there is atleast one path from any given node to any other node.
#getting weakly connected edges from graph
wcc=list(nx.weakly_connected_components(train_graph))
def belongs_to_same_wcc(a,b):
index = []
if train_graph.has_edge(b,a):
return 1
if train_graph.has_edge(a,b):
for i in wcc:
if a in i:
index= i
break
if (b in index):
train_graph.remove_edge(a,b)
if compute_shortest_path_length(a,b)==-1:
train_graph.add_edge(a,b)
return 0
else:
train_graph.add_edge(a,b)
return 1
else:
return 0
else:
for i in wcc:
if a in i:
index= i
break
if(b in index):
return 1
else:
return 0
6. Adar Index
Adamic/Adar measures is defined as inverted sum of degrees of common neighbors for given two vertices.
This metric measures the closeness of two nodes based on their shared neighbors.A value 0 indicates that two nodes are not close, while higher values indicate nodes are close.
def calc_adar_in(a,b):
sum=0
try:
n=list(set(train_graph.successors(a)).intersection(set(train_graph.successors(b))))
if len(n)!=0:
for i in n:
sum=sum+(1/np.log10(len(list(train_graph.predecessors(i)))))
return sum
else:
return 0
except:
return 0
7. Follow back
This is a simple feature to find out if a node follows back after being followed by a node.
def follows_back(a,b):
if train_graph.has_edge(b,a):
return 1
else:
return 0
8. Katz Centrality
Katz centrality of a node is a measure of centrality in a network. Unlike typical centrality measures which consider only the shortest path between a pair of actors, Katz centrality measures influence by taking into account the total number of walks between a pair of actors.
It is similar to Google’s PageRank.
Lets compute the Katz centrality for our network.
katz = nx.katz.katz_centrality(train_graph,alpha=0.005,beta=1)
pickle.dump(katz,open('data/fea_sample/katz.p','wb'))mean_katz=0.0007483801279873423print('min',katz[min(katz, key=katz.get)])
print('max',katz[max(katz, key=katz.get)])
print('mean',float(sum(katz.values())) / len(katz))min 0.0007314327794326148
max 0.003394923825996258
mean 0.0007483801279873423
8.HITS
Hyper-link induced topic search (HITS) identifies good authorities and hubs for a topic by assigning two numbers to a node : an authority and a hub weight. Authorities estimate the node value based on the incoming links. Hubs estimates the node value based on outgoing links.
Let us compute the hubs and authorities for our nodes.
hits = nx.hits(train_graph, max_iter=100, tol=1e-08, nstart=None, normalized=True)
pickle.dump(hits,open('data/fea_sample/hits.p','wb'))print('min',hits[0][min(hits[0], key=hits[0].get)])
print('max',hits[0][max(hits[0], key=hits[0].get)])
print('mean',float(sum(hits[0].values())) / len(hits[0]))min 0.0
max 0.004868653378780953
mean 5.615699699344123e-07
Now, we will create a sample of 500000 points for training and a sample of 50000 for testing.
filename = "data/train_after_eda.csv"
n_train = sum(1 for line in open(filename)) #number of records in file (excludes header)
s = 500000 #desired sample size
skip_train = sorted(random.sample(range(1,n_train+1),n_train-s))filename = "data/test_after_eda.csv"
n_test = sum(1 for line in open(filename)) #number of records in file (excludes header)
s = 50000 #desired sample size
skip_test = sorted(random.sample(range(1,n_test+1),n_test-s))
Now we will be computing the above features for our sampled data.
start_time = time()
#mapping jaccrd followers to train and test data
df_final_train['jaccard_followers'] = df_final_train.apply(lambda row:
jaccard_for_followers(row['source_node'],row['destination_node']),axis=1)
df_final_test['jaccard_followers'] = df_final_test.apply(lambda row:
jaccard_for_followers(row['source_node'],row['destination_node']),axis=1)
#mapping jaccrd followees to train and test data
df_final_train['jaccard_followees'] = df_final_train.apply(lambda row:
jaccard_for_followees(row['source_node'],row['destination_node']),axis=1)
df_final_test['jaccard_followees'] = df_final_test.apply(lambda row:
jaccard_for_followees(row['source_node'],row['destination_node']),axis=1)
#mapping jaccrd followers to train and test data
df_final_train['cosine_followers'] = df_final_train.apply(lambda row:
cosine_for_followers(row['source_node'],row['destination_node']),axis=1)
df_final_test['cosine_followers'] = df_final_test.apply(lambda row:
cosine_for_followers(row['source_node'],row['destination_node']),axis=1)
#mapping jaccrd followees to train and test data
df_final_train['cosine_followees'] = df_final_train.apply(lambda row:
cosine_for_followees(row['source_node'],row['destination_node']),axis=1)
df_final_test['cosine_followees'] = df_final_test.apply(lambda row:
cosine_for_followees(row['source_node'],row['destination_node']),axis=1)
print("--- %s seconds ---" % (time() - start_time))
Let’s also compute number of followers, followees for both source and destination and inter followers and followees between them.
def compute_features_stage1(df_final):
#calculating no of followers followees for source and destination
#calculating intersection of followers and followees for source and destination
num_followers_s=[]
num_followees_s=[]
num_followers_d=[]
num_followees_d=[]
inter_followers=[]
inter_followees=[]
for i,row in df_final.iterrows():
try:
s1=set(train_graph.predecessors(row['source_node']))
s2=set(train_graph.successors(row['source_node']))
except:
s1 = set()
s2 = set()
try:
d1=set(train_graph.predecessors(row['destination_node']))
d2=set(train_graph.successors(row['destination_node']))
except:
d1 = set()
d2 = set()
num_followers_s.append(len(s1))
num_followees_s.append(len(s2))
num_followers_d.append(len(d1))
num_followees_d.append(len(d2))
inter_followers.append(len(s1.intersection(d1)))
inter_followees.append(len(s2.intersection(d2)))
return num_followers_s, num_followers_d, num_followees_s, num_followees_d, inter_followers, inter_followeesdf_final_train['num_followers_s'], df_final_train['num_followers_d'], \
df_final_train['num_followees_s'], df_final_train['num_followees_d'], \
df_final_train['inter_followers'], df_final_train['inter_followees']= compute_features_stage1(df_final_train)df_final_test['num_followers_s'],
df_final_test['num_followers_d'], \
df_final_test['num_followees_s'],
df_final_test['num_followees_d'], \
df_final_test['inter_followers'],
df_final_test['inter_followees']= compute_features_stage1(df_final_test)
And finally the remaining features.
#mapping adar index on train
df_final_train['adar_index'] = df_final_train.apply(lambda row: calc_adar_in(row['source_node'],row['destination_node']),axis=1)
#mapping adar index on test
df_final_test['adar_index'] = df_final_test.apply(lambda row: calc_adar_in(row['source_node'],row['destination_node']),axis=1)
#--------------------------------------------------------------------------------------------------------
#mapping followback or not on train
df_final_train['follows_back'] = df_final_train.apply(lambda row: follows_back(row['source_node'],row['destination_node']),axis=1)
#mapping followback or not on test
df_final_test['follows_back'] = df_final_test.apply(lambda row: follows_back(row['source_node'],row['destination_node']),axis=1)
#--------------------------------------------------------------------------------------------------------
#mapping same component of wcc or not on train
df_final_train['same_comp'] = df_final_train.apply(lambda row: belongs_to_same_wcc(row['source_node'],row['destination_node']),axis=1)
##mapping same component of wcc or not on train
df_final_test['same_comp'] = df_final_test.apply(lambda row: belongs_to_same_wcc(row['source_node'],row['destination_node']),axis=1)
#--------------------------------------------------------------------------------------------------------
#mapping shortest path on train
df_final_train['shortest_path'] = df_final_train.apply(lambda row: compute_shortest_path_length(row['source_node'],row['destination_node']),axis=1)
#mapping shortest path on test
df_final_test['shortest_path'] = df_final_test.apply(lambda row: compute_shortest_path_length(row['source_node'],row['destination_node']),axis=1)df_final_train['page_rank_s'] = df_final_train.source_node.apply(lambda x:pr.get(x,mean_pr))
df_final_train['page_rank_d'] = df_final_train.destination_node.apply(lambda x:pr.get(x,mean_pr))
df_final_test['page_rank_s'] = df_final_test.source_node.apply(lambda x:pr.get(x,mean_pr))
df_final_test['page_rank_d'] = df_final_test.destination_node.apply(lambda x:pr.get(x,mean_pr))
#================================================================================
#Katz centrality score for source and destination in Train and test
#if anything not there in train graph then adding mean katz score
df_final_train['katz_s'] = df_final_train.source_node.apply(lambda x: katz.get(x,mean_katz))
df_final_train['katz_d'] = df_final_train.destination_node.apply(lambda x: katz.get(x,mean_katz))
df_final_test['katz_s'] = df_final_test.source_node.apply(lambda x: katz.get(x,mean_katz))
df_final_test['katz_d'] = df_final_test.destination_node.apply(lambda x: katz.get(x,mean_katz))
#================================================================================
#Hits algorithm score for source and destination in Train and test
#if anything not there in train graph then adding 0
df_final_train['hubs_s'] = df_final_train.source_node.apply(lambda x: hits[0].get(x,0))
df_final_train['hubs_d'] = df_final_train.destination_node.apply(lambda x: hits[0].get(x,0))
df_final_test['hubs_s'] = df_final_test.source_node.apply(lambda x: hits[0].get(x,0))
df_final_test['hubs_d'] = df_final_test.destination_node.apply(lambda x: hits[0].get(x,0))
#================================================================================
#Hits algorithm score for source and destination in Train and Test
#if anything not there in train graph then adding 0
df_final_train['authorities_s'] = df_final_train.source_node.apply(lambda x: hits[1].get(x,0))
df_final_train['authorities_d'] = df_final_train.destination_node.apply(lambda x: hits[1].get(x,0))
df_final_test['authorities_s'] = df_final_test.source_node.apply(lambda x: hits[1].get(x,0))
df_final_test['authorities_d'] = df_final_test.destination_node.apply(lambda x: hits[1].get(x,0))
8. Model Building
Now, that we have mapped this problem into a binary class classification problem and prepared some features for our data, we can proceed with training our machine learning model using the Random Forest Classifier.
We will be storing our target variable(indicator_link) in a separate variable for both test and train data so that we can later compare the actual and predicted results to evaluate our model.
y_train = df_final_train.indicator_link
y_test = df_final_test.indicator_linkdf_final_train.drop(['source_node', 'destination_node','indicator_link'],axis=1,inplace=True)
df_final_test.drop(['source_node', 'destination_node','indicator_link'],axis=1,inplace=True)#training the model
start_time = time()
param_dist = {"n_estimators":sp_randint(105,125),
"max_depth": sp_randint(10,15),
"min_samples_split": sp_randint(110,190),
"min_samples_leaf": sp_randint(25,65)}
clf = RandomForestClassifier(random_state=25,n_jobs=-1)
rf_random = RandomizedSearchCV(clf, param_distributions=param_dist,
n_iter=5,cv=10,scoring='f1',random_state=25)
rf_random.fit(df_final_train,y_train)
print('mean test scores',rf_random.cv_results_['mean_test_score'])
print('mean train scores',rf_random.cv_results_['mean_train_score'])
print("--- %s seconds ---" % (time() - start_time))mean test scores [0.94654557 0.94543735 0.94158429 0.94623061 0.9493983 ]
mean train scores [0.94709695 0.94588733 0.94188146 0.94681792 0.95008015]
--- 1306.9546287059784 seconds ---#checking the best parameters
print(rf_random.best_estimator_)RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=14, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=28, min_samples_split=111,
min_weight_fraction_leaf=0.0, n_estimators=121, n_jobs=-1,
oob_score=False, random_state=25, verbose=0, warm_start=False)
Building the mic with best parameters.
clf = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=14, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=28, min_samples_split=111,
min_weight_fraction_leaf=0.0, n_estimators=121, n_jobs=-1,
oob_score=False, random_state=25, verbose=0, warm_start=False)
9. Evaluation
Now that we have trained our model we must test our model.
#predicting the train and test results
clf.fit(df_final_train,y_train)
y_train_pred = clf.predict(df_final_train)
y_test_pred = clf.predict(df_final_test)
F1 score
The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0.
from sklearn.metrics import f1_score
print('Train f1 score',f1_score(y_train,y_train_pred))
print('Test f1 score',f1_score(y_test,y_test_pred))Train f1 score 0.9505666448977238
Test f1 score 0.9246177952822598
Confusion Matrix
A confusion matrix is often used to describe the performance of a classification model. is a summary of prediction results on a classification problem.
The number of correct and incorrect predictions are summarized with count values and broken down by each class.
We will be writing a script that will compute the confusion matrix, precision matrix, recall matrix for our model and also let us visualize the results which will be easy to understand.
from sklearn.metrics import confusion_matrix
def plot_confusion_matrix(test_y, predict_y):
C = confusion_matrix(test_y, predict_y)
A =(((C.T)/(C.sum(axis=1))).T)
B =(C/C.sum(axis=0))
plt.figure(figsize=(20,4))
labels = [0,1]
# representing A in heatmap format
cmap=sns.light_palette("blue")
plt.subplot(1, 3, 1)
sns.heatmap(C, annot=True, cmap=cmap, fmt=".3f", xticklabels=labels, yticklabels=labels)
plt.xlabel('Predicted Class')
plt.ylabel('Original Class')
plt.title("Confusion matrix")
plt.subplot(1, 3, 2)
sns.heatmap(B, annot=True, cmap=cmap, fmt=".3f", xticklabels=labels, yticklabels=labels)
plt.xlabel('Predicted Class')
plt.ylabel('Original Class')
plt.title("Precision matrix")
plt.subplot(1, 3, 3)
# representing B in heatmap format
sns.heatmap(A, annot=True, cmap=cmap, fmt=".3f", xticklabels=labels, yticklabels=labels)
plt.xlabel('Predicted Class')
plt.ylabel('Original Class')
plt.title("Recall matrix")
plt.show()print('Train confusion_matrix')
plot_confusion_matrix(y_train,y_train_pred)
print('Test confusion_matrix')
plot_confusion_matrix(y_test,y_test_pred)
Observations:
- From the test results, 490 of the points we falsely classified as ‘negative’ where as they were actually ‘positive’ points.
- Similarly, 3070 were wrongly classified as ‘positive’.
ROC curve
ROC curve is another way for measuring the performance of a classification model. Higher the area under the curve(AUC), better the model is at predicting 0s as 0s and 1s as 1s.
The ROC curve is plotted with TPR against the FPR where TPR is on y-axis and FPR is on the x-axis. The objective is to maximize area under the curve(AUC).
Let’s take an example
This figure consists of two ROC curves(red one and a blue one). The AUC for the red curve is greater than the AUC for the blue curve. Suppose the red curve was for Random Forest and the blue one was for logistic regression, we will choose the Random Forest as our winning model as it gives a better AUC.
from sklearn.metrics import roc_curve, auc
fpr,tpr,ths = roc_curve(y_test,y_test_pred)
auc_sc = auc(fpr, tpr)
plt.plot(fpr, tpr, color='navy',label='ROC curve (area = %0.2f)' % auc_sc)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic with test data')
plt.legend()
plt.show()
10.Conclusion
This I hope you got at least some insights on how friend suggestions on social media like facebook, instagram work using machine learning.
If you have any doubts, suggestions or corrections, do mention them in the comments and if you liked it, do share it with others .😘
I have put up a github repository containing some code if you are interested.
Here are some articles I have referred:
a. https://www.cs.cornell.edu/home/kleinber/link-pred.pdf
b. https://hackernoon.com/link-prediction-in-large-scale-networks-f836fcb05c88
Acknowledgements:
Writing this blog has been a great exercise. Special thanks to Prof.Harshall Lamba who helped me in prolonging my efforts with their encouragement and support.