diff --git a/docs/experiences_shared.html b/docs/experiences_shared.html index d0771ba..587c5b2 100644 --- a/docs/experiences_shared.html +++ b/docs/experiences_shared.html @@ -152,8 +152,8 @@

Heidi, Paula and Dietlind have started the search manually looking for job ads. This process can be scalable and that’s is what we describe here.

From the start Paula, picked indeed as job database for the search. This decision was based on the availability of examples on how to scrape indeed. After that the search using indeed uses by default your location, hence jobs were only searched in Belgium. The first problem encountered about the location is that each European country has a different URL to search ads using indeed, for example, be.indeed.com, de.indeed.com, indeed.co.uk. Just for the purpose of testing the search also this URL was used www.indeed.com/q-Europe-jobs.html. The second problem is that not all job ads in Europe are in English, which make it difficult to parse. After filtering all those the search found no matches or only 2 or 3, in some instances these were the same ad but posted in two different platforms.

Then for those obstacles, the decision was to only search in the United States, using the URL indeed.com. Using that location, enough jobs ads in English were found for several job titles.

-

Once we found some jobs to parse, the other question was, what to look for. For text mining, there are a few options, sentences, words or group of words. The first test was with sentences, but that came up to be the first problem. Many ads use bullet points, which don’t end up with a period .. In any case, I continue to look for words in an attempt to see which ones the most common words and what does that tell me. Most common words for ads related to “Data Steward” are data, business, management and experience, with 96% and above occurrence in the job ads searched. That was a bit interesting but didn’t say much to continue. I’ve also looked for word groups, 2 or 3 up to 6, trying to make sense of the search. Apart from not giving me any interesting results, I came across with the problem of duplicated ads, which I just decided to avoid.

-

Finally after some weeks of leaving it aside. The idea of cleaning the HTML before parsing it was what lead to the current implementation. From the HTML code the process is:

+

Once we found some jobs to parse, the other question was, what to look for. For text mining, there are a few options, sentences, words or group of words. The first test was with sentences, but that came up to be the first problem. Many ads use bullet points, which don’t end up with a period. In any case, I continued to look for words in an attempt to see which ones were the most common words and what does that tell me. Most common words for ads related to “Data Steward” are data, business, management and experience, with 96% and above occurrence in the job ads searched. That was a bit interesting, but didn’t say much to continue. I’ve also looked for word groups, 2 or 3 up to 6, trying to make sense of the search. Apart from not giving me any interesting results, I came across with the problem of duplicated ads, which I just decided to avoid.

+

Finally, after some weeks of leaving it aside. The idea of cleaning the HTML before parsing it was what lead to the current implementation. From the HTML code the process is:

  1. to replace the end of lines with a period (which is super useful with lists, and helps down to the road to have sentences).
  2. To replace all strange characters like \t, \r, /, <, >, |, : \ for a simple space.
  3. diff --git a/experiences_shared.md b/experiences_shared.md index db43f2e..d93d051 100644 --- a/experiences_shared.md +++ b/experiences_shared.md @@ -35,16 +35,16 @@ found for several job titles. Once we found some jobs to parse, the other question was, what to look for. For text mining, there are a few options, sentences, words or group of words. The first test was with sentences, but that came up to be the first problem. Many ads use -bullet points, which don't end up with a period `.`. In any case, I continue to look -for words in an attempt to see which ones the most common words and what does that +bullet points, which don't end up with a period. In any case, I continued to look +for words in an attempt to see which ones were the most common words and what does that tell me. Most common words for ads related to "Data Steward" are data, business, management and experience, [with 96% and above occurrence in the job ads searched](https://github.com/orchid00/jobsScrapping/blob/master/figures/top20words.pdf). -That was a bit interesting but didn't say much to continue. I've also looked for +That was a bit interesting, but didn't say much to continue. I've also looked for word groups, 2 or 3 up to 6, trying to make sense of the search. Apart from not giving me any interesting results, I came across with the problem of duplicated ads, which I just decided to avoid. -Finally after some weeks of leaving it aside. The idea of cleaning the HTML before +Finally, after some weeks of leaving it aside. The idea of cleaning the HTML before parsing it was what lead to the current implementation. From the HTML code the process is: 1. to replace the end of lines with a period (which is super useful with lists, and helps down to the road to have sentences).