For this project, considering the potential high-load and machine learning features, we decided to use Google Cloud Platform, Docker, Python, Flask, Tensorflow, Keras, Jupyter, React.js.
We have analyzed the dataset with the publications from the website for the following criteria:
- View analysis correlation on two criteria: 2 hours after posting the piece and 24 hours after posting the piece
- TF-IDF analysis to determine the importance of the specific words to a document in a corpus
- LDA topic modeling to round up the relevant tag keywords.
The analysis was performed on a Python-based natural language pipeline Polyglot.
Results of Analysis:
As a result, we have constructed a correlation matrix that illustrates the dependence of these factors on the success rate of the headline. The graph includes the following correlations:
- Publication time (2 and 24 hours after posting)
- Entity count (including person, organization, and location count)
- Title length, polarity, digits, and sentiment score